45 Sampling and Estimation
45.1 Population, Sample, Census
A population is the complete set of items or elements under investigation. A sample is a subset of the population chosen for actual study. A census surveys every member of the population (kothari2019?; gupta2021?).
The relationship is encoded in two terminologies:
| Concept | Population | Sample |
|---|---|---|
| Mean | \(\mu\) | \(\bar X\) |
| Variance | \(\sigma^2\) | \(s^2\) |
| Proportion | \(P\) | \(p\) |
| Size | \(N\) | \(n\) |
| Numbers from data | Parameter | Statistic |
45.2 Why Sample?
Samples are used because of:
- Cost — surveying the whole population is expensive.
- Time — sampling is faster.
- Practicality — destructive testing makes census impossible.
- Accuracy — well-designed samples can be more accurate than rushed censuses.
- Comprehensive coverage — a sample can be more thoroughly investigated than each unit in a census.
45.3 Methods of Sampling
Sampling methods divide into probability (each unit has a known, non-zero chance of selection) and non-probability (selection is by judgement or convenience).
| Family | Methods |
|---|---|
| Probability | Simple random, Stratified, Systematic, Cluster, Multi-stage, PPS |
| Non-probability | Convenience, Judgement / Purposive, Quota, Snowball, Volunteer |
45.3.1 Probability sampling — six methods
| Method | Working content | When to use |
|---|---|---|
| Simple Random Sampling (SRS) | Each unit has equal chance | Homogeneous population |
| Stratified Random Sampling | Population split into strata; sample drawn from each | Heterogeneous strata; reduce variance |
| Systematic Sampling | Every \(k\)-th unit after random start | Ordered list; large frame |
| Cluster Sampling | Population split into clusters; clusters chosen at random | Wide geography; cost reduction |
| Multi-stage Sampling | Sampling at successive stages | National household surveys |
| Probability Proportional to Size (PPS) | Larger units have higher chance of selection | Clusters of unequal size |
45.3.2 Non-probability sampling — five methods
| Method | Working content |
|---|---|
| Convenience | Whoever is easiest to access |
| Judgement / Purposive | Researcher’s expertise picks units |
| Quota | Specified number from each subgroup, chosen non-randomly |
| Snowball | Existing respondents refer further respondents |
| Volunteer | Self-selected participants (online polls) |
45.4 Sampling vs Non-Sampling Errors
| Error | Source | Reduced by |
|---|---|---|
| Sampling error | Random variation between sample and population | Larger sample size, better design |
| Non-sampling error | Faulty design, measurement, recording, processing | Better questionnaire, training, editing |
A census has zero sampling error but typically higher non-sampling error than a well-designed sample.
45.5 Sampling Distribution
The sampling distribution of a statistic is the probability distribution of the statistic computed across all possible samples of a given size from the population.
The most-tested example: the sampling distribution of the sample mean. By the Central Limit Theorem, for large \(n\):
\[ \bar X \sim N\left( \mu, \dfrac{\sigma^2}{n} \right) \]
The standard deviation of the sampling distribution is the standard error:
\[ \text{SE}(\bar X) = \dfrac{\sigma}{\sqrt{n}} \]
For a proportion: \(\text{SE}(p) = \sqrt{p(1-p)/n}\).
45.6 Estimation
Estimation is the process of using sample statistics to infer unknown population parameters (gupta2021?). Two kinds:
- Point estimation — single best-guess value (e.g. \(\bar X\) for \(\mu\)).
- Interval estimation — a range with associated confidence level (e.g. 95 % CI).
| Property | Working content |
|---|---|
| Unbiasedness | \(E(\hat\theta) = \theta\) |
| Consistency | \(\hat\theta \to \theta\) as \(n \to \infty\) |
| Efficiency | Smallest variance among unbiased estimators |
| Sufficiency | Uses all relevant information in the sample |
45.7 Confidence Intervals
The general form for the population mean \(\mu\), with known \(\sigma\):
\[ \bar X \pm z_{\alpha/2} \cdot \dfrac{\sigma}{\sqrt{n}} \]
For unknown \(\sigma\) and small \(n\) (use Student’s t):
\[ \bar X \pm t_{\alpha/2, n-1} \cdot \dfrac{s}{\sqrt{n}} \]
| Confidence | \(\alpha\) | \(z_{\alpha/2}\) |
|---|---|---|
| 90 % | 0.10 | 1.645 |
| 95 % | 0.05 | 1.96 |
| 99 % | 0.01 | 2.58 |
45.8 Sample Size Determination
For estimating \(\mu\) within a margin of error \(E\) at confidence \(1 - \alpha\):
\[ n = \left( \dfrac{z_{\alpha/2} \cdot \sigma}{E} \right)^2 \]
For estimating \(P\):
\[ n = \dfrac{z^2 \cdot P(1-P)}{E^2} \]
When \(P\) is unknown, use \(P = 0.5\) for the most conservative (largest) sample size.
45.9 Worked Numerical
A sample of 100 students has mean income ₹50,000 with sample SD ₹5,000.
- Standard error of mean = \(5,000 / \sqrt{100} = 500\).
- 95 % confidence interval for \(\mu\): $50,000 = 50,000 = $ (₹49,020, ₹50,980).
To estimate \(\mu\) within ±₹100 at 95 % confidence with \(\sigma\) ≈ ₹5,000:
\[ n = (1.96 \times 5{,}000 / 100)^2 = 98^2 = 9{,}604 \]
45.10 Exam-Pattern MCQs
Q1. Which of the following is not a probability-sampling method?
A. Simple random sampling B. Stratified random sampling C. Convenience sampling D. Cluster sampling
Answer: C. Convenience sampling is non-probability.
Q2. Match each sampling method with its description:
| Method | Description | ||
|---|---|---|---|
| (i) | Simple Random | (a) | Population split into strata; random sample from each |
| (ii) | Stratified | (b) | Every k-th unit after a random start |
| (iii) | Systematic | (c) | Each unit has equal chance |
| (iv) | Cluster | (d) | Population split into clusters; some clusters fully sampled |
A. (i)-(c), (ii)-(a), (iii)-(b), (iv)-(d) B. (i)-(a), (ii)-(b), (iii)-(c), (iv)-(d) C. (i)-(b), (ii)-(c), (iii)-(d), (iv)-(a) D. (i)-(d), (ii)-(c), (iii)-(a), (iv)-(b)
Answer: A.
Q3. A sample of 400 has mean ₹2,000 and SD ₹100. The standard error of the mean is:
A. ₹0.25 B. ₹5 C. ₹10 D. ₹100
Answer: B. SE = $100 / = 100/20 = $ ₹5.
Q4. Match each property of a good estimator with its meaning:
| Property | Meaning | ||
|---|---|---|---|
| (i) | Unbiasedness | (a) | Smallest variance among unbiased estimators |
| (ii) | Consistency | (b) | Uses all relevant information |
| (iii) | Efficiency | (c) | \(E(\hat\theta) = \theta\) |
| (iv) | Sufficiency | (d) | \(\hat\theta \to \theta\) as \(n \to \infty\) |
A. (i)-(c), (ii)-(d), (iii)-(a), (iv)-(b) B. (i)-(a), (ii)-(b), (iii)-(c), (iv)-(d) C. (i)-(b), (ii)-(c), (iii)-(d), (iv)-(a) D. (i)-(d), (ii)-(a), (iii)-(b), (iv)-(c)
Answer: A.
Q5. The 95 % confidence z-value is approximately:
A. 1.645 B. 1.96 C. 2.33 D. 2.58
Answer: B. z = 1.96 for 95 % confidence (two-tailed).
Q6. What sample size is needed to estimate a population proportion within ±5 % at 95 % confidence, when \(P\) is unknown?
A. 96 B. 196 C. 385 D. 1,000
Answer: C. $n = (1.96)^2 / (0.05)^2 = 3.8416 / 0.0025 ≈ $ 385.
Q7. Arrange the steps of inferential statistics in correct order:
- Compute confidence interval
- Determine sample size
- Collect sample
- Calculate sample statistic and standard error
A. (ii), (iii), (iv), (i) B. (i), (ii), (iii), (iv) C. (iv), (iii), (ii), (i) D. (iii), (iv), (ii), (i)
Answer: A. Sample-size → Collect → Compute statistic and SE → Confidence interval.
Q8. Match each error with its source / mitigation:
| Error | Source / Mitigation | ||
|---|---|---|---|
| (i) | Sampling error | (a) | Faulty questionnaire; reduced by training and editing |
| (ii) | Non-sampling error | (b) | Random variation; reduced by larger n and better design |
A. (i)-(b), (ii)-(a) B. (i)-(a), (ii)-(b)
Answer: A.
- Population (size \(N\)) vs Sample (size \(n\)). Parameter (\(\mu\), \(\sigma\), \(P\)) vs Statistic (\(\bar X\), \(s\), \(p\)).
- Probability sampling: SRS, Stratified, Systematic, Cluster, Multi-stage, PPS.
- Non-probability sampling: Convenience, Judgement, Quota, Snowball, Volunteer.
- Census has zero sampling error but typically larger non-sampling error.
- Standard error of mean = \(\sigma / \sqrt{n}\). SE(p) = \(\sqrt{p(1-p)/n}\).
- CLT: \(\bar X \sim N(\mu, \sigma^2/n)\) for large \(n\).
- Properties of good estimator: unbiased, consistent, efficient, sufficient.
- 95 % CI: \(\bar X \pm 1.96 \cdot \sigma / \sqrt{n}\). (Use t for unknown \(\sigma\) and small \(n\).)
- Sample size: \(n = (z \sigma / E)^2\). For proportions, conservative \(P = 0.5\) → ≈ 385 for ±5 % at 95 %.