46 Sampling and estimation: Concepts; Methods of sampling - probability and non-probability methods; Sampling distribution; Central limit theorem; Standard error; Statistical estimation

46.1 Population, Sample, Census

A population (universe) is the entire set of units relevant to a research question. A sample is a subset of the population. Sampling is the process of selecting a sample, and the sampling design is the plan for doing so. The alternative — census — surveys every unit. Census gives complete coverage but is costly, time-consuming, and often impractical; sampling is quicker, cheaper, and — when properly designed — accurate enough. India conducts a decennial Census (last completed in 2011; 2021 census deferred); routine surveys use sampling. Sampling theory rests on two foundations: the Central Limit Theorem (sample means tend to normal) and the Law of Large Numbers (sample mean converges to population mean).

46.2 Why Sample?

Advantages of Sampling over Census

Lower cost.
Less time.
Greater detail per unit — better quality.
Necessary when population is infinite or destructive testing is involved.
Reliable under proper sampling design.

46.3 Probability Sampling Methods

In probability sampling, every unit has a known, non-zero probability of selection. Allows statistical inference to the population.

Major Probability Methods

Method	Working
Simple Random Sampling (SRS)	Every unit has equal chance; with or without replacement
Systematic Sampling	Pick every k-th unit after random start; k = N/n
Stratified Sampling	Population divided into homogeneous strata; random sample from each
Cluster Sampling	Population divided into clusters; some clusters randomly selected; all units in chosen clusters surveyed
Multi-stage Sampling	Sample units selected in stages (e.g., state → district → village → household)
Probability Proportional to Size (PPS)	Probability of selection proportional to unit size

46.3.1 Stratified vs Cluster Sampling

Stratified vs Cluster

Aspect	Stratified	Cluster
Strata composition	Homogeneous within, heterogeneous between	Heterogeneous within, homogeneous between
Sample	From every stratum	From selected clusters only
Efficiency	Higher precision per cost	Lower precision but cheaper
Used when	Population is naturally divisible into subgroups	Population is geographically dispersed

46.4 Non-Probability Sampling Methods

In non-probability sampling, units are selected on bases other than chance. Inference to population is more risky.

Major Non-Probability Methods

Method	Working
Convenience sampling	Units chosen because they are easy to access
Purposive / Judgement	Researcher selects units based on judgement
Quota	Fill predefined quotas of categories (gender, age)
Snowball / Chain referral	One respondent refers others — for hidden populations (e.g., immigrants, drug users)
Self-selection / Voluntary	Volunteers participate (online polls)

46.5 Sampling Error and Non-Sampling Error

Two Types of Error

Type	Source	Reduced by
Sampling error	Difference between sample estimate and population parameter; arises from chance	Larger sample; better design
Non-sampling error	Errors in measurement, response, processing, non-response	Better instrument and procedures

46.6 Sampling Distribution

The sampling distribution of a statistic (e.g., sample mean) is the probability distribution of its values across all possible samples of a fixed size from the population. Its standard deviation is called the standard error.

Standard Error of Sample Mean

Statistic	Formula
Standard error of mean (σ known)	SE(x̄) = σ/√n
Standard error of mean (σ unknown, sample size large)	SE(x̄) = s/√n
Standard error of proportion	SE(p̂) = √(p(1−p)/n)

46.7 Central Limit Theorem (CLT) — Revisited

CLT: For random samples of size n from a population with mean μ and finite variance σ², the sampling distribution of x̄ tends to be normal with mean μ and SE σ/√n as n → ∞, regardless of the population’s distribution. Rule of thumb: n ≥ 30.

46.8 Law of Large Numbers

The Law of Large Numbers states that the sample mean converges in probability to the population mean as the sample size grows. This is the formal foundation for “more data is better”.

46.9 Statistical Estimation

Estimation is the use of sample statistics to infer population parameters. Two forms:

Two Forms of Estimation

Form	Working
Point estimation	Single value (e.g., x̄ for μ)
Interval estimation (Confidence Interval)	A range with stated confidence (e.g., 95 % CI for μ)

46.9.1 Properties of a Good Estimator

Four Properties (BLUE)

Unbiasedness — E(θ̂) = θ.
Consistency — θ̂ → θ as n → ∞.
Efficiency — minimum variance among unbiased estimators.
Sufficiency — uses all information about θ in the sample.
(Gauss-Markov: OLS is BLUE — Best Linear Unbiased Estimator.)

46.9.2 Confidence Interval

For a sample mean from a large sample (n ≥ 30):

\[CI = \bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}\]

For small samples (n < 30, σ unknown), use t-distribution:

\[CI = \bar{x} \pm t_{\alpha/2, n-1} \cdot \frac{s}{\sqrt{n}}\]

Critical z-values: - 90 % CI: ±1.645 - 95 % CI: ±1.96 - 99 % CI: ±2.58

46.10 Sample Size Determination

For estimating a population mean with margin of error E at confidence level (1 − α):

\[n = \left(\frac{z_{\alpha/2} \cdot \sigma}{E}\right)^2\]

For estimating a population proportion:

\[n = \frac{z_{\alpha/2}^2 \cdot p(1-p)}{E^2}\]

(Maximum at p = 0.5.)

flowchart TB
  S[Sampling Methods] --> PR[Probability]
  S --> NP[Non-Probability]
  PR --> SRS[SRS]
  PR --> SYS[Systematic]
  PR --> STR[Stratified]
  PR --> CL[Cluster]
  PR --> MS[Multi-stage]
  NP --> CN[Convenience]
  NP --> PU[Purposive]
  NP --> QU[Quota]
  NP --> SN[Snowball]
    classDef default fill:#003366,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;

Distractor warning

PYQ trap: Standard error of mean = σ/√n, not σ²/n. Confidence interval involves z (or t) multiplied by SE.

46.11 Practice Questions

Q 01DefinitionEasy

A **census** surveys:

AA random sample
BEvery unit of the population
CCluster only
D10 % of population

View solution

Correct Option: B

**Census** = complete enumeration.

Q 02SRSEasy

In SRS:

AEvery unit has equal chance of selection
BSelection is convenience-based
CResearcher uses judgement
DSnowball referrals are used

View solution

Correct Option: A

Equal-probability random selection.

Q 03MethodsMedium

Match each method with its description:

	Method		Description
(i)	Stratified	(a)	Random selection of clusters
(ii)	Cluster	(b)	Random selection from each subgroup
(iii)	Systematic	(c)	Every k-th unit selected after random start
(iv)	Quota	(d)	Predefined quotas of categories filled

A(i)-(b), (ii)-(a), (iii)-(c), (iv)-(d)
B(i)-(a), (ii)-(b), (iii)-(c), (iv)-(d)
C(i)-(c), (ii)-(b), (iii)-(a), (iv)-(d)
D(i)-(d), (ii)-(c), (iii)-(b), (iv)-(a)

View solution

Correct Option: A

Stratified — from each subgroup; Cluster — clusters; Systematic — k-th; Quota — categories.

Q 04SEMedium

σ = 20; n = 100. Standard error of the sample mean is:

A2
B0.2
C20
D100

View solution

Correct Option: A

SE = σ/√n = 20/10 = **2**.

Q 05CLTMedium

By the Central Limit Theorem, the sampling distribution of the sample mean is approximately Normal when:

APopulation is normal only
Bn ≥ 30, regardless of population distribution
Cn < 5
Dσ is unknown

View solution

Correct Option: B

**n ≥ 30** rule of thumb; CLT works regardless of population distribution.

Q 06CIMedium

For a 95 % CI for the population mean (large sample), the critical z-value is:

A1.645
B1.96
C2.33
D2.58

View solution

Correct Option: B

**z_{0.025} = ±1.96** for 95 % CI.

Q 07CI computeMedium

x̄ = 50, σ = 10, n = 100. 95 % CI for μ:

A[40, 60]
B[48.04, 51.96]
C[45, 55]
D[49, 51]

View solution

Correct Option: B

SE = 10/10 = 1; 50 ± 1.96 × 1 = **[48.04, 51.96]**.

Q 08EstimatorMedium

An estimator is *unbiased* if:

AIts variance is minimum
BE(θ̂) = θ
Cθ̂ → θ as n → ∞
DIt uses all sample information

View solution

Correct Option: B

**Expected value equals true parameter**.

Q 09ConsistencyMedium

An estimator is *consistent* if:

AExpected value equals parameter
BVariance is minimum among unbiased estimators
CEstimator converges (in probability) to parameter as n → ∞
DUses all sample data

View solution

Correct Option: C

**Consistency** — convergence to true value with sample size.

Q 10Strat vs ClusterHard

Stratified vs cluster sampling differ chiefly in that:

AStrata are homogeneous within; clusters are heterogeneous within
BBoth same
CClusters are smaller
DStrata are randomly selected

View solution

Correct Option: A

**Stratified** — homogeneous within strata; **Cluster** — heterogeneous within clusters.

Q 11Non-probMedium

Which is a *non-probability* method?

ASRS
BStratified
CConvenience
DCluster

View solution

Correct Option: C

**Convenience** — non-probability; the others are probability.

Q 12SnowballMedium

Snowball sampling is most suitable for:

ALarge general populations
BHidden or hard-to-reach populations (drug users, undocumented migrants)
CRandom surveys
DCensus

View solution

Correct Option: B

Snowball — chain-referral for hidden populations.

Q 13Sample sizeHard

Required sample size for estimating μ at 95 % confidence with margin of error E = 2, σ = 10:

A25
B96
C100
D1000

View solution

Correct Option: B

n = (1.96 × 10 / 2)² = 9.8² ≈ **96**.

Q 14ErrorMedium

Sampling error is reduced primarily by:

ABetter questionnaire wording
BLarger sample size and better sampling design
CBetter enumerator training
DSwitch to census

View solution

Correct Option: B

Sampling error ↓ with larger n and better design.

Q 15BLUEHard

By the Gauss-Markov theorem, OLS estimators are:

ABest Linear Unbiased Estimators (BLUE)
BAlways non-linear
CBiased
DInefficient

View solution

Correct Option: A

**Gauss-Markov: OLS is BLUE** under classical assumptions.

Q 16SystematicMedium

In systematic sampling with N = 1000 and n = 50, the interval k is:

A10
B20
C50
D100

View solution

Correct Option: B

k = N/n = 1000/50 = **20**.

Q 17Census IndiaMedium

India conducts a census every:

A5 years
B10 years (decennial)
C2 years
D25 years

View solution

Correct Option: B

Decennial since 1872; last completed **2011**.

Q 18SE proportionHard

Standard error of a sample proportion is:

Aσ/√n
B√(p(1−p)/n)
Cσ²/n
Dp/n

View solution

Correct Option: B

**SE(p̂) = √(p(1−p)/n)**.

Q 1999 % zMedium

Critical z for 99 % CI:

A1.645
B1.96
C2.33
D2.58

View solution

Correct Option: D

99 % CI → z = **±2.58**.

Q 20QuotaMedium

Quota sampling is:

AA probability method
BA non-probability method analogous to stratified, but with non-random selection within each quota
CSame as census
DRandom within strata

View solution

Correct Option: B

**Quota = non-probability analogue of stratified**.

46.12 Quick Recall

Quick recall

Population vs Sample; Sampling vs Census. India — decennial census since 1872; last completed 2011.
Probability methods: SRS, Systematic (k = N/n), Stratified (homogeneous within strata), Cluster (heterogeneous within clusters), Multi-stage, PPS.
Non-probability methods: Convenience, Purposive, Quota, Snowball, Self-selection.
Errors: Sampling (chance, reduced by n) vs Non-sampling (measurement / response / processing).
CLT: x̄ ~ Normal(μ, σ/√n) for large n.
SE of mean = σ/√n; SE of proportion = √(p(1−p)/n).
Estimation: Point vs Interval (CI). BLUE properties — unbiasedness, consistency, efficiency, sufficiency. Gauss-Markov: OLS is BLUE.
Critical z: 90 % → 1.645; 95 % → 1.96; 99 % → 2.58.
Sample size: n = (z σ / E)² for mean; n = z² p(1−p)/E² for proportion.