45 Data: Collection and classification of data

45.1 Concept of Data

Data are raw facts and figures — observations, measurements, responses — collected to answer research questions. Once processed and given context, data become information, and when patterns are extracted from information, they become knowledge. Research is fundamentally an exercise in transforming data into evidence. The credibility of any analysis depends on (a) the quality of data collection, and (b) the appropriateness of classification — turning chaotic raw observations into orderly, comparable form ready for analysis.

45.2 Types of Data

Major Classifications of Data

Basis	Categories
Source	Primary · Secondary
Nature	Quantitative · Qualitative
Measurement scale	Nominal · Ordinal · Interval · Ratio
Time	Cross-sectional · Time-series · Panel / Longitudinal
Form	Continuous · Discrete
Granularity	Individual · Aggregated

45.3 Scales of Measurement — Stevens (1946)

S.S. Stevens (1946) proposed the four-level taxonomy of measurement scales.

Four Measurement Scales

Scale	Working	Example	Permissible statistic
Nominal	Categories only; no order	Gender, religion, jersey numbers	Mode, frequencies, χ²
Ordinal	Ordered categories; intervals not equal	Customer satisfaction (1-5), army ranks	Median, percentiles, Spearman ρ
Interval	Equal intervals; no true zero	Temperature in °C, calendar year	Mean, SD, Pearson r
Ratio	Equal intervals + true zero	Weight, height, income, age	All statistics including GM, CV

Distractor warning

PYQ trap: Interval scale has no true zero — 0 °C does not mean “no temperature”. Ratio has a true zero — 0 kg means no weight, and ratios (twice as heavy) are meaningful.

45.4 Primary vs Secondary Data

Primary vs Secondary Data

Dimension	Primary	Secondary
Source	Collected first-hand by the researcher	Already available from other sources
Specific to research?	Yes — tailored	No — collected for other purposes
Cost	Higher	Lower
Time	Longer	Shorter
Reliability	Researcher controls quality	Depends on original source
Examples	Survey, experiment, interview, observation	Government reports, RBI, NSO, IMF, World Bank, company filings, journals

45.5 Methods of Collecting Primary Data

Primary Data Methods

Observation method — direct, structured/unstructured, participant/non-participant.
Interview method — personal, telephone, online; structured / semi-structured / unstructured / focus group.
Schedule / questionnaire — printed list of questions; schedule administered by enumerator.
Mail / online survey — self-administered.
Experiments — laboratory and field.
Panel — longitudinal sample tracked over time.
Projective techniques — Rorschach, word association, sentence completion (qualitative).

45.5.1 Questionnaire vs Schedule

Questionnaire vs Schedule

Aspect	Questionnaire	Schedule
Administered by	Respondent (self-fill)	Enumerator (interviewer)
Cost	Low	High
Response rate	Low	High
Reach	Wide	Limited by field staff
Suitability	Literate respondents	Both literate and illiterate

45.6 Sources of Secondary Data

Major Secondary Sources

Government: NSO (National Statistical Office), MoSPI, RBI, SEBI, IRDAI, Ministry reports, Economic Survey.
International: IMF, World Bank, OECD, UN, ILO, BIS, IFC, ADB.
Trade and industry: CII, FICCI, ASSOCHAM, sector-specific bodies.
Company sources: annual reports, prospectuses, regulatory filings.
Research and academia: journals, books, working papers, theses.
Commercial databases: Bloomberg, Refinitiv (Eikon), CMIE Prowess, Capitaline, Ace Equity.
Mass media: newspapers, magazines, websites.

45.6.1 Caveats on Secondary Data

Evaluating Secondary Data

Reliability of original source.
Suitability — collected for a different purpose may not fit current question.
Adequacy — coverage of variables, time period.
Timeliness — currency of data.
Definitions and units — may differ across sources.

45.7 Classification of Data

Classification is the process of arranging data into groups or classes according to some common characteristic.

Bases of Classification

Geographical / Spatial — by location (state-wise GDP).
Chronological / Temporal — by time (year-wise sales).
Qualitative — by attribute (gender, education).
Quantitative — by magnitude / value (income brackets).

45.7.1 Frequency Distribution

A frequency distribution classifies quantitative data into class intervals with their corresponding frequencies. Key terms:

Frequency Distribution Terminology

Class interval — range of values.
Class limits — lower and upper.
Class boundaries — exact limits (for continuous data).
Class mark / mid-point — average of limits.
Class width — upper − lower.
Frequency (f) — number of observations in the class.
Cumulative frequency — running total.
Relative frequency — f / N.

45.7.2 Sturges’ Rule

Number of classes (k) and class width (h):

\[k = 1 + 3.322 \log_{10} N\] \[h = \text{Range} / k\]

(H.A. Sturges, 1926.)

45.8 Tabulation

Tabulation is “the orderly arrangement of data in rows and columns” to facilitate comparison and analysis. Components of a statistical table: title, head-note, captions, stubs, body, footnotes, source.

45.9 Graphical Presentation

Common Graphs

Data type	Graph
Categorical	Bar chart, Pie chart
Frequency distribution	Histogram, Frequency polygon, Ogive
Time series	Line chart, Z-chart
Two-variable	Scatter plot
Distribution shape	Box plot
Geographic	Cartogram

flowchart TB
  D[Data] --> P[Primary]
  D --> S[Secondary]
  P --> O[Observation]
  P --> I[Interview]
  P --> Q[Questionnaire]
  P --> X[Experiment]
  D --> CL[Classification]
  CL --> G[Geographical]
  CL --> C[Chronological]
  CL --> QU[Qualitative]
  CL --> QN[Quantitative]
    classDef default fill:#003366,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;

45.10 Practice Questions

Q 01StevensEasy

The four-level taxonomy of measurement scales was given by:

ALikert
BS.S. Stevens (1946)
CSturges
DGalton

View solution

Correct Option: B

**Stevens 1946** — Nominal, Ordinal, Interval, Ratio.

Q 02ScalesMedium

Match each scale with its example:

	Scale		Example
(i)	Nominal	(a)	Temperature in °C
(ii)	Ordinal	(b)	Income in ₹
(iii)	Interval	(c)	Customer satisfaction 1-5
(iv)	Ratio	(d)	Gender (Male/Female)

A(i)-(d), (ii)-(c), (iii)-(a), (iv)-(b)
B(i)-(a), (ii)-(b), (iii)-(c), (iv)-(d)
C(i)-(c), (ii)-(d), (iii)-(b), (iv)-(a)
D(i)-(b), (ii)-(a), (iii)-(d), (iv)-(c)

View solution

Correct Option: A

Nominal → Gender; Ordinal → Satisfaction; Interval → Temp °C; Ratio → Income.

Q 03IntervalMedium

Which scale has *equal intervals but no true zero*?

ANominal
BOrdinal
CInterval
DRatio

View solution

Correct Option: C

**Interval** — no true zero (e.g., temperature in °C).

Q 04Pri-SecEasy

Data collected through a researcher's own survey is:

APrimary
BSecondary
CAggregated
DTertiary

View solution

Correct Option: A

First-hand collection = **primary**.

Q 05SturgesMedium

Sturges' rule for the number of classes is:

Ak = √N
Bk = 1 + 3.322 log₁₀ N
Ck = N/10
Dk = 2N

View solution

Correct Option: B

**k = 1 + 3.322 log₁₀ N** (Sturges 1926).

Q 06ScheduleMedium

A schedule is administered by:

ARespondent only
BEnumerator / interviewer
COnline robot
DNo one — self-recorded

View solution

Correct Option: B

Schedule — administered by *enumerator*; questionnaire — self-filled.

Q 07ClassificationMedium

Match each basis with its example:

	Basis		Example
(i)	Geographical	(a)	Year-wise sales
(ii)	Chronological	(b)	State-wise GDP
(iii)	Qualitative	(c)	Income brackets
(iv)	Quantitative	(d)	Gender / education

A(i)-(b), (ii)-(a), (iii)-(d), (iv)-(c)
B(i)-(a), (ii)-(b), (iii)-(c), (iv)-(d)
C(i)-(c), (ii)-(d), (iii)-(b), (iv)-(a)
D(i)-(d), (ii)-(c), (iii)-(b), (iv)-(a)

View solution

Correct Option: A

Geographical — state-wise; Chrono — year-wise; Qualitative — attribute; Quantitative — value-brackets.

Q 08RatioMedium

Which scale permits all mathematical operations including ratios?

ANominal
BOrdinal
CInterval
DRatio

View solution

Correct Option: D

**Ratio** — true zero; meaningful ratios; permits GM, CV.

Q 09HistogramEasy

Histogram is used to represent:

ATime series
BFrequency distribution of continuous data
CCategorical comparison
DTwo-variable scatter

View solution

Correct Option: B

Histogram — continuous frequency distribution.

Q 10OgiveMedium

A graph of *cumulative* frequency distribution is called:

AHistogram
BOgive
CPie chart
DZ-chart

View solution

Correct Option: B

**Ogive** — cumulative frequency curve.

Q 11Secondary sourceMedium

Which is a *secondary* source of Indian commercial / corporate data?

ARBI publications
BCMIE Prowess
CNSO / MoSPI
DAll of the above

View solution

Correct Option: D

All are common secondary sources in India.

Q 12Class widthMedium

N = 100. Approximate number of classes by Sturges' rule:

A5
B7-8
C12-15
D20

View solution

Correct Option: B

1 + 3.322 × log 100 = 1 + 3.322 × 2 = **≈ 7.6**.

Q 13OrdinalMedium

A 5-point Likert scale ("strongly disagree → strongly agree") is best classified as:

ANominal
BOrdinal
CInterval
DRatio

View solution

Correct Option: B

Likert is strictly **ordinal** (often treated as interval in practice).

Q 14PanelMedium

Tracking the *same sample* of households over several years yields:

ACross-sectional data
BTime-series data
CPanel / longitudinal data
DCategorical data

View solution

Correct Option: C

**Panel data** — same units across time.

Q 15Quest vs SchedMedium

Which is **true** of a questionnaire vs schedule?

AQuestionnaire response rate is typically lower than schedule
BSchedule has lower cost
CQuestionnaire is administered by enumerator
DBoth are identical

View solution

Correct Option: A

Self-administered questionnaire — lower response rate; lower cost.

Q 16ObservationMedium

"Researcher records customer behaviour in a shop without their knowledge." This is:

AParticipant observation
BNon-participant observation
CInterview
DExperiment

View solution

Correct Option: B

**Non-participant** — observer does not engage; covert observation.

Q 17PieEasy

Pie chart is most suitable for:

ATime-series
BShowing parts of a whole as percentages
CContinuous distribution
DOutliers

View solution

Correct Option: B

Pie — composition / shares.

Q 18InformationEasy

Information is:

ARaw data
BProcessed data with context
CKnowledge derived from analysis
DWisdom

View solution

Correct Option: B

Data → Information → Knowledge → Wisdom.

Q 19Secondary caveatMedium

A researcher using secondary data should especially check:

AReliability, suitability, adequacy, timeliness
BPersonal opinions of original collector
CStatistical sophistication only
DOnly cost

View solution

Correct Option: A

Four checks: reliability, suitability, adequacy, timeliness.

Q 20ContinuousEasy

Number of children in a family is:

AContinuous
BDiscrete
CQualitative
DOrdinal

View solution

Correct Option: B

Count → **discrete** (integer-valued).

45.11 Quick Recall

Quick recall

Data → Information → Knowledge → Wisdom.
Types: Primary vs Secondary; Quantitative vs Qualitative; Nominal/Ordinal/Interval/Ratio (Stevens 1946); Cross-sectional/Time-series/Panel; Discrete vs Continuous.
Stevens scales: Nominal (mode), Ordinal (median), Interval (mean; no true zero, e.g., °C), Ratio (all stats; true zero, e.g., weight).
Primary methods: Observation, Interview, Questionnaire, Schedule, Experiment, Panel, Projective.
Questionnaire (self-fill, low cost, low response) vs Schedule (enumerator-administered).
Secondary sources (India): RBI, NSO/MoSPI, SEBI, Ministry reports, CMIE Prowess, Capitaline. Check reliability, suitability, adequacy, timeliness.
Classification bases: Geographical, Chronological, Qualitative, Quantitative.
Sturges’ rule: k = 1 + 3.322 log₁₀ N; class width h = Range/k.
Graphs: Histogram, frequency polygon, Ogive (cumulative), Bar chart, Pie chart, Scatter plot.