45  Data: Collection and classification of data

45.1 Concept of Data

Data are raw facts and figures — observations, measurements, responses — collected to answer research questions. Once processed and given context, data become information, and when patterns are extracted from information, they become knowledge. Research is fundamentally an exercise in transforming data into evidence. The credibility of any analysis depends on (a) the quality of data collection, and (b) the appropriateness of classification — turning chaotic raw observations into orderly, comparable form ready for analysis.

45.2 Types of Data

TipMajor Classifications of Data
Basis Categories
Source Primary · Secondary
Nature Quantitative · Qualitative
Measurement scale Nominal · Ordinal · Interval · Ratio
Time Cross-sectional · Time-series · Panel / Longitudinal
Form Continuous · Discrete
Granularity Individual · Aggregated

45.3 Scales of Measurement — Stevens (1946)

S.S. Stevens (1946) proposed the four-level taxonomy of measurement scales.

TipFour Measurement Scales
Scale Working Example Permissible statistic
Nominal Categories only; no order Gender, religion, jersey numbers Mode, frequencies, χ²
Ordinal Ordered categories; intervals not equal Customer satisfaction (1-5), army ranks Median, percentiles, Spearman ρ
Interval Equal intervals; no true zero Temperature in °C, calendar year Mean, SD, Pearson r
Ratio Equal intervals + true zero Weight, height, income, age All statistics including GM, CV
NoteDistractor warning

PYQ trap: Interval scale has no true zero — 0 °C does not mean “no temperature”. Ratio has a true zero — 0 kg means no weight, and ratios (twice as heavy) are meaningful.

45.4 Primary vs Secondary Data

TipPrimary vs Secondary Data
Dimension Primary Secondary
Source Collected first-hand by the researcher Already available from other sources
Specific to research? Yes — tailored No — collected for other purposes
Cost Higher Lower
Time Longer Shorter
Reliability Researcher controls quality Depends on original source
Examples Survey, experiment, interview, observation Government reports, RBI, NSO, IMF, World Bank, company filings, journals

45.5 Methods of Collecting Primary Data

TipPrimary Data Methods
  • Observation method — direct, structured/unstructured, participant/non-participant.
  • Interview method — personal, telephone, online; structured / semi-structured / unstructured / focus group.
  • Schedule / questionnaire — printed list of questions; schedule administered by enumerator.
  • Mail / online survey — self-administered.
  • Experiments — laboratory and field.
  • Panel — longitudinal sample tracked over time.
  • Projective techniques — Rorschach, word association, sentence completion (qualitative).

45.5.1 Questionnaire vs Schedule

TipQuestionnaire vs Schedule
Aspect Questionnaire Schedule
Administered by Respondent (self-fill) Enumerator (interviewer)
Cost Low High
Response rate Low High
Reach Wide Limited by field staff
Suitability Literate respondents Both literate and illiterate

45.6 Sources of Secondary Data

TipMajor Secondary Sources
  • Government: NSO (National Statistical Office), MoSPI, RBI, SEBI, IRDAI, Ministry reports, Economic Survey.
  • International: IMF, World Bank, OECD, UN, ILO, BIS, IFC, ADB.
  • Trade and industry: CII, FICCI, ASSOCHAM, sector-specific bodies.
  • Company sources: annual reports, prospectuses, regulatory filings.
  • Research and academia: journals, books, working papers, theses.
  • Commercial databases: Bloomberg, Refinitiv (Eikon), CMIE Prowess, Capitaline, Ace Equity.
  • Mass media: newspapers, magazines, websites.

45.6.1 Caveats on Secondary Data

TipEvaluating Secondary Data
  • Reliability of original source.
  • Suitability — collected for a different purpose may not fit current question.
  • Adequacy — coverage of variables, time period.
  • Timeliness — currency of data.
  • Definitions and units — may differ across sources.

45.7 Classification of Data

Classification is the process of arranging data into groups or classes according to some common characteristic.

TipBases of Classification
  • Geographical / Spatial — by location (state-wise GDP).
  • Chronological / Temporal — by time (year-wise sales).
  • Qualitative — by attribute (gender, education).
  • Quantitative — by magnitude / value (income brackets).

45.7.1 Frequency Distribution

A frequency distribution classifies quantitative data into class intervals with their corresponding frequencies. Key terms:

TipFrequency Distribution Terminology
  • Class interval — range of values.
  • Class limits — lower and upper.
  • Class boundaries — exact limits (for continuous data).
  • Class mark / mid-point — average of limits.
  • Class width — upper − lower.
  • Frequency (f) — number of observations in the class.
  • Cumulative frequency — running total.
  • Relative frequency — f / N.

45.7.2 Sturges’ Rule

Number of classes (k) and class width (h):

\[k = 1 + 3.322 \log_{10} N\] \[h = \text{Range} / k\]

(H.A. Sturges, 1926.)

45.8 Tabulation

Tabulation is “the orderly arrangement of data in rows and columns” to facilitate comparison and analysis. Components of a statistical table: title, head-note, captions, stubs, body, footnotes, source.

45.9 Graphical Presentation

TipCommon Graphs
Data type Graph
Categorical Bar chart, Pie chart
Frequency distribution Histogram, Frequency polygon, Ogive
Time series Line chart, Z-chart
Two-variable Scatter plot
Distribution shape Box plot
Geographic Cartogram

flowchart TB
  D[Data] --> P[Primary]
  D --> S[Secondary]
  P --> O[Observation]
  P --> I[Interview]
  P --> Q[Questionnaire]
  P --> X[Experiment]
  D --> CL[Classification]
  CL --> G[Geographical]
  CL --> C[Chronological]
  CL --> QU[Qualitative]
  CL --> QN[Quantitative]
    classDef default fill:#003366,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;

45.10 Practice Questions

Q 01StevensEasy

The four-level taxonomy of measurement scales was given by:

  • ALikert
  • BS.S. Stevens (1946)
  • CSturges
  • DGalton
View solution
Correct Option: B
**Stevens 1946** — Nominal, Ordinal, Interval, Ratio.
Q 02ScalesMedium

Match each scale with its example:

Scale Example
(i) Nominal (a) Temperature in °C
(ii) Ordinal (b) Income in ₹
(iii) Interval (c) Customer satisfaction 1-5
(iv) Ratio (d) Gender (Male/Female)
  • A(i)-(d), (ii)-(c), (iii)-(a), (iv)-(b)
  • B(i)-(a), (ii)-(b), (iii)-(c), (iv)-(d)
  • C(i)-(c), (ii)-(d), (iii)-(b), (iv)-(a)
  • D(i)-(b), (ii)-(a), (iii)-(d), (iv)-(c)
View solution
Correct Option: A
Nominal → Gender; Ordinal → Satisfaction; Interval → Temp °C; Ratio → Income.
Q 03IntervalMedium

Which scale has *equal intervals but no true zero*?

  • ANominal
  • BOrdinal
  • CInterval
  • DRatio
View solution
Correct Option: C
**Interval** — no true zero (e.g., temperature in °C).
Q 04Pri-SecEasy

Data collected through a researcher's own survey is:

  • APrimary
  • BSecondary
  • CAggregated
  • DTertiary
View solution
Correct Option: A
First-hand collection = **primary**.
Q 05SturgesMedium

Sturges' rule for the number of classes is:

  • Ak = √N
  • Bk = 1 + 3.322 log₁₀ N
  • Ck = N/10
  • Dk = 2N
View solution
Correct Option: B
**k = 1 + 3.322 log₁₀ N** (Sturges 1926).
Q 06ScheduleMedium

A schedule is administered by:

  • ARespondent only
  • BEnumerator / interviewer
  • COnline robot
  • DNo one — self-recorded
View solution
Correct Option: B
Schedule — administered by *enumerator*; questionnaire — self-filled.
Q 07ClassificationMedium

Match each basis with its example:

Basis Example
(i) Geographical (a) Year-wise sales
(ii) Chronological (b) State-wise GDP
(iii) Qualitative (c) Income brackets
(iv) Quantitative (d) Gender / education
  • A(i)-(b), (ii)-(a), (iii)-(d), (iv)-(c)
  • B(i)-(a), (ii)-(b), (iii)-(c), (iv)-(d)
  • C(i)-(c), (ii)-(d), (iii)-(b), (iv)-(a)
  • D(i)-(d), (ii)-(c), (iii)-(b), (iv)-(a)
View solution
Correct Option: A
Geographical — state-wise; Chrono — year-wise; Qualitative — attribute; Quantitative — value-brackets.
Q 08RatioMedium

Which scale permits all mathematical operations including ratios?

  • ANominal
  • BOrdinal
  • CInterval
  • DRatio
View solution
Correct Option: D
**Ratio** — true zero; meaningful ratios; permits GM, CV.
Q 09HistogramEasy

Histogram is used to represent:

  • ATime series
  • BFrequency distribution of continuous data
  • CCategorical comparison
  • DTwo-variable scatter
View solution
Correct Option: B
Histogram — continuous frequency distribution.
Q 10OgiveMedium

A graph of *cumulative* frequency distribution is called:

  • AHistogram
  • BOgive
  • CPie chart
  • DZ-chart
View solution
Correct Option: B
**Ogive** — cumulative frequency curve.
Q 11Secondary sourceMedium

Which is a *secondary* source of Indian commercial / corporate data?

  • ARBI publications
  • BCMIE Prowess
  • CNSO / MoSPI
  • DAll of the above
View solution
Correct Option: D
All are common secondary sources in India.
Q 12Class widthMedium

N = 100. Approximate number of classes by Sturges' rule:

  • A5
  • B7-8
  • C12-15
  • D20
View solution
Correct Option: B
1 + 3.322 × log 100 = 1 + 3.322 × 2 = **≈ 7.6**.
Q 13OrdinalMedium

A 5-point Likert scale ("strongly disagree → strongly agree") is best classified as:

  • ANominal
  • BOrdinal
  • CInterval
  • DRatio
View solution
Correct Option: B
Likert is strictly **ordinal** (often treated as interval in practice).
Q 14PanelMedium

Tracking the *same sample* of households over several years yields:

  • ACross-sectional data
  • BTime-series data
  • CPanel / longitudinal data
  • DCategorical data
View solution
Correct Option: C
**Panel data** — same units across time.
Q 15Quest vs SchedMedium

Which is **true** of a questionnaire vs schedule?

  • AQuestionnaire response rate is typically lower than schedule
  • BSchedule has lower cost
  • CQuestionnaire is administered by enumerator
  • DBoth are identical
View solution
Correct Option: A
Self-administered questionnaire — lower response rate; lower cost.
Q 16ObservationMedium

"Researcher records customer behaviour in a shop without their knowledge." This is:

  • AParticipant observation
  • BNon-participant observation
  • CInterview
  • DExperiment
View solution
Correct Option: B
**Non-participant** — observer does not engage; covert observation.
Q 17PieEasy

Pie chart is most suitable for:

  • ATime-series
  • BShowing parts of a whole as percentages
  • CContinuous distribution
  • DOutliers
View solution
Correct Option: B
Pie — composition / shares.
Q 18InformationEasy

Information is:

  • ARaw data
  • BProcessed data with context
  • CKnowledge derived from analysis
  • DWisdom
View solution
Correct Option: B
Data → Information → Knowledge → Wisdom.
Q 19Secondary caveatMedium

A researcher using secondary data should especially check:

  • AReliability, suitability, adequacy, timeliness
  • BPersonal opinions of original collector
  • CStatistical sophistication only
  • DOnly cost
View solution
Correct Option: A
Four checks: reliability, suitability, adequacy, timeliness.
Q 20ContinuousEasy

Number of children in a family is:

  • AContinuous
  • BDiscrete
  • CQualitative
  • DOrdinal
View solution
Correct Option: B
Count → **discrete** (integer-valued).

45.11 Quick Recall

ImportantQuick recall
  • Data → Information → Knowledge → Wisdom.
  • Types: Primary vs Secondary; Quantitative vs Qualitative; Nominal/Ordinal/Interval/Ratio (Stevens 1946); Cross-sectional/Time-series/Panel; Discrete vs Continuous.
  • Stevens scales: Nominal (mode), Ordinal (median), Interval (mean; no true zero, e.g., °C), Ratio (all stats; true zero, e.g., weight).
  • Primary methods: Observation, Interview, Questionnaire, Schedule, Experiment, Panel, Projective.
  • Questionnaire (self-fill, low cost, low response) vs Schedule (enumerator-administered).
  • Secondary sources (India): RBI, NSO/MoSPI, SEBI, Ministry reports, CMIE Prowess, Capitaline. Check reliability, suitability, adequacy, timeliness.
  • Classification bases: Geographical, Chronological, Qualitative, Quantitative.
  • Sturges’ rule: k = 1 + 3.322 log₁₀ N; class width h = Range/k.
  • Graphs: Histogram, frequency polygon, Ogive (cumulative), Bar chart, Pie chart, Scatter plot.