flowchart TB
D[Data] --> P[Primary]
D --> S[Secondary]
P --> O[Observation]
P --> I[Interview]
P --> Q[Questionnaire]
P --> X[Experiment]
D --> CL[Classification]
CL --> G[Geographical]
CL --> C[Chronological]
CL --> QU[Qualitative]
CL --> QN[Quantitative]
classDef default fill:#003366,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;
45 Data: Collection and classification of data
45.1 Concept of Data
Data are raw facts and figures — observations, measurements, responses — collected to answer research questions. Once processed and given context, data become information, and when patterns are extracted from information, they become knowledge. Research is fundamentally an exercise in transforming data into evidence. The credibility of any analysis depends on (a) the quality of data collection, and (b) the appropriateness of classification — turning chaotic raw observations into orderly, comparable form ready for analysis.
45.2 Types of Data
| Basis | Categories |
|---|---|
| Source | Primary · Secondary |
| Nature | Quantitative · Qualitative |
| Measurement scale | Nominal · Ordinal · Interval · Ratio |
| Time | Cross-sectional · Time-series · Panel / Longitudinal |
| Form | Continuous · Discrete |
| Granularity | Individual · Aggregated |
45.3 Scales of Measurement — Stevens (1946)
S.S. Stevens (1946) proposed the four-level taxonomy of measurement scales.
| Scale | Working | Example | Permissible statistic |
|---|---|---|---|
| Nominal | Categories only; no order | Gender, religion, jersey numbers | Mode, frequencies, χ² |
| Ordinal | Ordered categories; intervals not equal | Customer satisfaction (1-5), army ranks | Median, percentiles, Spearman ρ |
| Interval | Equal intervals; no true zero | Temperature in °C, calendar year | Mean, SD, Pearson r |
| Ratio | Equal intervals + true zero | Weight, height, income, age | All statistics including GM, CV |
PYQ trap: Interval scale has no true zero — 0 °C does not mean “no temperature”. Ratio has a true zero — 0 kg means no weight, and ratios (twice as heavy) are meaningful.
45.4 Primary vs Secondary Data
| Dimension | Primary | Secondary |
|---|---|---|
| Source | Collected first-hand by the researcher | Already available from other sources |
| Specific to research? | Yes — tailored | No — collected for other purposes |
| Cost | Higher | Lower |
| Time | Longer | Shorter |
| Reliability | Researcher controls quality | Depends on original source |
| Examples | Survey, experiment, interview, observation | Government reports, RBI, NSO, IMF, World Bank, company filings, journals |
45.5 Methods of Collecting Primary Data
- Observation method — direct, structured/unstructured, participant/non-participant.
- Interview method — personal, telephone, online; structured / semi-structured / unstructured / focus group.
- Schedule / questionnaire — printed list of questions; schedule administered by enumerator.
- Mail / online survey — self-administered.
- Experiments — laboratory and field.
- Panel — longitudinal sample tracked over time.
- Projective techniques — Rorschach, word association, sentence completion (qualitative).
45.5.1 Questionnaire vs Schedule
| Aspect | Questionnaire | Schedule |
|---|---|---|
| Administered by | Respondent (self-fill) | Enumerator (interviewer) |
| Cost | Low | High |
| Response rate | Low | High |
| Reach | Wide | Limited by field staff |
| Suitability | Literate respondents | Both literate and illiterate |
45.6 Sources of Secondary Data
- Government: NSO (National Statistical Office), MoSPI, RBI, SEBI, IRDAI, Ministry reports, Economic Survey.
- International: IMF, World Bank, OECD, UN, ILO, BIS, IFC, ADB.
- Trade and industry: CII, FICCI, ASSOCHAM, sector-specific bodies.
- Company sources: annual reports, prospectuses, regulatory filings.
- Research and academia: journals, books, working papers, theses.
- Commercial databases: Bloomberg, Refinitiv (Eikon), CMIE Prowess, Capitaline, Ace Equity.
- Mass media: newspapers, magazines, websites.
45.6.1 Caveats on Secondary Data
- Reliability of original source.
- Suitability — collected for a different purpose may not fit current question.
- Adequacy — coverage of variables, time period.
- Timeliness — currency of data.
- Definitions and units — may differ across sources.
45.7 Classification of Data
Classification is the process of arranging data into groups or classes according to some common characteristic.
- Geographical / Spatial — by location (state-wise GDP).
- Chronological / Temporal — by time (year-wise sales).
- Qualitative — by attribute (gender, education).
- Quantitative — by magnitude / value (income brackets).
45.7.1 Frequency Distribution
A frequency distribution classifies quantitative data into class intervals with their corresponding frequencies. Key terms:
- Class interval — range of values.
- Class limits — lower and upper.
- Class boundaries — exact limits (for continuous data).
- Class mark / mid-point — average of limits.
- Class width — upper − lower.
- Frequency (f) — number of observations in the class.
- Cumulative frequency — running total.
- Relative frequency — f / N.
45.7.2 Sturges’ Rule
Number of classes (k) and class width (h):
\[k = 1 + 3.322 \log_{10} N\] \[h = \text{Range} / k\]
(H.A. Sturges, 1926.)
45.8 Tabulation
Tabulation is “the orderly arrangement of data in rows and columns” to facilitate comparison and analysis. Components of a statistical table: title, head-note, captions, stubs, body, footnotes, source.
45.9 Graphical Presentation
| Data type | Graph |
|---|---|
| Categorical | Bar chart, Pie chart |
| Frequency distribution | Histogram, Frequency polygon, Ogive |
| Time series | Line chart, Z-chart |
| Two-variable | Scatter plot |
| Distribution shape | Box plot |
| Geographic | Cartogram |
45.10 Practice Questions
The four-level taxonomy of measurement scales was given by:
View solution
Match each scale with its example:
| Scale | Example | ||
| (i) | Nominal | (a) | Temperature in °C |
| (ii) | Ordinal | (b) | Income in ₹ |
| (iii) | Interval | (c) | Customer satisfaction 1-5 |
| (iv) | Ratio | (d) | Gender (Male/Female) |
View solution
Which scale has *equal intervals but no true zero*?
View solution
Data collected through a researcher's own survey is:
View solution
Sturges' rule for the number of classes is:
View solution
A schedule is administered by:
View solution
Match each basis with its example:
| Basis | Example | ||
| (i) | Geographical | (a) | Year-wise sales |
| (ii) | Chronological | (b) | State-wise GDP |
| (iii) | Qualitative | (c) | Income brackets |
| (iv) | Quantitative | (d) | Gender / education |
View solution
Which scale permits all mathematical operations including ratios?
View solution
Histogram is used to represent:
View solution
A graph of *cumulative* frequency distribution is called:
View solution
Which is a *secondary* source of Indian commercial / corporate data?
View solution
N = 100. Approximate number of classes by Sturges' rule:
View solution
A 5-point Likert scale ("strongly disagree → strongly agree") is best classified as:
View solution
Tracking the *same sample* of households over several years yields:
View solution
Which is **true** of a questionnaire vs schedule?
View solution
"Researcher records customer behaviour in a shop without their knowledge." This is:
View solution
Pie chart is most suitable for:
View solution
Information is:
View solution
A researcher using secondary data should especially check:
View solution
Number of children in a family is:
View solution
45.11 Quick Recall
- Data → Information → Knowledge → Wisdom.
- Types: Primary vs Secondary; Quantitative vs Qualitative; Nominal/Ordinal/Interval/Ratio (Stevens 1946); Cross-sectional/Time-series/Panel; Discrete vs Continuous.
- Stevens scales: Nominal (mode), Ordinal (median), Interval (mean; no true zero, e.g., °C), Ratio (all stats; true zero, e.g., weight).
- Primary methods: Observation, Interview, Questionnaire, Schedule, Experiment, Panel, Projective.
- Questionnaire (self-fill, low cost, low response) vs Schedule (enumerator-administered).
- Secondary sources (India): RBI, NSO/MoSPI, SEBI, Ministry reports, CMIE Prowess, Capitaline. Check reliability, suitability, adequacy, timeliness.
- Classification bases: Geographical, Chronological, Qualitative, Quantitative.
- Sturges’ rule: k = 1 + 3.322 log₁₀ N; class width h = Range/k.
- Graphs: Histogram, frequency polygon, Ogive (cumulative), Bar chart, Pie chart, Scatter plot.