40  Correlation and Regression

This topic combines the two foundational tools of bivariate analysis: correlation measures how strongly two variables move together, while regression estimates how one variable changes with another.

41 Part A — Correlation

41.1 Meaning

Correlation is the statistical relationship between two or more variables — the degree to which they move together. Two variables are correlated when changes in one tend to be associated with changes in the other (gupta2021?; elhance2020?).

TipThree Types of Correlation
Basis Categories
Direction Positive (move together) vs Negative (move opposite) vs Zero (no association)
Number of variables Simple (two), Multiple (more than two), Partial (controlling others)
Form Linear (constant rate of change) vs Non-linear (curvilinear)

A spurious correlation arises when two variables are statistically related but the relationship is not causal — they both depend on a third common factor, or the association is coincidental. Correlation, famously, is not causation.

41.2 Methods of Studying Correlation

TipFour Methods of Studying Correlation
Method Working content
Scatter diagram Plot pairs of (X, Y); visual inspection of pattern and direction
Karl Pearson’s correlation coefficient (r) Algebraic measure for interval / ratio data; assumes linearity
Spearman’s rank correlation (\(\rho\)) For ordinal data, when only ranks are available
Kendall’s tau (\(\tau\)) Alternative rank-based measure, less affected by outliers

41.3 Karl Pearson’s Coefficient of Correlation

Karl Pearson (1895) generalised Galton’s earlier work into the modern formula. The Pearson product-moment correlation coefficient is:

\[ r = \dfrac{\sum (X - \bar X)(Y - \bar Y)}{\sqrt{\sum (X - \bar X)^2 \cdot \sum (Y - \bar Y)^2}} \]

A more convenient computational form:

\[ r = \dfrac{n \sum XY - \sum X \sum Y}{\sqrt{[n \sum X^2 - (\sum X)^2][n \sum Y^2 - (\sum Y)^2]}} \]

TipInterpretation of r
Value of r Interpretation
\(r = +1\) Perfect positive correlation
\(0 < r < +1\) Some positive correlation
\(r = 0\) No (linear) correlation
\(-1 < r < 0\) Some negative correlation
\(r = -1\) Perfect negative correlation

41.3.1 Properties of r

TipSix Properties of Pearson’s r
Property Working content
Range \(-1 \leq r \leq +1\)
Symmetric \(r_{XY} = r_{YX}\)
Independent of change of origin Adding a constant leaves \(r\) unchanged
Independent of change of scale (positive) Multiplying by a positive constant leaves \(r\) unchanged
Geometric mean of regression coefficients \(r = \sqrt{b_{YX} \cdot b_{XY}}\)
Sign matches both regression coefficients All three carry the same sign

41.4 Coefficient of Determination

The coefficient of determination, \(r^2\), is the fraction of the variation in Y explained by X. It lies between 0 and 1; the larger the \(r^2\), the better the linear fit. The complement \(1 - r^2\) is the coefficient of non-determination — the share of variation not explained.

41.4.1 Probable Error of r

\[ P.E. = 0.6745 \cdot \dfrac{1 - r^2}{\sqrt{n}} \]

If \(|r| > 6 \cdot P.E.\), the correlation is judged significant; if \(|r| < P.E.\), insignificant. The 0.6745 factor converts the standard error to the probable error (50 per cent confidence).

41.5 Spearman’s Rank Correlation

Charles Spearman (1904) developed a rank-based alternative for ordinal data and small samples. The formula:

\[ \rho = 1 - \dfrac{6 \sum d^2}{n (n^2 - 1)} \]

where \(d\) is the difference between ranks of paired observations and \(n\) is the number of pairs.

When ties occur, a correction is added to \(\sum d^2\):

\[ \rho = 1 - \dfrac{6 \left[\sum d^2 + \sum \dfrac{m^3 - m}{12}\right]}{n(n^2 - 1)} \]

where \(m\) is the number of items tied at each level.

41.5.1 Worked example

Five candidates’ ranks by two judges:

Candidate Judge 1 Judge 2 \(d\) \(d^2\)
A 1 2 −1 1
B 2 1 1 1
C 3 4 −1 1
D 4 3 1 1
E 5 5 0 0

\(\sum d^2 = 4\). $= 1 - = 1 - = 1 - 0.2 = $ 0.8.

41.6 Correlation Is Not Causation

Two variables can be highly correlated for several non-causal reasons:

  • A third common cause affecting both (the lurking variable).
  • Reverse causation (Y causes X, not X causes Y).
  • Coincidence in finite samples — the spurious correlation.

Establishing causation requires either controlled experimentation or quasi-experimental methods (instrumental variables, regression discontinuity, randomised trials).

42 Part B — Regression

42.1 Meaning and Origin

Regression is the statistical technique used to estimate the relationship between a dependent variable and one or more independent variables. The term was coined by Sir Francis Galton (1886) — he observed that the height of children of tall parents tended to regress toward the population mean (the regression to the mean).

While correlation tells us the strength and direction of a linear relationship, regression provides a line (or surface) that allows prediction.

42.2 Linear Regression

For two variables X and Y, two regression lines exist:

  • Y on X — predicts Y for a given X — used when X is the cause.
  • X on Y — predicts X for a given Y — used when Y is the cause.

The Y on X regression line is:

\[ Y - \bar Y = b_{YX} (X - \bar X) \]

The X on Y line is:

\[ X - \bar X = b_{XY} (Y - \bar Y) \]

TipRegression Coefficients
Coefficient Formula
\(b_{YX}\) — regression of Y on X \(\dfrac{\sum(X - \bar X)(Y - \bar Y)}{\sum(X - \bar X)^2} = r \cdot \dfrac{\sigma_Y}{\sigma_X}\)
\(b_{XY}\) — regression of X on Y \(\dfrac{\sum(X - \bar X)(Y - \bar Y)}{\sum(Y - \bar Y)^2} = r \cdot \dfrac{\sigma_X}{\sigma_Y}\)

42.3 Properties of Regression Coefficients

TipSix Properties of Regression Coefficients
Property Working content
Same sign as \(r\) All three carry the same sign
Geometric-mean identity \(r = \pm \sqrt{b_{YX} \cdot b_{XY}}\)
Product equals \(r^2\) \(b_{YX} \cdot b_{XY} = r^2 \leq 1\)
At most one can exceed 1 Both cannot be > 1 simultaneously
Independent of origin Shift in origin does not change either coefficient
Affected by change of scale Multiplying by \(k\) changes them by factor \(k\)

The two regression lines intersect at the point \((\bar X, \bar Y)\) — the means of the two variables.

42.4 Method of Least Squares

The regression line is fitted by minimising the sum of squared vertical deviations of observed points from the line — Gauss’s method of least squares.

For Y on X, the line \(Y = a + b X\) is fitted by:

\[ b = \dfrac{n \sum XY - \sum X \sum Y}{n \sum X^2 - (\sum X)^2}, \quad a = \bar Y - b \bar X \]

42.5 Worked Numerical

Five pairs: (X, Y) = (2, 4), (3, 6), (5, 8), (7, 10), (8, 12).

  • \(n = 5\), \(\sum X = 25\), \(\sum Y = 40\), \(\sum X^2 = 151\), \(\sum Y^2 = 360\), \(\sum XY = 232\).
  • \(\bar X = 5\), \(\bar Y = 8\).
  • \(r = \dfrac{5(232) - 25 \times 40}{\sqrt{[5(151) - 625][5(360) - 1600]}} = \dfrac{1160 - 1000}{\sqrt{130 \times 200}} = \dfrac{160}{\sqrt{26000}} = \dfrac{160}{161.25} \approx 0.992\).
  • \(b_{YX} = \dfrac{n \sum XY - \sum X \sum Y}{n \sum X^2 - (\sum X)^2} = \dfrac{160}{130} \approx 1.231\).
  • \(a = \bar Y - b_{YX} \bar X = 8 - 1.231 \times 5 = 1.846\).
  • Y-on-X line: \(Y = 1.846 + 1.231 X\).
  • \(r^2 \approx 0.984\) — about 98 per cent of variation in Y is explained by X.

42.6 Standard Error of Estimate

The standard error of estimate measures the typical deviation of observed Y values from the regression line:

\[ S_{YX} = \sqrt{\dfrac{\sum (Y - \hat Y)^2}{n}} = \sigma_Y \sqrt{1 - r^2} \]

Lower \(S_{YX}\) → tighter fit; in the limit \(r = \pm 1\), \(S_{YX} = 0\) (perfect fit).

42.7 Differences Between Correlation and Regression

TipCorrelation vs Regression
Dimension Correlation Regression
Purpose Measure strength of association Estimate functional relationship; predict
Symmetric? Yes (\(r_{XY} = r_{YX}\)) No (Y-on-X ≠ X-on-Y)
Direction Both directions same Two distinct regression lines
Output A single dimensionless number An equation; a line
Causation Does not imply causation Does not imply causation
Use Hypothesis-testing on association Forecasting and inference

42.8 Multiple and Partial Correlation

When more than two variables are involved, multiple correlation measures the joint association of several X’s with one Y; partial correlation measures the association of two variables holding others constant. The textbook denotes:

  • \(R_{1.23}\) — multiple correlation of \(X_1\) with \(X_2\) and \(X_3\).
  • \(r_{12.3}\) — partial correlation between \(X_1\) and \(X_2\), controlling for \(X_3\).

The natural extension of regression — multiple regression — fits \(Y = a + b_1 X_1 + b_2 X_2 + \dots + b_k X_k\) and is the workhorse of applied econometrics.

42.9 Exam-Pattern MCQs

Q 01
Which of the following statements about Karl Pearson's coefficient of correlation $r$ is not true?
  • A$-1 \leq r \leq +1$
  • B$r$ is independent of the change of origin
  • C$r$ is independent of the change of scale (positive)
  • D$r$ implies a causal relationship between X and Y
View solution
Correct Option: D
Correlation does not imply causation. The coefficient measures association, not cause and effect.
Q 02
Match each method of studying correlation with the type of data it suits best:
Method Best for
(i) Scatter diagram (a) Visual inspection of any data type
(ii) Karl Pearson's r (b) Ordinal / ranked data
(iii) Spearman's $\rho$ (c) Rank-based, robust to outliers
(iv) Kendall's $\tau$ (d) Interval / ratio data with linear pattern
  • A(i)-(a), (ii)-(d), (iii)-(b), (iv)-(c)
  • B(i)-(b), (ii)-(a), (iii)-(c), (iv)-(d)
  • C(i)-(c), (ii)-(b), (iii)-(d), (iv)-(a)
  • D(i)-(d), (ii)-(c), (iii)-(a), (iv)-(b)
View solution
Correct Option: A
Q 03
If the two regression coefficients $b_{YX}$ and $b_{XY}$ are 0.4 and 0.9, the correlation coefficient $r$ is:
  • A± 0.6
  • B± 0.36
  • C± 0.65
  • D± 0.45
View solution
Correct Option: A
$r = \pm \sqrt{0.4 \times 0.9} = \pm \sqrt{0.36} = $ ± 0.6.
Q 04
Match each property with the regression coefficient or correlation it characterises:
Property Statistic
(i) Range $-1$ to $+1$ (a) Coefficient of determination
(ii) Independent of change of origin and scale (b) Karl Pearson's $r$
(iii) Equal to $r^2$ (c) Regression coefficients $b_{YX} \cdot b_{XY}$
(iv) Affected by change of scale (d) Regression coefficient $b_{YX}$
  • A(i)-(b), (ii)-(b), (iii)-(c), (iv)-(d)
  • B(i)-(a), (ii)-(d), (iii)-(b), (iv)-(c)
  • C(i)-(c), (ii)-(b), (iii)-(d), (iv)-(a)
  • D(i)-(d), (ii)-(a), (iii)-(c), (iv)-(b)
View solution
Correct Option: A
Q 05
In Spearman's rank correlation with five paired observations, $\sum d^2 = 4$. The rank correlation coefficient is:
  • A0.5
  • B0.7
  • C0.8
  • D0.92
View solution
Correct Option: C
$\rho = 1 - 6(4) / [5(25-1)] = 1 - 24/120 = 1 - 0.2 = $ 0.8.
Q 06
The point at which the two regression lines intersect is:
  • AThe origin (0, 0)
  • BThe mean point $(\bar X, \bar Y)$
  • CThe first quartile of both variables
  • DThe mode of both variables
View solution
Correct Option: B
The two regression lines always pass through the means; their intersection is the point $(\bar X, \bar Y)$.
Q 07
Arrange the following in chronological order of contribution: (i) Pearson's correlation coefficient (ii) Galton's regression to the mean (iii) Spearman's rank correlation (iv) Gauss's method of least squares
  • A(iv), (ii), (i), (iii)
  • B(i), (ii), (iii), (iv)
  • C(iii), (iv), (i), (ii)
  • D(ii), (iii), (iv), (i)
View solution
Correct Option: A
Gauss (1809) → Galton (1886) → Pearson (1895) → Spearman (1904).
Q 08
Match each statistic with what it captures:
Statistic Captures
(i) $r$ (a) Functional relationship; allows prediction
(ii) $r^2$ (b) Strength of linear association
(iii) Regression equation (c) Typical deviation of observed Y from the fitted line
(iv) Standard error of estimate (d) Fraction of variation in Y explained by X
  • A(i)-(b), (ii)-(d), (iii)-(a), (iv)-(c)
  • B(i)-(a), (ii)-(b), (iii)-(c), (iv)-(d)
  • C(i)-(c), (ii)-(a), (iii)-(b), (iv)-(d)
  • D(i)-(d), (ii)-(c), (iii)-(b), (iv)-(a)
View solution
Correct Option: A
ImportantQuick recall
  • Correlation = strength and direction of association. Regression = functional relationship and prediction.
  • Types: positive / negative / zero; simple / multiple / partial; linear / non-linear.
  • Methods: scatter, Pearson r, Spearman \(\rho\), Kendall \(\tau\).
  • Pearson r: \(-1 \leq r \leq +1\); symmetric; independent of origin and (positive) scale.
  • Coefficient of determination: \(r^2\) = share of Y’s variation explained by X.
  • Probable error: \(P.E. = 0.6745 (1 - r^2)/\sqrt{n}\). Significant if \(|r| > 6 \cdot P.E.\)
  • Spearman: \(\rho = 1 - \dfrac{6 \sum d^2}{n(n^2 - 1)}\); tie correction adds \(\sum (m^3 - m)/12\) to \(\sum d^2\).
  • Galton (1886) coined “regression”; Pearson (1895) formalised correlation; Spearman (1904) rank correlation; Gauss (1809) method of least squares.
  • Two regression lines: Y on X (\(Y - \bar Y = b_{YX}(X - \bar X)\)) and X on Y. They intersect at \((\bar X, \bar Y)\).
  • Regression coefficients: \(b_{YX} = r \sigma_Y / \sigma_X\); \(b_{XY} = r \sigma_X / \sigma_Y\). \(r = \pm \sqrt{b_{YX} \cdot b_{XY}}\).
  • Both \(b\)’s and \(r\) have the same sign. At most one of \(b_{YX}, b_{XY}\) can exceed 1.
  • Standard error of estimate: \(S_{YX} = \sigma_Y \sqrt{1 - r^2}\).
  • Correlation ≠ causation.