40 Correlation and Regression

This topic combines the two foundational tools of bivariate analysis: correlation measures how strongly two variables move together, while regression estimates how one variable changes with another.

41 Part A — Correlation

41.1 Meaning

Correlation is the statistical relationship between two or more variables — the degree to which they move together. Two variables are correlated when changes in one tend to be associated with changes in the other (gupta2021?; elhance2020?).

Three Types of Correlation

Basis	Categories
Direction	Positive (move together) vs Negative (move opposite) vs Zero (no association)
Number of variables	Simple (two), Multiple (more than two), Partial (controlling others)
Form	Linear (constant rate of change) vs Non-linear (curvilinear)

A spurious correlation arises when two variables are statistically related but the relationship is not causal — they both depend on a third common factor, or the association is coincidental. Correlation, famously, is not causation.

41.2 Methods of Studying Correlation

Four Methods of Studying Correlation

Method	Working content
Scatter diagram	Plot pairs of (X, Y); visual inspection of pattern and direction
Karl Pearson’s correlation coefficient (r)	Algebraic measure for interval / ratio data; assumes linearity
Spearman’s rank correlation ($\rho$)	For ordinal data, when only ranks are available
Kendall’s tau ($\tau$)	Alternative rank-based measure, less affected by outliers

41.3 Karl Pearson’s Coefficient of Correlation

Karl Pearson (1895) generalised Galton’s earlier work into the modern formula. The Pearson product-moment correlation coefficient is:

\[ r = \dfrac{\sum (X - \bar X)(Y - \bar Y)}{\sqrt{\sum (X - \bar X)^2 \cdot \sum (Y - \bar Y)^2}} \]

A more convenient computational form:

\[ r = \dfrac{n \sum XY - \sum X \sum Y}{\sqrt{[n \sum X^2 - (\sum X)^2][n \sum Y^2 - (\sum Y)^2]}} \]

Interpretation of r

Value of r	Interpretation
$r = +1$	Perfect positive correlation
$0 < r < +1$	Some positive correlation
$r = 0$	No (linear) correlation
$-1 < r < 0$	Some negative correlation
$r = -1$	Perfect negative correlation

41.3.1 Properties of r

Six Properties of Pearson’s r

Property	Working content
Range	$-1 \leq r \leq +1$
Symmetric	$r_{XY} = r_{YX}$
Independent of change of origin	Adding a constant leaves $r$ unchanged
Independent of change of scale (positive)	Multiplying by a positive constant leaves $r$ unchanged
Geometric mean of regression coefficients	$r = \sqrt{b_{YX} \cdot b_{XY}}$
Sign matches both regression coefficients	All three carry the same sign

41.4 Coefficient of Determination

The coefficient of determination, $r^2$, is the fraction of the variation in Y explained by X. It lies between 0 and 1; the larger the $r^2$, the better the linear fit. The complement $1 - r^2$ is the coefficient of non-determination — the share of variation not explained.

41.4.1 Probable Error of r

\[ P.E. = 0.6745 \cdot \dfrac{1 - r^2}{\sqrt{n}} \]

If $|r| > 6 \cdot P.E.$, the correlation is judged significant; if $|r| < P.E.$, insignificant. The 0.6745 factor converts the standard error to the probable error (50 per cent confidence).

41.5 Spearman’s Rank Correlation

Charles Spearman (1904) developed a rank-based alternative for ordinal data and small samples. The formula:

\[ \rho = 1 - \dfrac{6 \sum d^2}{n (n^2 - 1)} \]

where $d$ is the difference between ranks of paired observations and $n$ is the number of pairs.

When ties occur, a correction is added to $\sum d^2$:

\[ \rho = 1 - \dfrac{6 \left[\sum d^2 + \sum \dfrac{m^3 - m}{12}\right]}{n(n^2 - 1)} \]

where $m$ is the number of items tied at each level.

41.5.1 Worked example

Five candidates’ ranks by two judges:

Candidate	Judge 1	Judge 2	$d$	$d^2$
A	1	2	−1	1
B	2	1	1	1
C	3	4	−1	1
D	4	3	1	1
E	5	5	0	0

$\sum d^2 = 4$. $= 1 - = 1 - = 1 - 0.2 = $ 0.8.

41.6 Correlation Is Not Causation

Two variables can be highly correlated for several non-causal reasons:

A third common cause affecting both (the lurking variable).
Reverse causation (Y causes X, not X causes Y).
Coincidence in finite samples — the spurious correlation.

Establishing causation requires either controlled experimentation or quasi-experimental methods (instrumental variables, regression discontinuity, randomised trials).

42 Part B — Regression

42.1 Meaning and Origin

Regression is the statistical technique used to estimate the relationship between a dependent variable and one or more independent variables. The term was coined by Sir Francis Galton (1886) — he observed that the height of children of tall parents tended to regress toward the population mean (the regression to the mean).

While correlation tells us the strength and direction of a linear relationship, regression provides a line (or surface) that allows prediction.

42.2 Linear Regression

For two variables X and Y, two regression lines exist:

Y on X — predicts Y for a given X — used when X is the cause.
X on Y — predicts X for a given Y — used when Y is the cause.

The Y on X regression line is:

\[ Y - \bar Y = b_{YX} (X - \bar X) \]

The X on Y line is:

\[ X - \bar X = b_{XY} (Y - \bar Y) \]

Regression Coefficients

Coefficient	Formula
$b_{YX}$ — regression of Y on X	$\dfrac{\sum(X - \bar X)(Y - \bar Y)}{\sum(X - \bar X)^2} = r \cdot \dfrac{\sigma_Y}{\sigma_X}$
$b_{XY}$ — regression of X on Y	$\dfrac{\sum(X - \bar X)(Y - \bar Y)}{\sum(Y - \bar Y)^2} = r \cdot \dfrac{\sigma_X}{\sigma_Y}$

42.3 Properties of Regression Coefficients

Six Properties of Regression Coefficients

Property	Working content
Same sign as $r$	All three carry the same sign
Geometric-mean identity	$r = \pm \sqrt{b_{YX} \cdot b_{XY}}$
Product equals $r^2$	$b_{YX} \cdot b_{XY} = r^2 \leq 1$
At most one can exceed 1	Both cannot be > 1 simultaneously
Independent of origin	Shift in origin does not change either coefficient
Affected by change of scale	Multiplying by $k$ changes them by factor $k$

The two regression lines intersect at the point $(\bar X, \bar Y)$ — the means of the two variables.

42.4 Method of Least Squares

The regression line is fitted by minimising the sum of squared vertical deviations of observed points from the line — Gauss’s method of least squares.

For Y on X, the line $Y = a + b X$ is fitted by:

\[ b = \dfrac{n \sum XY - \sum X \sum Y}{n \sum X^2 - (\sum X)^2}, \quad a = \bar Y - b \bar X \]

42.5 Worked Numerical

Five pairs: (X, Y) = (2, 4), (3, 6), (5, 8), (7, 10), (8, 12).

$n = 5$, $\sum X = 25$, $\sum Y = 40$, $\sum X^2 = 151$, $\sum Y^2 = 360$, $\sum XY = 232$.
$\bar X = 5$, $\bar Y = 8$.
$r = \dfrac{5(232) - 25 \times 40}{\sqrt{[5(151) - 625][5(360) - 1600]}} = \dfrac{1160 - 1000}{\sqrt{130 \times 200}} = \dfrac{160}{\sqrt{26000}} = \dfrac{160}{161.25} \approx 0.992$.
$b_{YX} = \dfrac{n \sum XY - \sum X \sum Y}{n \sum X^2 - (\sum X)^2} = \dfrac{160}{130} \approx 1.231$.
$a = \bar Y - b_{YX} \bar X = 8 - 1.231 \times 5 = 1.846$.
Y-on-X line: $Y = 1.846 + 1.231 X$.
$r^2 \approx 0.984$ — about 98 per cent of variation in Y is explained by X.

42.6 Standard Error of Estimate

The standard error of estimate measures the typical deviation of observed Y values from the regression line:

\[ S_{YX} = \sqrt{\dfrac{\sum (Y - \hat Y)^2}{n}} = \sigma_Y \sqrt{1 - r^2} \]

Lower $S_{YX}$ → tighter fit; in the limit $r = \pm 1$, $S_{YX} = 0$ (perfect fit).

42.7 Differences Between Correlation and Regression

Correlation vs Regression

Dimension	Correlation	Regression
Purpose	Measure strength of association	Estimate functional relationship; predict
Symmetric?	Yes ($r_{XY} = r_{YX}$)	No (Y-on-X ≠ X-on-Y)
Direction	Both directions same	Two distinct regression lines
Output	A single dimensionless number	An equation; a line
Causation	Does not imply causation	Does not imply causation
Use	Hypothesis-testing on association	Forecasting and inference

42.8 Multiple and Partial Correlation

When more than two variables are involved, multiple correlation measures the joint association of several X’s with one Y; partial correlation measures the association of two variables holding others constant. The textbook denotes:

$R_{1.23}$ — multiple correlation of $X_1$ with $X_2$ and $X_3$.
$r_{12.3}$ — partial correlation between $X_1$ and $X_2$, controlling for $X_3$.

The natural extension of regression — multiple regression — fits $Y = a + b_1 X_1 + b_2 X_2 + \dots + b_k X_k$ and is the workhorse of applied econometrics.

42.9 Exam-Pattern MCQs

Q 01

Which of the following statements about Karl Pearson's coefficient of correlation $r$ is not true?

A$-1 \leq r \leq +1$
B$r$ is independent of the change of origin
C$r$ is independent of the change of scale (positive)
D$r$ implies a causal relationship between X and Y

View solution

Correct Option: D

Correlation does not imply causation. The coefficient measures association, not cause and effect.

Q 02

Match each method of studying correlation with the type of data it suits best:

	Method		Best for
(i)	Scatter diagram	(a)	Visual inspection of any data type
(ii)	Karl Pearson's r	(b)	Ordinal / ranked data
(iii)	Spearman's $\rho$	(c)	Rank-based, robust to outliers
(iv)	Kendall's $\tau$	(d)	Interval / ratio data with linear pattern

A(i)-(a), (ii)-(d), (iii)-(b), (iv)-(c)
B(i)-(b), (ii)-(a), (iii)-(c), (iv)-(d)
C(i)-(c), (ii)-(b), (iii)-(d), (iv)-(a)
D(i)-(d), (ii)-(c), (iii)-(a), (iv)-(b)

View solution

Correct Option: A

Q 03

If the two regression coefficients $b_{YX}$ and $b_{XY}$ are 0.4 and 0.9, the correlation coefficient $r$ is:

A± 0.6
B± 0.36
C± 0.65
D± 0.45

View solution

Correct Option: A

$r = \pm \sqrt{0.4 \times 0.9} = \pm \sqrt{0.36} = $ ± 0.6.

Q 04

Match each property with the regression coefficient or correlation it characterises:

	Property		Statistic
(i)	Range $-1$ to $+1$	(a)	Coefficient of determination
(ii)	Independent of change of origin and scale	(b)	Karl Pearson's $r$
(iii)	Equal to $r^2$	(c)	Regression coefficients $b_{YX} \cdot b_{XY}$
(iv)	Affected by change of scale	(d)	Regression coefficient $b_{YX}$

A(i)-(b), (ii)-(b), (iii)-(c), (iv)-(d)
B(i)-(a), (ii)-(d), (iii)-(b), (iv)-(c)
C(i)-(c), (ii)-(b), (iii)-(d), (iv)-(a)
D(i)-(d), (ii)-(a), (iii)-(c), (iv)-(b)

View solution

Correct Option: A

Q 05

In Spearman's rank correlation with five paired observations, $\sum d^2 = 4$. The rank correlation coefficient is:

A0.5
B0.7
C0.8
D0.92

View solution

Correct Option: C

$\rho = 1 - 6(4) / [5(25-1)] = 1 - 24/120 = 1 - 0.2 = $ 0.8.

Q 06

The point at which the two regression lines intersect is:

AThe origin (0, 0)
BThe mean point $(\bar X, \bar Y)$
CThe first quartile of both variables
DThe mode of both variables

View solution

Correct Option: B

The two regression lines always pass through the means; their intersection is the point $(\bar X, \bar Y)$.

Q 07

Arrange the following in chronological order of contribution: (i) Pearson's correlation coefficient (ii) Galton's regression to the mean (iii) Spearman's rank correlation (iv) Gauss's method of least squares

A(iv), (ii), (i), (iii)
B(i), (ii), (iii), (iv)
C(iii), (iv), (i), (ii)
D(ii), (iii), (iv), (i)

View solution

Correct Option: A

Gauss (1809) → Galton (1886) → Pearson (1895) → Spearman (1904).

Q 08

Match each statistic with what it captures:

	Statistic		Captures
(i)	$r$	(a)	Functional relationship; allows prediction
(ii)	$r^2$	(b)	Strength of linear association
(iii)	Regression equation	(c)	Typical deviation of observed Y from the fitted line
(iv)	Standard error of estimate	(d)	Fraction of variation in Y explained by X

A(i)-(b), (ii)-(d), (iii)-(a), (iv)-(c)
B(i)-(a), (ii)-(b), (iii)-(c), (iv)-(d)
C(i)-(c), (ii)-(a), (iii)-(b), (iv)-(d)
D(i)-(d), (ii)-(c), (iii)-(b), (iv)-(a)

View solution

Correct Option: A

Quick recall

Correlation = strength and direction of association. Regression = functional relationship and prediction.
Types: positive / negative / zero; simple / multiple / partial; linear / non-linear.
Methods: scatter, Pearson r, Spearman $\rho$, Kendall $\tau$.
Pearson r: $-1 \leq r \leq +1$; symmetric; independent of origin and (positive) scale.
Coefficient of determination: $r^2$ = share of Y’s variation explained by X.
Probable error: $P.E. = 0.6745 (1 - r^2)/\sqrt{n}$. Significant if $|r| > 6 \cdot P.E.$
Spearman: $\rho = 1 - \dfrac{6 \sum d^2}{n(n^2 - 1)}$; tie correction adds $\sum (m^3 - m)/12$ to $\sum d^2$.
Galton (1886) coined “regression”; Pearson (1895) formalised correlation; Spearman (1904) rank correlation; Gauss (1809) method of least squares.
Two regression lines: Y on X ($Y - \bar Y = b_{YX}(X - \bar X)$) and X on Y. They intersect at $(\bar X, \bar Y)$.
Regression coefficients: $b_{YX} = r \sigma_Y / \sigma_X$; $b_{XY} = r \sigma_X / \sigma_Y$. $r = \pm \sqrt{b_{YX} \cdot b_{XY}}$.
Both $b$’s and $r$ have the same sign. At most one of $b_{YX}, b_{XY}$ can exceed 1.
Standard error of estimate: $S_{YX} = \sigma_Y \sqrt{1 - r^2}$.
Correlation ≠ causation.

Coefficient	Formula
\(b_{YX}\) — regression of Y on X	\(\dfrac{\sum(X - \bar X)(Y - \bar Y)}{\sum(X - \bar X)^2} = r \cdot \dfrac{\sigma_Y}{\sigma_X}\)
\(b_{XY}\) — regression of X on Y	\(\dfrac{\sum(X - \bar X)(Y - \bar Y)}{\sum(Y - \bar Y)^2} = r \cdot \dfrac{\sigma_X}{\sigma_Y}\)