40 Correlation and Regression
This topic combines the two foundational tools of bivariate analysis: correlation measures how strongly two variables move together, while regression estimates how one variable changes with another.
41 Part A — Correlation
41.1 Meaning
Correlation is the statistical relationship between two or more variables — the degree to which they move together. Two variables are correlated when changes in one tend to be associated with changes in the other (gupta2021?; elhance2020?).
| Basis | Categories |
|---|---|
| Direction | Positive (move together) vs Negative (move opposite) vs Zero (no association) |
| Number of variables | Simple (two), Multiple (more than two), Partial (controlling others) |
| Form | Linear (constant rate of change) vs Non-linear (curvilinear) |
A spurious correlation arises when two variables are statistically related but the relationship is not causal — they both depend on a third common factor, or the association is coincidental. Correlation, famously, is not causation.
41.2 Methods of Studying Correlation
| Method | Working content |
|---|---|
| Scatter diagram | Plot pairs of (X, Y); visual inspection of pattern and direction |
| Karl Pearson’s correlation coefficient (r) | Algebraic measure for interval / ratio data; assumes linearity |
| Spearman’s rank correlation (\(\rho\)) | For ordinal data, when only ranks are available |
| Kendall’s tau (\(\tau\)) | Alternative rank-based measure, less affected by outliers |
41.3 Karl Pearson’s Coefficient of Correlation
Karl Pearson (1895) generalised Galton’s earlier work into the modern formula. The Pearson product-moment correlation coefficient is:
\[ r = \dfrac{\sum (X - \bar X)(Y - \bar Y)}{\sqrt{\sum (X - \bar X)^2 \cdot \sum (Y - \bar Y)^2}} \]
A more convenient computational form:
\[ r = \dfrac{n \sum XY - \sum X \sum Y}{\sqrt{[n \sum X^2 - (\sum X)^2][n \sum Y^2 - (\sum Y)^2]}} \]
| Value of r | Interpretation |
|---|---|
| \(r = +1\) | Perfect positive correlation |
| \(0 < r < +1\) | Some positive correlation |
| \(r = 0\) | No (linear) correlation |
| \(-1 < r < 0\) | Some negative correlation |
| \(r = -1\) | Perfect negative correlation |
41.3.1 Properties of r
| Property | Working content |
|---|---|
| Range | \(-1 \leq r \leq +1\) |
| Symmetric | \(r_{XY} = r_{YX}\) |
| Independent of change of origin | Adding a constant leaves \(r\) unchanged |
| Independent of change of scale (positive) | Multiplying by a positive constant leaves \(r\) unchanged |
| Geometric mean of regression coefficients | \(r = \sqrt{b_{YX} \cdot b_{XY}}\) |
| Sign matches both regression coefficients | All three carry the same sign |
41.4 Coefficient of Determination
The coefficient of determination, \(r^2\), is the fraction of the variation in Y explained by X. It lies between 0 and 1; the larger the \(r^2\), the better the linear fit. The complement \(1 - r^2\) is the coefficient of non-determination — the share of variation not explained.
41.4.1 Probable Error of r
\[ P.E. = 0.6745 \cdot \dfrac{1 - r^2}{\sqrt{n}} \]
If \(|r| > 6 \cdot P.E.\), the correlation is judged significant; if \(|r| < P.E.\), insignificant. The 0.6745 factor converts the standard error to the probable error (50 per cent confidence).
41.5 Spearman’s Rank Correlation
Charles Spearman (1904) developed a rank-based alternative for ordinal data and small samples. The formula:
\[ \rho = 1 - \dfrac{6 \sum d^2}{n (n^2 - 1)} \]
where \(d\) is the difference between ranks of paired observations and \(n\) is the number of pairs.
When ties occur, a correction is added to \(\sum d^2\):
\[ \rho = 1 - \dfrac{6 \left[\sum d^2 + \sum \dfrac{m^3 - m}{12}\right]}{n(n^2 - 1)} \]
where \(m\) is the number of items tied at each level.
41.5.1 Worked example
Five candidates’ ranks by two judges:
| Candidate | Judge 1 | Judge 2 | \(d\) | \(d^2\) |
|---|---|---|---|---|
| A | 1 | 2 | −1 | 1 |
| B | 2 | 1 | 1 | 1 |
| C | 3 | 4 | −1 | 1 |
| D | 4 | 3 | 1 | 1 |
| E | 5 | 5 | 0 | 0 |
\(\sum d^2 = 4\). $= 1 - = 1 - = 1 - 0.2 = $ 0.8.
41.6 Correlation Is Not Causation
Two variables can be highly correlated for several non-causal reasons:
- A third common cause affecting both (the lurking variable).
- Reverse causation (Y causes X, not X causes Y).
- Coincidence in finite samples — the spurious correlation.
Establishing causation requires either controlled experimentation or quasi-experimental methods (instrumental variables, regression discontinuity, randomised trials).
42 Part B — Regression
42.1 Meaning and Origin
Regression is the statistical technique used to estimate the relationship between a dependent variable and one or more independent variables. The term was coined by Sir Francis Galton (1886) — he observed that the height of children of tall parents tended to regress toward the population mean (the regression to the mean).
While correlation tells us the strength and direction of a linear relationship, regression provides a line (or surface) that allows prediction.
42.2 Linear Regression
For two variables X and Y, two regression lines exist:
- Y on X — predicts Y for a given X — used when X is the cause.
- X on Y — predicts X for a given Y — used when Y is the cause.
The Y on X regression line is:
\[ Y - \bar Y = b_{YX} (X - \bar X) \]
The X on Y line is:
\[ X - \bar X = b_{XY} (Y - \bar Y) \]
| Coefficient | Formula |
|---|---|
| \(b_{YX}\) — regression of Y on X | \(\dfrac{\sum(X - \bar X)(Y - \bar Y)}{\sum(X - \bar X)^2} = r \cdot \dfrac{\sigma_Y}{\sigma_X}\) |
| \(b_{XY}\) — regression of X on Y | \(\dfrac{\sum(X - \bar X)(Y - \bar Y)}{\sum(Y - \bar Y)^2} = r \cdot \dfrac{\sigma_X}{\sigma_Y}\) |
42.3 Properties of Regression Coefficients
| Property | Working content |
|---|---|
| Same sign as \(r\) | All three carry the same sign |
| Geometric-mean identity | \(r = \pm \sqrt{b_{YX} \cdot b_{XY}}\) |
| Product equals \(r^2\) | \(b_{YX} \cdot b_{XY} = r^2 \leq 1\) |
| At most one can exceed 1 | Both cannot be > 1 simultaneously |
| Independent of origin | Shift in origin does not change either coefficient |
| Affected by change of scale | Multiplying by \(k\) changes them by factor \(k\) |
The two regression lines intersect at the point \((\bar X, \bar Y)\) — the means of the two variables.
42.4 Method of Least Squares
The regression line is fitted by minimising the sum of squared vertical deviations of observed points from the line — Gauss’s method of least squares.
For Y on X, the line \(Y = a + b X\) is fitted by:
\[ b = \dfrac{n \sum XY - \sum X \sum Y}{n \sum X^2 - (\sum X)^2}, \quad a = \bar Y - b \bar X \]
42.5 Worked Numerical
Five pairs: (X, Y) = (2, 4), (3, 6), (5, 8), (7, 10), (8, 12).
- \(n = 5\), \(\sum X = 25\), \(\sum Y = 40\), \(\sum X^2 = 151\), \(\sum Y^2 = 360\), \(\sum XY = 232\).
- \(\bar X = 5\), \(\bar Y = 8\).
- \(r = \dfrac{5(232) - 25 \times 40}{\sqrt{[5(151) - 625][5(360) - 1600]}} = \dfrac{1160 - 1000}{\sqrt{130 \times 200}} = \dfrac{160}{\sqrt{26000}} = \dfrac{160}{161.25} \approx 0.992\).
- \(b_{YX} = \dfrac{n \sum XY - \sum X \sum Y}{n \sum X^2 - (\sum X)^2} = \dfrac{160}{130} \approx 1.231\).
- \(a = \bar Y - b_{YX} \bar X = 8 - 1.231 \times 5 = 1.846\).
- Y-on-X line: \(Y = 1.846 + 1.231 X\).
- \(r^2 \approx 0.984\) — about 98 per cent of variation in Y is explained by X.
42.6 Standard Error of Estimate
The standard error of estimate measures the typical deviation of observed Y values from the regression line:
\[ S_{YX} = \sqrt{\dfrac{\sum (Y - \hat Y)^2}{n}} = \sigma_Y \sqrt{1 - r^2} \]
Lower \(S_{YX}\) → tighter fit; in the limit \(r = \pm 1\), \(S_{YX} = 0\) (perfect fit).
42.7 Differences Between Correlation and Regression
| Dimension | Correlation | Regression |
|---|---|---|
| Purpose | Measure strength of association | Estimate functional relationship; predict |
| Symmetric? | Yes (\(r_{XY} = r_{YX}\)) | No (Y-on-X ≠ X-on-Y) |
| Direction | Both directions same | Two distinct regression lines |
| Output | A single dimensionless number | An equation; a line |
| Causation | Does not imply causation | Does not imply causation |
| Use | Hypothesis-testing on association | Forecasting and inference |
42.8 Multiple and Partial Correlation
When more than two variables are involved, multiple correlation measures the joint association of several X’s with one Y; partial correlation measures the association of two variables holding others constant. The textbook denotes:
- \(R_{1.23}\) — multiple correlation of \(X_1\) with \(X_2\) and \(X_3\).
- \(r_{12.3}\) — partial correlation between \(X_1\) and \(X_2\), controlling for \(X_3\).
The natural extension of regression — multiple regression — fits \(Y = a + b_1 X_1 + b_2 X_2 + \dots + b_k X_k\) and is the workhorse of applied econometrics.
42.9 Exam-Pattern MCQs
View solution
| Method | Best for | ||
| (i) | Scatter diagram | (a) | Visual inspection of any data type |
| (ii) | Karl Pearson's r | (b) | Ordinal / ranked data |
| (iii) | Spearman's $\rho$ | (c) | Rank-based, robust to outliers |
| (iv) | Kendall's $\tau$ | (d) | Interval / ratio data with linear pattern |
View solution
View solution
| Property | Statistic | ||
| (i) | Range $-1$ to $+1$ | (a) | Coefficient of determination |
| (ii) | Independent of change of origin and scale | (b) | Karl Pearson's $r$ |
| (iii) | Equal to $r^2$ | (c) | Regression coefficients $b_{YX} \cdot b_{XY}$ |
| (iv) | Affected by change of scale | (d) | Regression coefficient $b_{YX}$ |
View solution
View solution
View solution
View solution
| Statistic | Captures | ||
| (i) | $r$ | (a) | Functional relationship; allows prediction |
| (ii) | $r^2$ | (b) | Strength of linear association |
| (iii) | Regression equation | (c) | Typical deviation of observed Y from the fitted line |
| (iv) | Standard error of estimate | (d) | Fraction of variation in Y explained by X |
View solution
- Correlation = strength and direction of association. Regression = functional relationship and prediction.
- Types: positive / negative / zero; simple / multiple / partial; linear / non-linear.
- Methods: scatter, Pearson r, Spearman \(\rho\), Kendall \(\tau\).
- Pearson r: \(-1 \leq r \leq +1\); symmetric; independent of origin and (positive) scale.
- Coefficient of determination: \(r^2\) = share of Y’s variation explained by X.
- Probable error: \(P.E. = 0.6745 (1 - r^2)/\sqrt{n}\). Significant if \(|r| > 6 \cdot P.E.\)
- Spearman: \(\rho = 1 - \dfrac{6 \sum d^2}{n(n^2 - 1)}\); tie correction adds \(\sum (m^3 - m)/12\) to \(\sum d^2\).
- Galton (1886) coined “regression”; Pearson (1895) formalised correlation; Spearman (1904) rank correlation; Gauss (1809) method of least squares.
- Two regression lines: Y on X (\(Y - \bar Y = b_{YX}(X - \bar X)\)) and X on Y. They intersect at \((\bar X, \bar Y)\).
- Regression coefficients: \(b_{YX} = r \sigma_Y / \sigma_X\); \(b_{XY} = r \sigma_X / \sigma_Y\). \(r = \pm \sqrt{b_{YX} \cdot b_{XY}}\).
- Both \(b\)’s and \(r\) have the same sign. At most one of \(b_{YX}, b_{XY}\) can exceed 1.
- Standard error of estimate: \(S_{YX} = \sigma_Y \sqrt{1 - r^2}\).
- Correlation ≠ causation.