flowchart LR
CR[Correlation] -->|measure of association| R[r ∈ −1, +1]
RG[Regression] -->|functional form| L[Y = a + bX]
R -.-> RG
RG --> SE[Standard Error of Estimate]
RG --> R2[R² Coefficient of Determination]
classDef default fill:#003366,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;
41 Correlation and regression of two variables
41.1 Two Foundational Tools of Bivariate Analysis
This topic combines the two foundational tools of bivariate analysis: correlation measures how strongly two variables move together, while regression estimates how one variable changes with another — and, by extension, predicts one from the other. Both are essential to empirical research in commerce, economics, marketing and finance. They are often discussed together because they share mathematical machinery (sums of squares and cross-products) and both are powerful when used carefully and misleading when used carelessly — correlation does not imply causation being the most-quoted caveat in statistics.
41.2 Correlation — Concept
Correlation measures the degree of linear association between two variables. It is bound between −1 and +1. Positive correlation: as X rises, Y tends to rise. Negative: as X rises, Y tends to fall.
| Basis | Categories |
|---|---|
| Direction | Positive · Negative · Zero |
| Number of variables | Simple (two) · Partial · Multiple |
| Relationship | Linear · Non-linear (curvilinear) |
| Method | Karl Pearson’s · Spearman’s rank · Concurrent deviation · Scatter diagram |
41.3 Karl Pearson’s Coefficient of Correlation (r)
\[r = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sqrt{\sum (X - \bar{X})^2 \cdot \sum (Y - \bar{Y})^2}} = \frac{Cov(X,Y)}{\sigma_X \sigma_Y}\]
- Range: −1 ≤ r ≤ +1.
- Symmetric: r(X, Y) = r(Y, X).
- Independent of change of origin and scale.
- Measures linear relationship only — does not capture curved relationships.
- r² = coefficient of determination — fraction of variation in Y explained by X.
- Not a measure of causation.
41.3.1 Interpretation Guide
| r | |
|---|---|
| 0.0 − 0.3 | Weak / none |
| 0.3 − 0.7 | Moderate |
| 0.7 − 1.0 | Strong |
| 1.0 | Perfect linear |
41.4 Spearman’s Rank Correlation (ρ)
For ordinal data or non-linear monotonic relationships, use Spearman’s rank correlation:
\[\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}\]
where \(d_i\) = difference of ranks of i-th pair.
Range: −1 to +1; same interpretation as r but for ranked data.
41.5 Concurrent Deviations Method
Quick approximation: \[r_c = \pm \sqrt{\frac{2c - n}{n}}\]
where c = number of concurrent (same-sign) deviations; n = total number of paired deviations. Sign matches majority of concurrent deviations.
41.6 Regression — Concept
Regression analysis estimates the functional relationship between a dependent variable Y and one or more independent variables X. It uses Ordinary Least Squares (OLS) to fit a line that minimises the sum of squared residuals.
41.6.1 Simple Linear Regression — Two Lines
| Regression | Equation | Slope formula |
|---|---|---|
| Y on X (predict Y from X) | \(Y - \bar{Y} = b_{yx}(X - \bar{X})\) | \(b_{yx} = r \cdot \frac{\sigma_y}{\sigma_x}\) |
| X on Y (predict X from Y) | \(X - \bar{X} = b_{xy}(Y - \bar{Y})\) | \(b_{xy} = r \cdot \frac{\sigma_x}{\sigma_y}\) |
41.6.2 Properties of Regression Coefficients
- Both slopes have the same sign — same sign as r.
- \(r = \sqrt{b_{yx} \cdot b_{xy}}\) — geometric mean of two regression slopes.
- Independent of change of origin.
- Dependent on change of scale.
- Two regression lines intersect at (X̄, Ȳ).
- They are identical only when r = ±1.
- The angle between the two lines indicates the strength of correlation: smaller angle → stronger correlation.
41.6.3 Standard Error of Estimate
\[S_{y.x} = \sigma_y \sqrt{1 - r^2}\]
Measures the typical prediction error of Y given X. When r = ±1, S_y.x = 0 — perfect prediction.
41.7 Coefficient of Determination (R²)
\[R^2 = r^2 = \frac{\text{Explained variation}}{\text{Total variation}} = 1 - \frac{SSE}{SST}\]
Range: 0 to 1. The higher, the better the linear fit. R² = 0.8 means 80 % of variation in Y is explained by X.
41.8 Multiple Regression
Extends to several independent variables: \(Y = a + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k + e\). Multiple R² and adjusted R² measure overall fit.
Correlation and regression are related but distinct. Correlation is symmetric: r(X,Y) = r(Y,X). Regression is asymmetric: b_yx ≠ b_xy in general. r² = b_yx × b_xy (sign of r given by sign of slopes).
41.9 Practice Questions
Karl Pearson's r lies between:
View solution
Spearman's rank correlation formula uses:
View solution
The two regression coefficients are related to r by:
View solution
The two regression lines intersect at:
View solution
If r = 0.8, the coefficient of determination is:
View solution
If b_yx = 0.5 and b_xy = 0.8, then r equals:
View solution
Two variables are independent if r equals:
View solution
"Correlation implies causation" is:
View solution
Spearman's rank correlation is most appropriate for:
View solution
The two regression lines are *identical* when:
View solution
Standard error of estimate of Y on X equals:
View solution
The slope of regression of Y on X is:
View solution
OLS minimises:
View solution
A *negative* correlation means:
View solution
If b_yx = 0.6 and b_xy = 0.7, then |r| equals approximately:
View solution
The Concurrent Deviation method gives only the:
View solution
A *scatter diagram* is used to:
View solution
In Spearman's rank correlation, tied ranks are typically handled by:
View solution
r = +1 indicates:
View solution
To **predict** Y from X, use:
View solution
41.10 Quick Recall
- Correlation — degree of linear association; range [−1, +1]. Karl Pearson r = Cov(X,Y)/σ_x σ_y.
- r² = Coefficient of Determination — fraction of Y variation explained by X.
- Properties: symmetric, independent of origin/scale; measures linear only; doesn’t imply causation.
- Spearman ρ = 1 − 6Σd²/n(n²−1) — for ranked / ordinal / monotonic data.
- Regression: Y on X (b_yx = r σ_y/σ_x); X on Y (b_xy = r σ_x/σ_y).
- r = ±√(b_yx · b_xy) (geometric mean of slopes; same sign).
- Both regression lines intersect at (X̄, Ȳ); identical when r = ±1.
- Standard error of estimate S_y.x = σ_y √(1 − r²).
- OLS minimises Σ residuals squared.