41  Correlation and regression of two variables

41.1 Two Foundational Tools of Bivariate Analysis

This topic combines the two foundational tools of bivariate analysis: correlation measures how strongly two variables move together, while regression estimates how one variable changes with another — and, by extension, predicts one from the other. Both are essential to empirical research in commerce, economics, marketing and finance. They are often discussed together because they share mathematical machinery (sums of squares and cross-products) and both are powerful when used carefully and misleading when used carelessly — correlation does not imply causation being the most-quoted caveat in statistics.

41.2 Correlation — Concept

Correlation measures the degree of linear association between two variables. It is bound between −1 and +1. Positive correlation: as X rises, Y tends to rise. Negative: as X rises, Y tends to fall.

TipTypes of Correlation
Basis Categories
Direction Positive · Negative · Zero
Number of variables Simple (two) · Partial · Multiple
Relationship Linear · Non-linear (curvilinear)
Method Karl Pearson’s · Spearman’s rank · Concurrent deviation · Scatter diagram

41.3 Karl Pearson’s Coefficient of Correlation (r)

\[r = \frac{\sum (X - \bar{X})(Y - \bar{Y})}{\sqrt{\sum (X - \bar{X})^2 \cdot \sum (Y - \bar{Y})^2}} = \frac{Cov(X,Y)}{\sigma_X \sigma_Y}\]

TipProperties of Pearson’s r
  • Range: −1 ≤ r ≤ +1.
  • Symmetric: r(X, Y) = r(Y, X).
  • Independent of change of origin and scale.
  • Measures linear relationship only — does not capture curved relationships.
  • r² = coefficient of determination — fraction of variation in Y explained by X.
  • Not a measure of causation.

41.3.1 Interpretation Guide

TipInterpreting r
r
0.0 − 0.3 Weak / none
0.3 − 0.7 Moderate
0.7 − 1.0 Strong
1.0 Perfect linear

41.4 Spearman’s Rank Correlation (ρ)

For ordinal data or non-linear monotonic relationships, use Spearman’s rank correlation:

\[\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}\]

where \(d_i\) = difference of ranks of i-th pair.

Range: −1 to +1; same interpretation as r but for ranked data.

41.5 Concurrent Deviations Method

Quick approximation: \[r_c = \pm \sqrt{\frac{2c - n}{n}}\]

where c = number of concurrent (same-sign) deviations; n = total number of paired deviations. Sign matches majority of concurrent deviations.

41.6 Regression — Concept

Regression analysis estimates the functional relationship between a dependent variable Y and one or more independent variables X. It uses Ordinary Least Squares (OLS) to fit a line that minimises the sum of squared residuals.

41.6.1 Simple Linear Regression — Two Lines

TipTwo Regression Equations
Regression Equation Slope formula
Y on X (predict Y from X) \(Y - \bar{Y} = b_{yx}(X - \bar{X})\) \(b_{yx} = r \cdot \frac{\sigma_y}{\sigma_x}\)
X on Y (predict X from Y) \(X - \bar{X} = b_{xy}(Y - \bar{Y})\) \(b_{xy} = r \cdot \frac{\sigma_x}{\sigma_y}\)

41.6.2 Properties of Regression Coefficients

TipProperties
  • Both slopes have the same sign — same sign as r.
  • \(r = \sqrt{b_{yx} \cdot b_{xy}}\)geometric mean of two regression slopes.
  • Independent of change of origin.
  • Dependent on change of scale.
  • Two regression lines intersect at (X̄, Ȳ).
  • They are identical only when r = ±1.
  • The angle between the two lines indicates the strength of correlation: smaller angle → stronger correlation.

41.6.3 Standard Error of Estimate

\[S_{y.x} = \sigma_y \sqrt{1 - r^2}\]

Measures the typical prediction error of Y given X. When r = ±1, S_y.x = 0 — perfect prediction.

41.7 Coefficient of Determination (R²)

\[R^2 = r^2 = \frac{\text{Explained variation}}{\text{Total variation}} = 1 - \frac{SSE}{SST}\]

Range: 0 to 1. The higher, the better the linear fit. R² = 0.8 means 80 % of variation in Y is explained by X.

41.8 Multiple Regression

Extends to several independent variables: \(Y = a + b_1 X_1 + b_2 X_2 + \ldots + b_k X_k + e\). Multiple R² and adjusted R² measure overall fit.

flowchart LR
  CR[Correlation] -->|measure of association| R[r ∈ −1, +1]
  RG[Regression] -->|functional form| L[Y = a + bX]
  R -.-> RG
  RG --> SE[Standard Error of Estimate]
  RG --> R2[R² Coefficient of Determination]
    classDef default fill:#003366,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;

NoteDistractor warning

Correlation and regression are related but distinct. Correlation is symmetric: r(X,Y) = r(Y,X). Regression is asymmetric: b_yx ≠ b_xy in general. r² = b_yx × b_xy (sign of r given by sign of slopes).

41.9 Practice Questions

Q 01RangeEasy

Karl Pearson's r lies between:

  • A0 and 1
  • B−1 and +1
  • C−∞ and +∞
  • D0 and 100
View solution
Correct Option: B
**−1 ≤ r ≤ +1**.
Q 02SpearmanMedium

Spearman's rank correlation formula uses:

  • ASum of products of ranks
  • BSum of squared rank differences
  • CGeometric mean of slopes
  • DCovariance
View solution
Correct Option: B
ρ = 1 − 6 Σd² / [n(n² − 1)].
Q 03SlopesMedium

The two regression coefficients are related to r by:

  • Ar = AM of slopes
  • Br = GM of slopes
  • Cr = HM of slopes
  • Dr = SD of slopes
View solution
Correct Option: B
**r = ±√(b_yx · b_xy)** — geometric mean.
Q 04IntersectMedium

The two regression lines intersect at:

  • AOrigin (0, 0)
  • B(X̄, Ȳ)
  • C(X_max, Y_max)
  • DAnywhere on the diagonal
View solution
Correct Option: B
Both lines pass through **(mean X, mean Y)**.
Q 05Medium

If r = 0.8, the coefficient of determination is:

  • A0.16
  • B0.64
  • C0.80
  • D1.0
View solution
Correct Option: B
r² = 0.8² = **0.64** — 64 % variation in Y explained by X.
Q 06SignMedium

If b_yx = 0.5 and b_xy = 0.8, then r equals:

  • A0.4
  • B0.63
  • C0.65
  • D1.3
View solution
Correct Option: B
r = √(0.5 × 0.8) = √0.4 ≈ **0.632**.
Q 07IndependentMedium

Two variables are independent if r equals:

  • A0
  • B+1
  • C−1
  • D
View solution
Correct Option: A
r = 0 — *linear* independence (true independence is stronger).
Q 08CausationMedium

"Correlation implies causation" is:

  • ATrue
  • BFalse; correlation does not establish causation
  • CTrue only for r > 0.5
  • DTrue only for negative correlation
View solution
Correct Option: B
Spurious correlations, reverse causation, lurking variables — correlation ≠ causation.
Q 09Spearman useMedium

Spearman's rank correlation is most appropriate for:

  • AContinuous data with linear relationship
  • BRanked or ordinal data
  • CCausal analysis
  • DMultiple variables
View solution
Correct Option: B
Spearman — ranked/ordinal or non-linear monotonic data.
Q 10IdenticalHard

The two regression lines are *identical* when:

  • Ar = 0
  • Br = ±1
  • Cr = 0.5
  • DAlways
View solution
Correct Option: B
Perfect correlation → both lines coincide.
Q 11SEHard

Standard error of estimate of Y on X equals:

  • Aσ_y × √(1 − r²)
  • Bσ_x × r
  • Cσ_y / r
  • D
View solution
Correct Option: A
**S_y.x = σ_y × √(1 − r²)** — zero when r = ±1.
Q 12Slope formulaMedium

The slope of regression of Y on X is:

  • Ar × (σ_x / σ_y)
  • Br × (σ_y / σ_x)
  • Cσ_y × σ_x
  • Dσ_y / σ_x
View solution
Correct Option: B
**b_yx = r × σ_y/σ_x**.
Q 13MethodHard

OLS minimises:

  • ASum of residuals
  • BSum of squared residuals
  • CMaximum absolute residual
  • DProduct of residuals
View solution
Correct Option: B
**OLS** = Ordinary Least Squares — minimises Σ e_i².
Q 14DirectionEasy

A *negative* correlation means:

  • ANo relationship
  • BAs X rises, Y tends to fall
  • CAs X rises, Y rises
  • DNon-linear
View solution
Correct Option: B
Negative — variables move in opposite directions.
Q 15LimitsHard

If b_yx = 0.6 and b_xy = 0.7, then |r| equals approximately:

  • A0.42
  • B0.65
  • C0.85
  • D0.95
View solution
Correct Option: B
|r| = √(0.6 × 0.7) = √0.42 ≈ **0.648**.
Q 16ConcurrentHard

The Concurrent Deviation method gives only the:

  • AExact value of r
  • BA rough idea of direction and strength of correlation
  • CSum of products
  • DSlope of regression
View solution
Correct Option: B
Quick rough method.
Q 17ScatterEasy

A *scatter diagram* is used to:

  • ACompute exact r
  • BVisually inspect the relationship between two variables
  • CTest hypothesis
  • DAverage ranks
View solution
Correct Option: B
Scatter plot — visual indicator of direction, strength, linearity, outliers.
Q 18Spearman tiesHard

In Spearman's rank correlation, tied ranks are typically handled by:

  • ASkipping
  • BAssigning the average of the tied ranks
  • CSetting both to zero
  • DIgnored entirely
View solution
Correct Option: B
Ties → assign **mean rank** to tied values; add correction term to formula.
Q 19PerfectMedium

r = +1 indicates:

  • ANo relationship
  • BPerfect positive linear relationship
  • CPerfect negative linear relationship
  • DNon-linear relationship
View solution
Correct Option: B
All points lie on a positively-sloped line.
Q 20PredictMedium

To **predict** Y from X, use:

  • ARegression of Y on X
  • BRegression of X on Y
  • CCorrelation coefficient only
  • DCoefficient of variation
View solution
Correct Option: A
Use **Y on X** equation (with slope b_yx).

41.10 Quick Recall

ImportantQuick recall
  • Correlation — degree of linear association; range [−1, +1]. Karl Pearson r = Cov(X,Y)/σ_x σ_y.
  • r² = Coefficient of Determination — fraction of Y variation explained by X.
  • Properties: symmetric, independent of origin/scale; measures linear only; doesn’t imply causation.
  • Spearman ρ = 1 − 6Σd²/n(n²−1) — for ranked / ordinal / monotonic data.
  • Regression: Y on X (b_yx = r σ_y/σ_x); X on Y (b_xy = r σ_x/σ_y).
  • r = ±√(b_yx · b_xy) (geometric mean of slopes; same sign).
  • Both regression lines intersect at (X̄, Ȳ); identical when r = ±1.
  • Standard error of estimate S_y.x = σ_y √(1 − r²).
  • OLS minimises Σ residuals squared.