๐Ÿ“˜ PART II โ€” SUPERVISED LEARNING

Linear Regression &
Gradient Descent

Master the foundational supervised learning algorithm. Learn to fit lines to data, derive cost functions from maximum likelihood, solve via the Normal Equation and Gradient Descent, and build production-ready predictive models.

๐Ÿ“– Chapter 4 โฑ๏ธ ~4 hours reading ๐Ÿงฎ 24 sections ๐Ÿ’ป 3 implementations ๐Ÿ—๏ธ 2 mini projects
01

Learning Objectives

After completing this chapter, you will be able to:

1
Explain the intuition behind fitting a line to data and the concept of "best fit."
2
Derive the Mean Squared Error cost function from Maximum Likelihood Estimation with Gaussian noise.
3
Solve linear regression using the closed-form Normal Equation: w = (XTX)-1XTy.
4
Implement Batch, Stochastic, and Mini-batch Gradient Descent from scratch in Python.
5
Analyze the effects of learning rate, convergence criteria, and feature scaling on GD.
6
Evaluate models using Rยฒ, Adjusted Rยฒ, MSE, RMSE, and MAE metrics.
7
Verify linear regression assumptions via residual plots, Q-Q plots, and leverage plots.
8
Build end-to-end regression pipelines using Scikit-Learn and TensorFlow.
9
Apply regression to real-world Indian datasets: house prices, crop yields, and rainfall.
10
Discuss ethical considerations in regression modeling (bias, fairness, interpretability).
02

Introduction

Imagine you're a real estate agent in Mumbai, and a client asks: "If I buy a 1,200 sq. ft. flat in Andheri, what should I expect to pay?" You've sold dozens of flatsโ€”you know roughly that larger flats cost more. But can you turn that intuition into a precise number? That's exactly what linear regression does.

Linear regression is the "Hello, World!" of machine learning. It is the simplest, most interpretable, and arguably the most widely deployed supervised learning algorithm. From predicting house prices to forecasting stock markets, from estimating crop yields to modeling climate changeโ€”regression is everywhere.

In this chapter, we'll go from zero to expert. We'll start with the intuitive idea of "drawing a line through points," derive every formula from first principles, implement everything from scratch in Python, and then see how the same ideas power billion-dollar applications at Google, Flipkart, and ISRO.

Why Linear Regression Matters

03

Historical Background

The story of linear regression begins with the starsโ€”literally. In 1805, the French mathematician Adrien-Marie Legendre published the method of least squares to predict the orbits of comets. He wanted to find the "best" line through noisy astronomical observations.

Just four years later, Carl Friedrich Gauss claimed he had invented the method even earlier (around 1795) and showed that least squares is the optimal estimator when errors follow a Gaussian (normal) distribution. This connection between least squares and Gaussian noise is the Maximum Likelihood derivation we'll explore in Section 6.

The term "regression" itself was coined by Sir Francis Galton in the 1880s, studying the relationship between parents' heights and children's heights. He observed that children of very tall parents tended to be shorterโ€”they "regressed toward the mean." Though the biological interpretation has changed, the mathematical method kept the name.

Timeline of Key Milestones

YearContributorMilestone
1795Carl Friedrich GaussFirst use of least squares (claimed, published 1809)
1805Adrien-Marie LegendrePublished method of least squares
1821Carl Friedrich GaussGauss-Markov theorem: OLS is BLUE
1885Francis GaltonCoined "regression" (regression to the mean)
1897Karl PearsonMultiple regression, correlation coefficient
1922R.A. FisherMaximum likelihood estimation formalized
1958Frank RosenblattPerceptron (linear regression + threshold = classification)
1970Hoerl & KennardRidge regression (regularized linear regression)
1996Robert TibshiraniLASSO (L1-regularized linear regression)
2005Zou & HastieElastic Net (L1+L2 regularization)
04

Conceptual Explanation

4.1 The Simplest Idea: Fitting a Line

Suppose you plot house sizes (x-axis) against prices (y-axis) and see a rough upward trend. A linear regression model says: "I believe the relationship is approximately a straight line."

Simple Linear Regression Model ลท = w ยท x + b

Where:

4.2 What Makes a Line "Best"?

There are infinitely many lines you could draw through a scatter plot. We need a way to measure which line is "best." The idea is simple: the best line minimizes the total error between predicted values (ลท) and actual values (y).

We define the residual for each data point as: eแตข = yแตข - ลทแตข. The residual is positive if we under-predict and negative if we over-predict. We can't just sum residuals (positives cancel negatives), so we square them:

Mean Squared Error (MSE) J(w, b) = (1/n) ยท ฮฃแตขโ‚Œโ‚โฟ (yแตข - ลทแตข)ยฒ

This is the cost function (or loss function). Our goal is to find w and b that minimize J(w, b). This is an optimization problem.

4.3 Multiple Linear Regression

Real problems have multiple features. A house price depends on area, bedrooms, age, location, etc. With p features:

Multiple Linear Regression ลท = wโ‚xโ‚ + wโ‚‚xโ‚‚ + ... + wโ‚šxโ‚š + b

In vectorized form, we add a column of 1s to the feature matrix X (for the bias), and write:

Vectorized Form ลท = X ยท w (where X includes a bias column, w includes b)

4.4 Two Solution Strategies

There are two fundamentally different ways to find the optimal w:

Strategy 1: Normal Equation (Analytical)

Set the derivative of J(w) to zero and solve directly. Gives an exact, closed-form answer. Fast for small/medium datasets. Complexity: O(pยณ) for p features.

Strategy 2: Gradient Descent (Iterative)

Start with random w, repeatedly nudge w in the direction that reduces J(w). Converges to the optimal solution. Essential for large datasets and neural networks. Complexity: O(npk) for n samples, p features, k iterations.

05

Mathematical Foundation

5.1 Notation

SymbolMeaningDimensions
XFeature matrix (with bias column)n ร— (p+1)
yTarget vectorn ร— 1
wWeight vector (including bias)(p+1) ร— 1
ลทPrediction vector = Xwn ร— 1
ฮตError/residual vector = y - ลทn ร— 1
nNumber of training samplesscalar
pNumber of features (before bias)scalar
ฮฑLearning ratescalar
J(w)Cost functionscalar

5.2 The Linear Model Assumption

We assume the true relationship between features and target is:

Data Generating Process y = Xw* + ฮต, where ฮต ~ N(0, ฯƒยฒI)

Here w* is the true (unknown) weight vector, and ฮต is random noise that is (1) normally distributed, (2) has zero mean, (3) has constant variance ฯƒยฒ, and (4) is independent across samples.

5.3 The Four Assumptions of Linear Regression

  1. Linearity: The relationship between X and y is linear in parameters. The true model is y = Xw + ฮต.
  2. Independence: The residuals ฮตแตข are independent of each other. No autocorrelation.
  3. Homoscedasticity: The variance of residuals is constant across all levels of X. Var(ฮตแตข) = ฯƒยฒ for all i.
  4. Normality: The residuals follow a normal distribution. ฮตแตข ~ N(0, ฯƒยฒ).

Bonus assumption: No multicollinearity โ€” features should not be perfectly correlated with each other (otherwise XTX is singular).

5.4 Evaluation Metrics

Mean Squared Error (MSE)

MSE = (1/n) ฮฃแตขโ‚Œโ‚โฟ (yแตข - ลทแตข)ยฒ

Root Mean Squared Error (RMSE)

RMSE = โˆšMSE = โˆš[(1/n) ฮฃแตขโ‚Œโ‚โฟ (yแตข - ลทแตข)ยฒ]

Mean Absolute Error (MAE)

MAE = (1/n) ฮฃแตขโ‚Œโ‚โฟ |yแตข - ลทแตข|

Rยฒ (Coefficient of Determination)

R-Squared Rยฒ = 1 - (SS_res / SS_tot)
SS_res = ฮฃ(yแตข - ลทแตข)ยฒ (residual sum of squares)
SS_tot = ฮฃ(yแตข - ศณ)ยฒ (total sum of squares)

Interpretation: Rยฒ = 0.85 means the model explains 85% of the variance in y. Rยฒ = 1 is perfect fit; Rยฒ = 0 means the model is no better than predicting the mean.

Adjusted Rยฒ

Adjusted R-Squared Adj-Rยฒ = 1 - [(1 - Rยฒ)(n - 1) / (n - p - 1)]

Adjusted Rยฒ penalizes adding useless features. If a new feature doesn't improve prediction, Adjusted Rยฒ decreases, while Rยฒ always increases (or stays the same) with more features.

06

Formula Derivations

6.1 Derivation of MSE from Maximum Likelihood

Setup: Assume each observation follows: yแตข = wTxแตข + ฮตแตข where ฮตแตข ~ N(0, ฯƒยฒ).

Step 1: Write the likelihood. Since ฮตแตข = yแตข - wTxแตข is Gaussian:

P(yแตข | xแตข, w) = (1/โˆš(2ฯ€ฯƒยฒ)) ยท exp(-(yแตข - wTxแตข)ยฒ / (2ฯƒยฒ))

Step 2: Likelihood of all n samples (independence assumption):

L(w) = ฮ แตขโ‚Œโ‚โฟ P(yแตข | xแตข, w) = ฮ แตขโ‚Œโ‚โฟ (1/โˆš(2ฯ€ฯƒยฒ)) ยท exp(-(yแตข - wTxแตข)ยฒ / (2ฯƒยฒ))

Step 3: Take the log-likelihood (products โ†’ sums):

ln L(w) = -n/2 ยท ln(2ฯ€ฯƒยฒ) - (1/(2ฯƒยฒ)) ยท ฮฃแตขโ‚Œโ‚โฟ (yแตข - wTxแตข)ยฒ

Step 4: Maximize log-likelihood = Minimize the sum of squares

The first term is a constant (doesn't depend on w). Maximizing ln L(w) is equivalent to:

MSE from Maximum Likelihood Minimize J(w) = (1/n) ฮฃแตขโ‚Œโ‚โฟ (yแตข - wTxแตข)ยฒ

Key result: Under Gaussian noise, Maximum Likelihood Estimation gives us MSE as the cost function. This is not an arbitrary choiceโ€”it's the statistically optimal loss under the normality assumption.

6.2 Derivation of the Normal Equation

Goal: Find w that minimizes J(w) = (1/n)||y - Xw||ยฒ

Step 1: Expand the cost function in matrix form

J(w) = (1/n)(y - Xw)T(y - Xw)
= (1/n)(yTy - yTXw - wTXTy + wTXTXw)
= (1/n)(yTy - 2wTXTy + wTXTXw)

(Note: yTXw is a scalar, so yTXw = wTXTy.)

Step 2: Take the gradient with respect to w

Using matrix calculus rules: โˆ‚(wTa)/โˆ‚w = a, and โˆ‚(wTAw)/โˆ‚w = 2Aw (for symmetric A):

โˆ‡โ‚“J(w) = (1/n)(-2XTy + 2XTXw)

Step 3: Set gradient to zero and solve

-2XTy + 2XTXw = 0
XTXw = XTy

Step 4: Solve for w (multiply both sides by (XTX)-1):

The Normal Equation w = (XTX)-1 XTy

This assumes XTX is invertible (non-singular), which requires n โ‰ฅ p and no perfect multicollinearity.

6.3 Derivation of Gradient Descent Update Rule

Idea: If we can't (or don't want to) invert matrices, we can iteratively walk "downhill" on the cost surface.

Step 1: We already computed the gradient:

โˆ‡J(w) = (2/n) XT(Xw - y)

Step 2: Update w in the negative gradient direction:

Gradient Descent Update Rule w โ† w - ฮฑ ยท โˆ‡J(w)
w โ† w - ฮฑ ยท (2/n) XT(Xw - y)

For simple linear regression (scalar w and b):

โˆ‚J/โˆ‚w = (2/n) ฮฃแตขโ‚Œโ‚โฟ (ลทแตข - yแตข) ยท xแตข
โˆ‚J/โˆ‚b = (2/n) ฮฃแตขโ‚Œโ‚โฟ (ลทแตข - yแตข)

w โ† w - ฮฑ ยท โˆ‚J/โˆ‚w
b โ† b - ฮฑ ยท โˆ‚J/โˆ‚b

6.4 Variants of Gradient Descent

Batch Gradient Descent

Uses all n samples to compute the gradient at each step.

w โ† w - ฮฑ ยท (2/n) XT(Xw - y)

Pros: Smooth convergence, guaranteed to find global minimum (for convex J).
Cons: Slow for large n (must process entire dataset each step).

Stochastic Gradient Descent (SGD)

Uses one random sample to estimate the gradient at each step.

w โ† w - ฮฑ ยท 2(ลทแตข - yแตข) ยท xแตข (for randomly chosen i)

Pros: Very fast per step, can escape shallow local minima, enables online learning.
Cons: Noisy updates, may oscillate around minimum.

Mini-Batch Gradient Descent

Uses a small batch of B samples (typically B = 32, 64, 128).

w โ† w - ฮฑ ยท (2/B) X_batchT(X_batch ยท w - y_batch)

Pros: Best of bothโ€”stable enough for convergence, fast enough for large datasets, leverages GPU parallelism.
Cons: Requires tuning batch size B.

6.5 Learning Rate Effects

Learning RateBehaviorConvergence
Too small (e.g., 0.0001)Very slow descent, tiny stepsEventually converges but takes too long
Just right (e.g., 0.01)Steady descent toward minimumConverges in reasonable time
Too large (e.g., 1.0)Overshoots minimum, oscillates wildlyMay diverge (J increases!)
Adaptive (e.g., Adam)Adjusts per-parameter learning ratesRobust convergence for most problems

6.6 Convergence Criteria

How do we know when to stop gradient descent? Common criteria:

  1. Maximum iterations: Stop after k iterations (e.g., k = 1000).
  2. Cost tolerance: Stop when |J(w_new) - J(w_old)| < ฮต (e.g., ฮต = 10โปโถ).
  3. Gradient norm: Stop when ||โˆ‡J(w)|| < ฮต.
  4. Parameter change: Stop when ||w_new - w_old|| < ฮต.
07

Worked Numerical Examples

Example 1: Normal Equation with 5 Data Points

Problem: Given the following data on flat area (sq. ft.) and price (โ‚น Lakhs) in Bangalore:

Area (x)Price (y)
60030
80045
100055
120068
150080

Find w (slope) and b (intercept) using the Normal Equation.

Step-by-Step Solution

Step 1: Construct the design matrix X (with bias column) and target vector y:

X = [[1, 600], y = [30, 45, 55, 68, 80]T
[1, 800],
[1, 1000],
[1, 1200],
[1, 1500]]

Step 2: Compute XTX:

XTX = [[5, 5100], [5100, 5540000]]

Detail: XTX[0,0] = 5 (sum of 1s), XTX[0,1] = 600+800+1000+1200+1500 = 5100, XTX[1,1] = 600ยฒ+800ยฒ+1000ยฒ+1200ยฒ+1500ยฒ = 360000+640000+1000000+1440000+2250000 = 5690000.

XTX = [[5, 5100], [5100, 5690000]]

Step 3: Compute XTy:

XTy = [30+45+55+68+80, 600ยท30+800ยท45+1000ยท55+1200ยท68+1500ยท80]T
= [278, 18000+36000+55000+81600+120000]T
= [278, 310600]T

Step 4: Compute (XTX)-1:

For a 2ร—2 matrix [[a,b],[c,d]], inverse = (1/(ad-bc)) ยท [[d,-b],[-c,a]]

det = 5ยท5690000 - 5100ยท5100 = 28450000 - 26010000 = 2440000

(XTX)-1 = (1/2440000) ยท [[5690000, -5100], [-5100, 5]]

Step 5: Compute w = (XTX)-1 XTy:

w = (1/2440000) ยท [[5690000ยท278 + (-5100)ยท310600],
[(-5100)ยท278 + 5ยท310600]]

w[0] = (1/2440000)(1581820000 - 1584060000) = (-2240000)/2440000 โ‰ˆ -0.918
w[1] = (1/2440000)(-1417800 + 1553000) = 135200/2440000 โ‰ˆ 0.05541
Final Model b โ‰ˆ -0.918, w โ‰ˆ 0.05541
ลท = 0.05541 ยท x - 0.918

Interpretation: Each sq. ft. adds โ‰ˆ โ‚น0.0554 Lakhs โ‰ˆ โ‚น5,541 to the price.

Verification: For x=1000: ลท = 0.05541ร—1000 - 0.918 = 55.41 - 0.918 = 54.49 โ‰ˆ 55 โœ“

Example 2: Two Iterations of Gradient Descent (ฮฑ = 0.01)

Problem: Using simplified data points: (1, 2), (2, 4), (3, 5). Starting with w=0, b=0. Perform 2 iterations of batch gradient descent with ฮฑ = 0.01.

Iteration 1

Predictions (ลท = wx + b, with w=0, b=0):

ลทโ‚ = 0ยท1 + 0 = 0, ลทโ‚‚ = 0ยท2 + 0 = 0, ลทโ‚ƒ = 0ยท3 + 0 = 0

Errors (ลทแตข - yแตข):

eโ‚ = 0 - 2 = -2, eโ‚‚ = 0 - 4 = -4, eโ‚ƒ = 0 - 5 = -5

Gradients:

โˆ‚J/โˆ‚w = (2/3)[(-2)(1) + (-4)(2) + (-5)(3)] = (2/3)[-2 - 8 - 15] = (2/3)(-25) = -16.667
โˆ‚J/โˆ‚b = (2/3)[(-2) + (-4) + (-5)] = (2/3)(-11) = -7.333

Updates:

w โ† 0 - 0.01 ร— (-16.667) = 0.1667
b โ† 0 - 0.01 ร— (-7.333) = 0.0733

Iteration 2

Predictions (w=0.1667, b=0.0733):

ลทโ‚ = 0.1667ยท1 + 0.0733 = 0.240
ลทโ‚‚ = 0.1667ยท2 + 0.0733 = 0.407
ลทโ‚ƒ = 0.1667ยท3 + 0.0733 = 0.573

Errors:

eโ‚ = 0.240 - 2 = -1.760, eโ‚‚ = 0.407 - 4 = -3.593, eโ‚ƒ = 0.573 - 5 = -4.427

Gradients:

โˆ‚J/โˆ‚w = (2/3)[(-1.760)(1) + (-3.593)(2) + (-4.427)(3)]
= (2/3)[-1.760 - 7.186 - 13.281] = (2/3)(-22.227) = -14.818
โˆ‚J/โˆ‚b = (2/3)[-1.760 - 3.593 - 4.427] = (2/3)(-9.780) = -6.520

Updates:

w โ† 0.1667 - 0.01 ร— (-14.818) = 0.1667 + 0.1482 = 0.3149
b โ† 0.0733 - 0.01 ร— (-6.520) = 0.0733 + 0.0652 = 0.1385

After just 2 iterations, the line has already started fitting: w is moving toward ~1.5 and b toward ~0.67 (the optimal values). With hundreds more iterations, gradient descent will converge to the exact solution.

Example 3: Indian Real Estate (4 Features)

Problem: Predict flat price in Delhi NCR using 4 features: Area (sq ft), Bedrooms, Floor, Age (years).

AreaBedsFloorAgePrice (โ‚นL)
85023545
110037272
1400312195
950251038
16004150120
750121528

This requires the Normal Equation with a 6ร—5 design matrix. We'll solve this computationally in the Python section, but the interpretation is key:

Price = wโ‚ยทArea + wโ‚‚ยทBeds + wโ‚ƒยทFloor + wโ‚„ยทAge + b

Expected signs: wโ‚ > 0 (more area โ†’ higher price)
wโ‚‚ > 0 (more bedrooms โ†’ higher price)
wโ‚ƒ > 0 (higher floor โ†’ slightly higher price)
wโ‚„ < 0 (older flat โ†’ lower price)
08

Visual Diagrams

Diagram 1: Scatter Plot with Best-Fit Line

๐Ÿ“Š Linear Regression: Fitting a Line to Data
Price (โ‚นL) โ”‚ 120โ”‚ โ˜… โ† actual โ”‚ ยท/ 100โ”‚ ยท/ โ”‚ โ˜… ยท/ 80โ”‚ ยท/ โ”‚ โ˜… ยท/ 60โ”‚ ยท/ โ”‚ ยท/ 40โ”‚ โ˜… ยท/ โ† best-fit line: ลท = wx + b โ”‚ ยท/ 20โ”‚ โ˜… ยท/ โ”‚ ยท/ 0โ”œโ”€โ”€ยท/โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Area (sq ft) 0 400 800 1200 1600 2000 Legend: โ˜… = actual data point, ยท/ = best-fit line The line minimizes the total squared distance from points to line.

Diagram 2: Cost Function Surface

๐Ÿ“Š MSE Cost Function J(w,b) โ€” A Bowl Shape (Convex)
J(w,b) โ”‚ high โ”‚ \ / โ”‚ \ / โ”‚ \ / โ”‚ \ / โ”‚ \ / โ”‚ \ ___ / โ”‚ \ / \ / low โ”‚ \ / \ / โ”‚ ยทโ”€โ”€ global minimum (w*, b*) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ w โ† optimal โ†’ Key insight: MSE for linear regression is CONVEX โ†’ There is exactly ONE global minimum โ†’ Gradient descent ALWAYS converges (with proper ฮฑ) โ†’ Normal equation finds it directly

Diagram 3: Gradient Descent Steps

๐Ÿ“Š Gradient Descent: Walking Downhill on the Cost Surface
J(w) โ”‚ 100 โ”‚ โ˜… start (random w) โ”‚ \ 80 โ”‚ โ˜… step 1 โ”‚ \ 60 โ”‚ โ˜… step 2 โ”‚ \ 40 โ”‚ โ˜… step 3 Learning rate ฮฑ too HIGH: โ”‚ \ โ˜… โ˜… 20 โ”‚ โ˜… step 4 \ / โ† oscillation! โ”‚ \ โ˜…/ 5 โ”‚ โ˜…โ˜…โ˜… converged! โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ w wโ‚€ wโ‚ wโ‚‚ wโ‚ƒ wโ‚„ w* Small ฮฑ โ†’ many small steps โ†’ slow but safe Large ฮฑ โ†’ few big steps โ†’ fast but risky Just right โ†’ efficient convergence!

Diagram 4: Residual Analysis

๐Ÿ“Š Residual Plots: Good vs. Bad Patterns
GOOD (Random scatter) BAD (Curved pattern) BAD (Funnel shape) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ยท ยท โ”‚ โ”‚ ยท ยท โ”‚ โ”‚ ยท โ”‚ โ”‚ ยท ยท ยท โ”‚ โ”‚ ยท โ”‚ โ”‚ ยท ยท โ”‚ โ”‚โ”€โ”€โ”€โ”€โ”€โ”€ยทโ”€โ”€0โ”€โ”€ยทโ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚ ยท ยท โ”‚ โ”‚ ยท ยท โ”‚ โ”‚ ยท ยท โ”‚ โ”‚โ”€โ”€โ”€โ”€โ”€โ”€0โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚โ”€โ”€ยทโ”€โ”€0โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚ ยท ยท โ”‚ โ”‚ ยท ยท โ”‚ โ”‚ ยท ยท โ”‚ โ”‚ ยท ยท โ”‚ โ”‚ ยท ยท โ”‚ โ”‚ยท โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†’ Assumptions MET โ†’ Non-linearity! โ†’ Heteroscedasticity! โ†’ Model is appropriate โ†’ Add polynomial terms โ†’ Use WLS or transform y
09

Flowcharts

Flowchart 1: Linear Regression Pipeline

๐Ÿ”„ End-to-End Linear Regression Pipeline
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Raw Data โ”‚ โ”‚ (CSV/DB/API) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ EDA & Cleaning โ”‚โ”€โ”€โ†’ Handle missing values, outliers โ”‚ (pandas/numpy) โ”‚ Check for correlations โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Feature Eng. โ”‚โ”€โ”€โ†’ Scaling, encoding, polynomial features โ”‚ (StandardScaler)โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Train/Test Splitโ”‚โ”€โ”€โ†’ 80/20 or K-Fold CV โ”‚ (train_test_ โ”‚ โ”‚ split) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Choose Method โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚ Normal Equationโ”‚โ”€โ”€โ†’ Small data (p < 10K) โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚โ”€โ”€โ”€โ”€โ–บโ”‚ Gradient Desc. โ”‚โ”€โ”€โ†’ Large data / online โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Evaluate Model โ”‚โ”€โ”€โ†’ Rยฒ, RMSE, MAE on test set โ”‚ (score, predict)โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Diagnostics โ”‚โ”€โ”€โ†’ Residual plot, Q-Q plot โ”‚ Check Assump. โ”‚ VIF for multicollinearity โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Deploy/Report โ”‚โ”€โ”€โ†’ Save model (pickle/joblib) โ”‚ โ”‚ API endpoint, dashboard โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Flowchart 2: Gradient Descent Algorithm

๐Ÿ”„ Gradient Descent Algorithm Flow
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Initialize w randomly โ”‚ โ”‚ (or with zeros) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Compute predictions โ”‚ โ”‚ ลท = Xw โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Compute cost J(w) โ”‚ โ”‚ J = (1/n)||y - ลท||ยฒ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Compute gradient โ”‚ โ”‚ โˆ‡J = (2/n)X^T(ลท - y) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Update weights โ”‚ โ”‚ w โ† w - ฮฑยทโˆ‡J โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” NO โ”‚Converged?โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ YES โ”‚ โ–ผ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ Return optimal w โ”‚ โ”‚ โ”‚ Model is trained! โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ (loop back)
10

Python Implementation (From Scratch)

Let's build a complete LinearRegression class from scratch using only NumPy.

Python โ€” LinearRegression from Scratch
import numpy as np
import matplotlib.pyplot as plt

class LinearRegressionScratch:
    """
    Linear Regression from scratch.
    Supports: Normal Equation, Batch GD, SGD, Mini-batch GD.
    """

    def __init__(self, method='normal', lr=0.01, n_iters=1000,
                 batch_size=32, tol=1e-6, verbose=False):
        """
        Parameters:
        -----------
        method : str, one of 'normal', 'batch_gd', 'sgd', 'mini_batch_gd'
        lr     : float, learning rate (alpha)
        n_iters: int, max iterations for GD methods
        batch_size: int, for mini-batch GD
        tol    : float, convergence tolerance
        verbose: bool, print progress
        """
        self.method = method
        self.lr = lr
        self.n_iters = n_iters
        self.batch_size = batch_size
        self.tol = tol
        self.verbose = verbose
        self.weights = None       # includes bias
        self.cost_history = []

    def _add_bias(self, X):
        """Add a column of 1s for the bias term."""
        return np.c_[np.ones(X.shape[0]), X]

    def _compute_cost(self, X, y):
        """Compute MSE cost: J = (1/n)||y - Xw||^2"""
        n = len(y)
        predictions = X @ self.weights
        cost = (1 / n) * np.sum((y - predictions) ** 2)
        return cost

    def _compute_gradient(self, X, y):
        """Compute gradient: (2/n) * X^T(Xw - y)"""
        n = len(y)
        predictions = X @ self.weights
        gradient = (2 / n) * X.T @ (predictions - y)
        return gradient

    def fit_normal_equation(self, X, y):
        """Closed-form solution: w = (X^TX)^{-1} X^Ty"""
        X_b = self._add_bias(X)
        # Use np.linalg.pinv for numerical stability
        self.weights = np.linalg.pinv(X_b.T @ X_b) @ X_b.T @ y
        if self.verbose:
            cost = self._compute_cost(X_b, y)
            print(f"Normal Equation โ€” MSE: {cost:.6f}")

    def fit_batch_gd(self, X, y):
        """Full-batch Gradient Descent."""
        X_b = self._add_bias(X)
        n, p = X_b.shape
        self.weights = np.zeros(p)   # initialize to zeros
        self.cost_history = []

        for i in range(self.n_iters):
            gradient = self._compute_gradient(X_b, y)
            self.weights -= self.lr * gradient
            cost = self._compute_cost(X_b, y)
            self.cost_history.append(cost)

            if self.verbose and i % 100 == 0:
                print(f"Iter {i:4d} | MSE: {cost:.6f}")

            # Convergence check
            if len(self.cost_history) > 1:
                if abs(self.cost_history[-1] - self.cost_history[-2]) < self.tol:
                    if self.verbose:
                        print(f"Converged at iteration {i}")
                    break

    def fit_sgd(self, X, y):
        """Stochastic Gradient Descent (one sample at a time)."""
        X_b = self._add_bias(X)
        n, p = X_b.shape
        self.weights = np.zeros(p)
        self.cost_history = []

        for epoch in range(self.n_iters):
            indices = np.random.permutation(n)
            for idx in indices:
                xi = X_b[idx:idx+1]
                yi = y[idx:idx+1]
                gradient = (2) * xi.T @ (xi @ self.weights - yi)
                self.weights -= self.lr * gradient

            cost = self._compute_cost(X_b, y)
            self.cost_history.append(cost)
            if self.verbose and epoch % 100 == 0:
                print(f"Epoch {epoch:4d} | MSE: {cost:.6f}")

    def fit_mini_batch_gd(self, X, y):
        """Mini-batch Gradient Descent."""
        X_b = self._add_bias(X)
        n, p = X_b.shape
        self.weights = np.zeros(p)
        self.cost_history = []

        for epoch in range(self.n_iters):
            indices = np.random.permutation(n)
            for start in range(0, n, self.batch_size):
                end = min(start + self.batch_size, n)
                batch_idx = indices[start:end]
                X_batch = X_b[batch_idx]
                y_batch = y[batch_idx]
                gradient = (2 / len(y_batch)) * X_batch.T @ (X_batch @ self.weights - y_batch)
                self.weights -= self.lr * gradient

            cost = self._compute_cost(X_b, y)
            self.cost_history.append(cost)

    def fit(self, X, y):
        """Fit model using the chosen method."""
        X = np.array(X, dtype=np.float64)
        y = np.array(y, dtype=np.float64)

        if self.method == 'normal':
            self.fit_normal_equation(X, y)
        elif self.method == 'batch_gd':
            self.fit_batch_gd(X, y)
        elif self.method == 'sgd':
            self.fit_sgd(X, y)
        elif self.method == 'mini_batch_gd':
            self.fit_mini_batch_gd(X, y)
        else:
            raise ValueError(f"Unknown method: {self.method}")
        return self

    def predict(self, X):
        """Make predictions."""
        X = np.array(X, dtype=np.float64)
        X_b = self._add_bias(X)
        return X_b @ self.weights

    def score(self, X, y):
        """Compute Rยฒ score."""
        y = np.array(y, dtype=np.float64)
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - (ss_res / ss_tot)

    def get_metrics(self, X, y):
        """Return dict of all metrics."""
        y = np.array(y, dtype=np.float64)
        y_pred = self.predict(X)
        n = len(y)
        p = len(self.weights) - 1  # exclude bias

        mse = np.mean((y - y_pred) ** 2)
        rmse = np.sqrt(mse)
        mae = np.mean(np.abs(y - y_pred))
        r2 = self.score(X, y)
        adj_r2 = 1 - ((1 - r2) * (n - 1) / (n - p - 1))

        return {
            'MSE': mse, 'RMSE': rmse, 'MAE': mae,
            'Rยฒ': r2, 'Adj Rยฒ': adj_r2
        }

    def plot_cost_history(self):
        """Plot cost vs iterations (for GD methods)."""
        if not self.cost_history:
            print("No cost history (use a GD method).")
            return
        plt.figure(figsize=(10, 5))
        plt.plot(self.cost_history, color='#059669', linewidth=2)
        plt.xlabel('Iteration')
        plt.ylabel('MSE Cost')
        plt.title('Cost Function Convergence')
        plt.grid(True, alpha=0.3)
        plt.show()

    def plot_fit(self, X, y, feature_idx=0):
        """Plot data + regression line (for 1D visualization)."""
        X = np.array(X, dtype=np.float64)
        y = np.array(y, dtype=np.float64)
        y_pred = self.predict(X)

        plt.figure(figsize=(10, 6))
        plt.scatter(X[:, feature_idx], y, color='#0891b2',
                    s=80, alpha=0.7, label='Actual', zorder=5)

        # Sort for smooth line
        sort_idx = X[:, feature_idx].argsort()
        plt.plot(X[sort_idx, feature_idx], y_pred[sort_idx],
                 color='#059669', linewidth=2.5, label='Predicted')

        plt.xlabel(f'Feature {feature_idx}')
        plt.ylabel('Target')
        plt.title('Linear Regression Fit')
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.show()

Using the Class

Python โ€” Usage Example
# === Example: Bangalore flat price prediction ===
import numpy as np

# Data: [area_sqft, bedrooms, age_years]
X = np.array([
    [600,  1, 10],
    [800,  2,  7],
    [1000, 2,  5],
    [1200, 3,  3],
    [1500, 3,  1],
    [1800, 4,  0],
    [900,  2,  8],
    [1100, 3,  4],
])
y = np.array([30, 45, 55, 68, 82, 105, 40, 62])  # Price in โ‚น Lakhs

# --- Method 1: Normal Equation ---
model_ne = LinearRegressionScratch(method='normal', verbose=True)
model_ne.fit(X, y)
print("Weights (Normal Eq):", model_ne.weights)
print("Metrics:", model_ne.get_metrics(X, y))

# --- Method 2: Batch Gradient Descent ---
# Feature scaling is CRITICAL for GD!
X_scaled = (X - X.mean(axis=0)) / X.std(axis=0)

model_gd = LinearRegressionScratch(
    method='batch_gd', lr=0.01, n_iters=1000, verbose=True
)
model_gd.fit(X_scaled, y)
print("Rยฒ (Batch GD):", model_gd.score(X_scaled, y))
model_gd.plot_cost_history()

# --- Predict new flat ---
new_flat = np.array([[1300, 3, 2]])
price = model_ne.predict(new_flat)
print(f"Predicted price for 1300 sqft, 3BHK, 2yr old: โ‚น{price[0]:.1f} Lakhs")

Feature Scaling Implementation

Python โ€” Feature Scaling
class StandardScaler:
    """Standardize features: z = (x - mean) / std"""

    def fit(self, X):
        self.mean_ = np.mean(X, axis=0)
        self.std_ = np.std(X, axis=0)
        self.std_[self.std_ == 0] = 1  # avoid division by zero
        return self

    def transform(self, X):
        return (X - self.mean_) / self.std_

    def fit_transform(self, X):
        return self.fit(X).transform(X)

# Why is scaling critical for GD?
# Without scaling: area~1000, bedrooms~3, age~5
# The gradient for 'area' is ~300x larger than for 'bedrooms'
# GD oscillates wildly on the area axis while barely moving on bedrooms
# With scaling: all features have similar ranges โ†’ smooth convergence

Diagnostic Plots Implementation

Python โ€” Diagnostic Plots
def diagnostic_plots(y_true, y_pred, feature_name="Feature"):
    """Generate diagnostic plots for regression analysis."""
    residuals = y_true - y_pred
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    # 1. Residual Plot
    axes[0].scatter(y_pred, residuals, color='#0891b2', alpha=0.6, s=50)
    axes[0].axhline(y=0, color='#dc2626', linestyle='--', linewidth=1.5)
    axes[0].set_xlabel('Predicted Values')
    axes[0].set_ylabel('Residuals')
    axes[0].set_title('Residual Plot')
    axes[0].grid(True, alpha=0.3)

    # 2. Q-Q Plot (manual implementation)
    sorted_residuals = np.sort(residuals)
    n = len(sorted_residuals)
    theoretical = np.array([
        np.sqrt(2) * 0.4769 *
        ((2 * (i + 1) - 1) / (2 * n) - 0.5)
        for i in range(n)
    ])  # Simplified; in practice, use scipy.stats.norm.ppf
    axes[1].scatter(theoretical, sorted_residuals,
                    color='#059669', alpha=0.7, s=50)
    # Reference line
    lims = [min(theoretical.min(), sorted_residuals.min()),
            max(theoretical.max(), sorted_residuals.max())]
    axes[1].plot(lims, lims, 'r--', linewidth=1.5)
    axes[1].set_xlabel('Theoretical Quantiles')
    axes[1].set_ylabel('Sample Quantiles')
    axes[1].set_title('Q-Q Plot')
    axes[1].grid(True, alpha=0.3)

    # 3. Actual vs Predicted
    axes[2].scatter(y_true, y_pred, color='#8b5cf6', alpha=0.6, s=50)
    lims = [min(y_true.min(), y_pred.min()),
            max(y_true.max(), y_pred.max())]
    axes[2].plot(lims, lims, 'r--', linewidth=1.5)
    axes[2].set_xlabel('Actual Values')
    axes[2].set_ylabel('Predicted Values')
    axes[2].set_title('Actual vs Predicted')
    axes[2].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

# Usage:
# y_pred = model.predict(X)
# diagnostic_plots(y, y_pred)
11

TensorFlow Implementation

Python โ€” TensorFlow/Keras Linear Regression
import tensorflow as tf
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# === Generate sample Indian housing data ===
np.random.seed(42)
n_samples = 500

area = np.random.uniform(500, 2500, n_samples)
bedrooms = np.random.randint(1, 5, n_samples)
age = np.random.uniform(0, 20, n_samples)
floor = np.random.randint(1, 20, n_samples)

# True relationship + noise
price = 0.05 * area + 8 * bedrooms - 1.5 * age + 0.3 * floor + 5 + \
        np.random.normal(0, 5, n_samples)

X = np.column_stack([area, bedrooms, age, floor])
y = price

# Split & scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# === Build TensorFlow Model ===
model = tf.keras.Sequential([
    tf.keras.layers.Dense(1, input_shape=(4,),
                          kernel_initializer='normal',
                          name='linear_layer')
])

# Linear regression = 1 Dense layer with no activation!
# This is equivalent to ลท = Xw + b

model.compile(
    optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
    loss='mse',               # Mean Squared Error
    metrics=['mae']            # Also track MAE
)

model.summary()

# === Train ===
history = model.fit(
    X_train_s, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.15,
    verbose=1
)

# === Evaluate ===
test_loss, test_mae = model.evaluate(X_test_s, y_test)
print(f"\nTest MSE: {test_loss:.4f}")
print(f"Test MAE: {test_mae:.4f}")

# === Extract learned weights ===
weights, bias = model.get_layer('linear_layer').get_weights()
print(f"\nLearned weights: {weights.flatten()}")
print(f"Learned bias: {bias[0]:.4f}")

# === Plot training history ===
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(history.history['loss'], label='Train MSE', color='#059669')
ax1.plot(history.history['val_loss'], label='Val MSE', color='#0891b2')
ax1.set_xlabel('Epoch'); ax1.set_ylabel('MSE')
ax1.set_title('Loss Convergence'); ax1.legend(); ax1.grid(True, alpha=0.3)

ax2.plot(history.history['mae'], label='Train MAE', color='#059669')
ax2.plot(history.history['val_mae'], label='Val MAE', color='#0891b2')
ax2.set_xlabel('Epoch'); ax2.set_ylabel('MAE')
ax2.set_title('MAE Over Training'); ax2.legend(); ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# === Predict new flat ===
new_flat = np.array([[1300, 3, 2, 10]])
new_flat_s = scaler.transform(new_flat)
pred_price = model.predict(new_flat_s)
print(f"Predicted price: โ‚น{pred_price[0][0]:.1f} Lakhs")
12

Scikit-Learn Implementation

Python โ€” Complete Scikit-Learn Pipeline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score
)

# ============================================
# 1. CREATE SYNTHETIC INDIAN HOUSING DATASET
# ============================================
np.random.seed(42)
n = 1000

data = pd.DataFrame({
    'area_sqft':  np.random.uniform(400, 3000, n),
    'bedrooms':   np.random.randint(1, 6, n),
    'bathrooms':  np.random.randint(1, 4, n),
    'age_years':  np.random.uniform(0, 25, n),
    'floor':      np.random.randint(1, 25, n),
    'city_tier':  np.random.choice([1, 2, 3], n, p=[0.3, 0.4, 0.3]),
})

# Generate realistic prices (โ‚น Lakhs)
city_multiplier = {1: 1.5, 2: 1.0, 3: 0.6}
data['price_lakhs'] = (
    0.045 * data['area_sqft'] +
    7 * data['bedrooms'] +
    5 * data['bathrooms'] -
    1.2 * data['age_years'] +
    0.3 * data['floor'] +
    10 * data['city_tier'].map(city_multiplier) +
    np.random.normal(0, 8, n)
)

print(data.head())
print(data.describe())

# ============================================
# 2. EXPLORATORY DATA ANALYSIS
# ============================================
print("\nCorrelation with price:")
print(data.corr()['price_lakhs'].sort_values(ascending=False))

# ============================================
# 3. PREPARE FEATURES & TARGET
# ============================================
features = ['area_sqft', 'bedrooms', 'bathrooms', 'age_years', 'floor']
X = data[features].values
y = data['price_lakhs'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ============================================
# 4. MODEL 1: BASIC LINEAR REGRESSION
# ============================================
model = LinearRegression()
model.fit(X_train, y_train)

print("\n=== Linear Regression Results ===")
print(f"Intercept: {model.intercept_:.4f}")
for feat, coef in zip(features, model.coef_):
    print(f"  {feat:15s}: {coef:+.6f}")

y_pred = model.predict(X_test)
print(f"\nTest Rยฒ:   {r2_score(y_test, y_pred):.4f}")
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"Test MAE:  {mean_absolute_error(y_test, y_pred):.4f}")

# ============================================
# 5. CROSS-VALIDATION
# ============================================
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"\n5-Fold CV Rยฒ scores: {cv_scores}")
print(f"Mean CV Rยฒ: {cv_scores.mean():.4f} ยฑ {cv_scores.std():.4f}")

# ============================================
# 6. MODEL 2: PIPELINE WITH SCALING + SGD
# ============================================
sgd_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('sgd', SGDRegressor(max_iter=1000, tol=1e-4,
                         learning_rate='optimal',
                         random_state=42))
])
sgd_pipeline.fit(X_train, y_train)
y_pred_sgd = sgd_pipeline.predict(X_test)
print(f"\nSGD Rยฒ:   {r2_score(y_test, y_pred_sgd):.4f}")

# ============================================
# 7. MODEL 3: POLYNOMIAL REGRESSION
# ============================================
poly_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('reg', LinearRegression())
])
poly_pipeline.fit(X_train, y_train)
y_pred_poly = poly_pipeline.predict(X_test)
print(f"Poly(2) Rยฒ: {r2_score(y_test, y_pred_poly):.4f}")

# ============================================
# 8. VISUALIZATION
# ============================================
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Actual vs Predicted
axes[0,0].scatter(y_test, y_pred, alpha=0.5, color='#059669', s=30)
axes[0,0].plot([y_test.min(), y_test.max()],
              [y_test.min(), y_test.max()], 'r--')
axes[0,0].set_title('Actual vs Predicted')

# Residuals
residuals = y_test - y_pred
axes[0,1].scatter(y_pred, residuals, alpha=0.5, color='#0891b2', s=30)
axes[0,1].axhline(y=0, color='red', linestyle='--')
axes[0,1].set_title('Residual Plot')

# Residual distribution
axes[1,0].hist(residuals, bins=30, color='#059669', alpha=0.7, edgecolor='white')
axes[1,0].set_title('Residual Distribution')

# Feature importance (coefficients)
coefs = pd.Series(model.coef_, index=features).sort_values()
coefs.plot.barh(ax=axes[1,1], color='#0891b2')
axes[1,1].set_title('Feature Coefficients')

plt.tight_layout()
plt.show()
13

Indian Case Studies

๐Ÿ  Case Study 1

House Price Prediction: Mumbai, Delhi NCR & Bangalore

Context: India's residential real estate market is valued at ~$300 billion (2024). Platforms like MagicBricks, 99acres, and Housing.com use regression models to provide instant price estimates.

Dataset Features

FeatureTypeExample ValuesExpected Coefficient Sign
Area (sq ft)Continuous500โ€“5000Positive (+)
Bedrooms (BHK)Discrete1โ€“5Positive (+)
BathroomsDiscrete1โ€“4Positive (+)
FloorDiscrete1โ€“30Slightly positive (+)
Age (years)Continuous0โ€“30Negative (โˆ’)
Distance to metro (km)Continuous0.1โ€“15Negative (โˆ’)
City tierCategorical1, 2, 3Tier 1 premium

Key Findings

  • Mumbai (Tier 1): โ‚น18,000โ€“โ‚น25,000 per sq ft in prime areas; regression captures ~82% variance (Rยฒ=0.82)
  • Bangalore: IT corridor proximity adds โ‚น3,000/sqft premium; modeled via distance feature
  • Delhi NCR: Metro connectivity is the strongest price predictor after area
  • Limitation: Non-linear effects (e.g., sea-view premium in Mumbai) require polynomial features or tree models

Model Equation (Simplified Mumbai Model)

Price (โ‚นL) = 0.062 ร— Area + 12.5 ร— Bedrooms โˆ’ 2.1 ร— Age โˆ’ 3.8 ร— Metro_Distance + 15.2

Interpretation: Each sq ft adds โ‚น6,200; each additional bedroom adds โ‚น12.5 Lakhs; each year of age reduces price by โ‚น2.1 Lakhs; each km from metro reduces price by โ‚น3.8 Lakhs.

๐ŸŒพ Case Study 2

Agricultural Crop Yield Prediction

Context: India is the world's second-largest agricultural producer. The Ministry of Agriculture uses regression models to estimate production of wheat, rice, and cotton to plan storage, transportation, and pricing (MSP).

Data Sources

  • IMD (India Meteorological Department): Rainfall, temperature, humidity data
  • ICAR: Soil quality, fertilizer usage data
  • DAC&FW: Historical crop yield data by state and district

Feature Engineering

FeatureSourceImpact on Yield
Total rainfall (mm)IMDStrong positive (up to a point)
Temperature (avg ยฐC)IMDCrop-dependent
Fertilizer (kg/hectare)ICARPositive (diminishing returns)
Soil pHICAROptimal range varies
Irrigation coverage (%)CensusStrong positive
Sown area (hectares)DAC&FWPositive

Results

Rice yield model (Punjab, 2010-2023): Rยฒ = 0.78, RMSE = 245 kg/hectare. The model correctly predicted the 2022 below-normal harvest due to irregular monsoon patterns detected via the rainfall feature.

Python โ€” Crop Yield Example
# Simplified crop yield regression
import numpy as np
from sklearn.linear_model import LinearRegression

# Features: [rainfall_mm, temp_C, fertilizer_kg, irrigation_%]
X_crop = np.array([
    [800,  28, 120, 60],   # Normal monsoon
    [1200, 27, 150, 75],   # Good monsoon
    [600,  30, 100, 45],   # Drought year
    [950,  26, 180, 80],   # Well-irrigated
    [1100, 29, 130, 65],   # Above average rain
    [700,  31, 90,  40],   # Hot & dry
    [1050, 27, 160, 70],   # Good year
    [500,  32, 80,  35],   # Severe drought
])
y_crop = np.array([3200, 4100, 2400, 4500, 3800, 2100, 4000, 1800])

model_crop = LinearRegression()
model_crop.fit(X_crop, y_crop)

features = ['Rainfall', 'Temperature', 'Fertilizer', 'Irrigation']
for f, c in zip(features, model_crop.coef_):
    print(f"{f:12s}: {c:+.2f} kg/hectare per unit increase")

print(f"\nRยฒ Score: {model_crop.score(X_crop, y_crop):.4f}")
๐ŸŒง๏ธ Case Study 3

Rainfall Prediction Using IMD Historical Data

Context: IMD maintains rainfall data from 1901 to present across 36 meteorological subdivisions. Linear regression with seasonal features (ENSO index, Indian Ocean Dipole, pre-monsoon temperatures) achieves Rยฒ โ‰ˆ 0.65 for June-September monsoon totals.

Challenge

Rainfall has inherent variability and chaotic dynamics. Linear regression captures the linear component of the signal but misses non-linear teleconnections. This is a great example of where linear regression provides a baseline that more complex models (Random Forest, LSTM) improve upon.

IMD's 2024 monsoon forecast used an ensemble of 5 statistical models, including multiple regression with ENSO, IOD, and Eurasian snow cover as predictors. Predicted: 106% of LPA. Actual: 109% of LPA. Error: 3%โ€”remarkably accurate for a linear model!

14

Global Case Studies

๐ŸŒ Case Study 1

Boston Housing Dataset โ€” Classic + Ethical Discussion

The Dataset: Created by Harrison & Rubinfeld (1978), the Boston Housing dataset has 506 samples with 13 features predicting median home value in Boston suburbs. For decades, it was the benchmark for regression algorithms.

Features

CRIM (per capita crime), ZN (residential zoning), INDUS (industrial proportion), CHAS (Charles River proximity), NOX (nitric oxide concentration), RM (average rooms), AGE (pre-1940 proportion), DIS (distance to employment), RAD (highway access), TAX (tax rate), PTRATIO (pupil-teacher), B (racial composition statistic), LSTAT (lower status %).

โš ๏ธ Ethical Concerns

The 'B' feature is defined as 1000(Bk - 0.63)ยฒ where Bk is the proportion of Black residents. This was originally included to model the racist assumption that racial composition affects property values. In 2020, Scikit-Learn deprecated `load_boston()` because:

  • The feature encodes racial bias as a "normal" predictor
  • Models trained on this data learn and perpetuate racial discrimination in housing
  • Students unknowingly absorb the idea that race should be a price predictor

Alternative: Use the California Housing dataset or create synthetic data. We include this discussion because understanding bias in data is as important as understanding algorithms.

๐ŸŒ Case Study 2

California Housing โ€” The Modern Benchmark

20,640 samples from the 1990 US Census. Features: median income, house age, average rooms, average bedrooms, population, average occupancy, latitude, longitude. Target: median house value.

Python โ€” California Housing
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

housing = fetch_california_housing()
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"Rยฒ:   {r2_score(y_test, y_pred):.4f}")   # ~0.60
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

# Linear regression gets Rยฒ โ‰ˆ 0.60 โ€” decent baseline!
# This means 40% of variance is in non-linear patterns
# โ†’ Motivation for polynomial regression, trees, neural nets

print("\nFeature importances:")
for name, coef in zip(housing.feature_names, model.coef_):
    print(f"  {name:15s}: {coef:+.4f}")
๐ŸŒ Case Study 3

Energy Consumption Prediction

Application: Google uses regression models (among others) in their DeepMind-powered data center cooling systems. By predicting energy consumption based on temperature, humidity, server load, and cooling settings, they reduced cooling energy by 40%.

A simplified linear model: Energy_kWh = 0.12 ร— Temperature + 0.08 ร— Humidity + 0.45 ร— Server_Load - 0.15 ร— Cooling_Setting + 50

While Google's actual system uses neural networks, the initial prototype was a linear regression model that proved the concept. This illustrates the ML workflow: start simple, prove value, then add complexity.

15

Startup Applications

๐Ÿ  PropTech Startups (Housing.com, NoBroker, Square Yards)

Use regression for instant price estimates. When a user lists a property, the model predicts a fair market value. Features include location coordinates, carpet area, amenities count, and nearby infrastructure scores. Revenue model: charge a premium for "AI-verified" price estimates.

๐ŸŒพ AgriTech Startups (CropIn, DeHaat, Ninjacart)

CropIn uses satellite imagery + regression to predict crop yields per farm. DeHaat uses price regression to advise farmers on optimal selling time. These startups serve 15+ million Indian farmers and have raised $200M+ in funding.

๐Ÿ’ฐ FinTech (Cred, Razorpay, PhonePe)

Credit scoring starts with logistic regression (Chapter 5), but expected transaction volume prediction uses linear regression. Razorpay predicts daily payment volumes to manage server scaling and cash reserves.

๐Ÿš— Mobility (Ola, Rapido)

Ride time prediction uses regression: ETA = ฮฒโ‚ร—distance + ฮฒโ‚‚ร—traffic_index + ฮฒโ‚ƒร—time_of_day + ฮฒโ‚„ร—weather + intercept. Surge pricing models also start with regression on demand-supply ratios.

16

Government Applications

๐Ÿ›๏ธ NITI Aayog โ€” GDP Forecasting

Uses multivariate regression with indicators: industrial production (IIP), credit growth, exports, government expenditure, FDI inflows. The model provides quarterly GDP growth estimates used in Union Budget planning.

๐Ÿ›๏ธ Census of India โ€” Population Projection

Simple time-series regression on population data from 1951-2011 censuses projects India's population trajectory. The 2023 projection (1.43 billion) was within 1% of the actual figure.

๐Ÿ›๏ธ ISRO โ€” Satellite Telemetry

ISRO uses regression models to predict satellite battery degradation, solar panel output vs. orbital angle, and fuel consumption rates. These models are interpretable and safety-criticalโ€”neural networks are too opaque for space missions.

๐Ÿ›๏ธ IMD โ€” Monsoon Forecasting

As discussed in Case Study 3, IMD's ensemble regression model for monsoon prediction directly impacts agricultural planning, water resource management, and disaster preparedness for 1.4 billion people.

17

Industry Applications

IndustryApplicationFeatures UsedImpact
๐Ÿฆ Banking (SBI, HDFC)Loan default amount estimationIncome, debt ratio, tenure, credit scoreRisk-based pricing, โ‚น2L Cr loan portfolio
๐Ÿฅ Healthcare (Apollo)Patient stay duration predictionAge, diagnosis, vitals, procedure typeBed allocation, staff scheduling
๐Ÿญ Manufacturing (Tata)Quality prediction (defect rate)Temperature, pressure, speed, materialReduced defect rate by 15%
๐Ÿ›’ E-commerce (Flipkart)Demand forecastingSeason, promotions, price, categoryInventory optimization, โ‚น500Cr savings
โšก Energy (Tata Power)Load forecastingTemperature, time, day type, historical loadGrid stability, peak management
๐Ÿ“ฑ Telecom (Jio)Churn risk scoringUsage, complaints, recharge frequencyRetention campaigns targeting top 10%
๐Ÿšš Logistics (Delhivery)Delivery time estimationDistance, weight, route, warehouse loadSLA compliance improvement by 20%
18

Mini Projects

๐Ÿ› ๏ธ Mini Project 1: Indian House Price Predictor

Objective: Build a complete end-to-end house price prediction system for Indian cities.

Requirements

  1. Generate or collect a dataset with 500+ rows and 8+ features
  2. Perform EDA: histograms, correlation matrix, scatter plots
  3. Handle missing values and outliers
  4. Feature engineering: create price_per_sqft, log-transform skewed features
  5. Train-test split (80/20) with cross-validation
  6. Compare: Normal Equation vs Batch GD vs SGD (all from scratch)
  7. Also fit Scikit-Learn LinearRegression and compare results
  8. Generate diagnostic plots: residuals, Q-Q, actual vs predicted
  9. Report: Rยฒ, Adjusted Rยฒ, RMSE, MAE for all models
  10. Predict price for 5 new flats and discuss predictions
Python โ€” Starter Code
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# === Generate synthetic Indian housing data ===
np.random.seed(42)
n = 800

cities = np.random.choice(
    ['Mumbai', 'Delhi', 'Bangalore', 'Hyderabad', 'Pune'],
    n, p=[0.25, 0.2, 0.25, 0.15, 0.15]
)
city_premium = {
    'Mumbai': 1.8, 'Delhi': 1.4, 'Bangalore': 1.3,
    'Hyderabad': 1.0, 'Pune': 0.9
}

data = pd.DataFrame({
    'city': cities,
    'area_sqft': np.random.uniform(400, 3000, n),
    'bedrooms': np.random.randint(1, 5, n),
    'bathrooms': np.random.randint(1, 4, n),
    'floor': np.random.randint(1, 20, n),
    'age_years': np.random.uniform(0, 20, n),
    'parking': np.random.randint(0, 3, n),
    'metro_dist_km': np.random.uniform(0.1, 10, n),
})

# Generate prices
data['price_lakhs'] = (
    0.05 * data['area_sqft'] +
    8 * data['bedrooms'] +
    5 * data['bathrooms'] -
    1.5 * data['age_years'] +
    0.4 * data['floor'] +
    4 * data['parking'] -
    2 * data['metro_dist_km'] +
    15 * data['city'].map(city_premium) +
    np.random.normal(0, 8, n)
).clip(lower=10)

print("Dataset created! Shape:", data.shape)
print(data.head())

# TODO: Students complete the following:
# 1. EDA (histograms, correlation matrix)
# 2. Feature engineering (one-hot encode city)
# 3. Train/test split
# 4. Implement LinearRegressionScratch.fit() with each method
# 5. Compare with sklearn LinearRegression
# 6. Diagnostic plots
# 7. Report all metrics

Expected Output

Rยฒ โ‰ˆ 0.90โ€“0.95 (since data is synthetic with known linear relationships). Students should note that real-world data will typically give Rยฒ โ‰ˆ 0.6โ€“0.85 due to non-linear effects and unobserved features.

๐Ÿ› ๏ธ Mini Project 2: Crop Yield Estimator

Objective: Build a regression model that estimates crop yield (kg/hectare) based on weather, soil, and farming practice data.

Requirements

  1. Create/collect dataset: 300+ rows, 6+ features (rainfall, temperature, soil pH, fertilizer, irrigation, seed variety)
  2. Compare simple vs. multiple regression (1 feature vs. 6 features)
  3. Implement from scratch with Batch GD (with learning rate experiments)
  4. Generate cost convergence plots for ฮฑ = 0.001, 0.01, 0.1, 0.5
  5. Show that feature scaling dramatically improves GD convergence
  6. Compute VIF (Variance Inflation Factor) for multicollinearity
  7. Polynomial regression (degree 2) vs. linear: is it worth it?
  8. Predict yield for upcoming season with different rainfall scenarios
Python โ€” VIF Calculation
from sklearn.linear_model import LinearRegression
import numpy as np

def compute_vif(X, feature_names):
    """
    Compute Variance Inflation Factor for each feature.
    VIF > 5: moderate multicollinearity
    VIF > 10: severe multicollinearity
    """
    vif_data = []
    for i in range(X.shape[1]):
        # Regress feature i on all other features
        X_other = np.delete(X, i, axis=1)
        model = LinearRegression().fit(X_other, X[:, i])
        r2 = model.score(X_other, X[:, i])
        vif = 1 / (1 - r2) if r2 < 1 else float('inf')
        vif_data.append({
            'Feature': feature_names[i],
            'VIF': round(vif, 2),
            'Status': 'OK' if vif < 5 else ('Warning' if vif < 10 else 'REMOVE')
        })
    return vif_data

# Usage:
# vif_results = compute_vif(X, feature_names)
# for v in vif_results: print(v)
19

End-of-Chapter Exercises

Conceptual Questions

Exercise 1

Explain the difference between correlation and causation. If a linear regression model shows a positive coefficient for ice cream sales predicting drowning deaths, does ice cream cause drowning? Discuss confounding variables.

Exercise 2

State the four assumptions of linear regression (LINE). For each assumption, describe (a) what it means, (b) how to check it using diagnostic plots, and (c) what happens to OLS estimates if it's violated.

Exercise 3

Why does MSE emerge as the natural cost function from Maximum Likelihood under Gaussian noise? What cost function would arise if noise followed a Laplace distribution instead?

Exercise 4

Explain why Rยฒ always increases (or stays the same) when you add more features, but Adjusted Rยฒ can decrease. Give a concrete numerical example.

Exercise 5

Compare and contrast Batch GD, Stochastic GD, and Mini-Batch GD in terms of: (a) computation per step, (b) convergence smoothness, (c) memory requirements, (d) suitability for online learning.

Mathematical Problems

Exercise 6

Given data points: (1,3), (2,5), (3,7), (4,9), (5,11). Compute w and b using the Normal Equation by hand. Verify that these points are perfectly collinear and Rยฒ = 1.

Exercise 7

Starting from w=1, b=0, perform 3 iterations of gradient descent on data points (1,2), (2,3), (3,5) with learning rate ฮฑ = 0.05. Show all intermediate calculations.

Exercise 8

Prove that the gradient of J(w) = (1/n)||y - Xw||ยฒ with respect to w is (2/n)XT(Xw - y). Use matrix calculus rules explicitly.

Exercise 9

Show that for simple linear regression (one feature), the Normal Equation simplifies to:
w = [nฮฃ(xแตขyแตข) - ฮฃ(xแตข)ฮฃ(yแตข)] / [nฮฃ(xแตขยฒ) - (ฮฃ(xแตข))ยฒ]
b = ศณ - wยทxฬ„

Exercise 10

A dataset has 100 samples and 5 features. The linear regression model achieves Rยฒ = 0.82. Compute Adjusted Rยฒ. If we add a 6th feature and Rยฒ becomes 0.823, should we keep the new feature? Justify using Adjusted Rยฒ.

Programming Exercises

Exercise 11

Implement gradient descent with momentum from scratch: v_t = ฮฒยทv_{t-1} + โˆ‡J(w), w โ† w - ฮฑยทv_t. Compare convergence speed with vanilla GD on a dataset of your choice.

Exercise 12

Create a function that generates an animated GIF showing gradient descent steps on a 2D cost surface (w vs J(w)). Use matplotlib.animation.

Exercise 13

Implement polynomial regression from scratch. Fit degree 1, 2, 3, and 5 polynomials to 20 noisy data points. Plot all fits and discuss underfitting vs. overfitting.

Exercise 14

Build a complete data pipeline: read a CSV file, handle missing values (mean imputation), encode categorical variables, scale features, train a model, evaluate, and save the model using pickle. The entire pipeline should run with a single function call.

Exercise 15

Implement k-fold cross-validation from scratch (without using sklearn). Compute mean and std of Rยฒ across 5 folds.

Application Exercises

Exercise 16

Collect electricity bills for your household over the past 12 months. Use month number and average temperature as features to predict the bill amount. Is the relationship linear? If not, try polynomial features.

Exercise 17

Use the IMD rainfall data (available on data.gov.in) for your state. Build a simple regression model: rainfall_this_year ~ f(rainfall_last_year, El_Nino_index). Report Rยฒ and discuss limitations.

Exercise 18

Scrape or collect prices of 50 second-hand cars from OLX/CarDekho. Features: year, km driven, fuel type, engine CC. Build a linear regression model and discuss which features matter most.

Exercise 19

Download the California Housing dataset from Scikit-Learn. Compare: (a) simple linear regression using only median income, (b) multiple regression with all 8 features, (c) polynomial regression (degree 2) with all features. Report Rยฒ for each.

Exercise 20

Build a simple web API (using Flask or FastAPI) that accepts house features via JSON and returns a predicted price. Deploy locally and test with curl or Postman.

Exercise 21

Implement Ridge Regression (L2 regularization) from scratch: J(w) = MSE + ฮป||w||ยฒ. Derive the gradient and implement gradient descent. Show that as ฮป increases, coefficients shrink toward zero.

Exercise 22

Create a comprehensive comparison table: run linear regression on the California Housing dataset using (a) your scratch implementation (Normal Equation), (b) your scratch GD implementation, (c) sklearn LinearRegression, (d) sklearn SGDRegressor, (e) TensorFlow Dense(1). All should give approximately the same Rยฒ. Discuss any differences.

20

Multiple Choice Questions

Q1. The Normal Equation for linear regression is:

  • (a) w = XT(XXT)-1y
  • (b) w = (XTX)-1XTy
  • (c) w = X(XTX)-1y
  • (d) w = (XXT)-1Xy
Show Answer
Answer: (b). The Normal Equation w = (XTX)-1XTy is derived by setting the gradient of MSE to zero.

Q2. Which of the following is NOT an assumption of OLS linear regression?

  • (a) Linearity between features and target
  • (b) Normality of residuals
  • (c) Features must follow a normal distribution
  • (d) Homoscedasticity of residuals
Show Answer
Answer: (c). OLS assumes normality of residuals, not features. Features can have any distribution. This is a common exam trap.

Q3. If the learning rate in gradient descent is too large, what happens?

  • (a) The model converges very slowly
  • (b) The cost function may diverge (increase)
  • (c) The model always finds the global minimum faster
  • (d) The model underfits the data
Show Answer
Answer: (b). A large learning rate causes the parameters to overshoot the minimum, potentially diverging. The cost function may oscillate or increase indefinitely.

Q4. Rยฒ = 0.75 means:

  • (a) The model is 75% accurate
  • (b) 75% of data points are correctly predicted
  • (c) The model explains 75% of the variance in the target variable
  • (d) 75% of features are significant
Show Answer
Answer: (c). Rยฒ (coefficient of determination) represents the proportion of variance in the dependent variable explained by the model. Rยฒ = 1 - SS_res/SS_tot.

Q5. Which gradient descent variant uses a single randomly chosen sample per update?

  • (a) Batch Gradient Descent
  • (b) Mini-Batch Gradient Descent
  • (c) Stochastic Gradient Descent
  • (d) Adam Optimizer
Show Answer
Answer: (c). SGD uses one random sample per update, making it the most granular variant. Batch GD uses all samples; Mini-Batch uses a subset (e.g., 32).

Q6. Feature scaling is essential for gradient descent because:

  • (a) It makes the data normally distributed
  • (b) It ensures features are on similar scales, preventing oscillation
  • (c) It removes outliers from the dataset
  • (d) It is required by the Normal Equation
Show Answer
Answer: (b). Without scaling, features with large ranges dominate the gradient, causing oscillation. Scaling makes the cost surface more spherical, enabling efficient convergence. Note: the Normal Equation does NOT require scaling.

Q7. The MSE cost function for linear regression is:

  • (a) Concave โ€” has a maximum
  • (b) Convex โ€” has exactly one minimum
  • (c) Non-convex โ€” has multiple local minima
  • (d) Flat โ€” has no minimum or maximum
Show Answer
Answer: (b). For linear regression, MSE is a convex (bowl-shaped) function. This guarantees that gradient descent finds the global minimum, and the Normal Equation gives the exact solution.

Q8. If XTX is singular (non-invertible), which of the following is true?

  • (a) The Normal Equation gives a unique solution
  • (b) Some features are perfectly multicollinear
  • (c) Gradient descent cannot be used
  • (d) Rยฒ will be exactly 1
Show Answer
Answer: (b). XTX is singular when features are linearly dependent (perfect multicollinearity) or n < p. Use pseudoinverse (np.linalg.pinv) or regularization (Ridge). GD can still work, but the solution is not unique.

Q9. Which metric is most robust to outliers?

  • (a) MSE
  • (b) RMSE
  • (c) MAE
  • (d) Rยฒ
Show Answer
Answer: (c). MAE uses absolute values instead of squares, so large errors don't disproportionately inflate the metric. MSE and RMSE amplify large errors due to squaring.

Q10. Deriving MSE from Maximum Likelihood requires which assumption about noise?

  • (a) Noise follows a uniform distribution
  • (b) Noise follows a Gaussian (normal) distribution
  • (c) Noise is always zero
  • (d) Noise follows a Bernoulli distribution
Show Answer
Answer: (b). Under the assumption ฮตแตข ~ N(0, ฯƒยฒ), maximizing log-likelihood is equivalent to minimizing MSE. With Laplace noise, we'd get MAE; with uniform noise, we'd get minimax.

Q11. What is the time complexity of the Normal Equation?

  • (a) O(n)
  • (b) O(np)
  • (c) O(npยฒ + pยณ)
  • (d) O(nยฒpยฒ)
Show Answer
Answer: (c). Computing XTX takes O(npยฒ), and inverting the pร—p matrix takes O(pยณ). This is why the Normal Equation becomes impractical for p > 10,000 features.

Q12. In the residual plot, a funnel-shaped pattern indicates:

  • (a) Non-linearity
  • (b) Heteroscedasticity
  • (c) Autocorrelation
  • (d) Multicollinearity
Show Answer
Answer: (b). A funnel shape (residuals fanning out) means variance is not constant across predictions โ€” this violates the homoscedasticity assumption. Solutions: WLS, log-transform target, or robust regression.
21

Interview Questions

Q1. What is the difference between Linear Regression and Correlation?

Show Answer

Correlation measures the strength and direction of a linear relationship (a number from -1 to +1). Linear regression models the relationship to make predictions. Correlation is symmetric (corr(X,Y) = corr(Y,X)), but regression is not (regressing Y on X โ‰  regressing X on Y). Additionally, correlation doesn't assume causation, whereas regression coefficients are often interpreted as "effect sizes" (though this requires causal assumptions).

Q2. Why do we use MSE and not just Mean Error?

Show Answer

Mean Error (ME = ฮฃ(yแตข - ลทแตข)/n) can be zero even with terrible predictions because positive and negative errors cancel out. MSE squares errors to make them all positive, penalizes large errors more heavily, and is mathematically convenient (differentiable, convex). Additionally, MSE arises naturally from the Maximum Likelihood principle under Gaussian noise assumptions.

Q3. When would you prefer Gradient Descent over the Normal Equation?

Show Answer

GD is preferred when: (1) p (features) is large (>10,000) because matrix inversion is O(pยณ); (2) n is very large and data doesn't fit in memory โ€” SGD processes one sample at a time; (3) Online/streaming data โ€” SGD can update in real-time; (4) Non-linear extensions โ€” GD generalizes to neural networks while Normal Equation is only for linear models. Normal Equation is preferred for small datasets where the exact solution is cheap to compute.

Q4. How do you handle multicollinearity in linear regression?

Show Answer

Detection: (1) Compute VIF (Variance Inflation Factor) โ€” VIF > 10 is severe; (2) Check correlation matrix for |r| > 0.9 pairs. Solutions: (1) Remove one of the correlated features; (2) Use PCA to create uncorrelated components; (3) Use Ridge (L2) or LASSO (L1) regularization โ€” Ridge shrinks correlated coefficients, LASSO can drive some to zero; (4) Domain knowledge to choose the most meaningful feature.

Q5. Can Rยฒ be negative? What does it mean?

Show Answer

Yes! Rยฒ = 1 - SS_res/SS_tot. If SS_res > SS_tot (the model's predictions are worse than simply predicting the mean), Rยฒ becomes negative. This happens when: (1) The model is very wrong (e.g., predicting a constant that's far from the mean); (2) You evaluate on a test set with a very different distribution; (3) The model is badly overfit. An Rยฒ < 0 model is worse than the "dumb baseline" of always predicting ศณ.

Q6. Explain the bias-variance tradeoff in the context of linear regression.

Show Answer

OLS linear regression has low bias (it's unbiased โ€” Gauss-Markov theorem) but can have high variance when there are many features or multicollinearity (coefficients become unstable). Ridge regression introduces a small bias (by shrinking coefficients) but dramatically reduces variance, often giving better test performance. The bias-variance tradeoff is: Total Error = Biasยฒ + Variance + Irreducible Noise. OLS minimizes bias; regularized methods optimize the tradeoff.

Q7. What is the relationship between linear regression and a single-neuron neural network?

Show Answer

A single neuron without activation function IS linear regression: output = wTx + b. With a sigmoid activation, it becomes logistic regression. With a ReLU activation, it's a "half-linear" model. Deep neural networks are essentially stacks of these linear transformations with non-linear activations. Understanding linear regression deeply is understanding the building block of all neural networks.

Q8. How would you handle outliers in a regression problem?

Show Answer

Approaches: (1) Detection โ€” IQR method, Z-score, Cook's distance (leverage plot); (2) Removal โ€” Only if they're data errors, not genuine extreme values; (3) Robust regression โ€” Use Huber loss or RANSAC which downweight outliers; (4) Transform โ€” Log-transform heavy-tailed targets; (5) Use MAE instead of MSE as the loss function (less sensitive to outliers); (6) Winsorize โ€” Cap extreme values at the 5th/95th percentile. The best approach depends on whether outliers are signal or noise.

Q9. You have a linear regression model with Rยฒ = 0.95 on training data but Rยฒ = 0.40 on test data. What happened and how do you fix it?

Show Answer

This is overfitting. The model memorized training data but fails to generalize. Likely causes: too many features relative to samples (high p/n ratio), polynomial features of high degree, or multicollinearity. Fixes: (1) Regularization โ€” Ridge (L2) or LASSO (L1); (2) Feature selection โ€” remove irrelevant features; (3) Cross-validation โ€” use k-fold CV instead of train/test split; (4) More data โ€” reduce variance; (5) Reduce polynomial degree; (6) Check for data leakage (test data information accidentally in training).

Q10. Explain feature engineering for linear regression. Give 3 examples.

Show Answer

Feature engineering creates new informative features: (1) Interaction terms โ€” area ร— bedrooms might capture "spacious bedrooms" effect; (2) Polynomial features โ€” xยฒ captures non-linear relationships while keeping the model "linear in parameters"; (3) Log transform โ€” log(income) linearizes exponential relationships; (4) Binning โ€” convert continuous age to age groups; (5) One-hot encoding โ€” convert categorical city to binary features; (6) Domain-specific โ€” price_per_sqft = price/area normalizes for size differences.

Q11. What happens to linear regression if features are perfectly multicollinear?

Show Answer

If features are perfectly multicollinear (e.g., xโ‚ƒ = 2xโ‚ + xโ‚‚), then XTX is singular (not invertible), and the Normal Equation has no unique solution. There are infinitely many weight vectors that achieve the same MSE. NumPy's pinv (pseudoinverse) will return one solution, but coefficients will be unreliable. GD may converge but to an arbitrary point in the solution space. Fix: use regularization, drop redundant features, or use PCA.

22

Research Problems

๐Ÿ”ฌ Research Problem 1: Adaptive Learning Rate for Indian Agricultural Data

Problem: Indian agricultural data has high variance across states (Punjab vs. Rajasthan) and years (monsoon vs. drought). Standard learning rate schedules (step decay, cosine annealing) are designed for i.i.d. data. Design and implement an adaptive learning rate schedule that accounts for heterogeneity in agricultural data.

Suggested approach: Implement per-feature learning rates based on the curvature of the loss landscape (second-order information). Compare with Adam, AdaGrad, and RMSprop. Evaluate on IMD+ICAR datasets across 5 crop types and 10 states.

Expected outcome: A novel learning rate schedule that converges 2-3x faster on heterogeneous agricultural data compared to fixed learning rate.

๐Ÿ”ฌ Research Problem 2: Robust Regression for Noisy Urban Data

Problem: Indian urban datasets (real estate, traffic, pollution) contain extreme outliers and non-Gaussian noise. Standard MSE-based regression is highly sensitive to outliers. Investigate combinations of Huber loss, quantile regression, and RANSAC for Indian city datasets.

Research questions: (1) What is the optimal Huber ฮด parameter for Indian real estate data? (2) Does the optimal ฮด vary by city tier? (3) Can you derive a closed-form estimator for Huber regression? (4) Compare computational cost vs. robustness gain.

๐Ÿ”ฌ Research Problem 3: Interpretable Regression for Policy

Problem: Government policies in India (MSP pricing, subsidy allocation, tax rates) require interpretable models. But linear regression's assumption of linearity is often violated. Investigate Generalized Additive Models (GAMs) as a middle ground: y = ฮฒโ‚€ + fโ‚(xโ‚) + fโ‚‚(xโ‚‚) + ... where each f is a smooth non-linear function.

Questions: (1) How much accuracy do GAMs gain over linear regression on Indian policy datasets? (2) Are GAMs interpretable enough for Parliamentary scrutiny? (3) Implement a simple GAM using basis splines and compare with linear regression and random forest on 3 government datasets.

23

Key Takeaways

1
Linear regression models the relationship ลท = Xw + b by finding weights that minimize MSE. It's the foundation of all supervised learning.
2
MSE arises from Maximum Likelihood under Gaussian noise. It's not an arbitrary choiceโ€”it's the statistically optimal loss function when errors are normally distributed.
3
The Normal Equation w = (XTX)-1XTy gives the exact solution in one step. Use it for small to medium datasets (p < 10,000).
4
Gradient Descent iteratively updates weights: w โ† w - ฮฑยทโˆ‡J. Three variants (Batch, SGD, Mini-batch) trade off computation, stability, and speed.
5
Feature scaling is critical for gradient descent but unnecessary for the Normal Equation. Without scaling, GD may not converge.
6
Evaluate with multiple metrics: Rยฒ, Adjusted Rยฒ, RMSE, MAE. Each tells a different story. Always check diagnostic plots (residuals, Q-Q) to validate assumptions.
7
The four assumptions (LINE) โ€” Linearity, Independence, Normality of residuals, Equal variance โ€” must be verified. Violations don't invalidate the model but make OLS sub-optimal.
8
Linear regression is industry-standard in India: RBI uses it for GDP forecasting, IMD for monsoon prediction, ISRO for satellite telemetry, and every major tech company for baseline models.
9
Start with linear regression as a baseline before trying complex models. If linear regression gives Rยฒ = 0.90, a neural network that gives 0.92 may not be worth the added complexity and lost interpretability.
10
Every neuron in a neural network performs a linear regression + non-linear activation. Mastering this chapter prepares you for all of deep learning.
24

References & Further Reading

Textbooks

  1. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer. Chapters 3-4. [Free online: https://hastie.su.domains/ElemStatLearn/]
  2. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 3.
  3. Gรฉron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed. O'Reilly. Chapter 4.
  4. Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 11.
  5. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning, 2nd ed. Springer. Chapter 3. [Free online: https://www.statlearning.com/]

Research Papers

  1. Legendre, A.-M. (1805). Nouvelles mรฉthodes pour la dรฉtermination des orbites des comรจtes. Paris. [Original least squares paper]
  2. Gauss, C. F. (1809). Theoria Motus Corporum Coelestium. [Gaussian noise + MLE connection]
  3. Hoerl, A. E., & Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1), 55-67.
  4. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. JRSS-B, 58(1), 267-288.
  5. Harrison, D., & Rubinfeld, D. L. (1978). Hedonic Housing Prices and the Demand for Clean Air. Journal of Environmental Economics.

Indian Data Sources

  1. data.gov.in โ€” Government of India Open Data Platform [Housing, Agriculture, Weather datasets]
  2. IMD (mausam.imd.gov.in) โ€” India Meteorological Department historical rainfall data
  3. ICAR (icar.org.in) โ€” Indian Council of Agricultural Research crop data
  4. NHB (nhb.org.in) โ€” National Housing Bank RESIDEX (Real Estate Price Index)
  5. Kaggle: "Indian Houses Dataset", "Indian Real Estate Analysis", "IMD Rainfall Data"

Online Resources

  1. Scikit-Learn Documentation: sklearn.linear_model.LinearRegression
  2. TensorFlow Tutorials: Basic regression (tensorflow.org/tutorials)
  3. Stanford CS229 Lecture Notes: Andrew Ng's Linear Regression notes
  4. 3Blue1Brown: "The Essence of Linear Algebra" (YouTube)
  5. StatQuest with Josh Starmer: "Linear Regression" series (YouTube)

Ethical References

  1. Carlisle, M. (2020). "Racist data destruction? On the Boston Housing dataset." Medium article explaining why sklearn deprecated load_boston().
  2. O'Neil, C. (2016). Weapons of Math Destruction. Crown Books. [On biased models in housing, credit, and policing]

PART II: SUPERVISED LEARNING

โ† Ch 3: Data Preprocessing Chapter 4 Complete โœ“ Ch 5: Logistic Regression โ†’