Chapter 4: Linear Regression & Gradient Descent

01

Learning Objectives

After completing this chapter, you will be able to:

1

Explain the intuition behind fitting a line to data and the concept of "best fit."

2

Derive the Mean Squared Error cost function from Maximum Likelihood Estimation with Gaussian noise.

3

Solve linear regression using the closed-form Normal Equation: w = (X^TX)^-1X^Ty.

4

Implement Batch, Stochastic, and Mini-batch Gradient Descent from scratch in Python.

5

Analyze the effects of learning rate, convergence criteria, and feature scaling on GD.

6

Evaluate models using R², Adjusted R², MSE, RMSE, and MAE metrics.

7

Verify linear regression assumptions via residual plots, Q-Q plots, and leverage plots.

8

Build end-to-end regression pipelines using Scikit-Learn and TensorFlow.

9

Apply regression to real-world Indian datasets: house prices, crop yields, and rainfall.

10

Discuss ethical considerations in regression modeling (bias, fairness, interpretability).

02

Introduction

Imagine you're a real estate agent in Mumbai, and a client asks: "If I buy a 1,200 sq. ft. flat in Andheri, what should I expect to pay?" You've sold dozens of flats—you know roughly that larger flats cost more. But can you turn that intuition into a precise number? That's exactly what linear regression does.

Linear regression is the "Hello, World!" of machine learning. It is the simplest, most interpretable, and arguably the most widely deployed supervised learning algorithm. From predicting house prices to forecasting stock markets, from estimating crop yields to modeling climate change—regression is everywhere.

In this chapter, we'll go from zero to expert. We'll start with the intuitive idea of "drawing a line through points," derive every formula from first principles, implement everything from scratch in Python, and then see how the same ideas power billion-dollar applications at Google, Flipkart, and ISRO.

Why Linear Regression Matters

Interpretability: Every coefficient tells you exactly how much a feature affects the output.
Speed: Closed-form solution exists—training can be instantaneous.
Foundation: Logistic regression, ridge, lasso, elastic net, GLMs all extend linear regression.
Industry Standard: In regulated industries (banking, insurance, healthcare), interpretable models like linear regression are often legally required.
Baseline: Any ML project should start with a linear regression baseline before trying complex models.

03

Historical Background

The story of linear regression begins with the stars—literally. In 1805, the French mathematician Adrien-Marie Legendre published the method of least squares to predict the orbits of comets. He wanted to find the "best" line through noisy astronomical observations.

Just four years later, Carl Friedrich Gauss claimed he had invented the method even earlier (around 1795) and showed that least squares is the optimal estimator when errors follow a Gaussian (normal) distribution. This connection between least squares and Gaussian noise is the Maximum Likelihood derivation we'll explore in Section 6.

The term "regression" itself was coined by Sir Francis Galton in the 1880s, studying the relationship between parents' heights and children's heights. He observed that children of very tall parents tended to be shorter—they "regressed toward the mean." Though the biological interpretation has changed, the mathematical method kept the name.

Timeline of Key Milestones

Year	Contributor	Milestone
1795	Carl Friedrich Gauss	First use of least squares (claimed, published 1809)
1805	Adrien-Marie Legendre	Published method of least squares
1821	Carl Friedrich Gauss	Gauss-Markov theorem: OLS is BLUE
1885	Francis Galton	Coined "regression" (regression to the mean)
1897	Karl Pearson	Multiple regression, correlation coefficient
1922	R.A. Fisher	Maximum likelihood estimation formalized
1958	Frank Rosenblatt	Perceptron (linear regression + threshold = classification)
1970	Hoerl & Kennard	Ridge regression (regularized linear regression)
1996	Robert Tibshirani	LASSO (L1-regularized linear regression)
2005	Zou & Hastie	Elastic Net (L1+L2 regularization)

04

Conceptual Explanation

4.1 The Simplest Idea: Fitting a Line

Suppose you plot house sizes (x-axis) against prices (y-axis) and see a rough upward trend. A linear regression model says: "I believe the relationship is approximately a straight line."

Simple Linear Regression Model ŷ = w \cdot x + b

Where:

ŷ (y-hat): the predicted value
w (weight/slope): how much y changes per unit change in x
b (bias/intercept): the predicted y when x = 0
x: the input feature

4.2 What Makes a Line "Best"?

There are infinitely many lines you could draw through a scatter plot. We need a way to measure which line is "best." The idea is simple: the best line minimizes the total error between predicted values (ŷ) and actual values (y).

We define the residual for each data point as: eᵢ = yᵢ - ŷᵢ. The residual is positive if we under-predict and negative if we over-predict. We can't just sum residuals (positives cancel negatives), so we square them:

Mean Squared Error (MSE) J(w, b) = (1/n) \cdot Σᵢ₌₁ⁿ (yᵢ - ŷᵢ)²

This is the cost function (or loss function). Our goal is to find w and b that minimize J(w, b). This is an optimization problem.

4.3 Multiple Linear Regression

Real problems have multiple features. A house price depends on area, bedrooms, age, location, etc. With p features:

Multiple Linear Regression ŷ = w₁x₁ + w₂x₂ + ... + wₚxₚ + b

In vectorized form, we add a column of 1s to the feature matrix X (for the bias), and write:

Vectorized Form ŷ = X \cdot w (where X includes a bias column, w includes b)

4.4 Two Solution Strategies

There are two fundamentally different ways to find the optimal w:

Strategy 1: Normal Equation (Analytical)

Set the derivative of J(w) to zero and solve directly. Gives an exact, closed-form answer. Fast for small/medium datasets. Complexity: O(p³) for p features.

Strategy 2: Gradient Descent (Iterative)

Start with random w, repeatedly nudge w in the direction that reduces J(w). Converges to the optimal solution. Essential for large datasets and neural networks. Complexity: O(npk) for n samples, p features, k iterations.

05

Mathematical Foundation

5.1 Notation

Symbol	Meaning	Dimensions
X	Feature matrix (with bias column)	n × (p+1)
y	Target vector	n × 1
w	Weight vector (including bias)	(p+1) × 1
ŷ	Prediction vector = Xw	n × 1
ε	Error/residual vector = y - ŷ	n × 1
n	Number of training samples	scalar
p	Number of features (before bias)	scalar
α	Learning rate	scalar
J(w)	Cost function	scalar

5.2 The Linear Model Assumption

We assume the true relationship between features and target is:

Data Generating Process y = Xw* + ε, where ε ~ N(0, σ²I)

Here w* is the true (unknown) weight vector, and ε is random noise that is (1) normally distributed, (2) has zero mean, (3) has constant variance σ², and (4) is independent across samples.

5.3 The Four Assumptions of Linear Regression

Linearity: The relationship between X and y is linear in parameters. The true model is y = Xw + ε.
Independence: The residuals εᵢ are independent of each other. No autocorrelation.
Homoscedasticity: The variance of residuals is constant across all levels of X. Var(εᵢ) = σ² for all i.
Normality: The residuals follow a normal distribution. εᵢ ~ N(0, σ²).

Bonus assumption: No multicollinearity — features should not be perfectly correlated with each other (otherwise X^TX is singular).

5.4 Evaluation Metrics

Mean Squared Error (MSE)

MSE = (1/n) Σᵢ₌₁ⁿ (yᵢ - ŷᵢ)²

Root Mean Squared Error (RMSE)

RMSE = \sqrtMSE = \sqrt[(1/n) Σᵢ₌₁ⁿ (yᵢ - ŷᵢ)²]

Mean Absolute Error (MAE)

MAE = (1/n) Σᵢ₌₁ⁿ |yᵢ - ŷᵢ|

R² (Coefficient of Determination)

R-Squared R² = 1 - (SS_res / SS_tot) SS_res = Σ(yᵢ - ŷᵢ)² (residual sum of squares) SS_tot = Σ(yᵢ - ȳ)² (total sum of squares)

Interpretation: R² = 0.85 means the model explains 85% of the variance in y. R² = 1 is perfect fit; R² = 0 means the model is no better than predicting the mean.

Adjusted R²

Adjusted R-Squared Adj-R² = 1 - [(1 - R²)(n - 1) / (n - p - 1)]

Adjusted R² penalizes adding useless features. If a new feature doesn't improve prediction, Adjusted R² decreases, while R² always increases (or stays the same) with more features.

06

Formula Derivations

6.1 Derivation of MSE from Maximum Likelihood

Setup: Assume each observation follows: yᵢ = w^Txᵢ + εᵢ where εᵢ ~ N(0, σ²).

Step 1: Write the likelihood. Since εᵢ = yᵢ - w^Txᵢ is Gaussian:

P(yᵢ | xᵢ, w) = (1/\sqrt(2πσ²)) \cdot exp(-(yᵢ - w T xᵢ)² / (2σ²))

Step 2: Likelihood of all n samples (independence assumption):

L(w) = Πᵢ₌₁ⁿ P(yᵢ | xᵢ, w) = Πᵢ₌₁ⁿ (1/\sqrt(2πσ²)) \cdot exp(-(yᵢ - w T xᵢ)² / (2σ²))

Step 3: Take the log-likelihood (products → sums):

ln L(w) = -n/2 \cdot ln(2πσ²) - (1/(2σ²)) \cdot Σᵢ₌₁ⁿ (yᵢ - w T xᵢ)²

Step 4: Maximize log-likelihood = Minimize the sum of squares

The first term is a constant (doesn't depend on w). Maximizing ln L(w) is equivalent to:

MSE from Maximum Likelihood Minimize J(w) = (1/n) Σᵢ₌₁ⁿ (yᵢ - w T xᵢ)²

Key result: Under Gaussian noise, Maximum Likelihood Estimation gives us MSE as the cost function. This is not an arbitrary choice—it's the statistically optimal loss under the normality assumption.

6.2 Derivation of the Normal Equation

Goal: Find w that minimizes J(w) = (1/n)||y - Xw||²

Step 1: Expand the cost function in matrix form

J(w) = (1/n)(y - Xw) T (y - Xw) = (1/n)(y T y - y T Xw - w T X T y + w T X T Xw) = (1/n)(y T y - 2w T X T y + w T X T Xw)

(Note: y^TXw is a scalar, so y^TXw = w^TX^Ty.)

Step 2: Take the gradient with respect to w

Using matrix calculus rules: ∂(w^Ta)/∂w = a, and ∂(w^TAw)/∂w = 2Aw (for symmetric A):

\nablaₓJ(w) = (1/n)(-2X T y + 2X T Xw)

Step 3: Set gradient to zero and solve

-2X T y + 2X T Xw = 0 X T Xw = X T y

Step 4: Solve for w (multiply both sides by (X^TX)^-1):

The Normal Equation w = (X T X) -1 X T y

This assumes X^TX is invertible (non-singular), which requires n ≥ p and no perfect multicollinearity.

6.3 Derivation of Gradient Descent Update Rule

Idea: If we can't (or don't want to) invert matrices, we can iteratively walk "downhill" on the cost surface.

Step 1: We already computed the gradient:

\nablaJ(w) = (2/n) X T (Xw - y)

Step 2: Update w in the negative gradient direction:

Gradient Descent Update Rule w \leftarrow w - α \cdot \nablaJ(w) w \leftarrow w - α \cdot (2/n) X T (Xw - y)

For simple linear regression (scalar w and b):

\partialJ/\partialw = (2/n) Σᵢ₌₁ⁿ (ŷᵢ - yᵢ) \cdot xᵢ \partialJ/\partialb = (2/n) Σᵢ₌₁ⁿ (ŷᵢ - yᵢ) w \leftarrow w - α \cdot \partialJ/\partialw b \leftarrow b - α \cdot \partialJ/\partialb

6.4 Variants of Gradient Descent

Batch Gradient Descent

Uses all n samples to compute the gradient at each step.

w \leftarrow w - α \cdot (2/n) X T (Xw - y)

Pros: Smooth convergence, guaranteed to find global minimum (for convex J).
Cons: Slow for large n (must process entire dataset each step).

Stochastic Gradient Descent (SGD)

Uses one random sample to estimate the gradient at each step.

w \leftarrow w - α \cdot 2(ŷᵢ - yᵢ) \cdot xᵢ (for randomly chosen i)

Pros: Very fast per step, can escape shallow local minima, enables online learning.
Cons: Noisy updates, may oscillate around minimum.

Mini-Batch Gradient Descent

Uses a small batch of B samples (typically B = 32, 64, 128).

w \leftarrow w - α \cdot (2/B) X_batch T (X_batch \cdot w - y_batch)

Pros: Best of both—stable enough for convergence, fast enough for large datasets, leverages GPU parallelism.
Cons: Requires tuning batch size B.

6.5 Learning Rate Effects

Learning Rate	Behavior	Convergence
Too small (e.g., 0.0001)	Very slow descent, tiny steps	Eventually converges but takes too long
Just right (e.g., 0.01)	Steady descent toward minimum	Converges in reasonable time
Too large (e.g., 1.0)	Overshoots minimum, oscillates wildly	May diverge (J increases!)
Adaptive (e.g., Adam)	Adjusts per-parameter learning rates	Robust convergence for most problems

6.6 Convergence Criteria

How do we know when to stop gradient descent? Common criteria:

Maximum iterations: Stop after k iterations (e.g., k = 1000).
Cost tolerance: Stop when |J(w_new) - J(w_old)| < ε (e.g., ε = 10⁻⁶).
Gradient norm: Stop when ||∇J(w)|| < ε.
Parameter change: Stop when ||w_new - w_old|| < ε.

07

Worked Numerical Examples

Example 1: Normal Equation with 5 Data Points

Problem: Given the following data on flat area (sq. ft.) and price (₹ Lakhs) in Bangalore:

Area (x)	Price (y)
600	30
800	45
1000	55
1200	68
1500	80

Find w (slope) and b (intercept) using the Normal Equation.

Step-by-Step Solution

Step 1: Construct the design matrix X (with bias column) and target vector y:

X = [[1, 600], y = [30, 45, 55, 68, 80] T [1, 800], [1, 1000], [1, 1200], [1, 1500]]

Step 2: Compute X^TX:

X T X = [[5, 5100], [5100, 5540000]]

Detail: X^TX[0,0] = 5 (sum of 1s), X^TX[0,1] = 600+800+1000+1200+1500 = 5100, X^TX[1,1] = 600²+800²+1000²+1200²+1500² = 360000+640000+1000000+1440000+2250000 = 5690000.

X T X = [[5, 5100], [5100, 5690000]]

Step 3: Compute X^Ty:

X T y = [30+45+55+68+80, 600\cdot30+800\cdot45+1000\cdot55+1200\cdot68+1500\cdot80] T = [278, 18000+36000+55000+81600+120000] T = [278, 310600] T

Step 4: Compute (X^TX)^-1:

For a 2×2 matrix [[a,b],[c,d]], inverse = (1/(ad-bc)) · [[d,-b],[-c,a]]

det = 5\cdot5690000 - 5100\cdot5100 = 28450000 - 26010000 = 2440000 (X T X) -1 = (1/2440000) \cdot [[5690000, -5100], [-5100, 5]]

Step 5: Compute w = (X^TX)^-1 X^Ty:

w = (1/2440000) \cdot [[5690000\cdot278 + (-5100)\cdot310600], [(-5100)\cdot278 + 5\cdot310600]] w[0] = (1/2440000)(1581820000 - 1584060000) = (-2240000)/2440000 \approx -0.918 w[1] = (1/2440000)(-1417800 + 1553000) = 135200/2440000 \approx 0.05541

Final Model b \approx -0.918, w \approx 0.05541 ŷ = 0.05541 \cdot x - 0.918 Interpretation: Each sq. ft. adds \approx ₹0.0554 Lakhs \approx ₹5,541 to the price.

Verification: For x=1000: ŷ = 0.05541×1000 - 0.918 = 55.41 - 0.918 = 54.49 ≈ 55 ✓

Example 2: Two Iterations of Gradient Descent (α = 0.01)

Problem: Using simplified data points: (1, 2), (2, 4), (3, 5). Starting with w=0, b=0. Perform 2 iterations of batch gradient descent with α = 0.01.

Iteration 1

Predictions (ŷ = wx + b, with w=0, b=0):

ŷ₁ = 0\cdot1 + 0 = 0, ŷ₂ = 0\cdot2 + 0 = 0, ŷ₃ = 0\cdot3 + 0 = 0

Errors (ŷᵢ - yᵢ):

e₁ = 0 - 2 = -2, e₂ = 0 - 4 = -4, e₃ = 0 - 5 = -5

Gradients:

\partialJ/\partialw = (2/3)[(-2)(1) + (-4)(2) + (-5)(3)] = (2/3)[-2 - 8 - 15] = (2/3)(-25) = -16.667 \partialJ/\partialb = (2/3)[(-2) + (-4) + (-5)] = (2/3)(-11) = -7.333

Updates:

w \leftarrow 0 - 0.01 \times (-16.667) = 0.1667 b \leftarrow 0 - 0.01 \times (-7.333) = 0.0733

Iteration 2

Predictions (w=0.1667, b=0.0733):

ŷ₁ = 0.1667\cdot1 + 0.0733 = 0.240 ŷ₂ = 0.1667\cdot2 + 0.0733 = 0.407 ŷ₃ = 0.1667\cdot3 + 0.0733 = 0.573

Errors:

e₁ = 0.240 - 2 = -1.760, e₂ = 0.407 - 4 = -3.593, e₃ = 0.573 - 5 = -4.427

Gradients:

\partialJ/\partialw = (2/3)[(-1.760)(1) + (-3.593)(2) + (-4.427)(3)] = (2/3)[-1.760 - 7.186 - 13.281] = (2/3)(-22.227) = -14.818 \partialJ/\partialb = (2/3)[-1.760 - 3.593 - 4.427] = (2/3)(-9.780) = -6.520

Updates:

w \leftarrow 0.1667 - 0.01 \times (-14.818) = 0.1667 + 0.1482 = 0.3149 b \leftarrow 0.0733 - 0.01 \times (-6.520) = 0.0733 + 0.0652 = 0.1385

After just 2 iterations, the line has already started fitting: w is moving toward ~1.5 and b toward ~0.67 (the optimal values). With hundreds more iterations, gradient descent will converge to the exact solution.

Example 3: Indian Real Estate (4 Features)

Problem: Predict flat price in Delhi NCR using 4 features: Area (sq ft), Bedrooms, Floor, Age (years).

Area	Beds	Floor	Age	Price (₹L)
850	2	3	5	45
1100	3	7	2	72
1400	3	12	1	95
950	2	5	10	38
1600	4	15	0	120
750	1	2	15	28

This requires the Normal Equation with a 6×5 design matrix. We'll solve this computationally in the Python section, but the interpretation is key:

Price = w₁\cdotArea + w₂\cdotBeds + w₃\cdotFloor + w₄\cdotAge + b Expected signs: w₁ > 0 (more area \to higher price) w₂ > 0 (more bedrooms \to higher price) w₃ > 0 (higher floor \to slightly higher price) w₄ < 0 (older flat \to lower price)

08

Visual Diagrams

Diagram 1: Scatter Plot with Best-Fit Line

📊 Linear Regression: Fitting a Line to Data

Price (₹L) │ 120│ ★ ← actual │ ·/ 100│ ·/ │ ★ ·/ 80│ ·/ │ ★ ·/ 60│ ·/ │ ·/ 40│ ★ ·/ ← best-fit line: ŷ = wx + b │ ·/ 20│ ★ ·/ │ ·/ 0├──·/──────────────────────────────────── Area (sq ft) 0 400 800 1200 1600 2000 Legend: ★ = actual data point, ·/ = best-fit line The line minimizes the total squared distance from points to line.

Diagram 2: Cost Function Surface

📊 MSE Cost Function J(w,b) — A Bowl Shape (Convex)

J(w,b) │ high │ \ / │ \ / │ \ / │ \ / │ \ / │ \ ___ / │ \ / \ / low │ \ / \ / │ ·── global minimum (w*, b*) │ └─────────────────────────────── w ← optimal → Key insight: MSE for linear regression is CONVEX → There is exactly ONE global minimum → Gradient descent ALWAYS converges (with proper α) → Normal equation finds it directly

Diagram 3: Gradient Descent Steps

📊 Gradient Descent: Walking Downhill on the Cost Surface

J(w) │ 100 │ ★ start (random w) │ \ 80 │ ★ step 1 │ \ 60 │ ★ step 2 │ \ 40 │ ★ step 3 Learning rate α too HIGH: │ \ ★ ★ 20 │ ★ step 4 \ / ← oscillation! │ \ ★/ 5 │ ★★★ converged! │ └────────────────────────── w w₀ w₁ w₂ w₃ w₄ w* Small α → many small steps → slow but safe Large α → few big steps → fast but risky Just right → efficient convergence!

Diagram 4: Residual Analysis

📊 Residual Plots: Good vs. Bad Patterns

GOOD (Random scatter) BAD (Curved pattern) BAD (Funnel shape) ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ · · │ │ · · │ │ · │ │ · · · │ │ · │ │ · · │ │──────·──0──·─────│ │ · · │ │ · · │ │ · · │ │──────0───────────│ │──·──0───────────│ │ · · │ │ · · │ │ · · │ │ · · │ │ · · │ │· │ └──────────────────┘ └──────────────────┘ └──────────────────┘ → Assumptions MET → Non-linearity! → Heteroscedasticity! → Model is appropriate → Add polynomial terms → Use WLS or transform y

09

Flowcharts

Flowchart 1: Linear Regression Pipeline

🔄 End-to-End Linear Regression Pipeline

┌─────────────────┐ │ Raw Data │ │ (CSV/DB/API) │ └────────┬────────┘ ▼ ┌─────────────────┐ │ EDA & Cleaning │──→ Handle missing values, outliers │ (pandas/numpy) │ Check for correlations └────────┬────────┘ ▼ ┌─────────────────┐ │ Feature Eng. │──→ Scaling, encoding, polynomial features │ (StandardScaler)│ └────────┬────────┘ ▼ ┌─────────────────┐ │ Train/Test Split│──→ 80/20 or K-Fold CV │ (train_test_ │ │ split) │ └────────┬────────┘ ▼ ┌─────────────────┐ ┌────────────────┐ │ Choose Method │────►│ Normal Equation│──→ Small data (p < 10K) │ │ └────────────────┘ │ │ ┌────────────────┐ │ │────►│ Gradient Desc. │──→ Large data / online └────────┬────────┘ └────────────────┘ ▼ ┌─────────────────┐ │ Evaluate Model │──→ R², RMSE, MAE on test set │ (score, predict)│ └────────┬────────┘ ▼ ┌─────────────────┐ │ Diagnostics │──→ Residual plot, Q-Q plot │ Check Assump. │ VIF for multicollinearity └────────┬────────┘ ▼ ┌─────────────────┐ │ Deploy/Report │──→ Save model (pickle/joblib) │ │ API endpoint, dashboard └─────────────────┘

Flowchart 2: Gradient Descent Algorithm

🔄 Gradient Descent Algorithm Flow

┌───────────────────────┐ │ Initialize w randomly │ │ (or with zeros) │ └───────────┬───────────┘ ▼ ┌───────────────────────┐ │ Compute predictions │ │ ŷ = Xw │ └───────────┬───────────┘ ▼ ┌───────────────────────┐ │ Compute cost J(w) │ │ J = (1/n)||y - ŷ||² │ └───────────┬───────────┘ ▼ ┌───────────────────────┐ │ Compute gradient │ │ ∇J = (2/n)X^T(ŷ - y) │ └───────────┬───────────┘ ▼ ┌───────────────────────┐ │ Update weights │ │ w ← w - α·∇J │ └───────────┬───────────┘ ▼ ┌─────────┐ NO │Converged?├────────────┐ │ │ │ └────┬─────┘ │ │ YES │ ▼ │ ┌───────────────────────┐ │ │ Return optimal w │ │ │ Model is trained! │◄─────┘ └───────────────────────┘ (loop back)

10

Python Implementation (From Scratch)

Let's build a complete LinearRegression class from scratch using only NumPy.

Python — LinearRegression from Scratch

import numpy as np
import matplotlib.pyplot as plt

class LinearRegressionScratch:
    """
    Linear Regression from scratch.
    Supports: Normal Equation, Batch GD, SGD, Mini-batch GD.
    """

    def __init__(self, method='normal', lr=0.01, n_iters=1000,
                 batch_size=32, tol=1e-6, verbose=False):
        """
        Parameters:
        -----------
        method : str, one of 'normal', 'batch_gd', 'sgd', 'mini_batch_gd'
        lr     : float, learning rate (alpha)
        n_iters: int, max iterations for GD methods
        batch_size: int, for mini-batch GD
        tol    : float, convergence tolerance
        verbose: bool, print progress
        """
        self.method = method
        self.lr = lr
        self.n_iters = n_iters
        self.batch_size = batch_size
        self.tol = tol
        self.verbose = verbose
        self.weights = None       # includes bias
        self.cost_history = []

    def _add_bias(self, X):
        """Add a column of 1s for the bias term."""
        return np.c_[np.ones(X.shape[0]), X]

    def _compute_cost(self, X, y):
        """Compute MSE cost: J = (1/n)||y - Xw||^2"""
        n = len(y)
        predictions = X @ self.weights
        cost = (1 / n) * np.sum((y - predictions) ** 2)
        return cost

    def _compute_gradient(self, X, y):
        """Compute gradient: (2/n) * X^T(Xw - y)"""
        n = len(y)
        predictions = X @ self.weights
        gradient = (2 / n) * X.T @ (predictions - y)
        return gradient

    def fit_normal_equation(self, X, y):
        """Closed-form solution: w = (X^TX)^{-1} X^Ty"""
        X_b = self._add_bias(X)
        # Use np.linalg.pinv for numerical stability
        self.weights = np.linalg.pinv(X_b.T @ X_b) @ X_b.T @ y
        if self.verbose:
            cost = self._compute_cost(X_b, y)
            print(f"Normal Equation — MSE: {cost:.6f}")

    def fit_batch_gd(self, X, y):
        """Full-batch Gradient Descent."""
        X_b = self._add_bias(X)
        n, p = X_b.shape
        self.weights = np.zeros(p)   # initialize to zeros
        self.cost_history = []

        for i in range(self.n_iters):
            gradient = self._compute_gradient(X_b, y)
            self.weights -= self.lr * gradient
            cost = self._compute_cost(X_b, y)
            self.cost_history.append(cost)

            if self.verbose and i % 100 == 0:
                print(f"Iter {i:4d} | MSE: {cost:.6f}")

            # Convergence check
            if len(self.cost_history) > 1:
                if abs(self.cost_history[-1] - self.cost_history[-2]) < self.tol:
                    if self.verbose:
                        print(f"Converged at iteration {i}")
                    break

    def fit_sgd(self, X, y):
        """Stochastic Gradient Descent (one sample at a time)."""
        X_b = self._add_bias(X)
        n, p = X_b.shape
        self.weights = np.zeros(p)
        self.cost_history = []

        for epoch in range(self.n_iters):
            indices = np.random.permutation(n)
            for idx in indices:
                xi = X_b[idx:idx+1]
                yi = y[idx:idx+1]
                gradient = (2) * xi.T @ (xi @ self.weights - yi)
                self.weights -= self.lr * gradient

            cost = self._compute_cost(X_b, y)
            self.cost_history.append(cost)
            if self.verbose and epoch % 100 == 0:
                print(f"Epoch {epoch:4d} | MSE: {cost:.6f}")

    def fit_mini_batch_gd(self, X, y):
        """Mini-batch Gradient Descent."""
        X_b = self._add_bias(X)
        n, p = X_b.shape
        self.weights = np.zeros(p)
        self.cost_history = []

        for epoch in range(self.n_iters):
            indices = np.random.permutation(n)
            for start in range(0, n, self.batch_size):
                end = min(start + self.batch_size, n)
                batch_idx = indices[start:end]
                X_batch = X_b[batch_idx]
                y_batch = y[batch_idx]
                gradient = (2 / len(y_batch)) * X_batch.T @ (X_batch @ self.weights - y_batch)
                self.weights -= self.lr * gradient

            cost = self._compute_cost(X_b, y)
            self.cost_history.append(cost)

    def fit(self, X, y):
        """Fit model using the chosen method."""
        X = np.array(X, dtype=np.float64)
        y = np.array(y, dtype=np.float64)

        if self.method == 'normal':
            self.fit_normal_equation(X, y)
        elif self.method == 'batch_gd':
            self.fit_batch_gd(X, y)
        elif self.method == 'sgd':
            self.fit_sgd(X, y)
        elif self.method == 'mini_batch_gd':
            self.fit_mini_batch_gd(X, y)
        else:
            raise ValueError(f"Unknown method: {self.method}")
        return self

    def predict(self, X):
        """Make predictions."""
        X = np.array(X, dtype=np.float64)
        X_b = self._add_bias(X)
        return X_b @ self.weights

    def score(self, X, y):
        """Compute R² score."""
        y = np.array(y, dtype=np.float64)
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - (ss_res / ss_tot)

    def get_metrics(self, X, y):
        """Return dict of all metrics."""
        y = np.array(y, dtype=np.float64)
        y_pred = self.predict(X)
        n = len(y)
        p = len(self.weights) - 1  # exclude bias

        mse = np.mean((y - y_pred) ** 2)
        rmse = np.sqrt(mse)
        mae = np.mean(np.abs(y - y_pred))
        r2 = self.score(X, y)
        adj_r2 = 1 - ((1 - r2) * (n - 1) / (n - p - 1))

        return {
            'MSE': mse, 'RMSE': rmse, 'MAE': mae,
            'R²': r2, 'Adj R²': adj_r2
        }

    def plot_cost_history(self):
        """Plot cost vs iterations (for GD methods)."""
        if not self.cost_history:
            print("No cost history (use a GD method).")
            return
        plt.figure(figsize=(10, 5))
        plt.plot(self.cost_history, color='#059669', linewidth=2)
        plt.xlabel('Iteration')
        plt.ylabel('MSE Cost')
        plt.title('Cost Function Convergence')
        plt.grid(True, alpha=0.3)
        plt.show()

    def plot_fit(self, X, y, feature_idx=0):
        """Plot data + regression line (for 1D visualization)."""
        X = np.array(X, dtype=np.float64)
        y = np.array(y, dtype=np.float64)
        y_pred = self.predict(X)

        plt.figure(figsize=(10, 6))
        plt.scatter(X[:, feature_idx], y, color='#0891b2',
                    s=80, alpha=0.7, label='Actual', zorder=5)

        # Sort for smooth line
        sort_idx = X[:, feature_idx].argsort()
        plt.plot(X[sort_idx, feature_idx], y_pred[sort_idx],
                 color='#059669', linewidth=2.5, label='Predicted')

        plt.xlabel(f'Feature {feature_idx}')
        plt.ylabel('Target')
        plt.title('Linear Regression Fit')
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.show()

Using the Class

Python — Usage Example

# === Example: Bangalore flat price prediction ===
import numpy as np

# Data: [area_sqft, bedrooms, age_years]
X = np.array([
    [600,  1, 10],
    [800,  2,  7],
    [1000, 2,  5],
    [1200, 3,  3],
    [1500, 3,  1],
    [1800, 4,  0],
    [900,  2,  8],
    [1100, 3,  4],
])
y = np.array([30, 45, 55, 68, 82, 105, 40, 62])  # Price in ₹ Lakhs

# --- Method 1: Normal Equation ---
model_ne = LinearRegressionScratch(method='normal', verbose=True)
model_ne.fit(X, y)
print("Weights (Normal Eq):", model_ne.weights)
print("Metrics:", model_ne.get_metrics(X, y))

# --- Method 2: Batch Gradient Descent ---
# Feature scaling is CRITICAL for GD!
X_scaled = (X - X.mean(axis=0)) / X.std(axis=0)

model_gd = LinearRegressionScratch(
    method='batch_gd', lr=0.01, n_iters=1000, verbose=True
)
model_gd.fit(X_scaled, y)
print("R² (Batch GD):", model_gd.score(X_scaled, y))
model_gd.plot_cost_history()

# --- Predict new flat ---
new_flat = np.array([[1300, 3, 2]])
price = model_ne.predict(new_flat)
print(f"Predicted price for 1300 sqft, 3BHK, 2yr old: ₹{price[0]:.1f} Lakhs")

Feature Scaling Implementation

Python — Feature Scaling

class StandardScaler:
    """Standardize features: z = (x - mean) / std"""

    def fit(self, X):
        self.mean_ = np.mean(X, axis=0)
        self.std_ = np.std(X, axis=0)
        self.std_[self.std_ == 0] = 1  # avoid division by zero
        return self

    def transform(self, X):
        return (X - self.mean_) / self.std_

    def fit_transform(self, X):
        return self.fit(X).transform(X)

# Why is scaling critical for GD?
# Without scaling: area~1000, bedrooms~3, age~5
# The gradient for 'area' is ~300x larger than for 'bedrooms'
# GD oscillates wildly on the area axis while barely moving on bedrooms
# With scaling: all features have similar ranges → smooth convergence

Diagnostic Plots Implementation

Python — Diagnostic Plots

def diagnostic_plots(y_true, y_pred, feature_name="Feature"):
    """Generate diagnostic plots for regression analysis."""
    residuals = y_true - y_pred
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    # 1. Residual Plot
    axes[0].scatter(y_pred, residuals, color='#0891b2', alpha=0.6, s=50)
    axes[0].axhline(y=0, color='#dc2626', linestyle='--', linewidth=1.5)
    axes[0].set_xlabel('Predicted Values')
    axes[0].set_ylabel('Residuals')
    axes[0].set_title('Residual Plot')
    axes[0].grid(True, alpha=0.3)

    # 2. Q-Q Plot (manual implementation)
    sorted_residuals = np.sort(residuals)
    n = len(sorted_residuals)
    theoretical = np.array([
        np.sqrt(2) * 0.4769 *
        ((2 * (i + 1) - 1) / (2 * n) - 0.5)
        for i in range(n)
    ])  # Simplified; in practice, use scipy.stats.norm.ppf
    axes[1].scatter(theoretical, sorted_residuals,
                    color='#059669', alpha=0.7, s=50)
    # Reference line
    lims = [min(theoretical.min(), sorted_residuals.min()),
            max(theoretical.max(), sorted_residuals.max())]
    axes[1].plot(lims, lims, 'r--', linewidth=1.5)
    axes[1].set_xlabel('Theoretical Quantiles')
    axes[1].set_ylabel('Sample Quantiles')
    axes[1].set_title('Q-Q Plot')
    axes[1].grid(True, alpha=0.3)

    # 3. Actual vs Predicted
    axes[2].scatter(y_true, y_pred, color='#8b5cf6', alpha=0.6, s=50)
    lims = [min(y_true.min(), y_pred.min()),
            max(y_true.max(), y_pred.max())]
    axes[2].plot(lims, lims, 'r--', linewidth=1.5)
    axes[2].set_xlabel('Actual Values')
    axes[2].set_ylabel('Predicted Values')
    axes[2].set_title('Actual vs Predicted')
    axes[2].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

# Usage:
# y_pred = model.predict(X)
# diagnostic_plots(y, y_pred)

11

TensorFlow Implementation

Python — TensorFlow/Keras Linear Regression

import tensorflow as tf
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# === Generate sample Indian housing data ===
np.random.seed(42)
n_samples = 500

area = np.random.uniform(500, 2500, n_samples)
bedrooms = np.random.randint(1, 5, n_samples)
age = np.random.uniform(0, 20, n_samples)
floor = np.random.randint(1, 20, n_samples)

# True relationship + noise
price = 0.05 * area + 8 * bedrooms - 1.5 * age + 0.3 * floor + 5 + \
        np.random.normal(0, 5, n_samples)

X = np.column_stack([area, bedrooms, age, floor])
y = price

# Split & scale
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# === Build TensorFlow Model ===
model = tf.keras.Sequential([
    tf.keras.layers.Dense(1, input_shape=(4,),
                          kernel_initializer='normal',
                          name='linear_layer')
])

# Linear regression = 1 Dense layer with no activation!
# This is equivalent to ŷ = Xw + b

model.compile(
    optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
    loss='mse',               # Mean Squared Error
    metrics=['mae']            # Also track MAE
)

model.summary()

# === Train ===
history = model.fit(
    X_train_s, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.15,
    verbose=1
)

# === Evaluate ===
test_loss, test_mae = model.evaluate(X_test_s, y_test)
print(f"\nTest MSE: {test_loss:.4f}")
print(f"Test MAE: {test_mae:.4f}")

# === Extract learned weights ===
weights, bias = model.get_layer('linear_layer').get_weights()
print(f"\nLearned weights: {weights.flatten()}")
print(f"Learned bias: {bias[0]:.4f}")

# === Plot training history ===
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(history.history['loss'], label='Train MSE', color='#059669')
ax1.plot(history.history['val_loss'], label='Val MSE', color='#0891b2')
ax1.set_xlabel('Epoch'); ax1.set_ylabel('MSE')
ax1.set_title('Loss Convergence'); ax1.legend(); ax1.grid(True, alpha=0.3)

ax2.plot(history.history['mae'], label='Train MAE', color='#059669')
ax2.plot(history.history['val_mae'], label='Val MAE', color='#0891b2')
ax2.set_xlabel('Epoch'); ax2.set_ylabel('MAE')
ax2.set_title('MAE Over Training'); ax2.legend(); ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# === Predict new flat ===
new_flat = np.array([[1300, 3, 2, 10]])
new_flat_s = scaler.transform(new_flat)
pred_price = model.predict(new_flat_s)
print(f"Predicted price: ₹{pred_price[0][0]:.1f} Lakhs")

12

Scikit-Learn Implementation

Python — Complete Scikit-Learn Pipeline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score
)

# ============================================
# 1. CREATE SYNTHETIC INDIAN HOUSING DATASET
# ============================================
np.random.seed(42)
n = 1000

data = pd.DataFrame({
    'area_sqft':  np.random.uniform(400, 3000, n),
    'bedrooms':   np.random.randint(1, 6, n),
    'bathrooms':  np.random.randint(1, 4, n),
    'age_years':  np.random.uniform(0, 25, n),
    'floor':      np.random.randint(1, 25, n),
    'city_tier':  np.random.choice([1, 2, 3], n, p=[0.3, 0.4, 0.3]),
})

# Generate realistic prices (₹ Lakhs)
city_multiplier = {1: 1.5, 2: 1.0, 3: 0.6}
data['price_lakhs'] = (
    0.045 * data['area_sqft'] +
    7 * data['bedrooms'] +
    5 * data['bathrooms'] -
    1.2 * data['age_years'] +
    0.3 * data['floor'] +
    10 * data['city_tier'].map(city_multiplier) +
    np.random.normal(0, 8, n)
)

print(data.head())
print(data.describe())

# ============================================
# 2. EXPLORATORY DATA ANALYSIS
# ============================================
print("\nCorrelation with price:")
print(data.corr()['price_lakhs'].sort_values(ascending=False))

# ============================================
# 3. PREPARE FEATURES & TARGET
# ============================================
features = ['area_sqft', 'bedrooms', 'bathrooms', 'age_years', 'floor']
X = data[features].values
y = data['price_lakhs'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ============================================
# 4. MODEL 1: BASIC LINEAR REGRESSION
# ============================================
model = LinearRegression()
model.fit(X_train, y_train)

print("\n=== Linear Regression Results ===")
print(f"Intercept: {model.intercept_:.4f}")
for feat, coef in zip(features, model.coef_):
    print(f"  {feat:15s}: {coef:+.6f}")

y_pred = model.predict(X_test)
print(f"\nTest R²:   {r2_score(y_test, y_pred):.4f}")
print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"Test MAE:  {mean_absolute_error(y_test, y_pred):.4f}")

# ============================================
# 5. CROSS-VALIDATION
# ============================================
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"\n5-Fold CV R² scores: {cv_scores}")
print(f"Mean CV R²: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# ============================================
# 6. MODEL 2: PIPELINE WITH SCALING + SGD
# ============================================
sgd_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('sgd', SGDRegressor(max_iter=1000, tol=1e-4,
                         learning_rate='optimal',
                         random_state=42))
])
sgd_pipeline.fit(X_train, y_train)
y_pred_sgd = sgd_pipeline.predict(X_test)
print(f"\nSGD R²:   {r2_score(y_test, y_pred_sgd):.4f}")

# ============================================
# 7. MODEL 3: POLYNOMIAL REGRESSION
# ============================================
poly_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('reg', LinearRegression())
])
poly_pipeline.fit(X_train, y_train)
y_pred_poly = poly_pipeline.predict(X_test)
print(f"Poly(2) R²: {r2_score(y_test, y_pred_poly):.4f}")

# ============================================
# 8. VISUALIZATION
# ============================================
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Actual vs Predicted
axes[0,0].scatter(y_test, y_pred, alpha=0.5, color='#059669', s=30)
axes[0,0].plot([y_test.min(), y_test.max()],
              [y_test.min(), y_test.max()], 'r--')
axes[0,0].set_title('Actual vs Predicted')

# Residuals
residuals = y_test - y_pred
axes[0,1].scatter(y_pred, residuals, alpha=0.5, color='#0891b2', s=30)
axes[0,1].axhline(y=0, color='red', linestyle='--')
axes[0,1].set_title('Residual Plot')

# Residual distribution
axes[1,0].hist(residuals, bins=30, color='#059669', alpha=0.7, edgecolor='white')
axes[1,0].set_title('Residual Distribution')

# Feature importance (coefficients)
coefs = pd.Series(model.coef_, index=features).sort_values()
coefs.plot.barh(ax=axes[1,1], color='#0891b2')
axes[1,1].set_title('Feature Coefficients')

plt.tight_layout()
plt.show()

13

Indian Case Studies

🏠 Case Study 1

House Price Prediction: Mumbai, Delhi NCR & Bangalore

Context: India's residential real estate market is valued at ~$300 billion (2024). Platforms like MagicBricks, 99acres, and Housing.com use regression models to provide instant price estimates.

Dataset Features

Feature	Type	Example Values	Expected Coefficient Sign
Area (sq ft)	Continuous	500–5000	Positive (+)
Bedrooms (BHK)	Discrete	1–5	Positive (+)
Bathrooms	Discrete	1–4	Positive (+)
Floor	Discrete	1–30	Slightly positive (+)
Age (years)	Continuous	0–30	Negative (−)
Distance to metro (km)	Continuous	0.1–15	Negative (−)
City tier	Categorical	1, 2, 3	Tier 1 premium

Key Findings

Mumbai (Tier 1): ₹18,000–₹25,000 per sq ft in prime areas; regression captures ~82% variance (R²=0.82)
Bangalore: IT corridor proximity adds ₹3,000/sqft premium; modeled via distance feature
Delhi NCR: Metro connectivity is the strongest price predictor after area
Limitation: Non-linear effects (e.g., sea-view premium in Mumbai) require polynomial features or tree models

Model Equation (Simplified Mumbai Model)

Price (₹L) = 0.062 \times Area + 12.5 \times Bedrooms - 2.1 \times Age - 3.8 \times Metro_Distance + 15.2

Interpretation: Each sq ft adds ₹6,200; each additional bedroom adds ₹12.5 Lakhs; each year of age reduces price by ₹2.1 Lakhs; each km from metro reduces price by ₹3.8 Lakhs.

🌾 Case Study 2

Agricultural Crop Yield Prediction

Context: India is the world's second-largest agricultural producer. The Ministry of Agriculture uses regression models to estimate production of wheat, rice, and cotton to plan storage, transportation, and pricing (MSP).

Data Sources

IMD (India Meteorological Department): Rainfall, temperature, humidity data
ICAR: Soil quality, fertilizer usage data
DAC&FW: Historical crop yield data by state and district

Feature Engineering

Feature	Source	Impact on Yield
Total rainfall (mm)	IMD	Strong positive (up to a point)
Temperature (avg °C)	IMD	Crop-dependent
Fertilizer (kg/hectare)	ICAR	Positive (diminishing returns)
Soil pH	ICAR	Optimal range varies
Irrigation coverage (%)	Census	Strong positive
Sown area (hectares)	DAC&FW	Positive

Results

Rice yield model (Punjab, 2010-2023): R² = 0.78, RMSE = 245 kg/hectare. The model correctly predicted the 2022 below-normal harvest due to irregular monsoon patterns detected via the rainfall feature.

Python — Crop Yield Example

# Simplified crop yield regression
import numpy as np
from sklearn.linear_model import LinearRegression

# Features: [rainfall_mm, temp_C, fertilizer_kg, irrigation_%]
X_crop = np.array([
    [800,  28, 120, 60],   # Normal monsoon
    [1200, 27, 150, 75],   # Good monsoon
    [600,  30, 100, 45],   # Drought year
    [950,  26, 180, 80],   # Well-irrigated
    [1100, 29, 130, 65],   # Above average rain
    [700,  31, 90,  40],   # Hot & dry
    [1050, 27, 160, 70],   # Good year
    [500,  32, 80,  35],   # Severe drought
])
y_crop = np.array([3200, 4100, 2400, 4500, 3800, 2100, 4000, 1800])

model_crop = LinearRegression()
model_crop.fit(X_crop, y_crop)

features = ['Rainfall', 'Temperature', 'Fertilizer', 'Irrigation']
for f, c in zip(features, model_crop.coef_):
    print(f"{f:12s}: {c:+.2f} kg/hectare per unit increase")

print(f"\nR² Score: {model_crop.score(X_crop, y_crop):.4f}")

🌧️ Case Study 3

Rainfall Prediction Using IMD Historical Data

Context: IMD maintains rainfall data from 1901 to present across 36 meteorological subdivisions. Linear regression with seasonal features (ENSO index, Indian Ocean Dipole, pre-monsoon temperatures) achieves R² ≈ 0.65 for June-September monsoon totals.

Challenge

Rainfall has inherent variability and chaotic dynamics. Linear regression captures the linear component of the signal but misses non-linear teleconnections. This is a great example of where linear regression provides a baseline that more complex models (Random Forest, LSTM) improve upon.

IMD's 2024 monsoon forecast used an ensemble of 5 statistical models, including multiple regression with ENSO, IOD, and Eurasian snow cover as predictors. Predicted: 106% of LPA. Actual: 109% of LPA. Error: 3%—remarkably accurate for a linear model!

14

Global Case Studies

🌍 Case Study 1

Boston Housing Dataset — Classic + Ethical Discussion

The Dataset: Created by Harrison & Rubinfeld (1978), the Boston Housing dataset has 506 samples with 13 features predicting median home value in Boston suburbs. For decades, it was the benchmark for regression algorithms.

Features

CRIM (per capita crime), ZN (residential zoning), INDUS (industrial proportion), CHAS (Charles River proximity), NOX (nitric oxide concentration), RM (average rooms), AGE (pre-1940 proportion), DIS (distance to employment), RAD (highway access), TAX (tax rate), PTRATIO (pupil-teacher), B (racial composition statistic), LSTAT (lower status %).

⚠️ Ethical Concerns

The 'B' feature is defined as 1000(Bk - 0.63)² where Bk is the proportion of Black residents. This was originally included to model the racist assumption that racial composition affects property values. In 2020, Scikit-Learn deprecated `load_boston()` because:

The feature encodes racial bias as a "normal" predictor
Models trained on this data learn and perpetuate racial discrimination in housing
Students unknowingly absorb the idea that race should be a price predictor

Alternative: Use the California Housing dataset or create synthetic data. We include this discussion because understanding bias in data is as important as understanding algorithms.

🌍 Case Study 2

California Housing — The Modern Benchmark

20,640 samples from the 1990 US Census. Features: median income, house age, average rooms, average bedrooms, population, average occupancy, latitude, longitude. Target: median house value.

Python — California Housing

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

housing = fetch_california_housing()
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"R²:   {r2_score(y_test, y_pred):.4f}")   # ~0.60
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

# Linear regression gets R² ≈ 0.60 — decent baseline!
# This means 40% of variance is in non-linear patterns
# → Motivation for polynomial regression, trees, neural nets

print("\nFeature importances:")
for name, coef in zip(housing.feature_names, model.coef_):
    print(f"  {name:15s}: {coef:+.4f}")

🌍 Case Study 3

Energy Consumption Prediction

Application: Google uses regression models (among others) in their DeepMind-powered data center cooling systems. By predicting energy consumption based on temperature, humidity, server load, and cooling settings, they reduced cooling energy by 40%.

A simplified linear model: Energy_kWh = 0.12 × Temperature + 0.08 × Humidity + 0.45 × Server_Load - 0.15 × Cooling_Setting + 50

While Google's actual system uses neural networks, the initial prototype was a linear regression model that proved the concept. This illustrates the ML workflow: start simple, prove value, then add complexity.

15

Startup Applications

🏠 PropTech Startups (Housing.com, NoBroker, Square Yards)

Use regression for instant price estimates. When a user lists a property, the model predicts a fair market value. Features include location coordinates, carpet area, amenities count, and nearby infrastructure scores. Revenue model: charge a premium for "AI-verified" price estimates.

🌾 AgriTech Startups (CropIn, DeHaat, Ninjacart)

CropIn uses satellite imagery + regression to predict crop yields per farm. DeHaat uses price regression to advise farmers on optimal selling time. These startups serve 15+ million Indian farmers and have raised $200M+ in funding.

💰 FinTech (Cred, Razorpay, PhonePe)

Credit scoring starts with logistic regression (Chapter 5), but expected transaction volume prediction uses linear regression. Razorpay predicts daily payment volumes to manage server scaling and cash reserves.

🚗 Mobility (Ola, Rapido)

Ride time prediction uses regression: ETA = β₁×distance + β₂×traffic_index + β₃×time_of_day + β₄×weather + intercept. Surge pricing models also start with regression on demand-supply ratios.

16

Government Applications

🏛️ NITI Aayog — GDP Forecasting

Uses multivariate regression with indicators: industrial production (IIP), credit growth, exports, government expenditure, FDI inflows. The model provides quarterly GDP growth estimates used in Union Budget planning.

🏛️ Census of India — Population Projection

Simple time-series regression on population data from 1951-2011 censuses projects India's population trajectory. The 2023 projection (1.43 billion) was within 1% of the actual figure.

🏛️ ISRO — Satellite Telemetry

ISRO uses regression models to predict satellite battery degradation, solar panel output vs. orbital angle, and fuel consumption rates. These models are interpretable and safety-critical—neural networks are too opaque for space missions.

🏛️ IMD — Monsoon Forecasting

As discussed in Case Study 3, IMD's ensemble regression model for monsoon prediction directly impacts agricultural planning, water resource management, and disaster preparedness for 1.4 billion people.

17

Industry Applications

Industry	Application	Features Used	Impact
🏦 Banking (SBI, HDFC)	Loan default amount estimation	Income, debt ratio, tenure, credit score	Risk-based pricing, ₹2L Cr loan portfolio
🏥 Healthcare (Apollo)	Patient stay duration prediction	Age, diagnosis, vitals, procedure type	Bed allocation, staff scheduling
🏭 Manufacturing (Tata)	Quality prediction (defect rate)	Temperature, pressure, speed, material	Reduced defect rate by 15%
🛒 E-commerce (Flipkart)	Demand forecasting	Season, promotions, price, category	Inventory optimization, ₹500Cr savings
⚡ Energy (Tata Power)	Load forecasting	Temperature, time, day type, historical load	Grid stability, peak management
📱 Telecom (Jio)	Churn risk scoring	Usage, complaints, recharge frequency	Retention campaigns targeting top 10%
🚚 Logistics (Delhivery)	Delivery time estimation	Distance, weight, route, warehouse load	SLA compliance improvement by 20%

18

Mini Projects

🛠️ Mini Project 1: Indian House Price Predictor

Objective: Build a complete end-to-end house price prediction system for Indian cities.

Requirements

Generate or collect a dataset with 500+ rows and 8+ features
Perform EDA: histograms, correlation matrix, scatter plots
Handle missing values and outliers
Feature engineering: create price_per_sqft, log-transform skewed features
Train-test split (80/20) with cross-validation
Compare: Normal Equation vs Batch GD vs SGD (all from scratch)
Also fit Scikit-Learn LinearRegression and compare results
Generate diagnostic plots: residuals, Q-Q, actual vs predicted
Report: R², Adjusted R², RMSE, MAE for all models
Predict price for 5 new flats and discuss predictions

Python — Starter Code

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# === Generate synthetic Indian housing data ===
np.random.seed(42)
n = 800

cities = np.random.choice(
    ['Mumbai', 'Delhi', 'Bangalore', 'Hyderabad', 'Pune'],
    n, p=[0.25, 0.2, 0.25, 0.15, 0.15]
)
city_premium = {
    'Mumbai': 1.8, 'Delhi': 1.4, 'Bangalore': 1.3,
    'Hyderabad': 1.0, 'Pune': 0.9
}

data = pd.DataFrame({
    'city': cities,
    'area_sqft': np.random.uniform(400, 3000, n),
    'bedrooms': np.random.randint(1, 5, n),
    'bathrooms': np.random.randint(1, 4, n),
    'floor': np.random.randint(1, 20, n),
    'age_years': np.random.uniform(0, 20, n),
    'parking': np.random.randint(0, 3, n),
    'metro_dist_km': np.random.uniform(0.1, 10, n),
})

# Generate prices
data['price_lakhs'] = (
    0.05 * data['area_sqft'] +
    8 * data['bedrooms'] +
    5 * data['bathrooms'] -
    1.5 * data['age_years'] +
    0.4 * data['floor'] +
    4 * data['parking'] -
    2 * data['metro_dist_km'] +
    15 * data['city'].map(city_premium) +
    np.random.normal(0, 8, n)
).clip(lower=10)

print("Dataset created! Shape:", data.shape)
print(data.head())

# TODO: Students complete the following:
# 1. EDA (histograms, correlation matrix)
# 2. Feature engineering (one-hot encode city)
# 3. Train/test split
# 4. Implement LinearRegressionScratch.fit() with each method
# 5. Compare with sklearn LinearRegression
# 6. Diagnostic plots
# 7. Report all metrics

Expected Output

R² ≈ 0.90–0.95 (since data is synthetic with known linear relationships). Students should note that real-world data will typically give R² ≈ 0.6–0.85 due to non-linear effects and unobserved features.

🛠️ Mini Project 2: Crop Yield Estimator

Objective: Build a regression model that estimates crop yield (kg/hectare) based on weather, soil, and farming practice data.

Requirements

Create/collect dataset: 300+ rows, 6+ features (rainfall, temperature, soil pH, fertilizer, irrigation, seed variety)
Compare simple vs. multiple regression (1 feature vs. 6 features)
Implement from scratch with Batch GD (with learning rate experiments)
Generate cost convergence plots for α = 0.001, 0.01, 0.1, 0.5
Show that feature scaling dramatically improves GD convergence
Compute VIF (Variance Inflation Factor) for multicollinearity
Polynomial regression (degree 2) vs. linear: is it worth it?
Predict yield for upcoming season with different rainfall scenarios

Python — VIF Calculation

from sklearn.linear_model import LinearRegression
import numpy as np

def compute_vif(X, feature_names):
    """
    Compute Variance Inflation Factor for each feature.
    VIF > 5: moderate multicollinearity
    VIF > 10: severe multicollinearity
    """
    vif_data = []
    for i in range(X.shape[1]):
        # Regress feature i on all other features
        X_other = np.delete(X, i, axis=1)
        model = LinearRegression().fit(X_other, X[:, i])
        r2 = model.score(X_other, X[:, i])
        vif = 1 / (1 - r2) if r2 < 1 else float('inf')
        vif_data.append({
            'Feature': feature_names[i],
            'VIF': round(vif, 2),
            'Status': 'OK' if vif < 5 else ('Warning' if vif < 10 else 'REMOVE')
        })
    return vif_data

# Usage:
# vif_results = compute_vif(X, feature_names)
# for v in vif_results: print(v)

19

End-of-Chapter Exercises

Conceptual Questions

Exercise 1

Explain the difference between correlation and causation. If a linear regression model shows a positive coefficient for ice cream sales predicting drowning deaths, does ice cream cause drowning? Discuss confounding variables.

Exercise 2

State the four assumptions of linear regression (LINE). For each assumption, describe (a) what it means, (b) how to check it using diagnostic plots, and (c) what happens to OLS estimates if it's violated.

Exercise 3

Why does MSE emerge as the natural cost function from Maximum Likelihood under Gaussian noise? What cost function would arise if noise followed a Laplace distribution instead?

Exercise 4

Explain why R² always increases (or stays the same) when you add more features, but Adjusted R² can decrease. Give a concrete numerical example.

Exercise 5

Compare and contrast Batch GD, Stochastic GD, and Mini-Batch GD in terms of: (a) computation per step, (b) convergence smoothness, (c) memory requirements, (d) suitability for online learning.

Mathematical Problems

Exercise 6

Given data points: (1,3), (2,5), (3,7), (4,9), (5,11). Compute w and b using the Normal Equation by hand. Verify that these points are perfectly collinear and R² = 1.

Exercise 7

Starting from w=1, b=0, perform 3 iterations of gradient descent on data points (1,2), (2,3), (3,5) with learning rate α = 0.05. Show all intermediate calculations.

Exercise 8

Prove that the gradient of J(w) = (1/n)||y - Xw||² with respect to w is (2/n)X^T(Xw - y). Use matrix calculus rules explicitly.

Exercise 9

Show that for simple linear regression (one feature), the Normal Equation simplifies to:
w = [nΣ(xᵢyᵢ) - Σ(xᵢ)Σ(yᵢ)] / [nΣ(xᵢ²) - (Σ(xᵢ))²]
b = ȳ - w·x̄

Exercise 10

A dataset has 100 samples and 5 features. The linear regression model achieves R² = 0.82. Compute Adjusted R². If we add a 6th feature and R² becomes 0.823, should we keep the new feature? Justify using Adjusted R².

Programming Exercises

Exercise 11

Implement gradient descent with momentum from scratch: v_t = β·v_{t-1} + ∇J(w), w ← w - α·v_t. Compare convergence speed with vanilla GD on a dataset of your choice.

Exercise 12

Create a function that generates an animated GIF showing gradient descent steps on a 2D cost surface (w vs J(w)). Use matplotlib.animation.

Exercise 13

Implement polynomial regression from scratch. Fit degree 1, 2, 3, and 5 polynomials to 20 noisy data points. Plot all fits and discuss underfitting vs. overfitting.

Exercise 14

Build a complete data pipeline: read a CSV file, handle missing values (mean imputation), encode categorical variables, scale features, train a model, evaluate, and save the model using pickle. The entire pipeline should run with a single function call.

Exercise 15

Implement k-fold cross-validation from scratch (without using sklearn). Compute mean and std of R² across 5 folds.

Application Exercises

Exercise 16

Collect electricity bills for your household over the past 12 months. Use month number and average temperature as features to predict the bill amount. Is the relationship linear? If not, try polynomial features.

Exercise 17

Use the IMD rainfall data (available on data.gov.in) for your state. Build a simple regression model: rainfall_this_year ~ f(rainfall_last_year, El_Nino_index). Report R² and discuss limitations.

Exercise 18

Scrape or collect prices of 50 second-hand cars from OLX/CarDekho. Features: year, km driven, fuel type, engine CC. Build a linear regression model and discuss which features matter most.

Exercise 19

Download the California Housing dataset from Scikit-Learn. Compare: (a) simple linear regression using only median income, (b) multiple regression with all 8 features, (c) polynomial regression (degree 2) with all features. Report R² for each.

Exercise 20

Build a simple web API (using Flask or FastAPI) that accepts house features via JSON and returns a predicted price. Deploy locally and test with curl or Postman.

Exercise 21

Implement Ridge Regression (L2 regularization) from scratch: J(w) = MSE + λ||w||². Derive the gradient and implement gradient descent. Show that as λ increases, coefficients shrink toward zero.

Exercise 22

Create a comprehensive comparison table: run linear regression on the California Housing dataset using (a) your scratch implementation (Normal Equation), (b) your scratch GD implementation, (c) sklearn LinearRegression, (d) sklearn SGDRegressor, (e) TensorFlow Dense(1). All should give approximately the same R². Discuss any differences.

20

Multiple Choice Questions

Q1. The Normal Equation for linear regression is:

(a) w = X^T(XX^T)^-1y
(b) w = (X^TX)^-1X^Ty
(c) w = X(X^TX)^-1y
(d) w = (XX^T)^-1Xy

Show Answer

Answer: (b). The Normal Equation w = (X^TX)^-1X^Ty is derived by setting the gradient of MSE to zero.

Q2. Which of the following is NOT an assumption of OLS linear regression?

(a) Linearity between features and target
(b) Normality of residuals
(c) Features must follow a normal distribution
(d) Homoscedasticity of residuals

Show Answer

Answer: (c). OLS assumes normality of residuals, not features. Features can have any distribution. This is a common exam trap.

Q3. If the learning rate in gradient descent is too large, what happens?

(a) The model converges very slowly
(b) The cost function may diverge (increase)
(c) The model always finds the global minimum faster
(d) The model underfits the data

Show Answer

Answer: (b). A large learning rate causes the parameters to overshoot the minimum, potentially diverging. The cost function may oscillate or increase indefinitely.

Q4. R² = 0.75 means:

(a) The model is 75% accurate
(b) 75% of data points are correctly predicted
(c) The model explains 75% of the variance in the target variable
(d) 75% of features are significant

Show Answer

Answer: (c). R² (coefficient of determination) represents the proportion of variance in the dependent variable explained by the model. R² = 1 - SS_res/SS_tot.

Q5. Which gradient descent variant uses a single randomly chosen sample per update?

(a) Batch Gradient Descent
(b) Mini-Batch Gradient Descent
(c) Stochastic Gradient Descent
(d) Adam Optimizer

Show Answer

Answer: (c). SGD uses one random sample per update, making it the most granular variant. Batch GD uses all samples; Mini-Batch uses a subset (e.g., 32).

Q6. Feature scaling is essential for gradient descent because:

(a) It makes the data normally distributed
(b) It ensures features are on similar scales, preventing oscillation
(c) It removes outliers from the dataset
(d) It is required by the Normal Equation

Show Answer

Answer: (b). Without scaling, features with large ranges dominate the gradient, causing oscillation. Scaling makes the cost surface more spherical, enabling efficient convergence. Note: the Normal Equation does NOT require scaling.

Q7. The MSE cost function for linear regression is:

(a) Concave — has a maximum
(b) Convex — has exactly one minimum
(c) Non-convex — has multiple local minima
(d) Flat — has no minimum or maximum

Show Answer

Answer: (b). For linear regression, MSE is a convex (bowl-shaped) function. This guarantees that gradient descent finds the global minimum, and the Normal Equation gives the exact solution.

Q8. If X^TX is singular (non-invertible), which of the following is true?

(a) The Normal Equation gives a unique solution
(b) Some features are perfectly multicollinear
(c) Gradient descent cannot be used
(d) R² will be exactly 1

Show Answer

Answer: (b). X^TX is singular when features are linearly dependent (perfect multicollinearity) or n < p. Use pseudoinverse (np.linalg.pinv) or regularization (Ridge). GD can still work, but the solution is not unique.

Q9. Which metric is most robust to outliers?

(a) MSE
(b) RMSE
(c) MAE
(d) R²

Show Answer

Answer: (c). MAE uses absolute values instead of squares, so large errors don't disproportionately inflate the metric. MSE and RMSE amplify large errors due to squaring.

Q10. Deriving MSE from Maximum Likelihood requires which assumption about noise?

(a) Noise follows a uniform distribution
(b) Noise follows a Gaussian (normal) distribution
(c) Noise is always zero
(d) Noise follows a Bernoulli distribution

Show Answer

Answer: (b). Under the assumption εᵢ ~ N(0, σ²), maximizing log-likelihood is equivalent to minimizing MSE. With Laplace noise, we'd get MAE; with uniform noise, we'd get minimax.

Q11. What is the time complexity of the Normal Equation?

(a) O(n)
(b) O(np)
(c) O(np² + p³)
(d) O(n²p²)

Show Answer

Answer: (c). Computing X^TX takes O(np²), and inverting the p×p matrix takes O(p³). This is why the Normal Equation becomes impractical for p > 10,000 features.

Q12. In the residual plot, a funnel-shaped pattern indicates:

(a) Non-linearity
(b) Heteroscedasticity
(c) Autocorrelation
(d) Multicollinearity

Show Answer

Answer: (b). A funnel shape (residuals fanning out) means variance is not constant across predictions — this violates the homoscedasticity assumption. Solutions: WLS, log-transform target, or robust regression.

21

Interview Questions

Q1. What is the difference between Linear Regression and Correlation?

Show Answer

Correlation measures the strength and direction of a linear relationship (a number from -1 to +1). Linear regression models the relationship to make predictions. Correlation is symmetric (corr(X,Y) = corr(Y,X)), but regression is not (regressing Y on X ≠ regressing X on Y). Additionally, correlation doesn't assume causation, whereas regression coefficients are often interpreted as "effect sizes" (though this requires causal assumptions).

Q2. Why do we use MSE and not just Mean Error?

Show Answer

Mean Error (ME = Σ(yᵢ - ŷᵢ)/n) can be zero even with terrible predictions because positive and negative errors cancel out. MSE squares errors to make them all positive, penalizes large errors more heavily, and is mathematically convenient (differentiable, convex). Additionally, MSE arises naturally from the Maximum Likelihood principle under Gaussian noise assumptions.

Q3. When would you prefer Gradient Descent over the Normal Equation?

Show Answer

GD is preferred when: (1) p (features) is large (>10,000) because matrix inversion is O(p³); (2) n is very large and data doesn't fit in memory — SGD processes one sample at a time; (3) Online/streaming data — SGD can update in real-time; (4) Non-linear extensions — GD generalizes to neural networks while Normal Equation is only for linear models. Normal Equation is preferred for small datasets where the exact solution is cheap to compute.

Q4. How do you handle multicollinearity in linear regression?

Show Answer

Detection: (1) Compute VIF (Variance Inflation Factor) — VIF > 10 is severe; (2) Check correlation matrix for |r| > 0.9 pairs. Solutions: (1) Remove one of the correlated features; (2) Use PCA to create uncorrelated components; (3) Use Ridge (L2) or LASSO (L1) regularization — Ridge shrinks correlated coefficients, LASSO can drive some to zero; (4) Domain knowledge to choose the most meaningful feature.

Q5. Can R² be negative? What does it mean?

Show Answer

Yes! R² = 1 - SS_res/SS_tot. If SS_res > SS_tot (the model's predictions are worse than simply predicting the mean), R² becomes negative. This happens when: (1) The model is very wrong (e.g., predicting a constant that's far from the mean); (2) You evaluate on a test set with a very different distribution; (3) The model is badly overfit. An R² < 0 model is worse than the "dumb baseline" of always predicting ȳ.

Q6. Explain the bias-variance tradeoff in the context of linear regression.

Show Answer

OLS linear regression has low bias (it's unbiased — Gauss-Markov theorem) but can have high variance when there are many features or multicollinearity (coefficients become unstable). Ridge regression introduces a small bias (by shrinking coefficients) but dramatically reduces variance, often giving better test performance. The bias-variance tradeoff is: Total Error = Bias² + Variance + Irreducible Noise. OLS minimizes bias; regularized methods optimize the tradeoff.

Q7. What is the relationship between linear regression and a single-neuron neural network?

Show Answer

A single neuron without activation function IS linear regression: output = w^Tx + b. With a sigmoid activation, it becomes logistic regression. With a ReLU activation, it's a "half-linear" model. Deep neural networks are essentially stacks of these linear transformations with non-linear activations. Understanding linear regression deeply is understanding the building block of all neural networks.

Q8. How would you handle outliers in a regression problem?

Show Answer

Approaches: (1) Detection — IQR method, Z-score, Cook's distance (leverage plot); (2) Removal — Only if they're data errors, not genuine extreme values; (3) Robust regression — Use Huber loss or RANSAC which downweight outliers; (4) Transform — Log-transform heavy-tailed targets; (5) Use MAE instead of MSE as the loss function (less sensitive to outliers); (6) Winsorize — Cap extreme values at the 5th/95th percentile. The best approach depends on whether outliers are signal or noise.

Q9. You have a linear regression model with R² = 0.95 on training data but R² = 0.40 on test data. What happened and how do you fix it?

Show Answer

This is overfitting. The model memorized training data but fails to generalize. Likely causes: too many features relative to samples (high p/n ratio), polynomial features of high degree, or multicollinearity. Fixes: (1) Regularization — Ridge (L2) or LASSO (L1); (2) Feature selection — remove irrelevant features; (3) Cross-validation — use k-fold CV instead of train/test split; (4) More data — reduce variance; (5) Reduce polynomial degree; (6) Check for data leakage (test data information accidentally in training).

Q10. Explain feature engineering for linear regression. Give 3 examples.

Show Answer

Feature engineering creates new informative features: (1) Interaction terms — area × bedrooms might capture "spacious bedrooms" effect; (2) Polynomial features — x² captures non-linear relationships while keeping the model "linear in parameters"; (3) Log transform — log(income) linearizes exponential relationships; (4) Binning — convert continuous age to age groups; (5) One-hot encoding — convert categorical city to binary features; (6) Domain-specific — price_per_sqft = price/area normalizes for size differences.

Q11. What happens to linear regression if features are perfectly multicollinear?

Show Answer

If features are perfectly multicollinear (e.g., x₃ = 2x₁ + x₂), then X^TX is singular (not invertible), and the Normal Equation has no unique solution. There are infinitely many weight vectors that achieve the same MSE. NumPy's pinv (pseudoinverse) will return one solution, but coefficients will be unreliable. GD may converge but to an arbitrary point in the solution space. Fix: use regularization, drop redundant features, or use PCA.

22

Research Problems

🔬 Research Problem 1: Adaptive Learning Rate for Indian Agricultural Data

Problem: Indian agricultural data has high variance across states (Punjab vs. Rajasthan) and years (monsoon vs. drought). Standard learning rate schedules (step decay, cosine annealing) are designed for i.i.d. data. Design and implement an adaptive learning rate schedule that accounts for heterogeneity in agricultural data.

Suggested approach: Implement per-feature learning rates based on the curvature of the loss landscape (second-order information). Compare with Adam, AdaGrad, and RMSprop. Evaluate on IMD+ICAR datasets across 5 crop types and 10 states.

Expected outcome: A novel learning rate schedule that converges 2-3x faster on heterogeneous agricultural data compared to fixed learning rate.

🔬 Research Problem 2: Robust Regression for Noisy Urban Data

Problem: Indian urban datasets (real estate, traffic, pollution) contain extreme outliers and non-Gaussian noise. Standard MSE-based regression is highly sensitive to outliers. Investigate combinations of Huber loss, quantile regression, and RANSAC for Indian city datasets.

Research questions: (1) What is the optimal Huber δ parameter for Indian real estate data? (2) Does the optimal δ vary by city tier? (3) Can you derive a closed-form estimator for Huber regression? (4) Compare computational cost vs. robustness gain.

🔬 Research Problem 3: Interpretable Regression for Policy

Problem: Government policies in India (MSP pricing, subsidy allocation, tax rates) require interpretable models. But linear regression's assumption of linearity is often violated. Investigate Generalized Additive Models (GAMs) as a middle ground: y = β₀ + f₁(x₁) + f₂(x₂) + ... where each f is a smooth non-linear function.

Questions: (1) How much accuracy do GAMs gain over linear regression on Indian policy datasets? (2) Are GAMs interpretable enough for Parliamentary scrutiny? (3) Implement a simple GAM using basis splines and compare with linear regression and random forest on 3 government datasets.

23

Key Takeaways

1

Linear regression models the relationship ŷ = Xw + b by finding weights that minimize MSE. It's the foundation of all supervised learning.

2

MSE arises from Maximum Likelihood under Gaussian noise. It's not an arbitrary choice—it's the statistically optimal loss function when errors are normally distributed.

3

The Normal Equation w = (X^TX)^-1X^Ty gives the exact solution in one step. Use it for small to medium datasets (p < 10,000).

4

Gradient Descent iteratively updates weights: w ← w - α·∇J. Three variants (Batch, SGD, Mini-batch) trade off computation, stability, and speed.

5

Feature scaling is critical for gradient descent but unnecessary for the Normal Equation. Without scaling, GD may not converge.

6

Evaluate with multiple metrics: R², Adjusted R², RMSE, MAE. Each tells a different story. Always check diagnostic plots (residuals, Q-Q) to validate assumptions.

7

The four assumptions (LINE) — Linearity, Independence, Normality of residuals, Equal variance — must be verified. Violations don't invalidate the model but make OLS sub-optimal.

8

Linear regression is industry-standard in India: RBI uses it for GDP forecasting, IMD for monsoon prediction, ISRO for satellite telemetry, and every major tech company for baseline models.

9

Start with linear regression as a baseline before trying complex models. If linear regression gives R² = 0.90, a neural network that gives 0.92 may not be worth the added complexity and lost interpretability.

10

Every neuron in a neural network performs a linear regression + non-linear activation. Mastering this chapter prepares you for all of deep learning.

24

References & Further Reading

Textbooks

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer. Chapters 3-4. [Free online: https://hastie.su.domains/ElemStatLearn/]
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 3.
Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed. O'Reilly. Chapter 4.
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 11.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning, 2nd ed. Springer. Chapter 3. [Free online: https://www.statlearning.com/]

Research Papers

Legendre, A.-M. (1805). Nouvelles méthodes pour la détermination des orbites des comètes. Paris. [Original least squares paper]
Gauss, C. F. (1809). Theoria Motus Corporum Coelestium. [Gaussian noise + MLE connection]
Hoerl, A. E., & Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1), 55-67.
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. JRSS-B, 58(1), 267-288.
Harrison, D., & Rubinfeld, D. L. (1978). Hedonic Housing Prices and the Demand for Clean Air. Journal of Environmental Economics.

Indian Data Sources

data.gov.in — Government of India Open Data Platform [Housing, Agriculture, Weather datasets]
IMD (mausam.imd.gov.in) — India Meteorological Department historical rainfall data
ICAR (icar.org.in) — Indian Council of Agricultural Research crop data
NHB (nhb.org.in) — National Housing Bank RESIDEX (Real Estate Price Index)
Kaggle: "Indian Houses Dataset", "Indian Real Estate Analysis", "IMD Rainfall Data"

Online Resources

Scikit-Learn Documentation: sklearn.linear_model.LinearRegression
TensorFlow Tutorials: Basic regression (tensorflow.org/tutorials)
Stanford CS229 Lecture Notes: Andrew Ng's Linear Regression notes
3Blue1Brown: "The Essence of Linear Algebra" (YouTube)
StatQuest with Josh Starmer: "Linear Regression" series (YouTube)

Ethical References

Carlisle, M. (2020). "Racist data destruction? On the Boston Housing dataset." Medium article explaining why sklearn deprecated load_boston().
O'Neil, C. (2016). Weapons of Math Destruction. Crown Books. [On biased models in housing, credit, and policing]