Learning Objectives
After completing this chapter, you will be able to:
Introduction
Imagine you're a real estate agent in Mumbai, and a client asks: "If I buy a 1,200 sq. ft. flat in Andheri, what should I expect to pay?" You've sold dozens of flatsโyou know roughly that larger flats cost more. But can you turn that intuition into a precise number? That's exactly what linear regression does.
Linear regression is the "Hello, World!" of machine learning. It is the simplest, most interpretable, and arguably the most widely deployed supervised learning algorithm. From predicting house prices to forecasting stock markets, from estimating crop yields to modeling climate changeโregression is everywhere.
In this chapter, we'll go from zero to expert. We'll start with the intuitive idea of "drawing a line through points," derive every formula from first principles, implement everything from scratch in Python, and then see how the same ideas power billion-dollar applications at Google, Flipkart, and ISRO.
Why Linear Regression Matters
- Interpretability: Every coefficient tells you exactly how much a feature affects the output.
- Speed: Closed-form solution existsโtraining can be instantaneous.
- Foundation: Logistic regression, ridge, lasso, elastic net, GLMs all extend linear regression.
- Industry Standard: In regulated industries (banking, insurance, healthcare), interpretable models like linear regression are often legally required.
- Baseline: Any ML project should start with a linear regression baseline before trying complex models.
Historical Background
The story of linear regression begins with the starsโliterally. In 1805, the French mathematician Adrien-Marie Legendre published the method of least squares to predict the orbits of comets. He wanted to find the "best" line through noisy astronomical observations.
Just four years later, Carl Friedrich Gauss claimed he had invented the method even earlier (around 1795) and showed that least squares is the optimal estimator when errors follow a Gaussian (normal) distribution. This connection between least squares and Gaussian noise is the Maximum Likelihood derivation we'll explore in Section 6.
The term "regression" itself was coined by Sir Francis Galton in the 1880s, studying the relationship between parents' heights and children's heights. He observed that children of very tall parents tended to be shorterโthey "regressed toward the mean." Though the biological interpretation has changed, the mathematical method kept the name.
Timeline of Key Milestones
| Year | Contributor | Milestone |
|---|---|---|
| 1795 | Carl Friedrich Gauss | First use of least squares (claimed, published 1809) |
| 1805 | Adrien-Marie Legendre | Published method of least squares |
| 1821 | Carl Friedrich Gauss | Gauss-Markov theorem: OLS is BLUE |
| 1885 | Francis Galton | Coined "regression" (regression to the mean) |
| 1897 | Karl Pearson | Multiple regression, correlation coefficient |
| 1922 | R.A. Fisher | Maximum likelihood estimation formalized |
| 1958 | Frank Rosenblatt | Perceptron (linear regression + threshold = classification) |
| 1970 | Hoerl & Kennard | Ridge regression (regularized linear regression) |
| 1996 | Robert Tibshirani | LASSO (L1-regularized linear regression) |
| 2005 | Zou & Hastie | Elastic Net (L1+L2 regularization) |
Conceptual Explanation
4.1 The Simplest Idea: Fitting a Line
Suppose you plot house sizes (x-axis) against prices (y-axis) and see a rough upward trend. A linear regression model says: "I believe the relationship is approximately a straight line."
Where:
- ลท (y-hat): the predicted value
- w (weight/slope): how much y changes per unit change in x
- b (bias/intercept): the predicted y when x = 0
- x: the input feature
4.2 What Makes a Line "Best"?
There are infinitely many lines you could draw through a scatter plot. We need a way to measure which line is "best." The idea is simple: the best line minimizes the total error between predicted values (ลท) and actual values (y).
We define the residual for each data point as: eแตข = yแตข - ลทแตข. The residual is positive if we under-predict and negative if we over-predict. We can't just sum residuals (positives cancel negatives), so we square them:
This is the cost function (or loss function). Our goal is to find w and b that minimize J(w, b). This is an optimization problem.
4.3 Multiple Linear Regression
Real problems have multiple features. A house price depends on area, bedrooms, age, location, etc. With p features:
In vectorized form, we add a column of 1s to the feature matrix X (for the bias), and write:
4.4 Two Solution Strategies
There are two fundamentally different ways to find the optimal w:
Strategy 1: Normal Equation (Analytical)
Set the derivative of J(w) to zero and solve directly. Gives an exact, closed-form answer. Fast for small/medium datasets. Complexity: O(pยณ) for p features.
Strategy 2: Gradient Descent (Iterative)
Start with random w, repeatedly nudge w in the direction that reduces J(w). Converges to the optimal solution. Essential for large datasets and neural networks. Complexity: O(npk) for n samples, p features, k iterations.
Mathematical Foundation
5.1 Notation
| Symbol | Meaning | Dimensions |
|---|---|---|
| X | Feature matrix (with bias column) | n ร (p+1) |
| y | Target vector | n ร 1 |
| w | Weight vector (including bias) | (p+1) ร 1 |
| ลท | Prediction vector = Xw | n ร 1 |
| ฮต | Error/residual vector = y - ลท | n ร 1 |
| n | Number of training samples | scalar |
| p | Number of features (before bias) | scalar |
| ฮฑ | Learning rate | scalar |
| J(w) | Cost function | scalar |
5.2 The Linear Model Assumption
We assume the true relationship between features and target is:
Here w* is the true (unknown) weight vector, and ฮต is random noise that is (1) normally distributed, (2) has zero mean, (3) has constant variance ฯยฒ, and (4) is independent across samples.
5.3 The Four Assumptions of Linear Regression
- Linearity: The relationship between X and y is linear in parameters. The true model is y = Xw + ฮต.
- Independence: The residuals ฮตแตข are independent of each other. No autocorrelation.
- Homoscedasticity: The variance of residuals is constant across all levels of X. Var(ฮตแตข) = ฯยฒ for all i.
- Normality: The residuals follow a normal distribution. ฮตแตข ~ N(0, ฯยฒ).
Bonus assumption: No multicollinearity โ features should not be perfectly correlated with each other (otherwise XTX is singular).
5.4 Evaluation Metrics
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
Mean Absolute Error (MAE)
Rยฒ (Coefficient of Determination)
SS_res = ฮฃ(yแตข - ลทแตข)ยฒ (residual sum of squares)
SS_tot = ฮฃ(yแตข - ศณ)ยฒ (total sum of squares)
Interpretation: Rยฒ = 0.85 means the model explains 85% of the variance in y. Rยฒ = 1 is perfect fit; Rยฒ = 0 means the model is no better than predicting the mean.
Adjusted Rยฒ
Adjusted Rยฒ penalizes adding useless features. If a new feature doesn't improve prediction, Adjusted Rยฒ decreases, while Rยฒ always increases (or stays the same) with more features.
Formula Derivations
6.1 Derivation of MSE from Maximum Likelihood
Setup: Assume each observation follows: yแตข = wTxแตข + ฮตแตข where ฮตแตข ~ N(0, ฯยฒ).
Step 1: Write the likelihood. Since ฮตแตข = yแตข - wTxแตข is Gaussian:
Step 2: Likelihood of all n samples (independence assumption):
Step 3: Take the log-likelihood (products โ sums):
Step 4: Maximize log-likelihood = Minimize the sum of squares
The first term is a constant (doesn't depend on w). Maximizing ln L(w) is equivalent to:
Key result: Under Gaussian noise, Maximum Likelihood Estimation gives us MSE as the cost function. This is not an arbitrary choiceโit's the statistically optimal loss under the normality assumption.
6.2 Derivation of the Normal Equation
Goal: Find w that minimizes J(w) = (1/n)||y - Xw||ยฒ
Step 1: Expand the cost function in matrix form
= (1/n)(yTy - yTXw - wTXTy + wTXTXw)
= (1/n)(yTy - 2wTXTy + wTXTXw)
(Note: yTXw is a scalar, so yTXw = wTXTy.)
Step 2: Take the gradient with respect to w
Using matrix calculus rules: โ(wTa)/โw = a, and โ(wTAw)/โw = 2Aw (for symmetric A):
Step 3: Set gradient to zero and solve
XTXw = XTy
Step 4: Solve for w (multiply both sides by (XTX)-1):
This assumes XTX is invertible (non-singular), which requires n โฅ p and no perfect multicollinearity.
6.3 Derivation of Gradient Descent Update Rule
Idea: If we can't (or don't want to) invert matrices, we can iteratively walk "downhill" on the cost surface.
Step 1: We already computed the gradient:
Step 2: Update w in the negative gradient direction:
w โ w - ฮฑ ยท (2/n) XT(Xw - y)
For simple linear regression (scalar w and b):
โJ/โb = (2/n) ฮฃแตขโโโฟ (ลทแตข - yแตข)
w โ w - ฮฑ ยท โJ/โw
b โ b - ฮฑ ยท โJ/โb
6.4 Variants of Gradient Descent
Batch Gradient Descent
Uses all n samples to compute the gradient at each step.
Pros: Smooth convergence, guaranteed to find global minimum (for convex J).
Cons: Slow for large n (must process entire dataset each step).
Stochastic Gradient Descent (SGD)
Uses one random sample to estimate the gradient at each step.
Pros: Very fast per step, can escape shallow local minima, enables online learning.
Cons: Noisy updates, may oscillate around minimum.
Mini-Batch Gradient Descent
Uses a small batch of B samples (typically B = 32, 64, 128).
Pros: Best of bothโstable enough for convergence, fast enough for large datasets, leverages GPU parallelism.
Cons: Requires tuning batch size B.
6.5 Learning Rate Effects
| Learning Rate | Behavior | Convergence |
|---|---|---|
| Too small (e.g., 0.0001) | Very slow descent, tiny steps | Eventually converges but takes too long |
| Just right (e.g., 0.01) | Steady descent toward minimum | Converges in reasonable time |
| Too large (e.g., 1.0) | Overshoots minimum, oscillates wildly | May diverge (J increases!) |
| Adaptive (e.g., Adam) | Adjusts per-parameter learning rates | Robust convergence for most problems |
6.6 Convergence Criteria
How do we know when to stop gradient descent? Common criteria:
- Maximum iterations: Stop after k iterations (e.g., k = 1000).
- Cost tolerance: Stop when |J(w_new) - J(w_old)| < ฮต (e.g., ฮต = 10โปโถ).
- Gradient norm: Stop when ||โJ(w)|| < ฮต.
- Parameter change: Stop when ||w_new - w_old|| < ฮต.
Worked Numerical Examples
Example 1: Normal Equation with 5 Data Points
Problem: Given the following data on flat area (sq. ft.) and price (โน Lakhs) in Bangalore:
| Area (x) | Price (y) |
|---|---|
| 600 | 30 |
| 800 | 45 |
| 1000 | 55 |
| 1200 | 68 |
| 1500 | 80 |
Find w (slope) and b (intercept) using the Normal Equation.
Step-by-Step Solution
Step 1: Construct the design matrix X (with bias column) and target vector y:
[1, 800],
[1, 1000],
[1, 1200],
[1, 1500]]
Step 2: Compute XTX:
Detail: XTX[0,0] = 5 (sum of 1s), XTX[0,1] = 600+800+1000+1200+1500 = 5100, XTX[1,1] = 600ยฒ+800ยฒ+1000ยฒ+1200ยฒ+1500ยฒ = 360000+640000+1000000+1440000+2250000 = 5690000.
Step 3: Compute XTy:
= [278, 18000+36000+55000+81600+120000]T
= [278, 310600]T
Step 4: Compute (XTX)-1:
For a 2ร2 matrix [[a,b],[c,d]], inverse = (1/(ad-bc)) ยท [[d,-b],[-c,a]]
(XTX)-1 = (1/2440000) ยท [[5690000, -5100], [-5100, 5]]
Step 5: Compute w = (XTX)-1 XTy:
[(-5100)ยท278 + 5ยท310600]]
w[0] = (1/2440000)(1581820000 - 1584060000) = (-2240000)/2440000 โ -0.918
w[1] = (1/2440000)(-1417800 + 1553000) = 135200/2440000 โ 0.05541
ลท = 0.05541 ยท x - 0.918
Interpretation: Each sq. ft. adds โ โน0.0554 Lakhs โ โน5,541 to the price.
Verification: For x=1000: ลท = 0.05541ร1000 - 0.918 = 55.41 - 0.918 = 54.49 โ 55 โ
Example 2: Two Iterations of Gradient Descent (ฮฑ = 0.01)
Problem: Using simplified data points: (1, 2), (2, 4), (3, 5). Starting with w=0, b=0. Perform 2 iterations of batch gradient descent with ฮฑ = 0.01.
Iteration 1
Predictions (ลท = wx + b, with w=0, b=0):
Errors (ลทแตข - yแตข):
Gradients:
โJ/โb = (2/3)[(-2) + (-4) + (-5)] = (2/3)(-11) = -7.333
Updates:
b โ 0 - 0.01 ร (-7.333) = 0.0733
Iteration 2
Predictions (w=0.1667, b=0.0733):
ลทโ = 0.1667ยท2 + 0.0733 = 0.407
ลทโ = 0.1667ยท3 + 0.0733 = 0.573
Errors:
Gradients:
= (2/3)[-1.760 - 7.186 - 13.281] = (2/3)(-22.227) = -14.818
โJ/โb = (2/3)[-1.760 - 3.593 - 4.427] = (2/3)(-9.780) = -6.520
Updates:
b โ 0.0733 - 0.01 ร (-6.520) = 0.0733 + 0.0652 = 0.1385
After just 2 iterations, the line has already started fitting: w is moving toward ~1.5 and b toward ~0.67 (the optimal values). With hundreds more iterations, gradient descent will converge to the exact solution.
Example 3: Indian Real Estate (4 Features)
Problem: Predict flat price in Delhi NCR using 4 features: Area (sq ft), Bedrooms, Floor, Age (years).
| Area | Beds | Floor | Age | Price (โนL) |
|---|---|---|---|---|
| 850 | 2 | 3 | 5 | 45 |
| 1100 | 3 | 7 | 2 | 72 |
| 1400 | 3 | 12 | 1 | 95 |
| 950 | 2 | 5 | 10 | 38 |
| 1600 | 4 | 15 | 0 | 120 |
| 750 | 1 | 2 | 15 | 28 |
This requires the Normal Equation with a 6ร5 design matrix. We'll solve this computationally in the Python section, but the interpretation is key:
Expected signs: wโ > 0 (more area โ higher price)
wโ > 0 (more bedrooms โ higher price)
wโ > 0 (higher floor โ slightly higher price)
wโ < 0 (older flat โ lower price)
Visual Diagrams
Diagram 1: Scatter Plot with Best-Fit Line
Diagram 2: Cost Function Surface
Diagram 3: Gradient Descent Steps
Diagram 4: Residual Analysis
Flowcharts
Flowchart 1: Linear Regression Pipeline
Flowchart 2: Gradient Descent Algorithm
Python Implementation (From Scratch)
Let's build a complete LinearRegression class from scratch using only NumPy.
import numpy as np import matplotlib.pyplot as plt class LinearRegressionScratch: """ Linear Regression from scratch. Supports: Normal Equation, Batch GD, SGD, Mini-batch GD. """ def __init__(self, method='normal', lr=0.01, n_iters=1000, batch_size=32, tol=1e-6, verbose=False): """ Parameters: ----------- method : str, one of 'normal', 'batch_gd', 'sgd', 'mini_batch_gd' lr : float, learning rate (alpha) n_iters: int, max iterations for GD methods batch_size: int, for mini-batch GD tol : float, convergence tolerance verbose: bool, print progress """ self.method = method self.lr = lr self.n_iters = n_iters self.batch_size = batch_size self.tol = tol self.verbose = verbose self.weights = None # includes bias self.cost_history = [] def _add_bias(self, X): """Add a column of 1s for the bias term.""" return np.c_[np.ones(X.shape[0]), X] def _compute_cost(self, X, y): """Compute MSE cost: J = (1/n)||y - Xw||^2""" n = len(y) predictions = X @ self.weights cost = (1 / n) * np.sum((y - predictions) ** 2) return cost def _compute_gradient(self, X, y): """Compute gradient: (2/n) * X^T(Xw - y)""" n = len(y) predictions = X @ self.weights gradient = (2 / n) * X.T @ (predictions - y) return gradient def fit_normal_equation(self, X, y): """Closed-form solution: w = (X^TX)^{-1} X^Ty""" X_b = self._add_bias(X) # Use np.linalg.pinv for numerical stability self.weights = np.linalg.pinv(X_b.T @ X_b) @ X_b.T @ y if self.verbose: cost = self._compute_cost(X_b, y) print(f"Normal Equation โ MSE: {cost:.6f}") def fit_batch_gd(self, X, y): """Full-batch Gradient Descent.""" X_b = self._add_bias(X) n, p = X_b.shape self.weights = np.zeros(p) # initialize to zeros self.cost_history = [] for i in range(self.n_iters): gradient = self._compute_gradient(X_b, y) self.weights -= self.lr * gradient cost = self._compute_cost(X_b, y) self.cost_history.append(cost) if self.verbose and i % 100 == 0: print(f"Iter {i:4d} | MSE: {cost:.6f}") # Convergence check if len(self.cost_history) > 1: if abs(self.cost_history[-1] - self.cost_history[-2]) < self.tol: if self.verbose: print(f"Converged at iteration {i}") break def fit_sgd(self, X, y): """Stochastic Gradient Descent (one sample at a time).""" X_b = self._add_bias(X) n, p = X_b.shape self.weights = np.zeros(p) self.cost_history = [] for epoch in range(self.n_iters): indices = np.random.permutation(n) for idx in indices: xi = X_b[idx:idx+1] yi = y[idx:idx+1] gradient = (2) * xi.T @ (xi @ self.weights - yi) self.weights -= self.lr * gradient cost = self._compute_cost(X_b, y) self.cost_history.append(cost) if self.verbose and epoch % 100 == 0: print(f"Epoch {epoch:4d} | MSE: {cost:.6f}") def fit_mini_batch_gd(self, X, y): """Mini-batch Gradient Descent.""" X_b = self._add_bias(X) n, p = X_b.shape self.weights = np.zeros(p) self.cost_history = [] for epoch in range(self.n_iters): indices = np.random.permutation(n) for start in range(0, n, self.batch_size): end = min(start + self.batch_size, n) batch_idx = indices[start:end] X_batch = X_b[batch_idx] y_batch = y[batch_idx] gradient = (2 / len(y_batch)) * X_batch.T @ (X_batch @ self.weights - y_batch) self.weights -= self.lr * gradient cost = self._compute_cost(X_b, y) self.cost_history.append(cost) def fit(self, X, y): """Fit model using the chosen method.""" X = np.array(X, dtype=np.float64) y = np.array(y, dtype=np.float64) if self.method == 'normal': self.fit_normal_equation(X, y) elif self.method == 'batch_gd': self.fit_batch_gd(X, y) elif self.method == 'sgd': self.fit_sgd(X, y) elif self.method == 'mini_batch_gd': self.fit_mini_batch_gd(X, y) else: raise ValueError(f"Unknown method: {self.method}") return self def predict(self, X): """Make predictions.""" X = np.array(X, dtype=np.float64) X_b = self._add_bias(X) return X_b @ self.weights def score(self, X, y): """Compute Rยฒ score.""" y = np.array(y, dtype=np.float64) y_pred = self.predict(X) ss_res = np.sum((y - y_pred) ** 2) ss_tot = np.sum((y - np.mean(y)) ** 2) return 1 - (ss_res / ss_tot) def get_metrics(self, X, y): """Return dict of all metrics.""" y = np.array(y, dtype=np.float64) y_pred = self.predict(X) n = len(y) p = len(self.weights) - 1 # exclude bias mse = np.mean((y - y_pred) ** 2) rmse = np.sqrt(mse) mae = np.mean(np.abs(y - y_pred)) r2 = self.score(X, y) adj_r2 = 1 - ((1 - r2) * (n - 1) / (n - p - 1)) return { 'MSE': mse, 'RMSE': rmse, 'MAE': mae, 'Rยฒ': r2, 'Adj Rยฒ': adj_r2 } def plot_cost_history(self): """Plot cost vs iterations (for GD methods).""" if not self.cost_history: print("No cost history (use a GD method).") return plt.figure(figsize=(10, 5)) plt.plot(self.cost_history, color='#059669', linewidth=2) plt.xlabel('Iteration') plt.ylabel('MSE Cost') plt.title('Cost Function Convergence') plt.grid(True, alpha=0.3) plt.show() def plot_fit(self, X, y, feature_idx=0): """Plot data + regression line (for 1D visualization).""" X = np.array(X, dtype=np.float64) y = np.array(y, dtype=np.float64) y_pred = self.predict(X) plt.figure(figsize=(10, 6)) plt.scatter(X[:, feature_idx], y, color='#0891b2', s=80, alpha=0.7, label='Actual', zorder=5) # Sort for smooth line sort_idx = X[:, feature_idx].argsort() plt.plot(X[sort_idx, feature_idx], y_pred[sort_idx], color='#059669', linewidth=2.5, label='Predicted') plt.xlabel(f'Feature {feature_idx}') plt.ylabel('Target') plt.title('Linear Regression Fit') plt.legend() plt.grid(True, alpha=0.3) plt.show()
Using the Class
# === Example: Bangalore flat price prediction === import numpy as np # Data: [area_sqft, bedrooms, age_years] X = np.array([ [600, 1, 10], [800, 2, 7], [1000, 2, 5], [1200, 3, 3], [1500, 3, 1], [1800, 4, 0], [900, 2, 8], [1100, 3, 4], ]) y = np.array([30, 45, 55, 68, 82, 105, 40, 62]) # Price in โน Lakhs # --- Method 1: Normal Equation --- model_ne = LinearRegressionScratch(method='normal', verbose=True) model_ne.fit(X, y) print("Weights (Normal Eq):", model_ne.weights) print("Metrics:", model_ne.get_metrics(X, y)) # --- Method 2: Batch Gradient Descent --- # Feature scaling is CRITICAL for GD! X_scaled = (X - X.mean(axis=0)) / X.std(axis=0) model_gd = LinearRegressionScratch( method='batch_gd', lr=0.01, n_iters=1000, verbose=True ) model_gd.fit(X_scaled, y) print("Rยฒ (Batch GD):", model_gd.score(X_scaled, y)) model_gd.plot_cost_history() # --- Predict new flat --- new_flat = np.array([[1300, 3, 2]]) price = model_ne.predict(new_flat) print(f"Predicted price for 1300 sqft, 3BHK, 2yr old: โน{price[0]:.1f} Lakhs")
Feature Scaling Implementation
class StandardScaler: """Standardize features: z = (x - mean) / std""" def fit(self, X): self.mean_ = np.mean(X, axis=0) self.std_ = np.std(X, axis=0) self.std_[self.std_ == 0] = 1 # avoid division by zero return self def transform(self, X): return (X - self.mean_) / self.std_ def fit_transform(self, X): return self.fit(X).transform(X) # Why is scaling critical for GD? # Without scaling: area~1000, bedrooms~3, age~5 # The gradient for 'area' is ~300x larger than for 'bedrooms' # GD oscillates wildly on the area axis while barely moving on bedrooms # With scaling: all features have similar ranges โ smooth convergence
Diagnostic Plots Implementation
def diagnostic_plots(y_true, y_pred, feature_name="Feature"): """Generate diagnostic plots for regression analysis.""" residuals = y_true - y_pred fig, axes = plt.subplots(1, 3, figsize=(18, 5)) # 1. Residual Plot axes[0].scatter(y_pred, residuals, color='#0891b2', alpha=0.6, s=50) axes[0].axhline(y=0, color='#dc2626', linestyle='--', linewidth=1.5) axes[0].set_xlabel('Predicted Values') axes[0].set_ylabel('Residuals') axes[0].set_title('Residual Plot') axes[0].grid(True, alpha=0.3) # 2. Q-Q Plot (manual implementation) sorted_residuals = np.sort(residuals) n = len(sorted_residuals) theoretical = np.array([ np.sqrt(2) * 0.4769 * ((2 * (i + 1) - 1) / (2 * n) - 0.5) for i in range(n) ]) # Simplified; in practice, use scipy.stats.norm.ppf axes[1].scatter(theoretical, sorted_residuals, color='#059669', alpha=0.7, s=50) # Reference line lims = [min(theoretical.min(), sorted_residuals.min()), max(theoretical.max(), sorted_residuals.max())] axes[1].plot(lims, lims, 'r--', linewidth=1.5) axes[1].set_xlabel('Theoretical Quantiles') axes[1].set_ylabel('Sample Quantiles') axes[1].set_title('Q-Q Plot') axes[1].grid(True, alpha=0.3) # 3. Actual vs Predicted axes[2].scatter(y_true, y_pred, color='#8b5cf6', alpha=0.6, s=50) lims = [min(y_true.min(), y_pred.min()), max(y_true.max(), y_pred.max())] axes[2].plot(lims, lims, 'r--', linewidth=1.5) axes[2].set_xlabel('Actual Values') axes[2].set_ylabel('Predicted Values') axes[2].set_title('Actual vs Predicted') axes[2].grid(True, alpha=0.3) plt.tight_layout() plt.show() # Usage: # y_pred = model.predict(X) # diagnostic_plots(y, y_pred)
TensorFlow Implementation
import tensorflow as tf import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # === Generate sample Indian housing data === np.random.seed(42) n_samples = 500 area = np.random.uniform(500, 2500, n_samples) bedrooms = np.random.randint(1, 5, n_samples) age = np.random.uniform(0, 20, n_samples) floor = np.random.randint(1, 20, n_samples) # True relationship + noise price = 0.05 * area + 8 * bedrooms - 1.5 * age + 0.3 * floor + 5 + \ np.random.normal(0, 5, n_samples) X = np.column_stack([area, bedrooms, age, floor]) y = price # Split & scale X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) scaler = StandardScaler() X_train_s = scaler.fit_transform(X_train) X_test_s = scaler.transform(X_test) # === Build TensorFlow Model === model = tf.keras.Sequential([ tf.keras.layers.Dense(1, input_shape=(4,), kernel_initializer='normal', name='linear_layer') ]) # Linear regression = 1 Dense layer with no activation! # This is equivalent to ลท = Xw + b model.compile( optimizer=tf.keras.optimizers.SGD(learning_rate=0.01), loss='mse', # Mean Squared Error metrics=['mae'] # Also track MAE ) model.summary() # === Train === history = model.fit( X_train_s, y_train, epochs=100, batch_size=32, validation_split=0.15, verbose=1 ) # === Evaluate === test_loss, test_mae = model.evaluate(X_test_s, y_test) print(f"\nTest MSE: {test_loss:.4f}") print(f"Test MAE: {test_mae:.4f}") # === Extract learned weights === weights, bias = model.get_layer('linear_layer').get_weights() print(f"\nLearned weights: {weights.flatten()}") print(f"Learned bias: {bias[0]:.4f}") # === Plot training history === import matplotlib.pyplot as plt fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5)) ax1.plot(history.history['loss'], label='Train MSE', color='#059669') ax1.plot(history.history['val_loss'], label='Val MSE', color='#0891b2') ax1.set_xlabel('Epoch'); ax1.set_ylabel('MSE') ax1.set_title('Loss Convergence'); ax1.legend(); ax1.grid(True, alpha=0.3) ax2.plot(history.history['mae'], label='Train MAE', color='#059669') ax2.plot(history.history['val_mae'], label='Val MAE', color='#0891b2') ax2.set_xlabel('Epoch'); ax2.set_ylabel('MAE') ax2.set_title('MAE Over Training'); ax2.legend(); ax2.grid(True, alpha=0.3) plt.tight_layout() plt.show() # === Predict new flat === new_flat = np.array([[1300, 3, 2, 10]]) new_flat_s = scaler.transform(new_flat) pred_price = model.predict(new_flat_s) print(f"Predicted price: โน{pred_price[0][0]:.1f} Lakhs")
Scikit-Learn Implementation
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression, SGDRegressor from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.pipeline import Pipeline from sklearn.metrics import ( mean_squared_error, mean_absolute_error, r2_score ) # ============================================ # 1. CREATE SYNTHETIC INDIAN HOUSING DATASET # ============================================ np.random.seed(42) n = 1000 data = pd.DataFrame({ 'area_sqft': np.random.uniform(400, 3000, n), 'bedrooms': np.random.randint(1, 6, n), 'bathrooms': np.random.randint(1, 4, n), 'age_years': np.random.uniform(0, 25, n), 'floor': np.random.randint(1, 25, n), 'city_tier': np.random.choice([1, 2, 3], n, p=[0.3, 0.4, 0.3]), }) # Generate realistic prices (โน Lakhs) city_multiplier = {1: 1.5, 2: 1.0, 3: 0.6} data['price_lakhs'] = ( 0.045 * data['area_sqft'] + 7 * data['bedrooms'] + 5 * data['bathrooms'] - 1.2 * data['age_years'] + 0.3 * data['floor'] + 10 * data['city_tier'].map(city_multiplier) + np.random.normal(0, 8, n) ) print(data.head()) print(data.describe()) # ============================================ # 2. EXPLORATORY DATA ANALYSIS # ============================================ print("\nCorrelation with price:") print(data.corr()['price_lakhs'].sort_values(ascending=False)) # ============================================ # 3. PREPARE FEATURES & TARGET # ============================================ features = ['area_sqft', 'bedrooms', 'bathrooms', 'age_years', 'floor'] X = data[features].values y = data['price_lakhs'].values X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # ============================================ # 4. MODEL 1: BASIC LINEAR REGRESSION # ============================================ model = LinearRegression() model.fit(X_train, y_train) print("\n=== Linear Regression Results ===") print(f"Intercept: {model.intercept_:.4f}") for feat, coef in zip(features, model.coef_): print(f" {feat:15s}: {coef:+.6f}") y_pred = model.predict(X_test) print(f"\nTest Rยฒ: {r2_score(y_test, y_pred):.4f}") print(f"Test RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}") print(f"Test MAE: {mean_absolute_error(y_test, y_pred):.4f}") # ============================================ # 5. CROSS-VALIDATION # ============================================ cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2') print(f"\n5-Fold CV Rยฒ scores: {cv_scores}") print(f"Mean CV Rยฒ: {cv_scores.mean():.4f} ยฑ {cv_scores.std():.4f}") # ============================================ # 6. MODEL 2: PIPELINE WITH SCALING + SGD # ============================================ sgd_pipeline = Pipeline([ ('scaler', StandardScaler()), ('sgd', SGDRegressor(max_iter=1000, tol=1e-4, learning_rate='optimal', random_state=42)) ]) sgd_pipeline.fit(X_train, y_train) y_pred_sgd = sgd_pipeline.predict(X_test) print(f"\nSGD Rยฒ: {r2_score(y_test, y_pred_sgd):.4f}") # ============================================ # 7. MODEL 3: POLYNOMIAL REGRESSION # ============================================ poly_pipeline = Pipeline([ ('poly', PolynomialFeatures(degree=2, include_bias=False)), ('scaler', StandardScaler()), ('reg', LinearRegression()) ]) poly_pipeline.fit(X_train, y_train) y_pred_poly = poly_pipeline.predict(X_test) print(f"Poly(2) Rยฒ: {r2_score(y_test, y_pred_poly):.4f}") # ============================================ # 8. VISUALIZATION # ============================================ fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Actual vs Predicted axes[0,0].scatter(y_test, y_pred, alpha=0.5, color='#059669', s=30) axes[0,0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--') axes[0,0].set_title('Actual vs Predicted') # Residuals residuals = y_test - y_pred axes[0,1].scatter(y_pred, residuals, alpha=0.5, color='#0891b2', s=30) axes[0,1].axhline(y=0, color='red', linestyle='--') axes[0,1].set_title('Residual Plot') # Residual distribution axes[1,0].hist(residuals, bins=30, color='#059669', alpha=0.7, edgecolor='white') axes[1,0].set_title('Residual Distribution') # Feature importance (coefficients) coefs = pd.Series(model.coef_, index=features).sort_values() coefs.plot.barh(ax=axes[1,1], color='#0891b2') axes[1,1].set_title('Feature Coefficients') plt.tight_layout() plt.show()
Indian Case Studies
House Price Prediction: Mumbai, Delhi NCR & Bangalore
Context: India's residential real estate market is valued at ~$300 billion (2024). Platforms like MagicBricks, 99acres, and Housing.com use regression models to provide instant price estimates.
Dataset Features
| Feature | Type | Example Values | Expected Coefficient Sign |
|---|---|---|---|
| Area (sq ft) | Continuous | 500โ5000 | Positive (+) |
| Bedrooms (BHK) | Discrete | 1โ5 | Positive (+) |
| Bathrooms | Discrete | 1โ4 | Positive (+) |
| Floor | Discrete | 1โ30 | Slightly positive (+) |
| Age (years) | Continuous | 0โ30 | Negative (โ) |
| Distance to metro (km) | Continuous | 0.1โ15 | Negative (โ) |
| City tier | Categorical | 1, 2, 3 | Tier 1 premium |
Key Findings
- Mumbai (Tier 1): โน18,000โโน25,000 per sq ft in prime areas; regression captures ~82% variance (Rยฒ=0.82)
- Bangalore: IT corridor proximity adds โน3,000/sqft premium; modeled via distance feature
- Delhi NCR: Metro connectivity is the strongest price predictor after area
- Limitation: Non-linear effects (e.g., sea-view premium in Mumbai) require polynomial features or tree models
Model Equation (Simplified Mumbai Model)
Interpretation: Each sq ft adds โน6,200; each additional bedroom adds โน12.5 Lakhs; each year of age reduces price by โน2.1 Lakhs; each km from metro reduces price by โน3.8 Lakhs.
Agricultural Crop Yield Prediction
Context: India is the world's second-largest agricultural producer. The Ministry of Agriculture uses regression models to estimate production of wheat, rice, and cotton to plan storage, transportation, and pricing (MSP).
Data Sources
- IMD (India Meteorological Department): Rainfall, temperature, humidity data
- ICAR: Soil quality, fertilizer usage data
- DAC&FW: Historical crop yield data by state and district
Feature Engineering
| Feature | Source | Impact on Yield |
|---|---|---|
| Total rainfall (mm) | IMD | Strong positive (up to a point) |
| Temperature (avg ยฐC) | IMD | Crop-dependent |
| Fertilizer (kg/hectare) | ICAR | Positive (diminishing returns) |
| Soil pH | ICAR | Optimal range varies |
| Irrigation coverage (%) | Census | Strong positive |
| Sown area (hectares) | DAC&FW | Positive |
Results
Rice yield model (Punjab, 2010-2023): Rยฒ = 0.78, RMSE = 245 kg/hectare. The model correctly predicted the 2022 below-normal harvest due to irregular monsoon patterns detected via the rainfall feature.
# Simplified crop yield regression import numpy as np from sklearn.linear_model import LinearRegression # Features: [rainfall_mm, temp_C, fertilizer_kg, irrigation_%] X_crop = np.array([ [800, 28, 120, 60], # Normal monsoon [1200, 27, 150, 75], # Good monsoon [600, 30, 100, 45], # Drought year [950, 26, 180, 80], # Well-irrigated [1100, 29, 130, 65], # Above average rain [700, 31, 90, 40], # Hot & dry [1050, 27, 160, 70], # Good year [500, 32, 80, 35], # Severe drought ]) y_crop = np.array([3200, 4100, 2400, 4500, 3800, 2100, 4000, 1800]) model_crop = LinearRegression() model_crop.fit(X_crop, y_crop) features = ['Rainfall', 'Temperature', 'Fertilizer', 'Irrigation'] for f, c in zip(features, model_crop.coef_): print(f"{f:12s}: {c:+.2f} kg/hectare per unit increase") print(f"\nRยฒ Score: {model_crop.score(X_crop, y_crop):.4f}")
Rainfall Prediction Using IMD Historical Data
Context: IMD maintains rainfall data from 1901 to present across 36 meteorological subdivisions. Linear regression with seasonal features (ENSO index, Indian Ocean Dipole, pre-monsoon temperatures) achieves Rยฒ โ 0.65 for June-September monsoon totals.
Challenge
Rainfall has inherent variability and chaotic dynamics. Linear regression captures the linear component of the signal but misses non-linear teleconnections. This is a great example of where linear regression provides a baseline that more complex models (Random Forest, LSTM) improve upon.
IMD's 2024 monsoon forecast used an ensemble of 5 statistical models, including multiple regression with ENSO, IOD, and Eurasian snow cover as predictors. Predicted: 106% of LPA. Actual: 109% of LPA. Error: 3%โremarkably accurate for a linear model!
Global Case Studies
Boston Housing Dataset โ Classic + Ethical Discussion
The Dataset: Created by Harrison & Rubinfeld (1978), the Boston Housing dataset has 506 samples with 13 features predicting median home value in Boston suburbs. For decades, it was the benchmark for regression algorithms.
Features
CRIM (per capita crime), ZN (residential zoning), INDUS (industrial proportion), CHAS (Charles River proximity), NOX (nitric oxide concentration), RM (average rooms), AGE (pre-1940 proportion), DIS (distance to employment), RAD (highway access), TAX (tax rate), PTRATIO (pupil-teacher), B (racial composition statistic), LSTAT (lower status %).
โ ๏ธ Ethical Concerns
The 'B' feature is defined as 1000(Bk - 0.63)ยฒ where Bk is the proportion of Black residents. This was originally included to model the racist assumption that racial composition affects property values. In 2020, Scikit-Learn deprecated `load_boston()` because:
- The feature encodes racial bias as a "normal" predictor
- Models trained on this data learn and perpetuate racial discrimination in housing
- Students unknowingly absorb the idea that race should be a price predictor
Alternative: Use the California Housing dataset or create synthetic data. We include this discussion because understanding bias in data is as important as understanding algorithms.
California Housing โ The Modern Benchmark
20,640 samples from the 1990 US Census. Features: median income, house age, average rooms, average bedrooms, population, average occupancy, latitude, longitude. Target: median house value.
from sklearn.datasets import fetch_california_housing from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import r2_score, mean_squared_error import numpy as np housing = fetch_california_housing() X, y = housing.data, housing.target X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) print(f"Rยฒ: {r2_score(y_test, y_pred):.4f}") # ~0.60 print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}") # Linear regression gets Rยฒ โ 0.60 โ decent baseline! # This means 40% of variance is in non-linear patterns # โ Motivation for polynomial regression, trees, neural nets print("\nFeature importances:") for name, coef in zip(housing.feature_names, model.coef_): print(f" {name:15s}: {coef:+.4f}")
Energy Consumption Prediction
Application: Google uses regression models (among others) in their DeepMind-powered data center cooling systems. By predicting energy consumption based on temperature, humidity, server load, and cooling settings, they reduced cooling energy by 40%.
A simplified linear model: Energy_kWh = 0.12 ร Temperature + 0.08 ร Humidity + 0.45 ร Server_Load - 0.15 ร Cooling_Setting + 50
While Google's actual system uses neural networks, the initial prototype was a linear regression model that proved the concept. This illustrates the ML workflow: start simple, prove value, then add complexity.
Startup Applications
๐ PropTech Startups (Housing.com, NoBroker, Square Yards)
Use regression for instant price estimates. When a user lists a property, the model predicts a fair market value. Features include location coordinates, carpet area, amenities count, and nearby infrastructure scores. Revenue model: charge a premium for "AI-verified" price estimates.
๐พ AgriTech Startups (CropIn, DeHaat, Ninjacart)
CropIn uses satellite imagery + regression to predict crop yields per farm. DeHaat uses price regression to advise farmers on optimal selling time. These startups serve 15+ million Indian farmers and have raised $200M+ in funding.
๐ฐ FinTech (Cred, Razorpay, PhonePe)
Credit scoring starts with logistic regression (Chapter 5), but expected transaction volume prediction uses linear regression. Razorpay predicts daily payment volumes to manage server scaling and cash reserves.
๐ Mobility (Ola, Rapido)
Ride time prediction uses regression: ETA = ฮฒโรdistance + ฮฒโรtraffic_index + ฮฒโรtime_of_day + ฮฒโรweather + intercept. Surge pricing models also start with regression on demand-supply ratios.
Government Applications
๐๏ธ NITI Aayog โ GDP Forecasting
Uses multivariate regression with indicators: industrial production (IIP), credit growth, exports, government expenditure, FDI inflows. The model provides quarterly GDP growth estimates used in Union Budget planning.
๐๏ธ Census of India โ Population Projection
Simple time-series regression on population data from 1951-2011 censuses projects India's population trajectory. The 2023 projection (1.43 billion) was within 1% of the actual figure.
๐๏ธ ISRO โ Satellite Telemetry
ISRO uses regression models to predict satellite battery degradation, solar panel output vs. orbital angle, and fuel consumption rates. These models are interpretable and safety-criticalโneural networks are too opaque for space missions.
๐๏ธ IMD โ Monsoon Forecasting
As discussed in Case Study 3, IMD's ensemble regression model for monsoon prediction directly impacts agricultural planning, water resource management, and disaster preparedness for 1.4 billion people.
Industry Applications
| Industry | Application | Features Used | Impact |
|---|---|---|---|
| ๐ฆ Banking (SBI, HDFC) | Loan default amount estimation | Income, debt ratio, tenure, credit score | Risk-based pricing, โน2L Cr loan portfolio |
| ๐ฅ Healthcare (Apollo) | Patient stay duration prediction | Age, diagnosis, vitals, procedure type | Bed allocation, staff scheduling |
| ๐ญ Manufacturing (Tata) | Quality prediction (defect rate) | Temperature, pressure, speed, material | Reduced defect rate by 15% |
| ๐ E-commerce (Flipkart) | Demand forecasting | Season, promotions, price, category | Inventory optimization, โน500Cr savings |
| โก Energy (Tata Power) | Load forecasting | Temperature, time, day type, historical load | Grid stability, peak management |
| ๐ฑ Telecom (Jio) | Churn risk scoring | Usage, complaints, recharge frequency | Retention campaigns targeting top 10% |
| ๐ Logistics (Delhivery) | Delivery time estimation | Distance, weight, route, warehouse load | SLA compliance improvement by 20% |
Mini Projects
๐ ๏ธ Mini Project 1: Indian House Price Predictor
Objective: Build a complete end-to-end house price prediction system for Indian cities.
Requirements
- Generate or collect a dataset with 500+ rows and 8+ features
- Perform EDA: histograms, correlation matrix, scatter plots
- Handle missing values and outliers
- Feature engineering: create price_per_sqft, log-transform skewed features
- Train-test split (80/20) with cross-validation
- Compare: Normal Equation vs Batch GD vs SGD (all from scratch)
- Also fit Scikit-Learn LinearRegression and compare results
- Generate diagnostic plots: residuals, Q-Q, actual vs predicted
- Report: Rยฒ, Adjusted Rยฒ, RMSE, MAE for all models
- Predict price for 5 new flats and discuss predictions
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # === Generate synthetic Indian housing data === np.random.seed(42) n = 800 cities = np.random.choice( ['Mumbai', 'Delhi', 'Bangalore', 'Hyderabad', 'Pune'], n, p=[0.25, 0.2, 0.25, 0.15, 0.15] ) city_premium = { 'Mumbai': 1.8, 'Delhi': 1.4, 'Bangalore': 1.3, 'Hyderabad': 1.0, 'Pune': 0.9 } data = pd.DataFrame({ 'city': cities, 'area_sqft': np.random.uniform(400, 3000, n), 'bedrooms': np.random.randint(1, 5, n), 'bathrooms': np.random.randint(1, 4, n), 'floor': np.random.randint(1, 20, n), 'age_years': np.random.uniform(0, 20, n), 'parking': np.random.randint(0, 3, n), 'metro_dist_km': np.random.uniform(0.1, 10, n), }) # Generate prices data['price_lakhs'] = ( 0.05 * data['area_sqft'] + 8 * data['bedrooms'] + 5 * data['bathrooms'] - 1.5 * data['age_years'] + 0.4 * data['floor'] + 4 * data['parking'] - 2 * data['metro_dist_km'] + 15 * data['city'].map(city_premium) + np.random.normal(0, 8, n) ).clip(lower=10) print("Dataset created! Shape:", data.shape) print(data.head()) # TODO: Students complete the following: # 1. EDA (histograms, correlation matrix) # 2. Feature engineering (one-hot encode city) # 3. Train/test split # 4. Implement LinearRegressionScratch.fit() with each method # 5. Compare with sklearn LinearRegression # 6. Diagnostic plots # 7. Report all metrics
Expected Output
Rยฒ โ 0.90โ0.95 (since data is synthetic with known linear relationships). Students should note that real-world data will typically give Rยฒ โ 0.6โ0.85 due to non-linear effects and unobserved features.
๐ ๏ธ Mini Project 2: Crop Yield Estimator
Objective: Build a regression model that estimates crop yield (kg/hectare) based on weather, soil, and farming practice data.
Requirements
- Create/collect dataset: 300+ rows, 6+ features (rainfall, temperature, soil pH, fertilizer, irrigation, seed variety)
- Compare simple vs. multiple regression (1 feature vs. 6 features)
- Implement from scratch with Batch GD (with learning rate experiments)
- Generate cost convergence plots for ฮฑ = 0.001, 0.01, 0.1, 0.5
- Show that feature scaling dramatically improves GD convergence
- Compute VIF (Variance Inflation Factor) for multicollinearity
- Polynomial regression (degree 2) vs. linear: is it worth it?
- Predict yield for upcoming season with different rainfall scenarios
from sklearn.linear_model import LinearRegression import numpy as np def compute_vif(X, feature_names): """ Compute Variance Inflation Factor for each feature. VIF > 5: moderate multicollinearity VIF > 10: severe multicollinearity """ vif_data = [] for i in range(X.shape[1]): # Regress feature i on all other features X_other = np.delete(X, i, axis=1) model = LinearRegression().fit(X_other, X[:, i]) r2 = model.score(X_other, X[:, i]) vif = 1 / (1 - r2) if r2 < 1 else float('inf') vif_data.append({ 'Feature': feature_names[i], 'VIF': round(vif, 2), 'Status': 'OK' if vif < 5 else ('Warning' if vif < 10 else 'REMOVE') }) return vif_data # Usage: # vif_results = compute_vif(X, feature_names) # for v in vif_results: print(v)
End-of-Chapter Exercises
Conceptual Questions
Explain the difference between correlation and causation. If a linear regression model shows a positive coefficient for ice cream sales predicting drowning deaths, does ice cream cause drowning? Discuss confounding variables.
State the four assumptions of linear regression (LINE). For each assumption, describe (a) what it means, (b) how to check it using diagnostic plots, and (c) what happens to OLS estimates if it's violated.
Why does MSE emerge as the natural cost function from Maximum Likelihood under Gaussian noise? What cost function would arise if noise followed a Laplace distribution instead?
Explain why Rยฒ always increases (or stays the same) when you add more features, but Adjusted Rยฒ can decrease. Give a concrete numerical example.
Compare and contrast Batch GD, Stochastic GD, and Mini-Batch GD in terms of: (a) computation per step, (b) convergence smoothness, (c) memory requirements, (d) suitability for online learning.
Mathematical Problems
Given data points: (1,3), (2,5), (3,7), (4,9), (5,11). Compute w and b using the Normal Equation by hand. Verify that these points are perfectly collinear and Rยฒ = 1.
Starting from w=1, b=0, perform 3 iterations of gradient descent on data points (1,2), (2,3), (3,5) with learning rate ฮฑ = 0.05. Show all intermediate calculations.
Prove that the gradient of J(w) = (1/n)||y - Xw||ยฒ with respect to w is (2/n)XT(Xw - y). Use matrix calculus rules explicitly.
Show that for simple linear regression (one feature), the Normal Equation simplifies to:
w = [nฮฃ(xแตขyแตข) - ฮฃ(xแตข)ฮฃ(yแตข)] / [nฮฃ(xแตขยฒ) - (ฮฃ(xแตข))ยฒ]
b = ศณ - wยทxฬ
A dataset has 100 samples and 5 features. The linear regression model achieves Rยฒ = 0.82. Compute Adjusted Rยฒ. If we add a 6th feature and Rยฒ becomes 0.823, should we keep the new feature? Justify using Adjusted Rยฒ.
Programming Exercises
Implement gradient descent with momentum from scratch: v_t = ฮฒยทv_{t-1} + โJ(w), w โ w - ฮฑยทv_t. Compare convergence speed with vanilla GD on a dataset of your choice.
Create a function that generates an animated GIF showing gradient descent steps on a 2D cost surface (w vs J(w)). Use matplotlib.animation.
Implement polynomial regression from scratch. Fit degree 1, 2, 3, and 5 polynomials to 20 noisy data points. Plot all fits and discuss underfitting vs. overfitting.
Build a complete data pipeline: read a CSV file, handle missing values (mean imputation), encode categorical variables, scale features, train a model, evaluate, and save the model using pickle. The entire pipeline should run with a single function call.
Implement k-fold cross-validation from scratch (without using sklearn). Compute mean and std of Rยฒ across 5 folds.
Application Exercises
Collect electricity bills for your household over the past 12 months. Use month number and average temperature as features to predict the bill amount. Is the relationship linear? If not, try polynomial features.
Use the IMD rainfall data (available on data.gov.in) for your state. Build a simple regression model: rainfall_this_year ~ f(rainfall_last_year, El_Nino_index). Report Rยฒ and discuss limitations.
Scrape or collect prices of 50 second-hand cars from OLX/CarDekho. Features: year, km driven, fuel type, engine CC. Build a linear regression model and discuss which features matter most.
Download the California Housing dataset from Scikit-Learn. Compare: (a) simple linear regression using only median income, (b) multiple regression with all 8 features, (c) polynomial regression (degree 2) with all features. Report Rยฒ for each.
Build a simple web API (using Flask or FastAPI) that accepts house features via JSON and returns a predicted price. Deploy locally and test with curl or Postman.
Implement Ridge Regression (L2 regularization) from scratch: J(w) = MSE + ฮป||w||ยฒ. Derive the gradient and implement gradient descent. Show that as ฮป increases, coefficients shrink toward zero.
Create a comprehensive comparison table: run linear regression on the California Housing dataset using (a) your scratch implementation (Normal Equation), (b) your scratch GD implementation, (c) sklearn LinearRegression, (d) sklearn SGDRegressor, (e) TensorFlow Dense(1). All should give approximately the same Rยฒ. Discuss any differences.
Multiple Choice Questions
Q1. The Normal Equation for linear regression is:
Show Answer
Q2. Which of the following is NOT an assumption of OLS linear regression?
Show Answer
Q3. If the learning rate in gradient descent is too large, what happens?
Show Answer
Q4. Rยฒ = 0.75 means:
Show Answer
Q5. Which gradient descent variant uses a single randomly chosen sample per update?
Show Answer
Q6. Feature scaling is essential for gradient descent because:
Show Answer
Q7. The MSE cost function for linear regression is:
Show Answer
Q8. If XTX is singular (non-invertible), which of the following is true?
Show Answer
Q9. Which metric is most robust to outliers?
Show Answer
Q10. Deriving MSE from Maximum Likelihood requires which assumption about noise?
Show Answer
Q11. What is the time complexity of the Normal Equation?
Show Answer
Q12. In the residual plot, a funnel-shaped pattern indicates:
Show Answer
Interview Questions
Q1. What is the difference between Linear Regression and Correlation?
Show Answer
Correlation measures the strength and direction of a linear relationship (a number from -1 to +1). Linear regression models the relationship to make predictions. Correlation is symmetric (corr(X,Y) = corr(Y,X)), but regression is not (regressing Y on X โ regressing X on Y). Additionally, correlation doesn't assume causation, whereas regression coefficients are often interpreted as "effect sizes" (though this requires causal assumptions).
Q2. Why do we use MSE and not just Mean Error?
Show Answer
Mean Error (ME = ฮฃ(yแตข - ลทแตข)/n) can be zero even with terrible predictions because positive and negative errors cancel out. MSE squares errors to make them all positive, penalizes large errors more heavily, and is mathematically convenient (differentiable, convex). Additionally, MSE arises naturally from the Maximum Likelihood principle under Gaussian noise assumptions.
Q3. When would you prefer Gradient Descent over the Normal Equation?
Show Answer
GD is preferred when: (1) p (features) is large (>10,000) because matrix inversion is O(pยณ); (2) n is very large and data doesn't fit in memory โ SGD processes one sample at a time; (3) Online/streaming data โ SGD can update in real-time; (4) Non-linear extensions โ GD generalizes to neural networks while Normal Equation is only for linear models. Normal Equation is preferred for small datasets where the exact solution is cheap to compute.
Q4. How do you handle multicollinearity in linear regression?
Show Answer
Detection: (1) Compute VIF (Variance Inflation Factor) โ VIF > 10 is severe; (2) Check correlation matrix for |r| > 0.9 pairs. Solutions: (1) Remove one of the correlated features; (2) Use PCA to create uncorrelated components; (3) Use Ridge (L2) or LASSO (L1) regularization โ Ridge shrinks correlated coefficients, LASSO can drive some to zero; (4) Domain knowledge to choose the most meaningful feature.
Q5. Can Rยฒ be negative? What does it mean?
Show Answer
Yes! Rยฒ = 1 - SS_res/SS_tot. If SS_res > SS_tot (the model's predictions are worse than simply predicting the mean), Rยฒ becomes negative. This happens when: (1) The model is very wrong (e.g., predicting a constant that's far from the mean); (2) You evaluate on a test set with a very different distribution; (3) The model is badly overfit. An Rยฒ < 0 model is worse than the "dumb baseline" of always predicting ศณ.
Q6. Explain the bias-variance tradeoff in the context of linear regression.
Show Answer
OLS linear regression has low bias (it's unbiased โ Gauss-Markov theorem) but can have high variance when there are many features or multicollinearity (coefficients become unstable). Ridge regression introduces a small bias (by shrinking coefficients) but dramatically reduces variance, often giving better test performance. The bias-variance tradeoff is: Total Error = Biasยฒ + Variance + Irreducible Noise. OLS minimizes bias; regularized methods optimize the tradeoff.
Q7. What is the relationship between linear regression and a single-neuron neural network?
Show Answer
A single neuron without activation function IS linear regression: output = wTx + b. With a sigmoid activation, it becomes logistic regression. With a ReLU activation, it's a "half-linear" model. Deep neural networks are essentially stacks of these linear transformations with non-linear activations. Understanding linear regression deeply is understanding the building block of all neural networks.
Q8. How would you handle outliers in a regression problem?
Show Answer
Approaches: (1) Detection โ IQR method, Z-score, Cook's distance (leverage plot); (2) Removal โ Only if they're data errors, not genuine extreme values; (3) Robust regression โ Use Huber loss or RANSAC which downweight outliers; (4) Transform โ Log-transform heavy-tailed targets; (5) Use MAE instead of MSE as the loss function (less sensitive to outliers); (6) Winsorize โ Cap extreme values at the 5th/95th percentile. The best approach depends on whether outliers are signal or noise.
Q9. You have a linear regression model with Rยฒ = 0.95 on training data but Rยฒ = 0.40 on test data. What happened and how do you fix it?
Show Answer
This is overfitting. The model memorized training data but fails to generalize. Likely causes: too many features relative to samples (high p/n ratio), polynomial features of high degree, or multicollinearity. Fixes: (1) Regularization โ Ridge (L2) or LASSO (L1); (2) Feature selection โ remove irrelevant features; (3) Cross-validation โ use k-fold CV instead of train/test split; (4) More data โ reduce variance; (5) Reduce polynomial degree; (6) Check for data leakage (test data information accidentally in training).
Q10. Explain feature engineering for linear regression. Give 3 examples.
Show Answer
Feature engineering creates new informative features: (1) Interaction terms โ area ร bedrooms might capture "spacious bedrooms" effect; (2) Polynomial features โ xยฒ captures non-linear relationships while keeping the model "linear in parameters"; (3) Log transform โ log(income) linearizes exponential relationships; (4) Binning โ convert continuous age to age groups; (5) One-hot encoding โ convert categorical city to binary features; (6) Domain-specific โ price_per_sqft = price/area normalizes for size differences.
Q11. What happens to linear regression if features are perfectly multicollinear?
Show Answer
If features are perfectly multicollinear (e.g., xโ = 2xโ + xโ), then XTX is singular (not invertible), and the Normal Equation has no unique solution. There are infinitely many weight vectors that achieve the same MSE. NumPy's pinv (pseudoinverse) will return one solution, but coefficients will be unreliable. GD may converge but to an arbitrary point in the solution space. Fix: use regularization, drop redundant features, or use PCA.
Research Problems
๐ฌ Research Problem 1: Adaptive Learning Rate for Indian Agricultural Data
Problem: Indian agricultural data has high variance across states (Punjab vs. Rajasthan) and years (monsoon vs. drought). Standard learning rate schedules (step decay, cosine annealing) are designed for i.i.d. data. Design and implement an adaptive learning rate schedule that accounts for heterogeneity in agricultural data.
Suggested approach: Implement per-feature learning rates based on the curvature of the loss landscape (second-order information). Compare with Adam, AdaGrad, and RMSprop. Evaluate on IMD+ICAR datasets across 5 crop types and 10 states.
Expected outcome: A novel learning rate schedule that converges 2-3x faster on heterogeneous agricultural data compared to fixed learning rate.
๐ฌ Research Problem 2: Robust Regression for Noisy Urban Data
Problem: Indian urban datasets (real estate, traffic, pollution) contain extreme outliers and non-Gaussian noise. Standard MSE-based regression is highly sensitive to outliers. Investigate combinations of Huber loss, quantile regression, and RANSAC for Indian city datasets.
Research questions: (1) What is the optimal Huber ฮด parameter for Indian real estate data? (2) Does the optimal ฮด vary by city tier? (3) Can you derive a closed-form estimator for Huber regression? (4) Compare computational cost vs. robustness gain.
๐ฌ Research Problem 3: Interpretable Regression for Policy
Problem: Government policies in India (MSP pricing, subsidy allocation, tax rates) require interpretable models. But linear regression's assumption of linearity is often violated. Investigate Generalized Additive Models (GAMs) as a middle ground: y = ฮฒโ + fโ(xโ) + fโ(xโ) + ... where each f is a smooth non-linear function.
Questions: (1) How much accuracy do GAMs gain over linear regression on Indian policy datasets? (2) Are GAMs interpretable enough for Parliamentary scrutiny? (3) Implement a simple GAM using basis splines and compare with linear regression and random forest on 3 government datasets.
Key Takeaways
References & Further Reading
Textbooks
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer. Chapters 3-4. [Free online: https://hastie.su.domains/ElemStatLearn/]
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 3.
- Gรฉron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed. O'Reilly. Chapter 4.
- Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 11.
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning, 2nd ed. Springer. Chapter 3. [Free online: https://www.statlearning.com/]
Research Papers
- Legendre, A.-M. (1805). Nouvelles mรฉthodes pour la dรฉtermination des orbites des comรจtes. Paris. [Original least squares paper]
- Gauss, C. F. (1809). Theoria Motus Corporum Coelestium. [Gaussian noise + MLE connection]
- Hoerl, A. E., & Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1), 55-67.
- Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. JRSS-B, 58(1), 267-288.
- Harrison, D., & Rubinfeld, D. L. (1978). Hedonic Housing Prices and the Demand for Clean Air. Journal of Environmental Economics.
Indian Data Sources
- data.gov.in โ Government of India Open Data Platform [Housing, Agriculture, Weather datasets]
- IMD (mausam.imd.gov.in) โ India Meteorological Department historical rainfall data
- ICAR (icar.org.in) โ Indian Council of Agricultural Research crop data
- NHB (nhb.org.in) โ National Housing Bank RESIDEX (Real Estate Price Index)
- Kaggle: "Indian Houses Dataset", "Indian Real Estate Analysis", "IMD Rainfall Data"
Online Resources
- Scikit-Learn Documentation: sklearn.linear_model.LinearRegression
- TensorFlow Tutorials: Basic regression (tensorflow.org/tutorials)
- Stanford CS229 Lecture Notes: Andrew Ng's Linear Regression notes
- 3Blue1Brown: "The Essence of Linear Algebra" (YouTube)
- StatQuest with Josh Starmer: "Linear Regression" series (YouTube)
Ethical References
- Carlisle, M. (2020). "Racist data destruction? On the Boston Housing dataset." Medium article explaining why sklearn deprecated load_boston().
- O'Neil, C. (2016). Weapons of Math Destruction. Crown Books. [On biased models in housing, credit, and policing]
PART II: SUPERVISED LEARNING