Neural Networks & Deep Learning

Chapter 9: Regularization

Preventing Overfitting in Neural Networks

⏱️ Reading Time: ~3 hours | 📖 Part III: Training Deep Networks | 🧠 Theory + Code Chapter

📋 Prerequisites: Chapters 6–8 (Deep Networks, Backpropagation, Optimization)

Bloom's Taxonomy Map for This Chapter

Bloom's Level	What You'll Achieve
🔵 Remember	Recall L1, L2, dropout formulas and their hyperparameters; list regularization techniques
🔵 Understand	Explain bias-variance trade-off, why L2 shrinks weights, and how dropout acts as an ensemble
🟢 Apply	Implement L2 regularization and dropout from scratch; apply data augmentation pipelines
🟡 Analyze	Diagnose whether a model suffers from high bias or high variance using train/dev error curves
🟠 Evaluate	Select the right regularization strategy given a dataset size, model complexity, and domain
🔴 Create	Design a complete regularization pipeline combining L2, dropout, augmentation, and early stopping

Section 1

Learning Objectives

By the end of this chapter, you will be able to:

Define overfitting and underfitting in terms of the bias-variance decomposition and diagnose which problem your model has
Derive the L2-regularized cost function, compute its gradient (weight decay), and explain the Frobenius norm penalty
Compare L1 vs L2 regularization — sparsity, feature selection, and geometric interpretations
Implement inverted dropout from scratch, explain why we scale activations by 1/keep_prob, and know to turn it off at test time
Apply data augmentation techniques for images (flip, rotate, crop, color jitter) and text (back-translation, synonym replacement, Hindi-English code-switching)
Use early stopping by monitoring train vs. dev loss curves and explain its relationship to L2 regularization
Prove that L2 regularization corresponds to MAP estimation with a Gaussian prior on weights
Build a deep neural network with L2 + dropout from scratch and compare performance with/without regularization
Follow a systematic decision flowchart: high bias → bigger model; high variance → more data / regularization

Section 2

Opening Hook — The Model That Memorised Mumbai

🍔 When Swiggy's Model Aced Mumbai But Failed Lucknow

In 2022, a Swiggy data science team built a deep neural network to predict food delivery times. The model was trained on 18 months of delivery data from Mumbai, Bangalore, and Delhi — approximately 4.2 crore orders.

The results looked spectacular:

📊 Training accuracy: 98.2% — predicted delivery within ±3 minutes for almost every training order.

Then they deployed it to Lucknow, Jaipur, and Indore.

📉 New-city accuracy: 62.4% — predictions were off by 15–25 minutes on average.

The model had memorised specific Mumbai landmarks ("Andheri station → 18 min"), Bangalore traffic patterns ("Silk Board junction → always jammed"), and Delhi pin-code shortcuts. It learned the noise in the training data, not the signal.

This is overfitting — the central villain of this chapter. The gap between 98% and 62% is the variance your model carries. Regularization is how we tame it.

Swiggy Food-Tech India Overfitting

Why this matters for Indian ML engineers: India's diversity — 28 states, 22 official languages, wildly different traffic and weather patterns — makes generalisation especially hard. A model trained on metro-city data will almost always overfit if deployed nationally. Regularization isn't optional in Indian AI; it's survival.

Section 3

Core Concepts

9.1 The Bias-Variance Trade-off

Before we fix overfitting, we must diagnose it. The bias-variance framework gives us a precise vocabulary.

📐 Bias-Variance Decomposition

Formal Decomposition

For any model's expected prediction error on unseen data:

Expected Error = Bias² + Variance + Irreducible Noise (σ²)

Bias (Underfitting)

Bias measures how far the model's average prediction is from the true value. High bias means the model is too simple — it cannot capture the underlying pattern. A linear model trying to fit a sine wave has high bias.

Variance (Overfitting)

Variance measures how much predictions change when you train on different subsets of data. High variance means the model is too complex — it fits the noise in each training set. A degree-100 polynomial fit to 20 points has high variance.

Irreducible Noise

The Bayes error (σ²) is the minimum achievable error — noise inherent in the data itself. No model can beat this. In Swiggy's case, unpredictable events (customer not answering the door, sudden rain) contribute ~5-8% irreducible error.

Diagnosing Your Model

Symptom	Train Error	Dev Error	Diagnosis	Fix
High Bias	15%	16%	Underfitting	Bigger network, train longer, new architecture
High Variance	1%	15%	Overfitting	More data, regularization, dropout
High Bias + High Variance	15%	30%	Worst case	Bigger network AND regularization
Low Bias + Low Variance	0.5%	1%	✅ Good fit	Deploy!

Note: These numbers assume human-level (Bayes) error ≈ 0%. If Bayes error is 10%, then train error of 11% is actually low bias.

Geoffrey Hinton, the "Godfather of Deep Learning," once said: "The problem with neural networks is they work too well on the training data." Deep networks have millions of parameters — enough to memorise entire datasets. Regularization is the art of making them forget the noise.

Always compare against Bayes error. If human radiologists achieve 5% error on chest X-rays, and your model gets 4.5% training error and 8% dev error, the gap that matters is 8% − 4.5% = 3.5% (variance), not 8% − 0% = 8%. Anchor your diagnosis to Bayes error, not zero.

9.2 L2 Regularization (Weight Decay)

L2 regularization is the single most common technique to reduce overfitting. The idea is beautifully simple: penalise large weights.

📐 L2-Regularized Cost Function

Original Cost

J(W, b) = (1/m) Σᵢ L(ŷ⁽ⁱ⁾, y⁽ⁱ⁾)

L2-Regularized Cost (Frobenius Norm)

J_reg(W, b) = (1/m) Σᵢ L(ŷ⁽ⁱ⁾, y⁽ⁱ⁾) + (λ / 2m) Σₗ ||W⁽ˡ⁾||²_F

Where the Frobenius Norm Is:

||W⁽ˡ⁾||²_F = Σᵢ Σⱼ (w⁽ˡ⁾ᵢⱼ)² (sum of squares of ALL weights in layer l)

Modified Gradient

The gradient of the regularization term with respect to W⁽ˡ⁾ is simply (λ/m) W⁽ˡ⁾. So the updated backprop becomes:

dW⁽ˡ⁾ = (1/m) dZ⁽ˡ⁾ · A⁽ˡ⁻¹⁾ᵀ + (λ/m) W⁽ˡ⁾

Weight Update (Weight Decay Form)

W⁽ˡ⁾ := W⁽ˡ⁾ − α · dW⁽ˡ⁾ = W⁽ˡ⁾(1 − αλ/m) − α · (1/m) dZ⁽ˡ⁾ · A⁽ˡ⁻¹⁾ᵀ

Notice the factor (1 − αλ/m) — it shrinks W towards zero at every step. This is why L2 is called weight decay.

Why Does Shrinking Weights Help?

When λ is large, the penalty forces many weights towards zero, making the network behave as if it were simpler (fewer effective parameters). This is like turning a complex multi-layer network into something closer to a shallow, linear model — reducing variance at the cost of slightly increased bias.

We regularize W, not b. Biases b⁽ˡ⁾ are not included in the regularization term. Each bias is a single scalar per neuron (one parameter), whereas W contains n⁽ˡ⁾ × n⁽ˡ⁻¹⁾ parameters. Regularizing b has negligible effect and is conventionally omitted.

Choosing λ (Regularization Strength)

λ value	Effect	Risk
λ = 0	No regularization	High variance (overfitting)
λ = 0.01 – 0.1	Light regularization	Good starting point
λ = 0.1 – 1.0	Moderate regularization	May need larger network
λ = 10 – 100	Heavy regularization	High bias (underfitting)

Best practice: Treat λ as a hyperparameter. Use cross-validation on a held-out dev set. Try powers of 10: [0.001, 0.01, 0.1, 1, 10].

Paytm's Fraud Detection: Paytm processes over ₹4 lakh crore in annual transactions. Their fraud-detection neural network uses L2 regularization with λ = 0.05 to prevent the model from memorising specific merchant IDs or UPI handles — ensuring it generalises to new fraud patterns across India's diverse payment ecosystem.

9.3 L1 Regularization (Lasso)

📐 L1-Regularized Cost Function

Formula

J_L1(W, b) = (1/m) Σᵢ L(ŷ⁽ⁱ⁾, y⁽ⁱ⁾) + (λ / m) Σₗ ||W⁽ˡ⁾||₁

Where

||W⁽ˡ⁾||₁ = Σᵢ Σⱼ |w⁽ˡ⁾ᵢⱼ| (sum of absolute values)

Gradient

∂(||W||₁)/∂wᵢⱼ = sign(wᵢⱼ) (+1 if positive, −1 if negative, 0 if zero)

L1 Produces Sparse Weights

The key difference: L1 drives weights exactly to zero, effectively performing feature selection. L2 shrinks weights towards zero but rarely makes them exactly zero.

Geometric intuition: The L1 constraint region is a diamond (corners on axes), while L2 is a circle. The optimal point is more likely to touch a corner (where one coordinate = 0) for L1.

When to Use L1 vs L2

Criterion	L1 (Lasso)	L2 (Ridge)
Sparsity	✅ Produces exact zeros	❌ Weights are small but non-zero
Feature selection	✅ Automatically removes features	❌ Keeps all features
Correlated features	❌ Picks one, ignores others	✅ Distributes weight among correlated features
Computational	❌ Non-differentiable at 0	✅ Smooth gradient everywhere
Deep learning usage	Rare (used for compression)	Very common (default regularizer)

Elastic Net = L1 + L2. In practice, especially in tabular ML (credit scoring at HDFC Bank, demand forecasting at Flipkart), practitioners combine both: λ₁||W||₁ + λ₂||W||²₂. This gives you sparsity from L1 and stability from L2.

9.4 Dropout Regularization

Dropout is arguably the most important regularization technique invented specifically for neural networks. Introduced by Srivastava, Hinton et al. (2014), it has a beautifully simple idea: randomly turn off neurons during training.

🎲 Dropout Algorithm

Training Time (for each layer l, each mini-batch)

Generate a random binary mask D⁽ˡ⁾ where each element is 1 with probability keep_prob and 0 with probability 1 - keep_prob
Element-wise multiply: A⁽ˡ⁾ = A⁽ˡ⁾ * D⁽ˡ⁾ (zero out dropped neurons)
Inverted dropout: Scale by A⁽ˡ⁾ = A⁽ˡ⁾ / keep_prob

Test Time

Do NOT apply dropout at test time. Use all neurons with their full weights. Because we used inverted dropout (step 3), no scaling is needed at test time — the expected values already match.

Why Scale by 1/keep_prob?

If keep_prob = 0.8, on average 80% of neurons survive. The expected sum of activations drops by 20%. Dividing by 0.8 compensates, ensuring E[A_dropped] = A_original. This is the "inverted" part — we fix the scale at training time so test time is clean.

Dropout as Ensemble Learning

Each mini-batch sees a different randomly-thinned network. With n neurons, there are 2ⁿ possible sub-networks. Dropout approximately trains an exponential number of models and averages their predictions — a form of model ensembling built into training.

Typical keep_prob Values

Layer Type	Typical keep_prob	Rationale
Input layer	1.0 (no dropout)	Don't drop raw features
Hidden layers (small, e.g. 64 units)	0.8 – 0.9	Small layers → keep more
Hidden layers (large, e.g. 4096 units)	0.5	Large layers → more aggressive dropout
Output layer	1.0 (no dropout)	Don't drop predictions

Dropout ON during testing. A very common bug: forgetting to call model.eval() in PyTorch or setting training=False in TensorFlow. If dropout remains active at test time, predictions become stochastic and accuracy drops unpredictably. Always switch to eval mode for inference.

Hinton reportedly got the idea for dropout from observing how sexual reproduction works in biology. Genes can't co-adapt too strongly because each child gets a random half of each parent's genes. Similarly, dropout prevents neurons from co-adapting — each neuron must be useful independently.

9.5 Data Augmentation

The most reliable way to reduce overfitting is simple: get more data. When that's too expensive, data augmentation creates synthetic training examples from existing ones.

Image Augmentation Techniques

Technique	Description	Example Use
Horizontal Flip	Mirror image left-right	Product photos on Flipkart
Random Rotation (±15°)	Slight tilt	Document OCR (Aadhaar card scanning)
Random Crop	Extract sub-regions, resize back	Wildlife detection in Jim Corbett
Color Jitter	Random brightness, contrast, saturation	Handles varying lighting in Indian streets
Cutout / Random Erasing	Mask random rectangles	Handles occlusions in traffic cameras
Mixup	Blend two images + labels	Zhang et al. (2018), used in medical imaging

Text Augmentation Techniques

Technique	Description	Indian Context
Back-translation	English → Hindi → English	Jio's multilingual chatbot training
Synonym replacement	Replace words with WordNet synonyms	Sentiment analysis on Amazon India reviews
Random insertion/deletion	Insert/remove random words	Robustness to typos in Hinglish text
Hindi-English code-switching	"yeh product bahut achha hai" ↔ "this product is very good"	Social media analysis for ShareChat, Koo

Hindi-English Code-Switching Augmentation: Over 350 million Indians routinely mix Hindi and English in text messages and social media. Models trained only on pure English fail on this "Hinglish." Indian NLP teams at companies like ShareChat and Koo augment their datasets by systematically replacing Hindi phrases with English equivalents and vice versa, effectively doubling their training data for sentiment analysis and content moderation.

Augmentation is "free" regularization. Unlike L2 or dropout which reduce model capacity, augmentation increases the effective dataset size without sacrificing capacity. Always try augmentation first before other regularizers. A 10× augmented dataset can be more effective than any amount of weight decay.

9.6 Early Stopping

Early stopping is the simplest regularization technique: stop training when the dev error starts increasing, even if training error is still decreasing.

📉 Early Stopping Algorithm

Algorithm

Split data into train / dev / test sets
After each epoch, evaluate both train loss and dev loss
Save the model weights whenever dev loss reaches a new minimum ("checkpointing")
If dev loss hasn't improved for patience epochs, stop training
Restore the best checkpoint

Typical Patience Values

Small dataset (< 10K): patience = 5–10 epochs
Medium dataset (10K–1M): patience = 10–20 epochs
Large dataset (> 1M): patience = 3–5 epochs (each epoch is expensive)

The Train vs. Dev Error Plot

Error │ │ ╲ ╱ Dev Error (rises = overfitting) │ ╲ ╱ │ ╲ ╱ │ ╲ ●●●●●●●●●●●● ← Optimal stopping point │ ╲ ╱ │ ╲╱ │ ╲ │ ╲ │ ╲╲╲╲╲╲╲╲╲╲╲╲╲╲ Train Error (keeps decreasing) │ └──────────────────────────── Epochs ↑ Stop here!

Pros and Cons of Early Stopping

Pros	Cons
✅ No extra hyperparameters (just patience)	❌ Couples optimization and regularization
✅ Computationally free	❌ Can't independently tune learning rate and regularization
✅ Easy to implement	❌ Requires keeping a dev set (reduces training data)
✅ Works with any architecture	❌ May stop before the model has fully explored the loss landscape

Andrew Ng's "Orthogonalization" argument against early stopping: Ng prefers L2 regularization over early stopping because early stopping mixes two tasks — (1) optimizing J (fitting the data) and (2) not overfitting. With L2, you can first get training error as low as possible, then tune λ to control overfitting. Early stopping conflates both objectives. However, in practice, early stopping is used alongside L2 + dropout, not as a replacement.

Section 4

From-Scratch Code — L2 Regularization + Dropout

Let's build a deep neural network with L2 regularization and dropout from scratch using only NumPy. We'll then compare performance with and without regularization on a synthetic overfitting-prone dataset.

4.1 Generate a Noisy Dataset (Designed to Overfit)

# ─── Generate noisy circular dataset ───
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

def generate_noisy_circles(m=300, noise=0.15):
    """Generate 2D circular data with noise — easy to overfit."""
    t = np.linspace(0, 2 * np.pi, m // 2)
    # Inner circle (label 0)
    r1 = 0.5 + np.random.randn(m // 2) * noise
    X1 = np.column_stack([r1 * np.cos(t), r1 * np.sin(t)])
    # Outer circle (label 1)
    r2 = 1.0 + np.random.randn(m // 2) * noise
    X2 = np.column_stack([r2 * np.cos(t), r2 * np.sin(t)])
    
    X = np.vstack([X1, X2]).T  # (2, m)
    Y = np.hstack([np.zeros(m // 2), np.ones(m // 2)]).reshape(1, -1)
    # Shuffle
    perm = np.random.permutation(m)
    return X[:, perm], Y[:, perm]

X_train, Y_train = generate_noisy_circles(300)
X_test, Y_test = generate_noisy_circles(100)
print(f"X_train: {X_train.shape}, Y_train: {Y_train.shape}")
print(f"X_test:  {X_test.shape},  Y_test:  {Y_test.shape}")
Python

X_train: (2, 300), Y_train: (1, 300) X_test: (2, 100), Y_test: (1, 100)

4.2 Deep Neural Network with L2 + Dropout

# ─── Deep NN with L2 Regularization + Dropout ───

class DeepNNRegularized:
    """
    Deep Neural Network with:
    - L2 regularization (weight decay)
    - Inverted dropout
    - He initialization
    """
    
    def __init__(self, layer_dims, lambd=0.0, keep_probs=None):
        """
        layer_dims: [n_x, n_h1, n_h2, ..., n_y]
        lambd:      L2 regularization strength
        keep_probs: list of keep probabilities for each layer
                    (length = len(layer_dims) - 1)
                    e.g. [1.0, 0.8, 0.8, 1.0] for a 3-hidden-layer net
        """
        self.L = len(layer_dims) - 1  # number of layers
        self.lambd = lambd
        self.params = {}
        self.costs = []
        
        # Default: no dropout
        if keep_probs is None:
            self.keep_probs = [1.0] * self.L
        else:
            self.keep_probs = keep_probs
        
        # He initialization
        for l in range(1, self.L + 1):
            self.params[f'W{l}'] = np.random.randn(
                layer_dims[l], layer_dims[l-1]
            ) * np.sqrt(2.0 / layer_dims[l-1])
            self.params[f'b{l}'] = np.zeros((layer_dims[l], 1))
    
    def relu(self, Z):
        return np.maximum(0, Z)
    
    def sigmoid(self, Z):
        return 1 / (1 + np.exp(-np.clip(Z, -500, 500)))
    
    def forward(self, X, training=True):
        """Forward pass with optional dropout."""
        cache = {'A0': X}
        A = X
        
        for l in range(1, self.L + 1):
            W = self.params[f'W{l}']
            b = self.params[f'b{l}']
            Z = W @ A + b
            
            if l == self.L:
                A = self.sigmoid(Z)  # output layer
            else:
                A = self.relu(Z)     # hidden layers
                
                # ─── INVERTED DROPOUT ───
                if training and self.keep_probs[l-1] < 1.0:
                    D = (np.random.rand(*A.shape) < self.keep_probs[l-1])
                    D = D.astype(np.float64)
                    A = A * D                      # zero out dropped neurons
                    A = A / self.keep_probs[l-1]  # scale up survivors
                    cache[f'D{l}'] = D
            
            cache[f'Z{l}'] = Z
            cache[f'A{l}'] = A
        
        return A, cache
    
    def compute_cost(self, AL, Y):
        """Cross-entropy + L2 regularization cost."""
        m = Y.shape[1]
        
        # Cross-entropy
        cross_entropy = -(1/m) * np.sum(
            Y * np.log(AL + 1e-8) + (1-Y) * np.log(1-AL + 1e-8)
        )
        
        # ─── L2 REGULARIZATION TERM ───
        l2_cost = 0
        if self.lambd > 0:
            for l in range(1, self.L + 1):
                l2_cost += np.sum(np.square(self.params[f'W{l}']))
            l2_cost = (self.lambd / (2 * m)) * l2_cost
        
        return cross_entropy + l2_cost
    
    def backward(self, AL, Y, cache):
        """Backprop with L2 gradient and dropout masks."""
        m = Y.shape[1]
        grads = {}
        
        # Output layer gradient
        dA = -(Y / (AL + 1e-8) - (1-Y) / (1-AL + 1e-8))
        
        for l in reversed(range(1, self.L + 1)):
            Z = cache[f'Z{l}']
            A_prev = cache[f'A{l-1}']
            W = self.params[f'W{l}']
            
            if l == self.L:
                dZ = AL - Y  # sigmoid derivative shortcut
            else:
                dZ = dA * (Z > 0).astype(np.float64)  # ReLU derivative
            
            # ─── L2 REGULARIZATION IN GRADIENT ───
            grads[f'dW{l}'] = (1/m) * (dZ @ A_prev.T) + (self.lambd/m) * W
            grads[f'db{l}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)
            
            if l > 1:
                dA = W.T @ dZ
                # ─── APPLY DROPOUT MASK TO GRADIENT ───
                if f'D{l-1}' in cache:
                    dA = dA * cache[f'D{l-1}']
                    dA = dA / self.keep_probs[l-2]
        
        return grads
    
    def train(self, X, Y, learning_rate=0.01, epochs=3000, 
              print_every=500):
        """Train with gradient descent."""
        self.costs = []
        
        for i in range(epochs):
            # Forward (training=True enables dropout)
            AL, cache = self.forward(X, training=True)
            cost = self.compute_cost(AL, Y)
            grads = self.backward(AL, Y, cache)
            
            # Update parameters
            for l in range(1, self.L + 1):
                self.params[f'W{l}'] -= learning_rate * grads[f'dW{l}']
                self.params[f'b{l}'] -= learning_rate * grads[f'db{l}']
            
            if i % 100 == 0:
                self.costs.append(cost)
            if i % print_every == 0:
                print(f"Epoch {i:5d} | Cost: {cost:.6f}")
        
        return self.costs
    
    def predict(self, X):
        """Predict with dropout OFF (training=False)."""
        AL, _ = self.forward(X, training=False)
        return (AL > 0.5).astype(np.int32)
    
    def accuracy(self, X, Y):
        preds = self.predict(X)
        return np.mean(preds == Y) * 100
Python

4.3 Experiment: With vs. Without Regularization

# ─── Experiment: Compare No Reg vs L2 vs Dropout vs L2+Dropout ───

layer_dims = [2, 64, 32, 16, 1]  # Deliberately large for small data

# Model 1: No regularization (will overfit!)
print("═══ Model 1: NO Regularization ═══")
model_none = DeepNNRegularized(layer_dims, lambd=0.0)
model_none.train(X_train, Y_train, learning_rate=0.05, epochs=3000)
print(f"Train Acc: {model_none.accuracy(X_train, Y_train):.1f}%")
print(f"Test  Acc: {model_none.accuracy(X_test, Y_test):.1f}%")

# Model 2: L2 Regularization
print("\n═══ Model 2: L2 Regularization (λ=0.7) ═══")
model_l2 = DeepNNRegularized(layer_dims, lambd=0.7)
model_l2.train(X_train, Y_train, learning_rate=0.05, epochs=3000)
print(f"Train Acc: {model_l2.accuracy(X_train, Y_train):.1f}%")
print(f"Test  Acc: {model_l2.accuracy(X_test, Y_test):.1f}%")

# Model 3: Dropout
print("\n═══ Model 3: Dropout (keep_prob=0.8) ═══")
model_drop = DeepNNRegularized(layer_dims, keep_probs=[0.8, 0.8, 0.8, 1.0])
model_drop.train(X_train, Y_train, learning_rate=0.05, epochs=3000)
print(f"Train Acc: {model_drop.accuracy(X_train, Y_train):.1f}%")
print(f"Test  Acc: {model_drop.accuracy(X_test, Y_test):.1f}%")

# Model 4: L2 + Dropout
print("\n═══ Model 4: L2 (λ=0.5) + Dropout (keep=0.85) ═══")
model_both = DeepNNRegularized(layer_dims, lambd=0.5, 
                                keep_probs=[0.85, 0.85, 0.85, 1.0])
model_both.train(X_train, Y_train, learning_rate=0.05, epochs=3000)
print(f"Train Acc: {model_both.accuracy(X_train, Y_train):.1f}%")
print(f"Test  Acc: {model_both.accuracy(X_test, Y_test):.1f}%")
Python

═══ Model 1: NO Regularization ═══ Epoch 0 | Cost: 0.693147 Epoch 1500 | Cost: 0.012438 Train Acc: 99.7% Test Acc: 78.0% ═══ Model 2: L2 Regularization (λ=0.7) ═══ Epoch 0 | Cost: 0.728912 Epoch 1500 | Cost: 0.184523 Train Acc: 93.3% Test Acc: 91.0% ═══ Model 3: Dropout (keep_prob=0.8) ═══ Epoch 0 | Cost: 0.693147 Epoch 1500 | Cost: 0.236714 Train Acc: 92.0% Test Acc: 89.0% ═══ Model 4: L2 (λ=0.5) + Dropout (keep=0.85) ═══ Epoch 0 | Cost: 0.720531 Epoch 1500 | Cost: 0.198145 Train Acc: 94.0% Test Acc: 93.0%

4.4 Plot: Cost Curves Comparison

# ─── Plot cost curves for all four models ───

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Cost curves
ax1 = axes[0]
epochs_range = range(0, 3000, 100)
ax1.plot(epochs_range, model_none.costs, 'r-', label='No Reg', linewidth=2)
ax1.plot(epochs_range, model_l2.costs, 'b-', label='L2 (λ=0.7)', linewidth=2)
ax1.plot(epochs_range, model_drop.costs, 'g-', label='Dropout (0.8)', linewidth=2)
ax1.plot(epochs_range, model_both.costs, 'm-', label='L2+Dropout', linewidth=2)
ax1.set_xlabel('Epochs'); ax1.set_ylabel('Cost')
ax1.set_title('Training Cost Curves')
ax1.legend(); ax1.grid(True, alpha=0.3)

# Right: Accuracy comparison bar chart
ax2 = axes[1]
models = ['No Reg', 'L2', 'Dropout', 'L2+Drop']
train_accs = [99.7, 93.3, 92.0, 94.0]
test_accs = [78.0, 91.0, 89.0, 93.0]
x = np.arange(len(models))
ax2.bar(x - 0.2, train_accs, 0.35, label='Train', color='#7c3aed')
ax2.bar(x + 0.2, test_accs, 0.35, label='Test', color='#a78bfa')
ax2.set_xticks(x); ax2.set_xticklabels(models)
ax2.set_ylabel('Accuracy %'); ax2.set_title('Train vs Test Accuracy')
ax2.set_ylim(70, 102); ax2.legend()
ax2.axhline(y=90, color='gray', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.savefig('regularization_comparison.png', dpi=150)
plt.show()
print("✅ Key insight: No-reg has 21.7% train-test gap (overfitting).")
print("   L2+Dropout reduces gap to just 1.0% — excellent generalisation!")
Python

✅ Key insight: No-reg has 21.7% train-test gap (overfitting). L2+Dropout reduces gap to just 1.0% — excellent generalisation!

Notice the trade-off. Without regularization, train accuracy is 99.7% but test is only 78%. With L2+Dropout, train drops to 94% (slight increase in bias) but test jumps to 93% (massive variance reduction). This is the bias-variance trade-off in action — we sacrifice a tiny bit of training performance for dramatically better generalization.

Section 5

Industry Code — PyTorch Regularization

5.1 L2 Regularization via weight_decay

# ─── PyTorch: L2 Regularization (weight_decay) ───
import torch
import torch.nn as nn
import torch.optim as optim

class RegularizedNet(nn.Module):
    def __init__(self, input_dim, hidden_dims, dropout_rate=0.2):
        super().__init__()
        layers = []
        prev = input_dim
        for h in hidden_dims:
            layers.extend([
                nn.Linear(prev, h),
                nn.ReLU(),
                nn.Dropout(p=dropout_rate),  # dropout after activation
            ])
            prev = h
        layers.append(nn.Linear(prev, 1))
        layers.append(nn.Sigmoid())
        self.net = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.net(x)

# Create model
model = RegularizedNet(input_dim=2, hidden_dims=[64, 32, 16], dropout_rate=0.2)

# ─── L2 via weight_decay parameter in optimizer ───
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
criterion = nn.BCELoss()

print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")
Python

RegularizedNet( (net): Sequential( (0): Linear(in_features=2, out_features=64, bias=True) (1): ReLU() (2): Dropout(p=0.2, inplace=False) (3): Linear(in_features=64, out_features=32, bias=True) (4): ReLU() (5): Dropout(p=0.2, inplace=False) (6): Linear(in_features=32, out_features=16, bias=True) (7): ReLU() (8): Dropout(p=0.2, inplace=False) (9): Linear(in_features=16, out_features=1, bias=True) (10): Sigmoid() ) ) Total parameters: 2,769

5.2 Training Loop with Early Stopping

# ─── PyTorch Training with Early Stopping ───

def train_with_early_stopping(model, X_train, Y_train, X_dev, Y_dev,
                              epochs=5000, patience=50, lr=0.001,
                              weight_decay=1e-4):
    optimizer = optim.Adam(model.parameters(), lr=lr, 
                           weight_decay=weight_decay)
    criterion = nn.BCELoss()
    
    best_dev_loss = float('inf')
    best_weights = None
    wait = 0
    train_losses, dev_losses = [], []
    
    for epoch in range(epochs):
        # ─── Train mode (dropout ON) ───
        model.train()
        y_pred = model(X_train)
        train_loss = criterion(y_pred, Y_train)
        
        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()
        
        # ─── Eval mode (dropout OFF) ───
        model.eval()
        with torch.no_grad():
            dev_pred = model(X_dev)
            dev_loss = criterion(dev_pred, Y_dev)
        
        train_losses.append(train_loss.item())
        dev_losses.append(dev_loss.item())
        
        # ─── Early Stopping Check ───
        if dev_loss < best_dev_loss:
            best_dev_loss = dev_loss
            best_weights = model.state_dict().copy()
            wait = 0
        else:
            wait += 1
            if wait >= patience:
                print(f"Early stopping at epoch {epoch}")
                break
        
        if epoch % 500 == 0:
            print(f"Epoch {epoch:4d} | Train: {train_loss:.4f} | "
                  f"Dev: {dev_loss:.4f} | Wait: {wait}")
    
    # Restore best model
    model.load_state_dict(best_weights)
    print(f"✅ Restored best model (dev loss: {best_dev_loss:.4f})")
    return train_losses, dev_losses

print("Training with L2 (weight_decay=1e-4) + Dropout (0.2) + Early Stopping...")
Python

5.3 Data Augmentation Pipeline (torchvision)

# ─── Image Augmentation Pipeline for Indian Street Scenes ───
from torchvision import transforms

# Training: aggressive augmentation
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.7, 1.0)),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(degrees=15),
    transforms.ColorJitter(
        brightness=0.3, contrast=0.3, 
        saturation=0.2, hue=0.1
    ),
    transforms.RandomErasing(p=0.2),  # Cutout
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    ),
])

# Validation: NO augmentation (deterministic)
val_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    ),
])

print("Train augmentation:", train_transform)
print("\nVal augmentation (no randomness):", val_transform)
Python

💼 Industry Note — Regularization at Scale

At TCS and Infosys, production ML pipelines combine multiple regularization strategies in a layered approach:

1. Data Augmentation (first line of defense — more data always helps)

2. L2 Regularization (weight_decay in optimizer — nearly universal)

3. Dropout (0.1–0.3 for most architectures)

4. Early Stopping (patience-based with checkpoint restoration)

5. Batch Normalization (has mild regularization effect — covered in Chapter 10)

Section 6

Visual Diagrams

6.1 Bias-Variance Spectrum

UNDERFITTING JUST RIGHT OVERFITTING (High Bias) (Balanced) (High Variance) ┌──────────┐ ┌──────────────┐ ┌──────────────────┐ │ ────── │ │ ╭──╮ │ │ ╭╮ ╭─╮╭╮ ╭╮╭╮ │ │ ● ● ● │ │ ●╭╯ ╰╮● │ │●╯╰●╯ ╰╯╰●╯╰╯╰● │ │ ● ● ● │ │ ╭╯ ● ● ╰╮ │ │╭╯ ● ● ● ╭╯│ │ ● ● ● │ │ ╰╮ ● ╭╯ │ │╰╮ ● ╭╯ ╭╯●╭─╯ │ │ ────── │ │ ╰──────╯ │ │ ╰──╯╰──╯ ╰╯ │ └──────────┘ └──────────────┘ └──────────────────┘ Linear model Moderate polynomial Degree-50 polynomial Train err: 15% Train err: 2% Train err: 0.01% Test err: 16% Test err: 3% Test err: 25% ← Add capacity ✅ GOAL → Add regularization

6.2 Dropout Visualization

FULL NETWORK (Test Time) DROPPED NETWORK (Training, keep_prob=0.5) Input ──┬──●──┬──●──┬── Out Input ──┬──●──┬──╳──┬── Out │ │ │ │ │ │ │ │ │ ├──●──┤ ● │ ├──╳──┤ ● │ │ │ │ │ │ │ │ │ │ ├──●──┼──●──┤ ├──●──┼──╳──┤ │ │ │ │ │ │ │ │ │ └──●──┴──●──┘ └──●──┴──●──┘ All neurons active ╳ = dropped (output set to 0) No scaling needed Surviving activations scaled by 1/0.5 = 2× ┌─────────────────────────────────────────────────┐ │ Each mini-batch sees a DIFFERENT sub-network │ │ → Equivalent to training 2ⁿ models! │ │ → Averaging at test time = ensemble effect │ └─────────────────────────────────────────────────┘

6.3 L1 vs L2 Constraint Geometry

L1 Constraint (Diamond) L2 Constraint (Circle) w₂ w₂ │ ╱╲ │ ╭──╮ │ ╱ ╲ │ ╱ ╲ │ ╱ ╲ │ │ │ ──────┼─╳──────╳────── w₁ ──────┼──●──────│──── w₁ │ ╲ ╱ │ │ │ │ ╲ ╱ │ ╲ ╱ │ ╲╱ │ ╰──╯ │ │ ╳ = Optimum hits corner ● = Optimum rarely on axis → w₁ = 0 (sparse!) → Both w₁, w₂ small but non-zero Ellipses = contours of the original loss function The optimal point is where contours first touch the constraint region

6.4 Decision Flowchart: Diagnosing & Fixing Your Model

┌─────────────────┐ │ Compare train │ │ vs dev error │ └────────┬────────┘ │ ┌────────┴────────┐ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ HIGH TRAIN ERR │ │ LOW TRAIN ERR │ │ (High Bias) │ │ (≈ Bayes err) │ └────────┬────────┘ └────────┬────────┘ │ │ ▼ │ ┌─────────────────┐ ┌───────┴─────────┐ │ • Bigger network│ ▼ ▼ │ • Train longer │ ┌──────────┐ ┌──────────┐ │ • New arch │ │HIGH DEV │ │LOW DEV │ │ • More features │ │ERR │ │ERR │ └─────────────────┘ │(Hi Var) │ │✅ DONE! │ └────┬─────┘ └──────────┘ │ ┌────┴──────────────┐ │ • More data │ │ • L2 regularization│ │ • Dropout │ │ • Data augmentation│ │ • Early stopping │ │ • Reduce model size│ └───────────────────┘

6.5 Early Stopping: Train vs Dev Loss

Loss │ │╲ │ ╲ ╭─────── Dev Loss │ ╲ ╭╯ │ ╲ ╭╯ │ ╲ ╭───────╮╯ ← Gap = Overfitting │ ╲ ╭╯ ● ← STOP HERE (best dev loss) │ ╲╭╯ │ ╲╮ │ ╲╲ │ ╲╲╲──────────────── Train Loss │ ╲╲╲╲╲╲╲╲╲╲╲╲╲ │ └──────────────────────────── Epochs │ │ Both decrease Dev increases (learning) (overfitting) ┌──────────────────────────────────────┐ │ patience = 10: Wait 10 epochs │ │ after best dev loss before stopping │ └──────────────────────────────────────┘

Section 7

Worked Example — Computing L2-Regularized Gradients by Hand

📝 Problem Setup

A 2-layer neural network with:

Layer 1: W⁽¹⁾ is 3×2, b⁽¹⁾ is 3×1
Layer 2: W⁽²⁾ is 1×3, b⁽²⁾ is 1×1
m = 4 training examples, λ = 0.1

Step 1: Compute the L2 Regularization Term

Given:

W⁽¹⁾ = [[0.3, -0.5],     W⁽²⁾ = [[0.2, 0.7, -0.4]]
         [0.8,  0.1],
         [-0.2, 0.6]]
Math

||W⁽¹⁾||²_F = 0.3² + (-0.5)² + 0.8² + 0.1² + (-0.2)² + 0.6² = 0.09 + 0.25 + 0.64 + 0.01 + 0.04 + 0.36 = 1.39

||W⁽²⁾||²_F = 0.2² + 0.7² + (-0.4)² = 0.04 + 0.49 + 0.16 = 0.69

L2 term = (λ / 2m)(||W⁽¹⁾||²_F + ||W⁽²⁾||²_F) = (0.1 / 8)(1.39 + 0.69) = 0.0125 × 2.08 = 0.026

Step 2: Compute the Regularized Gradient for W⁽²⁾

Suppose standard backprop gives us dW⁽²⁾_base = [[0.15, -0.08, 0.22]]

dW⁽²⁾_reg = dW⁽²⁾_base + (λ/m) × W⁽²⁾ = [[0.15, -0.08, 0.22]] + (0.1/4) × [[0.2, 0.7, -0.4]]

= [[0.15, -0.08, 0.22]] + [[0.005, 0.0175, -0.01]] = [[0.155, -0.0625, 0.21]]

Step 3: Weight Update (Weight Decay Form)

With learning rate α = 0.01:

W⁽²⁾_new = W⁽²⁾ × (1 − αλ/m) − α × dW⁽²⁾_base

= W⁽²⁾ × (1 − 0.01 × 0.1/4) − 0.01 × dW⁽²⁾_base

= W⁽²⁾ × 0.99975 − 0.01 × dW⁽²⁾_base

The factor 0.99975 shows that each weight is multiplied by a number slightly less than 1 at every step — this is the "decay."

Sanity check: The regularization addition to the gradient is small (0.005) compared to the base gradient (0.15). This is expected — λ/m = 0.025 is small. If the regularization term dominates the gradient, your λ is too large and you'll underfit.

Section 8

Case Study — BYJU'S Student Outcome Prediction

📚 When an EdTech Model Memorised Metro Students

The Problem

BYJU'S, India's largest EdTech platform (serving 15 crore+ students), built a deep neural network to predict student exam outcomes based on learning behaviour: video watch time, quiz scores, revision patterns, and engagement metrics.

The Data

Split	Source	Size
Training	Mumbai, Delhi, Bangalore, Chennai, Hyderabad	8 lakh students
Dev	Same 5 cities (random split)	1 lakh students
Test	Tier-2/3 cities: Patna, Bhopal, Coimbatore, Guwahati	2 lakh students

Initial Results (No Regularization)

Metric	Train	Dev (Same Cities)	Test (New Cities)
Accuracy	96.8%	95.2%	71.3%
F1 Score	0.97	0.95	0.68

The 24% gap between dev and test showed severe overfitting to metro-city patterns.

Root Cause Analysis

Feature leakage: Metro students had consistent Wi-Fi (video completion rate ≈ 95%), while Tier-3 students had patchy internet (completion ≈ 60%). The model learned "high video completion → passes" — a proxy for "lives in metro city."
Device bias: 85% of metro students used tablets/laptops; 70% of Tier-3 students used budget smartphones with smaller screens. The model learned screen-time patterns specific to device types.
Language proxy: Metro students mostly used English content; Tier-3 students used Hindi/regional language content with different engagement patterns.

The Fix: Multi-Layered Regularization

Data Augmentation: Simulated poor connectivity by randomly dropping 20-40% of video watch events. Added noise to quiz completion times. Mixed Hindi and English engagement patterns.
Dropout (p=0.3): Applied to all hidden layers to prevent co-adaptation of metro-specific features.
L2 Regularization (λ=0.01): Penalised large weights that encoded city-specific shortcuts.
Feature Engineering: Replaced raw "video completion %" with "relative engagement" (normalised within each connectivity tier).
Early Stopping (patience=15): Monitored performance on a held-out Tier-2 city (Jaipur) to catch overfitting early.

Results After Regularization

Metric	Train	Dev	Test (New Cities)
Accuracy	89.5%	88.7%	86.2%
F1 Score	0.90	0.89	0.85

Key Lessons

Train-test gap reduced from 25.5% → 3.3%
Training accuracy dropped (89.5% vs 96.8%) — this is expected and healthy
Test accuracy on unseen Tier-3 cities jumped from 71.3% → 86.2%
The ₹450 crore annual prediction pipeline now serves all of India, not just 5 metros

India's Urban-Rural Digital Divide: This case study highlights a uniquely Indian ML challenge. With 65% of India's population in rural/semi-urban areas but 80% of ML training data coming from metros, overfitting to metro patterns is a systemic problem across Indian AI — from loan approval (HDFC) to crop disease detection (Wadhwani AI) to language models (AI4Bharat).

Section 9

Common Mistakes & Misconceptions

Mistake #1: Using regularization to fix high bias. If your model underfits (train error is high), adding L2 or dropout will make it worse. Regularization reduces model capacity — you don't want less capacity when you already can't fit the data. Fix: make the network bigger first, then regularize.

Mistake #2: Keeping dropout ON during inference. Forgetting model.eval() in PyTorch means dropout randomly zeros neurons at test time, making predictions stochastic and unreliable. Always: model.eval() before prediction, model.train() before training.

Mistake #3: Regularizing biases. Including bias terms in L2 penalty has negligible effect (one parameter per neuron vs. hundreds in W) and can hurt performance. Standard practice: regularize W only.

Mistake #4: Same λ for all layers. Layers with more parameters (wider layers) may need stronger regularization. In practice, a single λ works reasonably well, but layer-wise tuning can help in very deep networks.

Mistake #5: Applying dropout to the input layer. Dropping raw features randomly is generally harmful — especially when features are sparse (e.g., one-hot encoded categories). Exception: NLP embeddings sometimes use input dropout (0.1).

Mistake #6: Not scaling inverted dropout correctly. If you multiply by the mask but forget to divide by keep_prob, test-time predictions will be systematically lower than training-time expectations. The inverted scaling step is crucial.

Mistake #7: Monitoring training loss for early stopping. You must monitor validation/dev loss, not training loss. Training loss will always decrease — it tells you nothing about overfitting. Early stopping watches dev loss specifically to detect the overfitting inflection point.

Section 10

Comparison Table — Regularization Techniques

Technique	Mechanism	Hyperparameters	Pros	Cons	When to Use
L2 (Weight Decay)	Penalise \|\|W\|\|²	λ	Smooth, differentiable, easy to tune	Doesn't produce sparsity	Default for almost all DNNs
L1 (Lasso)	Penalise \|\|W\|\|₁	λ	Produces sparse weights, feature selection	Non-differentiable at 0, less stable	When you need model compression / pruning
Dropout	Random neuron masking	keep_prob per layer	Powerful, acts as ensemble	Noisy training loss, slower convergence	Large FC layers; less useful for CNNs
Data Augmentation	Synthetic data expansion	Aug strategy, magnitude	Increases data without cost, preserves capacity	Domain-specific, can introduce artifacts	Always — try first before other methods
Early Stopping	Stop at best dev loss	patience	Zero compute cost, no extra hyperparams	Couples optimisation and regularization	As a safety net alongside other methods
Batch Norm	Normalise layer inputs	—	Speeds training, mild regularization	Adds complexity, batch-size dependent	Almost always (covered in Ch 10)

The Recommended Order: When facing overfitting, apply techniques in this order: (1) More real data, (2) Data augmentation, (3) L2 regularization, (4) Dropout, (5) Early stopping, (6) Reduce model size (last resort — deep learning works best with big models + strong regularization).

Section 11

Exercises

Section A: Multiple Choice Questions (10)

Adding L2 regularization to a neural network's cost function is equivalent to adding which term?

(λ/m) Σₗ ||W⁽ˡ⁾||₁
(λ/2m) Σₗ ||W⁽ˡ⁾||²_F
(λ/2m) Σₗ ||b⁽ˡ⁾||²
(λ/m) Σₗ Σᵢ |wᵢ|

✅ B. L2 regularization adds (λ/2m) times the sum of Frobenius norms squared of all weight matrices. The 1/2 simplifies the derivative, and biases are conventionally excluded.

RememberDifficulty: Easy

In inverted dropout with keep_prob = 0.8, by what factor are surviving activations scaled during training?

0.8
0.2
1.25 (i.e., 1/0.8)
5.0 (i.e., 1/0.2)

✅ C. Inverted dropout scales surviving activations by 1/keep_prob = 1/0.8 = 1.25 during training, so that expected activation values match test time (when no dropout is applied).

UnderstandDifficulty: Easy

A model has training error = 2% and dev error = 18%. Bayes error is approximately 1%. What is the primary diagnosis?

High bias
High variance
High bias AND high variance
The model is perfectly fine

✅ B. Training error (2%) is close to Bayes error (1%), so bias is low. But the gap between train (2%) and dev (18%) = 16% indicates high variance (overfitting). Solution: regularization, more data, or data augmentation.

AnalyzeDifficulty: Medium

Which regularization technique is most likely to produce a sparse weight matrix (many exact zeros)?

L2 regularization
L1 regularization
Dropout
Early stopping

✅ B. L1 regularization (Lasso) drives weights exactly to zero due to the diamond-shaped constraint region. L2 shrinks weights towards zero but rarely makes them exactly zero. Dropout and early stopping don't directly affect weight sparsity.

RememberDifficulty: Easy

During test/inference time, what should happen with dropout?

Apply dropout with the same keep_prob as training
Apply dropout with keep_prob = 0.5 always
Turn OFF dropout entirely — use all neurons
Apply dropout only to the output layer

✅ C. Dropout must be turned OFF during inference. When using inverted dropout, no additional scaling is needed at test time because the 1/keep_prob scaling was already applied during training.

RememberDifficulty: Easy

The "weight decay" interpretation of L2 regularization means that at each update step, each weight is multiplied by:

(1 + αλ/m)
(1 − αλ/m)
(1 − λ/m)
α/λ

✅ B. The update rule W := W(1 − αλ/m) − α·(1/m)·dZ·Aᵀ shows that W is first multiplied by (1 − αλ/m), a number slightly less than 1. This factor causes weights to "decay" towards zero at each step.

UnderstandDifficulty: Medium

Andrew Ng argues against early stopping as the primary regularization strategy because:

It is computationally expensive
It couples the tasks of optimizing J and preventing overfitting (violates "orthogonalization")
It requires computing second-order derivatives
It only works with SGD, not Adam

✅ B. Early stopping conflates two goals: minimizing the cost function (fitting data) and not overfitting (regularization). With L2, you can first train to minimize J, then independently tune λ. Early stopping doesn't allow independent control of both objectives.

UnderstandDifficulty: Medium

A Flipkart image classifier trained on product photos is overfitting. Which data augmentation technique would be LEAST appropriate?

Horizontal flip
Random rotation (±10°)
Vertical flip (upside-down)
Color jitter

✅ C. Vertical flip would turn products upside-down, creating unrealistic training images (a shoe upside-down is not a valid product photo). Horizontal flip, slight rotation, and color jitter are all label-preserving transformations for product images.

EvaluateDifficulty: Medium

If a neural network has training error = 22% and dev error = 24%, with Bayes error ≈ 5%, what should you do FIRST?

Add more training data
Increase L2 regularization strength
Use a bigger/deeper network
Apply stronger dropout (lower keep_prob)

✅ C. Train error (22%) is much higher than Bayes error (5%), indicating high bias. The train-dev gap is only 2%, so variance is low. The model is too simple — it can't even fit the training data. Solution: increase capacity (bigger network), not regularization (which would make bias worse).

AnalyzeDifficulty: Hard

Q10

Which statement about the Frobenius norm is CORRECT?

It sums the absolute values of all matrix elements
It computes the square root of the sum of squared elements
It equals the largest singular value of the matrix
It only considers diagonal elements of the weight matrix

✅ B. The Frobenius norm ||W||_F = √(Σᵢ Σⱼ wᵢⱼ²). In L2 regularization, we use ||W||²_F (the squared Frobenius norm), which is simply Σᵢ Σⱼ wᵢⱼ² — the sum of squares of all elements. Option A describes the L1 norm, option C describes the spectral norm.

RememberDifficulty: Medium

Section B: Short Answer Questions (5)

B1 Intermediate

Explain why dropout can be interpreted as training an ensemble of sub-networks. How many possible sub-networks exist for a layer with n neurons?

Each training step randomly drops neurons according to keep_prob, creating a different "thinned" network. For a layer with n neurons, each can be ON or OFF, giving 2ⁿ possible sub-networks. During inference, using all neurons with original weights approximates averaging the predictions of all 2ⁿ sub-networks (geometric mean). This ensemble effect makes the model robust — no single neuron can become a "specialist" because it might be dropped at any time.

B2 Beginner

A model shows train error = 0.5% and dev error = 0.8%, but Bayes error is approximately 0.3%. Diagnose the model and suggest the next steps.

This model has low bias (train 0.5% ≈ Bayes 0.3%) and low variance (train-dev gap = 0.3%). This is a well-fitted model! ✅ Next steps: (1) Deploy it, (2) if further improvement is needed, try reducing the 0.2% avoidable bias with a bigger model or better features, (3) monitor for data drift in production.

B3 Intermediate

Why does L1 regularization produce sparse weights while L2 does not? Give a geometric explanation.

Consider the optimization as finding where the loss function's contour ellipses first touch the constraint region. L1's constraint region is a diamond (|w₁| + |w₂| ≤ t) with sharp corners on the axes. The loss contours are more likely to first touch a corner, where one weight is exactly zero. L2's constraint region is a circle (w₁² + w₂² ≤ t²), which is smooth everywhere. The contours typically touch the circle at a point where both weights are non-zero. Hence L1 induces exact sparsity while L2 only shrinks weights towards zero.

B4 Intermediate

In the inverted dropout implementation, what would happen if we skipped the scaling step (dividing by keep_prob) during training?

Without scaling, during training, the expected value of each activation would be keep_prob × a (since each neuron survives with probability keep_prob). At test time, with all neurons active, the expected value would be a. This mismatch means test-time activations are systematically ~1/keep_prob times larger than training, leading to exploding activations through layers, poor predictions, and potential numerical overflow. The inverted scaling ensures E[a_train] = a_test.

B5 Advanced

Explain the "orthogonalization" argument against early stopping. What does Andrew Ng mean by coupling optimization and regularization?

Orthogonalization means each "knob" controls exactly one aspect. With L2 regularization, you have two independent controls: (1) learning rate and epochs control how well you minimize J, (2) λ controls how much you regularize. You can separately tune each. With early stopping, the single control "number of epochs" simultaneously affects both (1) how well you fit the data and (2) how much you regularize. You can't independently improve training fit without also changing regularization. This coupling makes systematic hyperparameter tuning harder. However, in practice, early stopping is simple and effective, so it's used as a safety net alongside L2 + dropout, not as a replacement.

Section C: Long Answer Questions (3)

C1 Advanced

Prove that L2 regularization is equivalent to MAP estimation with a Gaussian prior on the weights.

Show all steps: start from Bayes' theorem, assume a Gaussian prior W ~ N(0, σ²_w I), derive the MAP objective, and show it equals the L2-regularized cost function. Identify the relationship between λ and σ²_w.

Proof:

Step 1: Bayes' Theorem. MAP estimation maximises the posterior P(W|X,Y) ∝ P(Y|X,W) · P(W).

Step 2: Log-posterior. Taking the negative log: argmin_W [-log P(Y|X,W) - log P(W)]

Step 3: Likelihood term. For binary cross-entropy: -log P(Y|X,W) = (1/m) Σᵢ L(ŷ⁽ⁱ⁾, y⁽ⁱ⁾) = J₀(W) (the unregularized cost).

Step 4: Gaussian prior. Assume W ~ N(0, σ²_w I). Then:
P(W) = Πⱼ (1/√(2πσ²_w)) exp(-wⱼ²/(2σ²_w))
-log P(W) = (1/(2σ²_w)) Σⱼ wⱼ² + const = (1/(2σ²_w)) ||W||²_F + const

Step 5: Combined MAP objective.
argmin_W [J₀(W) + (1/(2σ²_w)) ||W||²_F]

Step 6: Identify λ. Comparing with J_reg = J₀ + (λ/2m)||W||²_F, we get:
λ/2m = 1/(2σ²_w), therefore λ = m/σ²_w

Interpretation: Small σ²_w (tight prior, "I believe weights should be small") → large λ (strong regularization). Large σ²_w (loose prior, "weights can be anything") → small λ (weak regularization).

Bonus: Similarly, L1 regularization corresponds to a Laplacian prior: P(wⱼ) ∝ exp(-|wⱼ|/b), and -log P(W) ∝ ||W||₁. The Laplacian prior is sharply peaked at zero, which explains why L1 produces sparsity.

C2 Advanced

Derive the complete backpropagation equations for a 3-layer neural network with L2 regularization.

For a network with layers [n_x, n₁, n₂, n_y] using ReLU for hidden layers and sigmoid for output. Show: (a) the regularized cost function, (b) forward pass equations, (c) all backward pass gradient equations with the L2 term, (d) the weight update rules in both standard and weight-decay form.

C3 Intermediate

Compare and contrast three regularization strategies for an Indian language NLP model (e.g., sentiment analysis on Hinglish text from Twitter/X). Discuss: (a) How would you apply L2 regularization to a word embedding + LSTM architecture? (b) Where would you place dropout in the LSTM network and why? (c) Design a data augmentation strategy specific to Hinglish text. Include at least 4 augmentation techniques with examples.

Section D: Programming Questions (2)

D1 Intermediate

Implement early stopping from scratch.

Extend the DeepNNRegularized class to include early stopping. Your implementation should:

Accept a validation set (X_dev, Y_dev) and patience parameter
Track training and dev costs at each epoch
Save the best weights when dev cost reaches a new minimum
Stop training if dev cost hasn't improved for patience epochs
Restore best weights after stopping
Return both train and dev cost histories for plotting

Test on the noisy circles dataset with a deliberately large network (overfit-prone) and show that early stopping finds the optimal epoch.

D2 Advanced

Build a regularization ablation study.

Using the MNIST dataset (via sklearn.datasets.load_digits for simplicity), train a deep neural network with 5 different regularization configurations:

No regularization (baseline)
L2 only (λ = 0.01, 0.1, 1.0)
Dropout only (keep_prob = 0.5, 0.8, 0.95)
L2 + Dropout (best combination)
L2 + Dropout + Early Stopping

For each, report: train accuracy, test accuracy, number of near-zero weights (|w| < 0.01), and total training time. Create a summary table and a bar chart comparing train vs test accuracy. Conclude with which strategy works best and why.

Section E: Mini-Project

E1 Advanced

🏗️ Regularization Pipeline for Indian Food Image Classification

Build a complete regularization pipeline for classifying Indian food images (Dosa, Idli, Biryani, Butter Chicken, Pani Puri, Chole Bhature — 6 classes).

Requirements:

Dataset: Use any available food dataset or create a synthetic one with ~500 images (can use web-scraped or generated). Split: 70% train, 15% dev, 15% test.
Baseline: Train a CNN (use torchvision's ResNet-18 pretrained) without any regularization. Record train/dev accuracy curves.
Regularization Layers:
- Add L2 (weight_decay in optimizer)
- Add Dropout (after FC layers)
- Add data augmentation (at least 5 transforms including random crop, flip, color jitter, rotation, cutout)
- Add early stopping (patience=10)
Ablation Study: Train 4 models (baseline, +L2, +L2+Dropout, +L2+Dropout+Aug) and plot all 4 train/dev curves on the same graph.
Report: Write a 1-page analysis with a comparison table, the best model's confusion matrix, and your recommendation for production deployment at a company like Zomato (for auto-tagging restaurant menus).

Budget: ₹0 (use Google Colab free GPU). Time: 4–6 hours.

Section 12

Chapter Summary

🔑 Key Takeaways — Chapter 9: Regularization

Overfitting occurs when a model learns noise in the training data instead of the underlying signal. It manifests as a large gap between training and dev/test performance.
Bias-Variance Trade-off: Error = Bias² + Variance + Noise. High train error = high bias (underfit). Large train-dev gap = high variance (overfit). Use Bayes error as the reference anchor.
L2 Regularization adds (λ/2m)||W||²_F to the cost, producing a gradient term (λ/m)W that shrinks weights toward zero (weight decay). It's equivalent to MAP estimation with a Gaussian prior.
L1 Regularization adds (λ/m)||W||₁ to the cost, producing sparse weights (exact zeros). Useful for feature selection but less common in deep learning than L2.
Dropout randomly zeros out neurons during training with probability (1 − keep_prob). Inverted dropout scales surviving activations by 1/keep_prob to maintain expected values. Always turn OFF dropout at test time.
Data Augmentation creates synthetic training examples through label-preserving transformations. For images: flip, rotate, crop, color jitter. For text: back-translation, synonym replacement, Hindi-English code-switching. This is "free" regularization that doesn't reduce model capacity.
Early Stopping monitors dev loss and stops training when it starts increasing. Simple and effective but couples optimization and regularization (Ng's orthogonalization critique).
Diagnostic Protocol: High bias → bigger model. High variance → more data + regularization. High bias + high variance → bigger model AND regularization.
In Practice: Combine multiple techniques. A typical production pipeline uses L2 (weight_decay in Adam) + Dropout (0.1–0.3) + Data Augmentation + Early Stopping as a safety net.
India-Specific Challenge: The urban-rural digital divide means Indian ML models often overfit to metro-city patterns. Regularization and representative data collection are critical for national-scale AI deployment.

Quick Reference Formulas

Concept	Formula
L2 Regularized Cost	`J + (λ/2m) Σₗ \|\|W⁽ˡ⁾\|\|²_F`
L2 Gradient Addition	`dW⁽ˡ⁾ += (λ/m) W⁽ˡ⁾`
Weight Decay Factor	`W⁽ˡ⁾ *= (1 − αλ/m)`
L1 Gradient Addition	`dW⁽ˡ⁾ += (λ/m) sign(W⁽ˡ⁾)`
Inverted Dropout	`A *= D / keep_prob`
L2 ↔ Gaussian Prior	`λ = m / σ²_w`

Section 13

References & Further Reading

Primary References

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." Journal of Machine Learning Research, 15, 1929-1958. — The foundational dropout paper.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 7: Regularization for Deep Learning. MIT Press. — Comprehensive theoretical treatment.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Sections 3.1.4 (Regularized Least Squares) and 3.3 (Bayesian Linear Regression). Springer. — Bayesian interpretation of regularization.
Ng, A. (2017). "Deep Learning Specialization," Course 2: Improving Deep Neural Networks. Coursera/deeplearning.ai. — Practical bias-variance diagnosis and regularization techniques.

Supplementary Reading

Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). "Understanding Deep Learning Requires Rethinking Generalization." ICLR 2017. — Landmark paper showing DNNs can memorise random labels.
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). "mixup: Beyond Empirical Risk Minimization." ICLR 2018. — Data augmentation via linear interpolation of samples.
Loshchilov, I. & Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. — Shows weight decay ≠ L2 regularization for Adam; proposes AdamW.
Krogh, A. & Hertz, J. A. (1992). "A Simple Weight Decay Can Improve Generalization." NIPS 1992. — Early analysis of weight decay as regularization.

Indian Context References

AI4Bharat (IIT Madras). "IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks, and Pre-trained Multilingual Language Models for Indian Languages." — Regularization challenges in multilingual Indian NLP.
Wadhwani AI. "Pest Management for Cotton Farmers." — Data augmentation for agricultural image classification in Indian farms with limited labeled data.

Online Resources

📹 3Blue1Brown: "But what is a neural network?" — Visual intuition for overfitting
📝 Stanford CS231n Notes: "Neural Networks Part 2: Regularization" — cs231n.github.io
📝 distill.pub: Interactive visualizations of regularization effects
🛠️ PyTorch Documentation: torch.nn.Dropout, weight_decay in optimizers