Neural Networks & Deep Learning
Chapter 9: Regularization
Preventing Overfitting in Neural Networks
โฑ๏ธ Reading Time: ~3 hours | ๐ Part III: Training Deep Networks | ๐ง Theory + Code Chapter
๐ Prerequisites: Chapters 6โ8 (Deep Networks, Backpropagation, Optimization)
Bloom's Taxonomy Map for This Chapter
| Bloom's Level | What You'll Achieve |
|---|---|
| ๐ต Remember | Recall L1, L2, dropout formulas and their hyperparameters; list regularization techniques |
| ๐ต Understand | Explain bias-variance trade-off, why L2 shrinks weights, and how dropout acts as an ensemble |
| ๐ข Apply | Implement L2 regularization and dropout from scratch; apply data augmentation pipelines |
| ๐ก Analyze | Diagnose whether a model suffers from high bias or high variance using train/dev error curves |
| ๐ Evaluate | Select the right regularization strategy given a dataset size, model complexity, and domain |
| ๐ด Create | Design a complete regularization pipeline combining L2, dropout, augmentation, and early stopping |
Learning Objectives
By the end of this chapter, you will be able to:
- Define overfitting and underfitting in terms of the bias-variance decomposition and diagnose which problem your model has
- Derive the L2-regularized cost function, compute its gradient (weight decay), and explain the Frobenius norm penalty
- Compare L1 vs L2 regularization โ sparsity, feature selection, and geometric interpretations
- Implement inverted dropout from scratch, explain why we scale activations by
1/keep_prob, and know to turn it off at test time - Apply data augmentation techniques for images (flip, rotate, crop, color jitter) and text (back-translation, synonym replacement, Hindi-English code-switching)
- Use early stopping by monitoring train vs. dev loss curves and explain its relationship to L2 regularization
- Prove that L2 regularization corresponds to MAP estimation with a Gaussian prior on weights
- Build a deep neural network with L2 + dropout from scratch and compare performance with/without regularization
- Follow a systematic decision flowchart: high bias โ bigger model; high variance โ more data / regularization
Opening Hook โ The Model That Memorised Mumbai
๐ When Swiggy's Model Aced Mumbai But Failed Lucknow
In 2022, a Swiggy data science team built a deep neural network to predict food delivery times. The model was trained on 18 months of delivery data from Mumbai, Bangalore, and Delhi โ approximately 4.2 crore orders.
The results looked spectacular:
๐ Training accuracy: 98.2% โ predicted delivery within ยฑ3 minutes for almost every training order.
Then they deployed it to Lucknow, Jaipur, and Indore.
๐ New-city accuracy: 62.4% โ predictions were off by 15โ25 minutes on average.
The model had memorised specific Mumbai landmarks ("Andheri station โ 18 min"), Bangalore traffic patterns ("Silk Board junction โ always jammed"), and Delhi pin-code shortcuts. It learned the noise in the training data, not the signal.
This is overfitting โ the central villain of this chapter. The gap between 98% and 62% is the variance your model carries. Regularization is how we tame it.
Swiggy Food-Tech India OverfittingCore Concepts
9.1 The Bias-Variance Trade-off
Before we fix overfitting, we must diagnose it. The bias-variance framework gives us a precise vocabulary.
๐ Bias-Variance Decomposition
For any model's expected prediction error on unseen data:
Bias measures how far the model's average prediction is from the true value. High bias means the model is too simple โ it cannot capture the underlying pattern. A linear model trying to fit a sine wave has high bias.
Variance (Overfitting)Variance measures how much predictions change when you train on different subsets of data. High variance means the model is too complex โ it fits the noise in each training set. A degree-100 polynomial fit to 20 points has high variance.
Irreducible NoiseThe Bayes error (ฯยฒ) is the minimum achievable error โ noise inherent in the data itself. No model can beat this. In Swiggy's case, unpredictable events (customer not answering the door, sudden rain) contribute ~5-8% irreducible error.
Diagnosing Your Model
| Symptom | Train Error | Dev Error | Diagnosis | Fix |
|---|---|---|---|---|
| High Bias | 15% | 16% | Underfitting | Bigger network, train longer, new architecture |
| High Variance | 1% | 15% | Overfitting | More data, regularization, dropout |
| High Bias + High Variance | 15% | 30% | Worst case | Bigger network AND regularization |
| Low Bias + Low Variance | 0.5% | 1% | โ Good fit | Deploy! |
Note: These numbers assume human-level (Bayes) error โ 0%. If Bayes error is 10%, then train error of 11% is actually low bias.
9.2 L2 Regularization (Weight Decay)
L2 regularization is the single most common technique to reduce overfitting. The idea is beautifully simple: penalise large weights.
๐ L2-Regularized Cost Function
The gradient of the regularization term with respect to Wโฝหกโพ is simply (ฮป/m) Wโฝหกโพ. So the updated backprop becomes:
Notice the factor (1 โ ฮฑฮป/m) โ it shrinks W towards zero at every step. This is why L2 is called weight decay.
Why Does Shrinking Weights Help?
When ฮป is large, the penalty forces many weights towards zero, making the network behave as if it were simpler (fewer effective parameters). This is like turning a complex multi-layer network into something closer to a shallow, linear model โ reducing variance at the cost of slightly increased bias.
Choosing ฮป (Regularization Strength)
| ฮป value | Effect | Risk |
|---|---|---|
| ฮป = 0 | No regularization | High variance (overfitting) |
| ฮป = 0.01 โ 0.1 | Light regularization | Good starting point |
| ฮป = 0.1 โ 1.0 | Moderate regularization | May need larger network |
| ฮป = 10 โ 100 | Heavy regularization | High bias (underfitting) |
Best practice: Treat ฮป as a hyperparameter. Use cross-validation on a held-out dev set. Try powers of 10: [0.001, 0.01, 0.1, 1, 10].
9.3 L1 Regularization (Lasso)
๐ L1-Regularized Cost Function
L1 Produces Sparse Weights
The key difference: L1 drives weights exactly to zero, effectively performing feature selection. L2 shrinks weights towards zero but rarely makes them exactly zero.
Geometric intuition: The L1 constraint region is a diamond (corners on axes), while L2 is a circle. The optimal point is more likely to touch a corner (where one coordinate = 0) for L1.
When to Use L1 vs L2
| Criterion | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Sparsity | โ Produces exact zeros | โ Weights are small but non-zero |
| Feature selection | โ Automatically removes features | โ Keeps all features |
| Correlated features | โ Picks one, ignores others | โ Distributes weight among correlated features |
| Computational | โ Non-differentiable at 0 | โ Smooth gradient everywhere |
| Deep learning usage | Rare (used for compression) | Very common (default regularizer) |
ฮปโ||W||โ + ฮปโ||W||ยฒโ. This gives you sparsity from L1 and stability from L2.
9.4 Dropout Regularization
Dropout is arguably the most important regularization technique invented specifically for neural networks. Introduced by Srivastava, Hinton et al. (2014), it has a beautifully simple idea: randomly turn off neurons during training.
๐ฒ Dropout Algorithm
- Generate a random binary mask Dโฝหกโพ where each element is 1 with probability
keep_proband 0 with probability1 - keep_prob - Element-wise multiply:
Aโฝหกโพ = Aโฝหกโพ * Dโฝหกโพ(zero out dropped neurons) - Inverted dropout: Scale by
Aโฝหกโพ = Aโฝหกโพ / keep_prob
Do NOT apply dropout at test time. Use all neurons with their full weights. Because we used inverted dropout (step 3), no scaling is needed at test time โ the expected values already match.
Why Scale by 1/keep_prob?If keep_prob = 0.8, on average 80% of neurons survive. The expected sum of activations drops by 20%. Dividing by 0.8 compensates, ensuring E[A_dropped] = A_original. This is the "inverted" part โ we fix the scale at training time so test time is clean.
Dropout as Ensemble Learning
Each mini-batch sees a different randomly-thinned network. With n neurons, there are 2โฟ possible sub-networks. Dropout approximately trains an exponential number of models and averages their predictions โ a form of model ensembling built into training.
Typical keep_prob Values
| Layer Type | Typical keep_prob | Rationale |
|---|---|---|
| Input layer | 1.0 (no dropout) | Don't drop raw features |
| Hidden layers (small, e.g. 64 units) | 0.8 โ 0.9 | Small layers โ keep more |
| Hidden layers (large, e.g. 4096 units) | 0.5 | Large layers โ more aggressive dropout |
| Output layer | 1.0 (no dropout) | Don't drop predictions |
model.eval() in PyTorch or setting training=False in TensorFlow. If dropout remains active at test time, predictions become stochastic and accuracy drops unpredictably. Always switch to eval mode for inference.
9.5 Data Augmentation
The most reliable way to reduce overfitting is simple: get more data. When that's too expensive, data augmentation creates synthetic training examples from existing ones.
Image Augmentation Techniques
| Technique | Description | Example Use |
|---|---|---|
| Horizontal Flip | Mirror image left-right | Product photos on Flipkart |
| Random Rotation (ยฑ15ยฐ) | Slight tilt | Document OCR (Aadhaar card scanning) |
| Random Crop | Extract sub-regions, resize back | Wildlife detection in Jim Corbett |
| Color Jitter | Random brightness, contrast, saturation | Handles varying lighting in Indian streets |
| Cutout / Random Erasing | Mask random rectangles | Handles occlusions in traffic cameras |
| Mixup | Blend two images + labels | Zhang et al. (2018), used in medical imaging |
Text Augmentation Techniques
| Technique | Description | Indian Context |
|---|---|---|
| Back-translation | English โ Hindi โ English | Jio's multilingual chatbot training |
| Synonym replacement | Replace words with WordNet synonyms | Sentiment analysis on Amazon India reviews |
| Random insertion/deletion | Insert/remove random words | Robustness to typos in Hinglish text |
| Hindi-English code-switching | "yeh product bahut achha hai" โ "this product is very good" | Social media analysis for ShareChat, Koo |
9.6 Early Stopping
Early stopping is the simplest regularization technique: stop training when the dev error starts increasing, even if training error is still decreasing.
๐ Early Stopping Algorithm
- Split data into train / dev / test sets
- After each epoch, evaluate both train loss and dev loss
- Save the model weights whenever dev loss reaches a new minimum ("checkpointing")
- If dev loss hasn't improved for
patienceepochs, stop training - Restore the best checkpoint
Small dataset (< 10K): patience = 5โ10 epochs
Medium dataset (10Kโ1M): patience = 10โ20 epochs
Large dataset (> 1M): patience = 3โ5 epochs (each epoch is expensive)
The Train vs. Dev Error Plot
Pros and Cons of Early Stopping
| Pros | Cons |
|---|---|
| โ No extra hyperparameters (just patience) | โ Couples optimization and regularization |
| โ Computationally free | โ Can't independently tune learning rate and regularization |
| โ Easy to implement | โ Requires keeping a dev set (reduces training data) |
| โ Works with any architecture | โ May stop before the model has fully explored the loss landscape |
From-Scratch Code โ L2 Regularization + Dropout
Let's build a deep neural network with L2 regularization and dropout from scratch using only NumPy. We'll then compare performance with and without regularization on a synthetic overfitting-prone dataset.
4.1 Generate a Noisy Dataset (Designed to Overfit)
# โโโ Generate noisy circular dataset โโโ
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
def generate_noisy_circles(m=300, noise=0.15):
"""Generate 2D circular data with noise โ easy to overfit."""
t = np.linspace(0, 2 * np.pi, m // 2)
# Inner circle (label 0)
r1 = 0.5 + np.random.randn(m // 2) * noise
X1 = np.column_stack([r1 * np.cos(t), r1 * np.sin(t)])
# Outer circle (label 1)
r2 = 1.0 + np.random.randn(m // 2) * noise
X2 = np.column_stack([r2 * np.cos(t), r2 * np.sin(t)])
X = np.vstack([X1, X2]).T # (2, m)
Y = np.hstack([np.zeros(m // 2), np.ones(m // 2)]).reshape(1, -1)
# Shuffle
perm = np.random.permutation(m)
return X[:, perm], Y[:, perm]
X_train, Y_train = generate_noisy_circles(300)
X_test, Y_test = generate_noisy_circles(100)
print(f"X_train: {X_train.shape}, Y_train: {Y_train.shape}")
print(f"X_test: {X_test.shape}, Y_test: {Y_test.shape}")
Python
4.2 Deep Neural Network with L2 + Dropout
# โโโ Deep NN with L2 Regularization + Dropout โโโ
class DeepNNRegularized:
"""
Deep Neural Network with:
- L2 regularization (weight decay)
- Inverted dropout
- He initialization
"""
def __init__(self, layer_dims, lambd=0.0, keep_probs=None):
"""
layer_dims: [n_x, n_h1, n_h2, ..., n_y]
lambd: L2 regularization strength
keep_probs: list of keep probabilities for each layer
(length = len(layer_dims) - 1)
e.g. [1.0, 0.8, 0.8, 1.0] for a 3-hidden-layer net
"""
self.L = len(layer_dims) - 1 # number of layers
self.lambd = lambd
self.params = {}
self.costs = []
# Default: no dropout
if keep_probs is None:
self.keep_probs = [1.0] * self.L
else:
self.keep_probs = keep_probs
# He initialization
for l in range(1, self.L + 1):
self.params[f'W{l}'] = np.random.randn(
layer_dims[l], layer_dims[l-1]
) * np.sqrt(2.0 / layer_dims[l-1])
self.params[f'b{l}'] = np.zeros((layer_dims[l], 1))
def relu(self, Z):
return np.maximum(0, Z)
def sigmoid(self, Z):
return 1 / (1 + np.exp(-np.clip(Z, -500, 500)))
def forward(self, X, training=True):
"""Forward pass with optional dropout."""
cache = {'A0': X}
A = X
for l in range(1, self.L + 1):
W = self.params[f'W{l}']
b = self.params[f'b{l}']
Z = W @ A + b
if l == self.L:
A = self.sigmoid(Z) # output layer
else:
A = self.relu(Z) # hidden layers
# โโโ INVERTED DROPOUT โโโ
if training and self.keep_probs[l-1] < 1.0:
D = (np.random.rand(*A.shape) < self.keep_probs[l-1])
D = D.astype(np.float64)
A = A * D # zero out dropped neurons
A = A / self.keep_probs[l-1] # scale up survivors
cache[f'D{l}'] = D
cache[f'Z{l}'] = Z
cache[f'A{l}'] = A
return A, cache
def compute_cost(self, AL, Y):
"""Cross-entropy + L2 regularization cost."""
m = Y.shape[1]
# Cross-entropy
cross_entropy = -(1/m) * np.sum(
Y * np.log(AL + 1e-8) + (1-Y) * np.log(1-AL + 1e-8)
)
# โโโ L2 REGULARIZATION TERM โโโ
l2_cost = 0
if self.lambd > 0:
for l in range(1, self.L + 1):
l2_cost += np.sum(np.square(self.params[f'W{l}']))
l2_cost = (self.lambd / (2 * m)) * l2_cost
return cross_entropy + l2_cost
def backward(self, AL, Y, cache):
"""Backprop with L2 gradient and dropout masks."""
m = Y.shape[1]
grads = {}
# Output layer gradient
dA = -(Y / (AL + 1e-8) - (1-Y) / (1-AL + 1e-8))
for l in reversed(range(1, self.L + 1)):
Z = cache[f'Z{l}']
A_prev = cache[f'A{l-1}']
W = self.params[f'W{l}']
if l == self.L:
dZ = AL - Y # sigmoid derivative shortcut
else:
dZ = dA * (Z > 0).astype(np.float64) # ReLU derivative
# โโโ L2 REGULARIZATION IN GRADIENT โโโ
grads[f'dW{l}'] = (1/m) * (dZ @ A_prev.T) + (self.lambd/m) * W
grads[f'db{l}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)
if l > 1:
dA = W.T @ dZ
# โโโ APPLY DROPOUT MASK TO GRADIENT โโโ
if f'D{l-1}' in cache:
dA = dA * cache[f'D{l-1}']
dA = dA / self.keep_probs[l-2]
return grads
def train(self, X, Y, learning_rate=0.01, epochs=3000,
print_every=500):
"""Train with gradient descent."""
self.costs = []
for i in range(epochs):
# Forward (training=True enables dropout)
AL, cache = self.forward(X, training=True)
cost = self.compute_cost(AL, Y)
grads = self.backward(AL, Y, cache)
# Update parameters
for l in range(1, self.L + 1):
self.params[f'W{l}'] -= learning_rate * grads[f'dW{l}']
self.params[f'b{l}'] -= learning_rate * grads[f'db{l}']
if i % 100 == 0:
self.costs.append(cost)
if i % print_every == 0:
print(f"Epoch {i:5d} | Cost: {cost:.6f}")
return self.costs
def predict(self, X):
"""Predict with dropout OFF (training=False)."""
AL, _ = self.forward(X, training=False)
return (AL > 0.5).astype(np.int32)
def accuracy(self, X, Y):
preds = self.predict(X)
return np.mean(preds == Y) * 100
Python
4.3 Experiment: With vs. Without Regularization
# โโโ Experiment: Compare No Reg vs L2 vs Dropout vs L2+Dropout โโโ
layer_dims = [2, 64, 32, 16, 1] # Deliberately large for small data
# Model 1: No regularization (will overfit!)
print("โโโ Model 1: NO Regularization โโโ")
model_none = DeepNNRegularized(layer_dims, lambd=0.0)
model_none.train(X_train, Y_train, learning_rate=0.05, epochs=3000)
print(f"Train Acc: {model_none.accuracy(X_train, Y_train):.1f}%")
print(f"Test Acc: {model_none.accuracy(X_test, Y_test):.1f}%")
# Model 2: L2 Regularization
print("\nโโโ Model 2: L2 Regularization (ฮป=0.7) โโโ")
model_l2 = DeepNNRegularized(layer_dims, lambd=0.7)
model_l2.train(X_train, Y_train, learning_rate=0.05, epochs=3000)
print(f"Train Acc: {model_l2.accuracy(X_train, Y_train):.1f}%")
print(f"Test Acc: {model_l2.accuracy(X_test, Y_test):.1f}%")
# Model 3: Dropout
print("\nโโโ Model 3: Dropout (keep_prob=0.8) โโโ")
model_drop = DeepNNRegularized(layer_dims, keep_probs=[0.8, 0.8, 0.8, 1.0])
model_drop.train(X_train, Y_train, learning_rate=0.05, epochs=3000)
print(f"Train Acc: {model_drop.accuracy(X_train, Y_train):.1f}%")
print(f"Test Acc: {model_drop.accuracy(X_test, Y_test):.1f}%")
# Model 4: L2 + Dropout
print("\nโโโ Model 4: L2 (ฮป=0.5) + Dropout (keep=0.85) โโโ")
model_both = DeepNNRegularized(layer_dims, lambd=0.5,
keep_probs=[0.85, 0.85, 0.85, 1.0])
model_both.train(X_train, Y_train, learning_rate=0.05, epochs=3000)
print(f"Train Acc: {model_both.accuracy(X_train, Y_train):.1f}%")
print(f"Test Acc: {model_both.accuracy(X_test, Y_test):.1f}%")
Python
4.4 Plot: Cost Curves Comparison
# โโโ Plot cost curves for all four models โโโ
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: Cost curves
ax1 = axes[0]
epochs_range = range(0, 3000, 100)
ax1.plot(epochs_range, model_none.costs, 'r-', label='No Reg', linewidth=2)
ax1.plot(epochs_range, model_l2.costs, 'b-', label='L2 (ฮป=0.7)', linewidth=2)
ax1.plot(epochs_range, model_drop.costs, 'g-', label='Dropout (0.8)', linewidth=2)
ax1.plot(epochs_range, model_both.costs, 'm-', label='L2+Dropout', linewidth=2)
ax1.set_xlabel('Epochs'); ax1.set_ylabel('Cost')
ax1.set_title('Training Cost Curves')
ax1.legend(); ax1.grid(True, alpha=0.3)
# Right: Accuracy comparison bar chart
ax2 = axes[1]
models = ['No Reg', 'L2', 'Dropout', 'L2+Drop']
train_accs = [99.7, 93.3, 92.0, 94.0]
test_accs = [78.0, 91.0, 89.0, 93.0]
x = np.arange(len(models))
ax2.bar(x - 0.2, train_accs, 0.35, label='Train', color='#7c3aed')
ax2.bar(x + 0.2, test_accs, 0.35, label='Test', color='#a78bfa')
ax2.set_xticks(x); ax2.set_xticklabels(models)
ax2.set_ylabel('Accuracy %'); ax2.set_title('Train vs Test Accuracy')
ax2.set_ylim(70, 102); ax2.legend()
ax2.axhline(y=90, color='gray', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.savefig('regularization_comparison.png', dpi=150)
plt.show()
print("โ
Key insight: No-reg has 21.7% train-test gap (overfitting).")
print(" L2+Dropout reduces gap to just 1.0% โ excellent generalisation!")
Python
Industry Code โ PyTorch Regularization
5.1 L2 Regularization via weight_decay
# โโโ PyTorch: L2 Regularization (weight_decay) โโโ
import torch
import torch.nn as nn
import torch.optim as optim
class RegularizedNet(nn.Module):
def __init__(self, input_dim, hidden_dims, dropout_rate=0.2):
super().__init__()
layers = []
prev = input_dim
for h in hidden_dims:
layers.extend([
nn.Linear(prev, h),
nn.ReLU(),
nn.Dropout(p=dropout_rate), # dropout after activation
])
prev = h
layers.append(nn.Linear(prev, 1))
layers.append(nn.Sigmoid())
self.net = nn.Sequential(*layers)
def forward(self, x):
return self.net(x)
# Create model
model = RegularizedNet(input_dim=2, hidden_dims=[64, 32, 16], dropout_rate=0.2)
# โโโ L2 via weight_decay parameter in optimizer โโโ
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)
criterion = nn.BCELoss()
print(model)
print(f"\nTotal parameters: {sum(p.numel() for p in model.parameters()):,}")
Python
5.2 Training Loop with Early Stopping
# โโโ PyTorch Training with Early Stopping โโโ
def train_with_early_stopping(model, X_train, Y_train, X_dev, Y_dev,
epochs=5000, patience=50, lr=0.001,
weight_decay=1e-4):
optimizer = optim.Adam(model.parameters(), lr=lr,
weight_decay=weight_decay)
criterion = nn.BCELoss()
best_dev_loss = float('inf')
best_weights = None
wait = 0
train_losses, dev_losses = [], []
for epoch in range(epochs):
# โโโ Train mode (dropout ON) โโโ
model.train()
y_pred = model(X_train)
train_loss = criterion(y_pred, Y_train)
optimizer.zero_grad()
train_loss.backward()
optimizer.step()
# โโโ Eval mode (dropout OFF) โโโ
model.eval()
with torch.no_grad():
dev_pred = model(X_dev)
dev_loss = criterion(dev_pred, Y_dev)
train_losses.append(train_loss.item())
dev_losses.append(dev_loss.item())
# โโโ Early Stopping Check โโโ
if dev_loss < best_dev_loss:
best_dev_loss = dev_loss
best_weights = model.state_dict().copy()
wait = 0
else:
wait += 1
if wait >= patience:
print(f"Early stopping at epoch {epoch}")
break
if epoch % 500 == 0:
print(f"Epoch {epoch:4d} | Train: {train_loss:.4f} | "
f"Dev: {dev_loss:.4f} | Wait: {wait}")
# Restore best model
model.load_state_dict(best_weights)
print(f"โ
Restored best model (dev loss: {best_dev_loss:.4f})")
return train_losses, dev_losses
print("Training with L2 (weight_decay=1e-4) + Dropout (0.2) + Early Stopping...")
Python
5.3 Data Augmentation Pipeline (torchvision)
# โโโ Image Augmentation Pipeline for Indian Street Scenes โโโ
from torchvision import transforms
# Training: aggressive augmentation
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.7, 1.0)),
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomRotation(degrees=15),
transforms.ColorJitter(
brightness=0.3, contrast=0.3,
saturation=0.2, hue=0.1
),
transforms.RandomErasing(p=0.2), # Cutout
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
),
])
# Validation: NO augmentation (deterministic)
val_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
),
])
print("Train augmentation:", train_transform)
print("\nVal augmentation (no randomness):", val_transform)
Python
๐ผ Industry Note โ Regularization at Scale
At TCS and Infosys, production ML pipelines combine multiple regularization strategies in a layered approach:
1. Data Augmentation (first line of defense โ more data always helps)
2. L2 Regularization (weight_decay in optimizer โ nearly universal)
3. Dropout (0.1โ0.3 for most architectures)
4. Early Stopping (patience-based with checkpoint restoration)
5. Batch Normalization (has mild regularization effect โ covered in Chapter 10)
Visual Diagrams
6.1 Bias-Variance Spectrum
6.2 Dropout Visualization
6.3 L1 vs L2 Constraint Geometry
6.4 Decision Flowchart: Diagnosing & Fixing Your Model
6.5 Early Stopping: Train vs Dev Loss
Worked Example โ Computing L2-Regularized Gradients by Hand
๐ Problem Setup
A 2-layer neural network with:
- Layer 1: Wโฝยนโพ is 3ร2, bโฝยนโพ is 3ร1
- Layer 2: Wโฝยฒโพ is 1ร3, bโฝยฒโพ is 1ร1
- m = 4 training examples, ฮป = 0.1
Step 1: Compute the L2 Regularization Term
Given:
Wโฝยนโพ = [[0.3, -0.5], Wโฝยฒโพ = [[0.2, 0.7, -0.4]]
[0.8, 0.1],
[-0.2, 0.6]]
Math
Step 2: Compute the Regularized Gradient for Wโฝยฒโพ
Suppose standard backprop gives us dWโฝยฒโพ_base = [[0.15, -0.08, 0.22]]
Step 3: Weight Update (Weight Decay Form)
With learning rate ฮฑ = 0.01:
The factor 0.99975 shows that each weight is multiplied by a number slightly less than 1 at every step โ this is the "decay."
0.005) compared to the base gradient (0.15). This is expected โ ฮป/m = 0.025 is small. If the regularization term dominates the gradient, your ฮป is too large and you'll underfit.
Case Study โ BYJU'S Student Outcome Prediction
๐ When an EdTech Model Memorised Metro Students
The Problem
BYJU'S, India's largest EdTech platform (serving 15 crore+ students), built a deep neural network to predict student exam outcomes based on learning behaviour: video watch time, quiz scores, revision patterns, and engagement metrics.
The Data
| Split | Source | Size |
|---|---|---|
| Training | Mumbai, Delhi, Bangalore, Chennai, Hyderabad | 8 lakh students |
| Dev | Same 5 cities (random split) | 1 lakh students |
| Test | Tier-2/3 cities: Patna, Bhopal, Coimbatore, Guwahati | 2 lakh students |
Initial Results (No Regularization)
| Metric | Train | Dev (Same Cities) | Test (New Cities) |
|---|---|---|---|
| Accuracy | 96.8% | 95.2% | 71.3% |
| F1 Score | 0.97 | 0.95 | 0.68 |
The 24% gap between dev and test showed severe overfitting to metro-city patterns.
Root Cause Analysis
- Feature leakage: Metro students had consistent Wi-Fi (video completion rate โ 95%), while Tier-3 students had patchy internet (completion โ 60%). The model learned "high video completion โ passes" โ a proxy for "lives in metro city."
- Device bias: 85% of metro students used tablets/laptops; 70% of Tier-3 students used budget smartphones with smaller screens. The model learned screen-time patterns specific to device types.
- Language proxy: Metro students mostly used English content; Tier-3 students used Hindi/regional language content with different engagement patterns.
The Fix: Multi-Layered Regularization
- Data Augmentation: Simulated poor connectivity by randomly dropping 20-40% of video watch events. Added noise to quiz completion times. Mixed Hindi and English engagement patterns.
- Dropout (p=0.3): Applied to all hidden layers to prevent co-adaptation of metro-specific features.
- L2 Regularization (ฮป=0.01): Penalised large weights that encoded city-specific shortcuts.
- Feature Engineering: Replaced raw "video completion %" with "relative engagement" (normalised within each connectivity tier).
- Early Stopping (patience=15): Monitored performance on a held-out Tier-2 city (Jaipur) to catch overfitting early.
Results After Regularization
| Metric | Train | Dev | Test (New Cities) |
|---|---|---|---|
| Accuracy | 89.5% | 88.7% | 86.2% |
| F1 Score | 0.90 | 0.89 | 0.85 |
Key Lessons
- Train-test gap reduced from 25.5% โ 3.3%
- Training accuracy dropped (89.5% vs 96.8%) โ this is expected and healthy
- Test accuracy on unseen Tier-3 cities jumped from 71.3% โ 86.2%
- The โน450 crore annual prediction pipeline now serves all of India, not just 5 metros
Common Mistakes & Misconceptions
model.eval() in PyTorch means dropout randomly zeros neurons at test time, making predictions stochastic and unreliable. Always: model.eval() before prediction, model.train() before training.
keep_prob, test-time predictions will be systematically lower than training-time expectations. The inverted scaling step is crucial.
Comparison Table โ Regularization Techniques
| Technique | Mechanism | Hyperparameters | Pros | Cons | When to Use |
|---|---|---|---|---|---|
| L2 (Weight Decay) | Penalise ||W||ยฒ | ฮป | Smooth, differentiable, easy to tune | Doesn't produce sparsity | Default for almost all DNNs |
| L1 (Lasso) | Penalise ||W||โ | ฮป | Produces sparse weights, feature selection | Non-differentiable at 0, less stable | When you need model compression / pruning |
| Dropout | Random neuron masking | keep_prob per layer | Powerful, acts as ensemble | Noisy training loss, slower convergence | Large FC layers; less useful for CNNs |
| Data Augmentation | Synthetic data expansion | Aug strategy, magnitude | Increases data without cost, preserves capacity | Domain-specific, can introduce artifacts | Always โ try first before other methods |
| Early Stopping | Stop at best dev loss | patience | Zero compute cost, no extra hyperparams | Couples optimisation and regularization | As a safety net alongside other methods |
| Batch Norm | Normalise layer inputs | โ | Speeds training, mild regularization | Adds complexity, batch-size dependent | Almost always (covered in Ch 10) |
Exercises
Section A: Multiple Choice Questions (10)
Adding L2 regularization to a neural network's cost function is equivalent to adding which term?
- (ฮป/m) ฮฃโ ||Wโฝหกโพ||โ
- (ฮป/2m) ฮฃโ ||Wโฝหกโพ||ยฒ_F
- (ฮป/2m) ฮฃโ ||bโฝหกโพ||ยฒ
- (ฮป/m) ฮฃโ ฮฃแตข |wแตข|
In inverted dropout with keep_prob = 0.8, by what factor are surviving activations scaled during training?
- 0.8
- 0.2
- 1.25 (i.e., 1/0.8)
- 5.0 (i.e., 1/0.2)
A model has training error = 2% and dev error = 18%. Bayes error is approximately 1%. What is the primary diagnosis?
- High bias
- High variance
- High bias AND high variance
- The model is perfectly fine
Which regularization technique is most likely to produce a sparse weight matrix (many exact zeros)?
- L2 regularization
- L1 regularization
- Dropout
- Early stopping
During test/inference time, what should happen with dropout?
- Apply dropout with the same keep_prob as training
- Apply dropout with keep_prob = 0.5 always
- Turn OFF dropout entirely โ use all neurons
- Apply dropout only to the output layer
The "weight decay" interpretation of L2 regularization means that at each update step, each weight is multiplied by:
- (1 + ฮฑฮป/m)
- (1 โ ฮฑฮป/m)
- (1 โ ฮป/m)
- ฮฑ/ฮป
Andrew Ng argues against early stopping as the primary regularization strategy because:
- It is computationally expensive
- It couples the tasks of optimizing J and preventing overfitting (violates "orthogonalization")
- It requires computing second-order derivatives
- It only works with SGD, not Adam
A Flipkart image classifier trained on product photos is overfitting. Which data augmentation technique would be LEAST appropriate?
- Horizontal flip
- Random rotation (ยฑ10ยฐ)
- Vertical flip (upside-down)
- Color jitter
If a neural network has training error = 22% and dev error = 24%, with Bayes error โ 5%, what should you do FIRST?
- Add more training data
- Increase L2 regularization strength
- Use a bigger/deeper network
- Apply stronger dropout (lower keep_prob)
Which statement about the Frobenius norm is CORRECT?
- It sums the absolute values of all matrix elements
- It computes the square root of the sum of squared elements
- It equals the largest singular value of the matrix
- It only considers diagonal elements of the weight matrix
Section B: Short Answer Questions (5)
Explain why dropout can be interpreted as training an ensemble of sub-networks. How many possible sub-networks exist for a layer with n neurons?
A model shows train error = 0.5% and dev error = 0.8%, but Bayes error is approximately 0.3%. Diagnose the model and suggest the next steps.
Why does L1 regularization produce sparse weights while L2 does not? Give a geometric explanation.
In the inverted dropout implementation, what would happen if we skipped the scaling step (dividing by keep_prob) during training?
keep_prob ร a (since each neuron survives with probability keep_prob). At test time, with all neurons active, the expected value would be a. This mismatch means test-time activations are systematically ~1/keep_prob times larger than training, leading to exploding activations through layers, poor predictions, and potential numerical overflow. The inverted scaling ensures E[a_train] = a_test.Explain the "orthogonalization" argument against early stopping. What does Andrew Ng mean by coupling optimization and regularization?
Section C: Long Answer Questions (3)
Prove that L2 regularization is equivalent to MAP estimation with a Gaussian prior on the weights.
Show all steps: start from Bayes' theorem, assume a Gaussian prior W ~ N(0, ฯยฒ_w I), derive the MAP objective, and show it equals the L2-regularized cost function. Identify the relationship between ฮป and ฯยฒ_w.
Step 1: Bayes' Theorem. MAP estimation maximises the posterior P(W|X,Y) โ P(Y|X,W) ยท P(W).
Step 2: Log-posterior. Taking the negative log: argmin_W [-log P(Y|X,W) - log P(W)]
Step 3: Likelihood term. For binary cross-entropy: -log P(Y|X,W) = (1/m) ฮฃแตข L(ลทโฝโฑโพ, yโฝโฑโพ) = Jโ(W) (the unregularized cost).
Step 4: Gaussian prior. Assume W ~ N(0, ฯยฒ_w I). Then:
P(W) = ฮ โฑผ (1/โ(2ฯฯยฒ_w)) exp(-wโฑผยฒ/(2ฯยฒ_w))
-log P(W) = (1/(2ฯยฒ_w)) ฮฃโฑผ wโฑผยฒ + const = (1/(2ฯยฒ_w)) ||W||ยฒ_F + const
Step 5: Combined MAP objective.
argmin_W [Jโ(W) + (1/(2ฯยฒ_w)) ||W||ยฒ_F]
Step 6: Identify ฮป. Comparing with J_reg = Jโ + (ฮป/2m)||W||ยฒ_F, we get:
ฮป/2m = 1/(2ฯยฒ_w), therefore ฮป = m/ฯยฒ_w
Interpretation: Small ฯยฒ_w (tight prior, "I believe weights should be small") โ large ฮป (strong regularization). Large ฯยฒ_w (loose prior, "weights can be anything") โ small ฮป (weak regularization).
Bonus: Similarly, L1 regularization corresponds to a Laplacian prior: P(wโฑผ) โ exp(-|wโฑผ|/b), and -log P(W) โ ||W||โ. The Laplacian prior is sharply peaked at zero, which explains why L1 produces sparsity.
Derive the complete backpropagation equations for a 3-layer neural network with L2 regularization.
For a network with layers [n_x, nโ, nโ, n_y] using ReLU for hidden layers and sigmoid for output. Show: (a) the regularized cost function, (b) forward pass equations, (c) all backward pass gradient equations with the L2 term, (d) the weight update rules in both standard and weight-decay form.
Compare and contrast three regularization strategies for an Indian language NLP model (e.g., sentiment analysis on Hinglish text from Twitter/X). Discuss: (a) How would you apply L2 regularization to a word embedding + LSTM architecture? (b) Where would you place dropout in the LSTM network and why? (c) Design a data augmentation strategy specific to Hinglish text. Include at least 4 augmentation techniques with examples.
Section D: Programming Questions (2)
Implement early stopping from scratch.
Extend the DeepNNRegularized class to include early stopping. Your implementation should:
- Accept a validation set (X_dev, Y_dev) and patience parameter
- Track training and dev costs at each epoch
- Save the best weights when dev cost reaches a new minimum
- Stop training if dev cost hasn't improved for
patienceepochs - Restore best weights after stopping
- Return both train and dev cost histories for plotting
Test on the noisy circles dataset with a deliberately large network (overfit-prone) and show that early stopping finds the optimal epoch.
Build a regularization ablation study.
Using the MNIST dataset (via sklearn.datasets.load_digits for simplicity), train a deep neural network with 5 different regularization configurations:
- No regularization (baseline)
- L2 only (ฮป = 0.01, 0.1, 1.0)
- Dropout only (keep_prob = 0.5, 0.8, 0.95)
- L2 + Dropout (best combination)
- L2 + Dropout + Early Stopping
For each, report: train accuracy, test accuracy, number of near-zero weights (|w| < 0.01), and total training time. Create a summary table and a bar chart comparing train vs test accuracy. Conclude with which strategy works best and why.
Section E: Mini-Project
๐๏ธ Regularization Pipeline for Indian Food Image Classification
Build a complete regularization pipeline for classifying Indian food images (Dosa, Idli, Biryani, Butter Chicken, Pani Puri, Chole Bhature โ 6 classes).
Requirements:
- Dataset: Use any available food dataset or create a synthetic one with ~500 images (can use web-scraped or generated). Split: 70% train, 15% dev, 15% test.
- Baseline: Train a CNN (use torchvision's ResNet-18 pretrained) without any regularization. Record train/dev accuracy curves.
- Regularization Layers:
- Add L2 (weight_decay in optimizer)
- Add Dropout (after FC layers)
- Add data augmentation (at least 5 transforms including random crop, flip, color jitter, rotation, cutout)
- Add early stopping (patience=10)
- Ablation Study: Train 4 models (baseline, +L2, +L2+Dropout, +L2+Dropout+Aug) and plot all 4 train/dev curves on the same graph.
- Report: Write a 1-page analysis with a comparison table, the best model's confusion matrix, and your recommendation for production deployment at a company like Zomato (for auto-tagging restaurant menus).
Budget: โน0 (use Google Colab free GPU). Time: 4โ6 hours.
Chapter Summary
๐ Key Takeaways โ Chapter 9: Regularization
- Overfitting occurs when a model learns noise in the training data instead of the underlying signal. It manifests as a large gap between training and dev/test performance.
- Bias-Variance Trade-off: Error = Biasยฒ + Variance + Noise. High train error = high bias (underfit). Large train-dev gap = high variance (overfit). Use Bayes error as the reference anchor.
- L2 Regularization adds (ฮป/2m)||W||ยฒ_F to the cost, producing a gradient term (ฮป/m)W that shrinks weights toward zero (weight decay). It's equivalent to MAP estimation with a Gaussian prior.
- L1 Regularization adds (ฮป/m)||W||โ to the cost, producing sparse weights (exact zeros). Useful for feature selection but less common in deep learning than L2.
- Dropout randomly zeros out neurons during training with probability (1 โ keep_prob). Inverted dropout scales surviving activations by 1/keep_prob to maintain expected values. Always turn OFF dropout at test time.
- Data Augmentation creates synthetic training examples through label-preserving transformations. For images: flip, rotate, crop, color jitter. For text: back-translation, synonym replacement, Hindi-English code-switching. This is "free" regularization that doesn't reduce model capacity.
- Early Stopping monitors dev loss and stops training when it starts increasing. Simple and effective but couples optimization and regularization (Ng's orthogonalization critique).
- Diagnostic Protocol: High bias โ bigger model. High variance โ more data + regularization. High bias + high variance โ bigger model AND regularization.
- In Practice: Combine multiple techniques. A typical production pipeline uses L2 (weight_decay in Adam) + Dropout (0.1โ0.3) + Data Augmentation + Early Stopping as a safety net.
- India-Specific Challenge: The urban-rural digital divide means Indian ML models often overfit to metro-city patterns. Regularization and representative data collection are critical for national-scale AI deployment.
Quick Reference Formulas
| Concept | Formula |
|---|---|
| L2 Regularized Cost | J + (ฮป/2m) ฮฃโ ||Wโฝหกโพ||ยฒ_F |
| L2 Gradient Addition | dWโฝหกโพ += (ฮป/m) Wโฝหกโพ |
| Weight Decay Factor | Wโฝหกโพ *= (1 โ ฮฑฮป/m) |
| L1 Gradient Addition | dWโฝหกโพ += (ฮป/m) sign(Wโฝหกโพ) |
| Inverted Dropout | A *= D / keep_prob |
| L2 โ Gaussian Prior | ฮป = m / ฯยฒ_w |
References & Further Reading
Primary References
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." Journal of Machine Learning Research, 15, 1929-1958. โ The foundational dropout paper.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 7: Regularization for Deep Learning. MIT Press. โ Comprehensive theoretical treatment.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Sections 3.1.4 (Regularized Least Squares) and 3.3 (Bayesian Linear Regression). Springer. โ Bayesian interpretation of regularization.
- Ng, A. (2017). "Deep Learning Specialization," Course 2: Improving Deep Neural Networks. Coursera/deeplearning.ai. โ Practical bias-variance diagnosis and regularization techniques.
Supplementary Reading
- Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). "Understanding Deep Learning Requires Rethinking Generalization." ICLR 2017. โ Landmark paper showing DNNs can memorise random labels.
- Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). "mixup: Beyond Empirical Risk Minimization." ICLR 2018. โ Data augmentation via linear interpolation of samples.
- Loshchilov, I. & Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. โ Shows weight decay โ L2 regularization for Adam; proposes AdamW.
- Krogh, A. & Hertz, J. A. (1992). "A Simple Weight Decay Can Improve Generalization." NIPS 1992. โ Early analysis of weight decay as regularization.
Indian Context References
- AI4Bharat (IIT Madras). "IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks, and Pre-trained Multilingual Language Models for Indian Languages." โ Regularization challenges in multilingual Indian NLP.
- Wadhwani AI. "Pest Management for Cotton Farmers." โ Data augmentation for agricultural image classification in Indian farms with limited labeled data.
Online Resources
- ๐น 3Blue1Brown: "But what is a neural network?" โ Visual intuition for overfitting
- ๐ Stanford CS231n Notes: "Neural Networks Part 2: Regularization" โ cs231n.github.io
- ๐ distill.pub: Interactive visualizations of regularization effects
- ๐ ๏ธ PyTorch Documentation:
torch.nn.Dropout,weight_decayin optimizers