Neural Networks & Deep Learning
Chapter 8: Optimization
Training Neural Networks Efficiently
โฑ๏ธ Reading Time: ~4 hours | ๐ Part III: Training Deep Networks | ๐ง Theory + Code Chapter
๐ Prerequisites: Chapters 4โ7 (Backpropagation, Gradient Descent basics)
Bloom's Taxonomy Map for This Chapter
| Bloom's Level | What You'll Achieve |
|---|---|
| ๐ต Remember | Recall the update rules for SGD, Momentum, RMSprop, Adam, and their hyperparameter defaults |
| ๐ต Understand | Explain exponentially weighted averages, bias correction, and why adaptive learning rates help |
| ๐ข Apply | Implement all five optimizers from scratch in Python and use PyTorch's built-in versions |
| ๐ก Analyze | Compare optimizer convergence curves on the same loss landscape and diagnose training issues |
| ๐ Evaluate | Choose the right optimizer and learning rate schedule for a given problem (CV, NLP, tabular) |
| ๐ด Create | Design a custom training pipeline with warm-up, cosine annealing, and distributed mini-batch SGD |
Learning Objectives
By the end of this chapter, you will be able to:
- Distinguish between Batch Gradient Descent, Stochastic Gradient Descent, and Mini-batch SGD โ and quantify their trade-offs in speed, memory, and convergence stability
- Derive the exponentially weighted moving average formula and explain bias correction with numerical examples
- Explain how Momentum accelerates gradient descent using a rolling-ball analogy and implement it from scratch
- Describe how RMSprop adapts learning rates per-parameter to handle ill-conditioned loss surfaces
- State the complete Adam update equations with bias correction and justify the default hyperparameters (ฮฒโ=0.9, ฮฒโ=0.999, ฮต=10โปโธ)
- Implement all five optimizers (SGD, Momentum, RMSprop, Adam, AdaGrad) as Python classes from scratch
- Compare optimizer loss curves on a common benchmark to evaluate convergence speed and stability
- Select appropriate learning rate schedules (step decay, exponential, cosine annealing, warm-up) for production training
- Design a training pipeline for distributed mini-batch SGD with learning rate warm-up, as used at Flipkart and similar Indian tech companies
Opening Hook โ When 6 Hours Became 12 Minutes
๐ฅ๏ธ The Infosys Server Cluster That Changed Everything
In early 2023, a deep learning team at Infosys Mysore was training a fraud detection model for a major Indian bank. Their dataset: 48 million transactions spread across 186 features. Using vanilla Batch Gradient Descent โ computing the gradient over all 48M samples before every single weight update โ each epoch took 6 hours and 12 minutes on an 8-GPU NVIDIA A100 cluster.
The model needed at least 50 epochs to converge. That's 310 hours = 13 days of non-stop GPU time, costing roughly โน18.6 lakh in cloud compute alone.
Then a senior ML engineer made one change: switch from Batch GD to Mini-batch SGD with Adam optimizer, batch size 512, cosine annealing learning rate schedule. The result? Each epoch now took 12 minutes. The model converged in 35 epochs โ total: 7 hours. Cost: โน42,000.
Same data. Same model. Same GPUs. The only difference? How the optimizer navigated the loss landscape. This chapter teaches you exactly how.
Core Concepts โ The Optimization Toolkit
Training a neural network means finding parameters W and b that minimize the loss function J(W, b). Gradient descent is the engine that drives this search, but how you compute and apply those gradients makes all the difference between a model that trains in minutes versus days. This section builds your toolkit from the ground up.
8.1 Gradient Descent Variants
All gradient descent algorithms share the same fundamental update rule:
ฮธ = ฮธ โ ฮฑ ยท โJ(ฮธ)
where ฮฑ = learning rate, โJ(ฮธ) = gradient of loss w.r.t. parameters
The key difference between variants lies in how many samples you use to compute โJ(ฮธ) before each update.
๐ Batch Gradient Descent (Full-Batch GD)
Compute the gradient using the entire training set before making a single parameter update. If you have m = 48 million samples, you process all 48 million, average the gradients, then update once.
Update Ruleโ
Gradient direction is exact (no noise) โ guaranteed to descend for convex functions
โ
Smooth convergence curve โ easy to debug
โ
Deterministic: same data โ same result
โ Extremely slow on large datasets โ must process ALL samples before a single update
โ Requires entire dataset in memory
โ Can get stuck in sharp local minima (no noise to escape)
Only practical when m < 2,000. Common in classical optimization, rare in deep learning.
โก Stochastic Gradient Descent (SGD)
Compute the gradient using a single sample at a time. Update parameters after every single training example.
Update Ruleฮธ = ฮธ โ ฮฑ ยท โJ(ฮธ; xโฝโฑโพ, yโฝโฑโพ)
โ
Very fast updates โ starts learning immediately
โ
Can escape local minima due to noisy gradients
โ
Memory-efficient: only one sample at a time
โ Very noisy gradient โ oscillates heavily, may never fully converge
โ Cannot leverage vectorized (GPU) computation โ processes one sample at a time
โ Loses vectorization speed advantage of NumPy/PyTorch
๐ฏ Mini-batch Gradient Descent (The Sweet Spot)
Split the training set into mini-batches of size B. Compute the gradient on each mini-batch and update parameters. This is the standard in all modern deep learning.
Update Ruleฮธ = ฮธ โ ฮฑ ยท (1/B) ยท ฮฃแตขโโแดฎ โJ(ฮธ; xโฝโฑโพ, yโฝโฑโพ)
โ
Vectorization: GPU processes B samples in parallel โ massive speedup
โ
Moderate noise: enough randomness to escape local minima, smooth enough to converge
โ
Frequent updates: m/B updates per epoch (not just 1 like batch GD)
โ
Memory-friendly: only B samples in GPU memory at once
Typical values: 32, 64, 128, 256, 512, 1024. Always powers of 2 โ this aligns with GPU memory architecture (CUDA cores work in warps of 32).
Batch Size Trade-offs: A Complete View
| Aspect | Small Batch (32โ64) | Medium Batch (128โ512) | Large Batch (1024โ8192) |
|---|---|---|---|
| Gradient Noise | High โ acts as regularizer | Moderate โ balanced | Low โ may overfit |
| Convergence Speed | More epochs to converge | Good balance | Fewer epochs but each is slower |
| GPU Utilization | Underutilizes GPU | Good utilization | Maxes out GPU |
| Generalization | Better โ finds flat minima | Moderate | Worse โ sharp minima |
| Memory Needed | Low | Moderate | May OOM on large models |
| Learning Rate | Smaller ฮฑ needed | Standard ฮฑ | Scale ฮฑ linearly with batch size |
8.2 Exponentially Weighted Averages (EWA)
Before we dive into Momentum and Adam, we need to understand the mathematical primitive they're built on: exponentially weighted moving averages. This is the single most important building block for all advanced optimizers.
Intuition: Smoothing Noisy Data
Imagine you're tracking daily temperatures in Delhi over a year. The raw data is noisy โ 32ยฐC one day, 28ยฐC the next, 35ยฐC the day after. To see the underlying trend, you compute a running average that gives more weight to recent values and exponentially less weight to older values.
Vt = ฮฒ ยท Vt-1 + (1 โ ฮฒ) ยท ฮธt
Vt = smoothed value at time t
ฮธt = actual value at time t (e.g., today's temperature)
ฮฒ = weighting factor (0 < ฮฒ < 1), typically 0.9
V0 = 0 (initialize to zero)
What Does ฮฒ Control?
The parameter ฮฒ determines how many past values effectively contribute to the average. A rough approximation: Vt averages over approximately 1/(1โฮฒ) previous values.
| ฮฒ Value | โ Averaging Over | Behavior | Analogy |
|---|---|---|---|
| ฮฒ = 0.9 | ~10 values | Smooth, adapts at moderate speed | Weekly weather average |
| ฮฒ = 0.98 | ~50 values | Very smooth, slow to adapt | Monthly moving average |
| ฮฒ = 0.5 | ~2 values | Noisy, responds very fast | Yesterday + today average |
Numerical Example
Let's trace through with ฮฒ = 0.9 and daily temperatures ฮธ = [35, 33, 36, 34, 38]:
# Tracing EWA step-by-step
Vโ = 0
Vโ = 0.9 ร 0 + 0.1 ร 35 = 3.5 # Way too low! (bias problem)
Vโ = 0.9 ร 3.5 + 0.1 ร 33 = 6.45
Vโ = 0.9 ร 6.45+ 0.1 ร 36 = 9.41
Vโ = 0.9 ร 9.41+ 0.1 ร 34 = 11.87
Vโ
= 0.9 ร 11.87+0.1 ร 38 = 14.48
# After ~10 steps, V converges to the true range
The Bias Problem & Correction
Notice Vโ = 3.5 when the actual temperature is 35ยฐC! Since Vโ = 0, the early estimates are biased toward zero. The fix is bias correction:
Vtcorrected = Vt / (1 โ ฮฒt)
At t=1: Vโcorrected = 3.5 / (1 โ 0.9ยน) = 3.5 / 0.1 = 35.0 โ
At t=2: Vโcorrected = 6.45 / (1 โ 0.9ยฒ) = 6.45 / 0.19 = 33.9 โ
As t โ โ: (1 โ ฮฒt) โ 1, so correction vanishes
8.3 Gradient Descent with Momentum
The Rolling Ball Analogy
Imagine placing a ball at the top of a hilly landscape (the loss surface). Vanilla gradient descent is like the ball moving only according to the local slope โ at every point it forgets its previous direction. Momentum is like giving the ball mass โ it accumulates velocity in directions of consistent gradient, and resists changing direction for oscillatory gradients.
๐ Momentum โ Accumulate Velocity, Dampen Oscillations
Instead of using the raw gradient directly, maintain a velocity vector V that is an exponentially weighted average of past gradients. Update parameters using this smoothed velocity.
Update RulesVdW = ฮฒ ยท VdW + (1 โ ฮฒ) ยท dW
Vdb = ฮฒ ยท Vdb + (1 โ ฮฒ) ยท db
W = W โ ฮฑ ยท VdW
b = b โ ฮฑ ยท Vdb
Typical: ฮฒ = 0.9 (averages over ~10 gradients)
Dampens oscillations: In directions where gradients alternate sign (oscillate), the positive and negative gradients cancel out in V โ smaller effective step.
Accelerates consistent direction: In the direction of the minimum, gradients consistently point the same way โ they accumulate in V โ larger effective step.
ฮฒ = 0.9 โ the standard choice. Rarely needs tuning.
ฮฑ โ learning rate. You may need a slightly smaller ฮฑ than vanilla SGD since momentum effectively amplifies the step size.
8.4 RMSprop โ Root Mean Square Propagation
Momentum addresses the direction problem (dampening oscillations). But what about the magnitude problem? In many loss surfaces, some parameters have very large gradients while others have tiny ones. A single global learning rate is a poor fit for all of them.
๐ RMSprop โ Adaptive Per-Parameter Learning Rates
Track the exponentially weighted average of squared gradients for each parameter. Divide the gradient by the square root of this average. Parameters with historically large gradients get effectively smaller learning rates; parameters with small gradients get effectively larger ones.
Update RulesSdW = ฮฒ ยท SdW + (1 โ ฮฒ) ยท dWยฒ
Sdb = ฮฒ ยท Sdb + (1 โ ฮฒ) ยท dbยฒ
W = W โ ฮฑ ยท dW / (โSdW + ฮต)
b = b โ ฮฑ ยท db / (โSdb + ฮต)
ฮต = 10โปโธ (prevents division by zero)
ฮฒ = 0.999 (or 0.99) โ the original lecture used 0.999
If parameter wโ has consistently large gradients (say |dwโ| โ 10), then Sdwโ โ 100, so the update is divided by โ100 = 10 โ effective learning rate is ฮฑ/10.
If parameter wโ has small gradients (|dwโ| โ 0.01), then Sdwโ โ 0.0001, so the update is divided by โ0.0001 = 0.01 โ effective learning rate is ฮฑ/0.01 = 100ฮฑ.
All parameters receive appropriately scaled updates regardless of gradient magnitude. This eliminates the need to manually set different learning rates for different layers.
8.5 Adam โ Adaptive Moment Estimation
Adam combines the best of both worlds: Momentum's velocity (first moment of gradients) + RMSprop's adaptive scaling (second moment of gradients), with bias correction for both.
๐ Adam โ The King of Optimizers
On each iteration:
t = t + 1
Step 1: Compute gradients
gt = โJ(ฮธt-1)
Step 2: Update first moment (mean of gradients โ Momentum)
mt = ฮฒโ ยท mt-1 + (1 โ ฮฒโ) ยท gt
Step 3: Update second moment (mean of squared gradients โ RMSprop)
vt = ฮฒโ ยท vt-1 + (1 โ ฮฒโ) ยท gtยฒ
Step 4: Bias correction
mฬt = mt / (1 โ ฮฒโt)
vฬt = vt / (1 โ ฮฒโt)
Step 5: Update parameters
ฮธt = ฮธt-1 โ ฮฑ ยท mฬt / (โvฬt + ฮต)
ฮฑ = 0.001 โ learning rate
ฮฒโ = 0.9 โ first moment decay (momentum term)
ฮฒโ = 0.999 โ second moment decay (RMSprop term)
ฮต = 10โปโธ โ numerical stability constant
ฮฒโ = 0.9 averages over ~10 recent gradients โ fast adaptation to recent loss landscape.
ฮฒโ = 0.999 averages over ~1000 squared gradients โ slow, stable estimate of gradient variance.
This asymmetry is deliberate: you want the direction to adapt quickly but the scale to change slowly.
ฮปยทฮธ directly from parameters, not from gradients.
8.6 Learning Rate Schedules
A fixed learning rate is rarely optimal throughout training. Early on, you want large steps to explore quickly. Later, you want small steps to fine-tune near the minimum. Learning rate schedules systematically reduce ฮฑ during training.
๐ Four Major Learning Rate Schedules
Example: ฮฑโ=0.01, drop by 10ร every 30 epochs
Simple and effective. Used in the original ResNet paper (He et al., 2015). Reduce LR by 10ร at epoch 30, 60, 90.
2. Exponential Decayk = decay rate, t = epoch number
Smooth, continuous decay. No sudden jumps. Common in TensorFlow pipelines.
3. Cosine AnnealingT = total epochs, t = current epoch
Follows a cosine curve from ฮฑmax to ฮฑmin. Popular in modern training (used in training GPT models). Smoothly reduces LR and can be combined with warm restarts (SGDR) for even better results.
4. Linear Warm-up + Decayฮฑt = ฮฑmax ยท (t / W)
Phase 2 (decay, epochs W+1 to T):
ฮฑt follows cosine or step decay
Critical for large-batch training! When using batch sizes โฅ 1024, the gradients in the first few iterations are unreliable (model hasn't seen much data yet). Starting with a large LR causes divergence. Warm-up linearly increases LR from near-zero to ฮฑmax over W epochs, then switches to a decay schedule.
From-Scratch Code โ Implementing Every Optimizer in Python
Let's implement all five optimizers as Python classes. Each takes parameters and gradients and applies the update rule. We'll then compare them on a common problem.
4a. Vanilla SGD
Python
import numpy as np
class SGD:
"""Vanilla Stochastic Gradient Descent."""
def __init__(self, lr=0.01):
self.lr = lr
def update(self, params, grads):
"""
params: dict of parameter arrays {'W1': ..., 'b1': ..., ...}
grads: dict of gradient arrays {'dW1': ..., 'db1': ..., ...}
"""
for key in params:
params[key] -= self.lr * grads['d' + key]
return params
4b. SGD with Momentum
Python
class MomentumSGD:
"""SGD with Momentum โ rolling ball optimization."""
def __init__(self, lr=0.01, beta=0.9):
self.lr = lr
self.beta = beta
self.velocity = {}
def update(self, params, grads):
for key in params:
if key not in self.velocity:
self.velocity[key] = np.zeros_like(params[key])
# Update velocity: V = ฮฒยทV + (1-ฮฒ)ยทdW
self.velocity[key] = (self.beta * self.velocity[key]
+ (1 - self.beta) * grads['d' + key])
# Update parameter: W = W - ฮฑยทV
params[key] -= self.lr * self.velocity[key]
return params
4c. RMSprop
Python
class RMSprop:
"""RMSprop โ adaptive per-parameter learning rates."""
def __init__(self, lr=0.001, beta=0.999, epsilon=1e-8):
self.lr = lr
self.beta = beta
self.epsilon = epsilon
self.cache = {} # Squared gradient accumulator
def update(self, params, grads):
for key in params:
if key not in self.cache:
self.cache[key] = np.zeros_like(params[key])
grad = grads['d' + key]
# Update cache: S = ฮฒยทS + (1-ฮฒ)ยทdWยฒ
self.cache[key] = (self.beta * self.cache[key]
+ (1 - self.beta) * grad ** 2)
# Update parameter: W = W - ฮฑยทdW / (โS + ฮต)
params[key] -= (self.lr * grad
/ (np.sqrt(self.cache[key]) + self.epsilon))
return params
4d. Adam โ Full Implementation with Bias Correction
Python
class Adam:
"""
Adam optimizer โ Adaptive Moment Estimation.
Combines Momentum (first moment) + RMSprop (second moment)
with bias correction for both.
"""
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = {} # First moment (mean of gradients)
self.v = {} # Second moment (mean of squared gradients)
self.t = 0 # Time step counter
def update(self, params, grads):
self.t += 1
for key in params:
if key not in self.m:
self.m[key] = np.zeros_like(params[key])
self.v[key] = np.zeros_like(params[key])
grad = grads['d' + key]
# Step 1: Update first moment estimate (Momentum)
self.m[key] = (self.beta1 * self.m[key]
+ (1 - self.beta1) * grad)
# Step 2: Update second moment estimate (RMSprop)
self.v[key] = (self.beta2 * self.v[key]
+ (1 - self.beta2) * grad ** 2)
# Step 3: Bias correction
m_hat = self.m[key] / (1 - self.beta1 ** self.t)
v_hat = self.v[key] / (1 - self.beta2 ** self.t)
# Step 4: Update parameters
params[key] -= (self.lr * m_hat
/ (np.sqrt(v_hat) + self.epsilon))
return params
4e. AdaGrad (Bonus โ Historical Importance)
Python
class AdaGrad:
"""
AdaGrad โ Adaptive Gradient Algorithm (Duchi et al., 2011).
Accumulates ALL past squared gradients (no decay).
Problem: learning rate monotonically decreases โ may stop learning.
"""
def __init__(self, lr=0.01, epsilon=1e-8):
self.lr = lr
self.epsilon = epsilon
self.cache = {}
def update(self, params, grads):
for key in params:
if key not in self.cache:
self.cache[key] = np.zeros_like(params[key])
grad = grads['d' + key]
# Accumulate squared gradients (NO decay โ key difference)
self.cache[key] += grad ** 2
# Update: divide by sqrt of total accumulated squared grads
params[key] -= (self.lr * grad
/ (np.sqrt(self.cache[key]) + self.epsilon))
return params
4f. Comparing All Optimizers on the Same Problem
Let's train a simple 2-layer neural network on a synthetic classification task and overlay the loss curves.
Python
import numpy as np
import matplotlib.pyplot as plt
# โโโ Synthetic Dataset โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
np.random.seed(42)
N = 1000 # samples
D = 10 # features
X = np.random.randn(D, N)
W_true = np.random.randn(D, 1) * 0.5
y = (X.T @ W_true + np.random.randn(N, 1) * 0.1).T # shape (1, N)
# โโโ Simple Linear Model: y_hat = W.T @ x + b โโโโโ
def init_params():
return {
'W': np.random.randn(D, 1) * 0.01,
'b': np.zeros((1, 1))
}
def compute_loss_and_grads(params, X, y):
# Forward pass
y_hat = params['W'].T @ X + params['b'] # (1, N)
m = X.shape[1]
loss = np.mean((y_hat - y) ** 2)
# Backward pass
diff = y_hat - y # (1, N)
grads = {
'dW': (2 / m) * (X @ diff.T), # (D, 1)
'db': (2 / m) * np.sum(diff, axis=1, keepdims=True)
}
return loss, grads
# โโโ Train with each optimizer โโโโโโโโโโโโโโโโโโโโโ
optimizers = {
'Vanilla SGD': SGD(lr=0.01),
'Momentum': MomentumSGD(lr=0.01, beta=0.9),
'RMSprop': RMSprop(lr=0.001),
'Adam': Adam(lr=0.001),
'AdaGrad': AdaGrad(lr=0.1),
}
epochs = 200
results = {}
for name, optimizer in optimizers.items():
params = init_params()
losses = []
for epoch in range(epochs):
loss, grads = compute_loss_and_grads(params, X, y)
losses.append(loss)
params = optimizer.update(params, grads)
results[name] = losses
print(f"{name:15s} โ Final loss: {losses[-1]:.6f}")
# โโโ Plot comparison โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
plt.figure(figsize=(10, 6))
colors = ['#ef4444', '#f59e0b', '#22c55e', '#7c3aed', '#64748b']
for (name, losses), color in zip(results.items(), colors):
plt.plot(losses, label=name, linewidth=2, color=color)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('MSE Loss', fontsize=12)
plt.title('Optimizer Comparison: Loss Curves', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('optimizer_comparison.png', dpi=150)
plt.show()
4g. Learning Rate Schedules โ Implementation
Python
import math
class StepDecaySchedule:
"""Reduce LR by factor every N epochs."""
def __init__(self, initial_lr, drop_rate=0.1, step_size=30):
self.initial_lr = initial_lr
self.drop_rate = drop_rate
self.step_size = step_size
def get_lr(self, epoch):
return self.initial_lr * (self.drop_rate ** (epoch // self.step_size))
class ExponentialDecaySchedule:
"""Smooth exponential decay."""
def __init__(self, initial_lr, decay_rate=0.96):
self.initial_lr = initial_lr
self.decay_rate = decay_rate
def get_lr(self, epoch):
return self.initial_lr * (self.decay_rate ** epoch)
class CosineAnnealingSchedule:
"""Cosine annealing from lr_max to lr_min."""
def __init__(self, lr_max, lr_min=1e-6, total_epochs=100):
self.lr_max = lr_max
self.lr_min = lr_min
self.total_epochs = total_epochs
def get_lr(self, epoch):
return (self.lr_min + 0.5 * (self.lr_max - self.lr_min)
* (1 + math.cos(math.pi * epoch / self.total_epochs)))
class WarmupCosineSchedule:
"""Linear warm-up followed by cosine annealing."""
def __init__(self, lr_max, warmup_epochs=5,
total_epochs=100, lr_min=1e-6):
self.lr_max = lr_max
self.warmup_epochs = warmup_epochs
self.total_epochs = total_epochs
self.lr_min = lr_min
def get_lr(self, epoch):
if epoch < self.warmup_epochs:
# Linear warm-up
return self.lr_max * (epoch + 1) / self.warmup_epochs
else:
# Cosine annealing
progress = (epoch - self.warmup_epochs) / (
self.total_epochs - self.warmup_epochs)
return (self.lr_min + 0.5 * (self.lr_max - self.lr_min)
* (1 + math.cos(math.pi * progress)))
# โโโ Visualize all schedules โโโโโโโโโโโโโโโโโโโโโโโ
epochs = 100
schedules = {
'Step Decay (รท10 @ 30,60,90)': StepDecaySchedule(0.01),
'Exponential (ฮณ=0.96)': ExponentialDecaySchedule(0.01),
'Cosine Annealing': CosineAnnealingSchedule(0.01),
'Warmup + Cosine': WarmupCosineSchedule(0.01, warmup_epochs=5),
}
plt.figure(figsize=(10, 5))
for name, schedule in schedules.items():
lrs = [schedule.get_lr(e) for e in range(epochs)]
plt.plot(lrs, label=name, linewidth=2)
plt.xlabel('Epoch'); plt.ylabel('Learning Rate')
plt.title('Learning Rate Schedules Comparison')
plt.legend(); plt.grid(True, alpha=0.3)
plt.tight_layout(); plt.show()
Industry Code โ PyTorch Optimizers in Production
In real projects, you never implement optimizers from scratch. PyTorch provides battle-tested, GPU-accelerated implementations. Here's how to use them:
5a. Using PyTorch Optimizers
Python
import torch
import torch.nn as nn
import torch.optim as optim
# โโโ Define a simple model โโโโโโโโโโโโโโโโโโโโโโโโโโ
model = nn.Sequential(
nn.Linear(10, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 1)
)
# โโโ Pick your optimizer โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Option 1: Vanilla SGD
optimizer = optim.SGD(model.parameters(), lr=0.01)
# Option 2: SGD with Momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
# Option 3: RMSprop
optimizer = optim.RMSprop(model.parameters(), lr=0.001, alpha=0.99)
# Option 4: Adam (most common default)
optimizer = optim.Adam(model.parameters(), lr=0.001,
betas=(0.9, 0.999), eps=1e-8)
# Option 5: AdamW (recommended for production)
optimizer = optim.AdamW(model.parameters(), lr=0.001,
betas=(0.9, 0.999), weight_decay=0.01)
# โโโ Training loop โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
criterion = nn.MSELoss()
for epoch in range(100):
for X_batch, y_batch in dataloader:
# Forward pass
y_pred = model(X_batch)
loss = criterion(y_pred, y_batch)
# Backward pass
optimizer.zero_grad() # CRITICAL: reset gradients
loss.backward() # Compute gradients
optimizer.step() # Apply optimizer update
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1:3d} | Loss: {loss.item():.6f}")
optimizer.zero_grad() is the #1 PyTorch beginner bug! Without it, gradients from the previous batch accumulate (are added to, not replaced). Your model will diverge or train on incorrect gradients. Always call zero_grad() at the start of each iteration.
5b. PyTorch Learning Rate Schedulers
Python
from torch.optim.lr_scheduler import (
StepLR, ExponentialLR, CosineAnnealingLR,
CosineAnnealingWarmRestarts, OneCycleLR
)
optimizer = optim.AdamW(model.parameters(), lr=0.001)
# Step decay: reduce by 10ร every 30 epochs
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# Cosine annealing
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)
# Cosine annealing with warm restarts
scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2)
# One-cycle policy (super-convergence) โ popular for fast training
scheduler = OneCycleLR(optimizer, max_lr=0.01,
total_steps=1000, pct_start=0.3)
# โโโ Use in training loop โโโโโโโโโโโโโโโโโโโโโโโโโโโ
for epoch in range(100):
for X_batch, y_batch in dataloader:
optimizer.zero_grad()
loss = criterion(model(X_batch), y_batch)
loss.backward()
optimizer.step()
scheduler.step() # Update LR after each epoch
print(f"Epoch {epoch+1} | LR: {scheduler.get_last_lr()[0]:.6f}")
OneCycleLR implements it in one line. Use pct_start=0.3 (warm-up for 30% of training, decay for 70%).
Visual Diagrams โ Understanding Optimizer Behavior
6a. Contour Plot: Optimizer Paths on a 2D Loss Surface
Imagine the loss function as a bowl-shaped valley (contour lines show equal-loss regions). Each optimizer takes a different path to the minimum:
6b. The Optimizer Family Tree
6c. Bias Correction Visualization
Worked Example โ Adam Step-by-Step on a 2D Problem
Let's manually trace 3 iterations of Adam on a simple problem to build deep intuition. We'll minimize f(w) = wยฒ starting from w = 5.0.
Setup
Hyperparameters: ฮฑ = 0.1, ฮฒโ = 0.9, ฮฒโ = 0.999, ฮต = 10โปโธ
Initialize: mโ = 0, vโ = 0, t = 0
Iteration 1 (t = 1)
| Step | Computation | Result |
|---|---|---|
| Gradient | gโ = df/dw = 2w = 2(5.0) | gโ = 10.0 |
| First moment | mโ = 0.9(0) + 0.1(10.0) | mโ = 1.0 |
| Second moment | vโ = 0.999(0) + 0.001(10.0ยฒ) | vโ = 0.1 |
| Bias correct m | mฬโ = 1.0 / (1 โ 0.9ยน) = 1.0 / 0.1 | mฬโ = 10.0 |
| Bias correct v | vฬโ = 0.1 / (1 โ 0.999ยน) = 0.1 / 0.001 | vฬโ = 100.0 |
| Update | w = 5.0 โ 0.1 ร 10.0 / (โ100.0 + 10โปโธ) | w = 4.9 |
Iteration 2 (t = 2)
| Step | Computation | Result |
|---|---|---|
| Gradient | gโ = 2(4.9) | gโ = 9.8 |
| First moment | mโ = 0.9(1.0) + 0.1(9.8) | mโ = 1.88 |
| Second moment | vโ = 0.999(0.1) + 0.001(9.8ยฒ) | vโ = 0.1960 |
| Bias correct m | mฬโ = 1.88 / (1 โ 0.9ยฒ) = 1.88 / 0.19 | mฬโ = 9.895 |
| Bias correct v | vฬโ = 0.196 / (1 โ 0.999ยฒ) = 0.196 / 0.001999 | vฬโ = 98.05 |
| Update | w = 4.9 โ 0.1 ร 9.895 / (โ98.05 + 10โปโธ) | w = 4.8 |
Iteration 3 (t = 3)
| Step | Computation | Result |
|---|---|---|
| Gradient | gโ = 2(4.8) | gโ = 9.6 |
| First moment | mโ = 0.9(1.88) + 0.1(9.6) | mโ = 2.652 |
| Second moment | vโ = 0.999(0.196) + 0.001(9.6ยฒ) | vโ = 0.288 |
| Bias correct m | mฬโ = 2.652 / (1 โ 0.9ยณ) = 2.652 / 0.271 | mฬโ = 9.786 |
| Bias correct v | vฬโ = 0.288 / (1 โ 0.999ยณ) = 0.288 / 0.002997 | vฬโ = 96.10 |
| Update | w = 4.8 โ 0.1 ร 9.786 / (โ96.10 + 10โปโธ) | w = 4.7 |
Case Study โ Flipkart Search Ranking: Distributed Optimization at Scale
๐ How Flipkart Trains Search Ranking Models on 1.5 Billion Query Logs
The Problem
Flipkart processes over 15 million search queries per day. Their search ranking model must predict which products to show for each query, in what order. The model โ a deep cross-network with 12 layers โ trains on 1.5 billion historical query-click pairs. Training this on a single GPU would take over 3 weeks.
The Optimization Pipeline
| Component | Choice | Reasoning |
|---|---|---|
| Optimizer | AdamW | Adaptive LR handles 200+ features of different scales (price in โน, click rate 0-1, text embeddings) |
| Batch Size | 4096 (per GPU) ร 32 GPUs = 131,072 effective | Large effective batch for stable distributed training |
| LR Schedule | Linear warm-up (2000 steps) โ Cosine annealing | Large batch requires warm-up to prevent divergence |
| LR Scaling | ฮฑ = base_lr ร โ(num_gpus) = 0.001 ร โ32 โ 0.0057 | Square-root scaling rule for large-batch training |
| Gradient Clipping | Max norm = 1.0 | Prevents gradient explosion from outlier query-product pairs |
| Weight Decay | 0.01 (in AdamW) | Regularization for the large model (~50M parameters) |
Results
| Metric | Before (SGD, 1 GPU) | After (AdamW, 32 GPUs) |
|---|---|---|
| Training Time | 21 days | 16 hours |
| NDCG@10 | 0.72 | 0.79 (+9.7%) |
| Compute Cost | โน6.3 lakh | โน1.1 lakh |
| Click-Through Rate | 4.2% | 5.1% (+21.4%) |
Key Lessons
- Warm-up is non-negotiable for large-batch distributed training. Without it, the model diverged in 50 steps.
- AdamW > Adam for models with weight decay โ the decoupled implementation prevents the adaptive LR from conflicting with regularization.
- Square-root LR scaling (not linear!) worked better than the linear scaling rule from Goyal et al. (2017) for their specific architecture.
- Gradient clipping was essential โ 0.3% of training samples had extreme gradients (โน1 items with 90% click rate) that could destabilize training.
Common Mistakes & Misconceptions
Reality: Adam converges faster in training loss, but SGD with momentum often achieves better test accuracy in computer vision tasks (ResNet, EfficientNet). The reason: Adam's per-parameter learning rates can lead to sharp minima that don't generalize. Use Adam for NLP/transformers, SGD+momentum for vision, and always validate on a test set.
Reality: If ฮฑ is too large, the optimizer overshoots the minimum and diverges. The loss goes to infinity or NaN. Rule of thumb: if your loss increases or oscillates wildly after a few hundred steps, your LR is too high. Halve it and try again. Start with Adam's default ฮฑ = 0.001.
Reality: In 99% of cases, the defaults ฮฒโ = 0.9 and ฮฒโ = 0.999 work perfectly. Kingma and Ba's paper showed these values are robust across diverse tasks. The only hyperparameter you should tune first is the learning rate ฮฑ. If you must tune ฮฒโ, try 0.99 for very noisy problems.
Reality: This is actually correct, but the mistake is using these terms interchangeably. In practice, when people say "SGD" in deep learning, they almost always mean mini-batch SGD. PyTorch's
optim.SGD is actually mini-batch SGD โ the batch size comes from the DataLoader, not the optimizer.
Reality: For large batch sizes (โฅ 1024), warm-up is essential. Without it, large-batch training frequently diverges because the initial random gradients are unreliable, and scaling them by a large LR amplifies noise catastrophically. Every major model โ BERT, GPT, ViT โ uses warm-up.
Comprehensive Optimizer Comparison
| Feature | SGD | Momentum | AdaGrad | RMSprop | Adam |
|---|---|---|---|---|---|
| Update Rule | ฮธ -= ฮฑยทg | v = ฮฒv+(1-ฮฒ)g ฮธ -= ฮฑยทv |
c += gยฒ ฮธ -= ฮฑยทg/โc |
s = ฮฒs+(1-ฮฒ)gยฒ ฮธ -= ฮฑยทg/โs |
m,v moments bias correct ฮธ -= ฮฑยทmฬ/โvฬ |
| Adaptive LR | โ No | โ No | โ Yes | โ Yes | โ Yes |
| Momentum | โ No | โ Yes | โ No | โ No | โ Yes |
| Bias Correction | N/A | โ No | N/A | โ No | โ Yes |
| Memory (per param) | 0 extra | 1 buffer | 1 buffer | 1 buffer | 2 buffers |
| Key Hyperparams | ฮฑ | ฮฑ, ฮฒ | ฮฑ | ฮฑ, ฮฒ | ฮฑ, ฮฒโ, ฮฒโ |
| Default LR | 0.01 | 0.01 | 0.01 | 0.001 | 0.001 |
| Convergence | Slow but steady | Fast, smooth | Slows over time | Fast | Fastest |
| Generalization | โญโญโญ Best | โญโญโญ Very good | โญโญ Okay | โญโญ Good | โญโญ Good |
| Best For | Vision (final) | Vision, general | Sparse data, NLP | RNNs | Default choice |
| Weakness | Slow, oscillates | Can overshoot | LR โ 0 over time | No bias correct | May generalize worse |
| Year | 1951 | 1964 | 2011 | 2012 | 2014 |
๐น Computer Vision (ResNet, EfficientNet): SGD + Momentum (ฮฒ=0.9), cosine annealing, weight decay 1e-4
๐น NLP / Transformers (BERT, GPT): AdamW (ฮฒโ=0.9, ฮฒโ=0.999), linear warm-up + cosine decay, weight decay 0.01
๐น Prototyping / Quick experiments: Adam with default hyperparameters
๐น Sparse data (recommender systems): AdaGrad or SparseAdam
Exercises
Section A โ Multiple Choice Questions (10)
In mini-batch gradient descent with batch size B=64 and a dataset of m=6400 samples, how many parameter updates occur per epoch?
- 1
- 64
- 100
- 6400
In the exponentially weighted average formula Vt = ฮฒยทVt-1 + (1โฮฒ)ยทฮธt, if ฮฒ = 0.98, the average approximately considers the last _____ values.
- 2
- 10
- 50
- 98
Which problem does bias correction in Adam specifically solve?
- Gradient vanishing in deep networks
- Inaccurate moment estimates during early training steps
- Overfitting on small datasets
- Learning rate being too high
Why are batch sizes typically chosen as powers of 2 (32, 64, 128, 256)?
- It makes the math simpler for computing averages
- GPU memory banks are organized in powers of 2, optimizing memory access patterns
- Python's NumPy library requires it
- It ensures the dataset divides evenly
What is the key limitation of AdaGrad that RMSprop fixes?
- AdaGrad doesn't use gradient information
- AdaGrad's accumulated squared gradients grow monotonically, causing the effective learning rate to shrink to zero
- AdaGrad requires too much memory
- AdaGrad only works with convex loss functions
Adam can be viewed as a combination of which two optimization techniques?
- Batch GD + Stochastic GD
- Momentum + RMSprop
- AdaGrad + Newton's Method
- L1 Regularization + L2 Regularization
During training with a large batch size of 4096, the model diverges in the first 100 steps. What is the most likely fix?
- Switch from Adam to SGD
- Add learning rate warm-up for the first few hundred steps
- Increase the batch size to 8192
- Remove all regularization
How much extra memory per parameter does Adam require compared to vanilla SGD?
- None โ same memory
- 1ร extra (one buffer for velocity)
- 2ร extra (first moment m + second moment v)
- 3ร extra (m + v + bias correction terms)
In cosine annealing, the learning rate follows which pattern over training?
- Linearly decreases from max to min
- Drops sharply at fixed intervals
- Follows a half-cosine curve from max to min, with a smooth gradual decrease
- Increases exponentially
Research by Wilson et al. (2017) showed that for image classification tasks, which optimizer often achieves better test accuracy despite slower training convergence?
- Adam
- RMSprop
- SGD with Momentum
- AdaGrad
Section B โ Short Answer Questions (5)
Explain the difference between Batch Gradient Descent and Mini-batch Gradient Descent in exactly 3 sentences. Include the impact on training speed and gradient noise.
Why does Momentum help with oscillating gradients? Use the rolling ball analogy to explain.
Write the complete Adam update equations for a single parameter w, clearly labeling each step. What is the purpose of each step?
Explain why learning rate warm-up is critical for large-batch training. What happens without it?
In 3โ4 sentences, explain how AdamW differs from Adam and why this matters for regularized models.
Section C โ Long Answer Questions (3)
(15 marks) Trace through 3 complete iterations of the Momentum optimizer on the function f(w) = (w โ 3)ยฒ starting from wโ = 10. Use ฮฑ = 0.1, ฮฒ = 0.9. Show all intermediate computations for V and w at each step. Then explain: (a) Is Vโ larger or smaller than the raw gradient at step 1? Why? (b) What would happen if ฮฒ = 0.99?
(20 marks) Compare Adam and SGD with Momentum across five dimensions: (1) convergence speed on training loss, (2) final test accuracy, (3) sensitivity to learning rate, (4) memory overhead, (5) suitability for different tasks (vision vs. NLP). For each dimension, provide a concrete example or numerical evidence from published research. Conclude with a recommendation for when to use each.
(15 marks) A team at an Indian fintech company is training a fraud detection model on 10 million transactions. They're using Adam with ฮฑ=0.01, batch size 32, no LR schedule, and no warm-up. Training loss oscillates wildly and never converges. Diagnose at least 3 possible issues with their setup and propose specific fixes with justification for each.
Section D โ Programming Exercises (2)
Implement Adam from Scratch with Logging
Implement the Adam optimizer class (with full bias correction) and use it to minimize the Rosenbrock function: f(x,y) = (1โx)ยฒ + 100(yโxยฒ)ยฒ. Start from (xโ, yโ) = (โ1, 1). Log the values of x, y, f(x,y), mฬx, mฬy, vฬx, vฬy at each iteration. Run for 10,000 iterations with ฮฑ=0.001. Plot the optimization trajectory on a contour plot of the Rosenbrock function.
Batch Size Experiment
Using PyTorch, train a 3-layer neural network (784โ256โ128โ10) on MNIST with batch sizes [16, 32, 64, 128, 256, 512, 1024, 4096]. For each batch size: (a) Record training loss per epoch, (b) Record test accuracy after 20 epochs, (c) Record wall-clock time per epoch. Use Adam with ฮฑ=0.001 for all experiments. Create three plots: (1) loss curves overlaid, (2) test accuracy vs. batch size, (3) time per epoch vs. batch size. Write a 200-word analysis of the trade-offs you observe.
Section E โ Mini-Project
Build an "Optimizer Playground" Web Dashboard
Create a Python script using Streamlit or Gradio that lets users:
- Select an optimizer (SGD, Momentum, RMSprop, Adam)
- Set hyperparameters (ฮฑ, ฮฒ, ฮฒโ, ฮฒโ) via sliders
- Choose a 2D loss surface (quadratic, Rosenbrock, Rastrigin, Beale)
- Visualize the optimizer's trajectory on a contour plot in real-time
- See side-by-side loss curves comparing all optimizers on the same surface
- Display a table of parameter values at each iteration
Deliverables: Working Streamlit app, README with screenshots, analysis report (500 words) comparing optimizer behavior on different surfaces.
Chapter Summary
๐ง Key Takeaways from Chapter 8
- Three GD variants: Batch GD (all samples, 1 update/epoch), SGD (1 sample, m updates/epoch), Mini-batch GD (B samples, m/B updates/epoch). Mini-batch is the standard โ it combines GPU parallelism with sufficient gradient noise.
- Batch size: Use powers of 2 (32โ512 typical). Small batches โ better generalization, more noise. Large batches โ better GPU utilization, need LR warm-up. Scale LR with batch size (linear or square-root rule).
- Exponentially Weighted Averages: Vt = ฮฒยทVt-1 + (1โฮฒ)ยทฮธt averages over ~1/(1โฮฒ) values. Bias correction Vฬt = Vt/(1โฮฒt) is critical for accurate early estimates.
- Momentum (ฮฒ=0.9): Maintains velocity V that accumulates past gradients. Dampens oscillations, accelerates consistent-direction gradients. The "rolling ball" effect.
- RMSprop: Adapts learning rate per-parameter using running average of squared gradients. Parameters with large gradients get smaller effective LR. Fixes AdaGrad's monotonically decreasing LR problem.
- Adam (ฮฑ=0.001, ฮฒโ=0.9, ฮฒโ=0.999): Combines Momentum + RMSprop + bias correction. The default optimizer for most deep learning. Uses 2ร extra memory per parameter.
- AdamW > Adam when using weight decay. Decouples weight decay from adaptive gradient scaling.
- Learning rate schedules: Step decay (simple, effective), exponential (smooth), cosine annealing (modern standard), warm-up + cosine (essential for large-batch and transformer training).
- Optimizer selection: Adam/AdamW for NLP and default prototyping. SGD + Momentum for computer vision when best test accuracy matters. AdaGrad for sparse data.
- The Flipkart lesson: Production training at scale requires distributed mini-batch, AdamW, warm-up, gradient clipping, and proper LR scaling โ optimization is engineering, not just math.
SGD โ +Momentum โ +Adaptive LR (RMSprop) โ +Both+Bias Correction = Adam
Each step adds a mechanism: direction smoothing, scale adaptation, initialization correction.
References & Further Reading
Core Papers
- Kingma, D.P. & Ba, J. (2014). "Adam: A Method for Stochastic Optimization." ICLR 2015. arXiv:1412.6980. The original Adam paper โ 180,000+ citations.
- Loshchilov, I. & Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. arXiv:1711.05101. The AdamW paper fixing Adam's weight decay bug.
- Duchi, J., Hazan, E. & Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." JMLR, 12, 2121โ2159. The AdaGrad paper.
- Polyak, B.T. (1964). "Some methods of speeding up the convergence of iterative methods." USSR Computational Mathematics and Mathematical Physics, 4(5), 1โ17. The original momentum paper.
- Hinton, G. (2012). "Lecture 6e: RMSprop." Coursera Neural Networks for Machine Learning. Unpublished โ introduced in a lecture.
Important Related Work
- Keskar, N.S. et al. (2017). "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima." ICLR 2017. Why large batches hurt generalization.
- Goyal, P. et al. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." arXiv:1706.02677. Linear LR scaling rule for distributed training.
- Smith, L.N. & Topin, N. (2018). "Super-Convergence: Very Fast Training Using Large Learning Rates." arXiv:1708.07120. The 1-cycle policy.
- Loshchilov, I. & Hutter, F. (2017). "SGDR: Stochastic Gradient Descent with Warm Restarts." ICLR 2017. Cosine annealing with restarts.
- Wilson, A.C. et al. (2017). "The Marginal Value of Adaptive Gradient Methods in Machine Learning." NeurIPS 2017. SGD can outperform Adam on generalization.
Textbooks
- Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8: Optimization for Training Deep Models.
- Ruder, S. (2016). "An overview of gradient descent optimization algorithms." arXiv:1609.04747. Excellent survey comparing all optimizers.
Indian Industry References
- Flipkart Tech Blog. "Scaling Deep Learning for Search Ranking." tech.flipkart.com
- Infosys Research. "AI-Powered Fraud Detection for Indian Banking." Infosys Knowledge Institute, 2023.
- TCS Research. "Multilingual NLP for Indian Languages: Training Strategies." ACL 2023 Workshop on Language Technology for Equality.