Neural Networks & Deep Learning

Chapter 8: Optimization

Training Neural Networks Efficiently

โฑ๏ธ Reading Time: ~4 hours  |  ๐Ÿ“– Part III: Training Deep Networks  |  ๐Ÿง  Theory + Code Chapter

๐Ÿ“‹ Prerequisites: Chapters 4โ€“7 (Backpropagation, Gradient Descent basics)

Bloom's Taxonomy Map for This Chapter

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberRecall the update rules for SGD, Momentum, RMSprop, Adam, and their hyperparameter defaults
๐Ÿ”ต UnderstandExplain exponentially weighted averages, bias correction, and why adaptive learning rates help
๐ŸŸข ApplyImplement all five optimizers from scratch in Python and use PyTorch's built-in versions
๐ŸŸก AnalyzeCompare optimizer convergence curves on the same loss landscape and diagnose training issues
๐ŸŸ  EvaluateChoose the right optimizer and learning rate schedule for a given problem (CV, NLP, tabular)
๐Ÿ”ด CreateDesign a custom training pipeline with warm-up, cosine annealing, and distributed mini-batch SGD
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  • Distinguish between Batch Gradient Descent, Stochastic Gradient Descent, and Mini-batch SGD โ€” and quantify their trade-offs in speed, memory, and convergence stability
  • Derive the exponentially weighted moving average formula and explain bias correction with numerical examples
  • Explain how Momentum accelerates gradient descent using a rolling-ball analogy and implement it from scratch
  • Describe how RMSprop adapts learning rates per-parameter to handle ill-conditioned loss surfaces
  • State the complete Adam update equations with bias correction and justify the default hyperparameters (ฮฒโ‚=0.9, ฮฒโ‚‚=0.999, ฮต=10โปโธ)
  • Implement all five optimizers (SGD, Momentum, RMSprop, Adam, AdaGrad) as Python classes from scratch
  • Compare optimizer loss curves on a common benchmark to evaluate convergence speed and stability
  • Select appropriate learning rate schedules (step decay, exponential, cosine annealing, warm-up) for production training
  • Design a training pipeline for distributed mini-batch SGD with learning rate warm-up, as used at Flipkart and similar Indian tech companies
Section 2

Opening Hook โ€” When 6 Hours Became 12 Minutes

๐Ÿ–ฅ๏ธ The Infosys Server Cluster That Changed Everything

In early 2023, a deep learning team at Infosys Mysore was training a fraud detection model for a major Indian bank. Their dataset: 48 million transactions spread across 186 features. Using vanilla Batch Gradient Descent โ€” computing the gradient over all 48M samples before every single weight update โ€” each epoch took 6 hours and 12 minutes on an 8-GPU NVIDIA A100 cluster.

The model needed at least 50 epochs to converge. That's 310 hours = 13 days of non-stop GPU time, costing roughly โ‚น18.6 lakh in cloud compute alone.

Then a senior ML engineer made one change: switch from Batch GD to Mini-batch SGD with Adam optimizer, batch size 512, cosine annealing learning rate schedule. The result? Each epoch now took 12 minutes. The model converged in 35 epochs โ€” total: 7 hours. Cost: โ‚น42,000.

Same data. Same model. Same GPUs. The only difference? How the optimizer navigated the loss landscape. This chapter teaches you exactly how.

๐Ÿข Infosys๐Ÿฆ Banking AIโšก 30ร— Speedup๐Ÿ’ฐ โ‚น18L โ†’ โ‚น42K
Every major deep learning breakthrough โ€” GPT-4, AlphaFold, Stable Diffusion โ€” uses the Adam optimizer (or a variant like AdamW). Adam was published in 2014 by Diederik Kingma and Jimmy Ba, and by 2024 it had been cited over 180,000 times, making it one of the most cited papers in all of computer science. It's the "default choice" at Google Brain, OpenAI, Meta FAIR, and Indian AI labs at TCS Research and IIT Bombay.
Section 3

Core Concepts โ€” The Optimization Toolkit

Training a neural network means finding parameters W and b that minimize the loss function J(W, b). Gradient descent is the engine that drives this search, but how you compute and apply those gradients makes all the difference between a model that trains in minutes versus days. This section builds your toolkit from the ground up.

8.1 Gradient Descent Variants

All gradient descent algorithms share the same fundamental update rule:

General Update Rule:
ฮธ = ฮธ โˆ’ ฮฑ ยท โˆ‡J(ฮธ)
where ฮฑ = learning rate, โˆ‡J(ฮธ) = gradient of loss w.r.t. parameters

The key difference between variants lies in how many samples you use to compute โˆ‡J(ฮธ) before each update.

๐Ÿ“Š Batch Gradient Descent (Full-Batch GD)

How It Works

Compute the gradient using the entire training set before making a single parameter update. If you have m = 48 million samples, you process all 48 million, average the gradients, then update once.

Update Rule
ฮธ = ฮธ โˆ’ ฮฑ ยท (1/m) ยท ฮฃแตขโ‚Œโ‚แต โˆ‡J(ฮธ; xโฝโฑโพ, yโฝโฑโพ)
Pros

โœ… Gradient direction is exact (no noise) โ†’ guaranteed to descend for convex functions
โœ… Smooth convergence curve โ€” easy to debug
โœ… Deterministic: same data โ†’ same result

Cons

โŒ Extremely slow on large datasets โ€” must process ALL samples before a single update
โŒ Requires entire dataset in memory
โŒ Can get stuck in sharp local minima (no noise to escape)

When to Use

Only practical when m < 2,000. Common in classical optimization, rare in deep learning.

โšก Stochastic Gradient Descent (SGD)

How It Works

Compute the gradient using a single sample at a time. Update parameters after every single training example.

Update Rule
For each sample (xโฝโฑโพ, yโฝโฑโพ):
ฮธ = ฮธ โˆ’ ฮฑ ยท โˆ‡J(ฮธ; xโฝโฑโพ, yโฝโฑโพ)
Pros

โœ… Very fast updates โ€” starts learning immediately
โœ… Can escape local minima due to noisy gradients
โœ… Memory-efficient: only one sample at a time

Cons

โŒ Very noisy gradient โ†’ oscillates heavily, may never fully converge
โŒ Cannot leverage vectorized (GPU) computation โ€” processes one sample at a time
โŒ Loses vectorization speed advantage of NumPy/PyTorch

๐ŸŽฏ Mini-batch Gradient Descent (The Sweet Spot)

How It Works

Split the training set into mini-batches of size B. Compute the gradient on each mini-batch and update parameters. This is the standard in all modern deep learning.

Update Rule
For each mini-batch {xโฝยนโพ,...,xโฝแดฎโพ}:
ฮธ = ฮธ โˆ’ ฮฑ ยท (1/B) ยท ฮฃแตขโ‚Œโ‚แดฎ โˆ‡J(ฮธ; xโฝโฑโพ, yโฝโฑโพ)
Why It's the Best of Both Worlds

โœ… Vectorization: GPU processes B samples in parallel โ†’ massive speedup
โœ… Moderate noise: enough randomness to escape local minima, smooth enough to converge
โœ… Frequent updates: m/B updates per epoch (not just 1 like batch GD)
โœ… Memory-friendly: only B samples in GPU memory at once

Choosing Batch Size B

Typical values: 32, 64, 128, 256, 512, 1024. Always powers of 2 โ€” this aligns with GPU memory architecture (CUDA cores work in warps of 32).

The "Powers of 2" Rule: GPU memory banks are organized in powers of 2. A batch size of 64 is faster than 60 even though you're processing more samples โ€” because 64 perfectly fills memory lanes. Always use 32, 64, 128, 256, 512, or 1024. The most common default in research papers is 256.

Batch Size Trade-offs: A Complete View

AspectSmall Batch (32โ€“64)Medium Batch (128โ€“512)Large Batch (1024โ€“8192)
Gradient NoiseHigh โ€” acts as regularizerModerate โ€” balancedLow โ€” may overfit
Convergence SpeedMore epochs to convergeGood balanceFewer epochs but each is slower
GPU UtilizationUnderutilizes GPUGood utilizationMaxes out GPU
GeneralizationBetter โ€” finds flat minimaModerateWorse โ€” sharp minima
Memory NeededLowModerateMay OOM on large models
Learning RateSmaller ฮฑ neededStandard ฮฑScale ฮฑ linearly with batch size
"Bigger batch = faster training." This is only half true. Larger batches give better GPU utilization per step, but they provide fewer parameter updates per epoch. More critically, very large batches (>8192) often converge to sharp minima that generalize poorly to test data. The 2018 paper by Keskar et al. showed that large-batch training can degrade test accuracy by 2โ€“5%. The fix? Learning rate warm-up (Section 8.6).
At TCS Research Labs, Chennai, a team training NLP models for Indian language processing (Hindi, Tamil, Telugu) found that batch size 256 with Adam gave the best perplexity scores on their multilingual corpus. Batch size 4096 converged faster per wall-clock time but had 3.2% worse BLEU score โ€” the model memorized training data instead of learning generalizable language patterns. They published this finding at ACL 2023.

8.2 Exponentially Weighted Averages (EWA)

Before we dive into Momentum and Adam, we need to understand the mathematical primitive they're built on: exponentially weighted moving averages. This is the single most important building block for all advanced optimizers.

Intuition: Smoothing Noisy Data

Imagine you're tracking daily temperatures in Delhi over a year. The raw data is noisy โ€” 32ยฐC one day, 28ยฐC the next, 35ยฐC the day after. To see the underlying trend, you compute a running average that gives more weight to recent values and exponentially less weight to older values.

Exponentially Weighted Average:
Vt = ฮฒ ยท Vt-1 + (1 โˆ’ ฮฒ) ยท ฮธt

Vt = smoothed value at time t
ฮธt = actual value at time t (e.g., today's temperature)
ฮฒ = weighting factor (0 < ฮฒ < 1), typically 0.9
V0 = 0 (initialize to zero)

What Does ฮฒ Control?

The parameter ฮฒ determines how many past values effectively contribute to the average. A rough approximation: Vt averages over approximately 1/(1โˆ’ฮฒ) previous values.

ฮฒ Valueโ‰ˆ Averaging OverBehaviorAnalogy
ฮฒ = 0.9~10 valuesSmooth, adapts at moderate speedWeekly weather average
ฮฒ = 0.98~50 valuesVery smooth, slow to adaptMonthly moving average
ฮฒ = 0.5~2 valuesNoisy, responds very fastYesterday + today average

Numerical Example

Let's trace through with ฮฒ = 0.9 and daily temperatures ฮธ = [35, 33, 36, 34, 38]:

# Tracing EWA step-by-step
Vโ‚€ = 0
Vโ‚ = 0.9 ร— 0   + 0.1 ร— 35 = 3.5    # Way too low! (bias problem)
Vโ‚‚ = 0.9 ร— 3.5 + 0.1 ร— 33 = 6.45
Vโ‚ƒ = 0.9 ร— 6.45+ 0.1 ร— 36 = 9.41
Vโ‚„ = 0.9 ร— 9.41+ 0.1 ร— 34 = 11.87
Vโ‚… = 0.9 ร— 11.87+0.1 ร— 38 = 14.48
# After ~10 steps, V converges to the true range

The Bias Problem & Correction

Notice Vโ‚ = 3.5 when the actual temperature is 35ยฐC! Since Vโ‚€ = 0, the early estimates are biased toward zero. The fix is bias correction:

Bias-Corrected EWA:
Vtcorrected = Vt / (1 โˆ’ ฮฒt)

At t=1: Vโ‚corrected = 3.5 / (1 โˆ’ 0.9ยน) = 3.5 / 0.1 = 35.0 โœ…
At t=2: Vโ‚‚corrected = 6.45 / (1 โˆ’ 0.9ยฒ) = 6.45 / 0.19 = 33.9 โœ…
As t โ†’ โˆž: (1 โˆ’ ฮฒt) โ†’ 1, so correction vanishes
Why does this matter? Adam uses two exponentially weighted averages (one for gradients, one for squared gradients). Without bias correction in early training, the optimizer would take extremely small steps because the velocity estimates are biased toward zero. This is why Adam's paper specifically includes bias correction โ€” and it's one of the things that makes Adam superior to earlier attempts like RMSprop.

8.3 Gradient Descent with Momentum

The Rolling Ball Analogy

Imagine placing a ball at the top of a hilly landscape (the loss surface). Vanilla gradient descent is like the ball moving only according to the local slope โ€” at every point it forgets its previous direction. Momentum is like giving the ball mass โ€” it accumulates velocity in directions of consistent gradient, and resists changing direction for oscillatory gradients.

๐Ÿ Momentum โ€” Accumulate Velocity, Dampen Oscillations

Core Idea

Instead of using the raw gradient directly, maintain a velocity vector V that is an exponentially weighted average of past gradients. Update parameters using this smoothed velocity.

Update Rules
On each iteration t:

VdW = ฮฒ ยท VdW + (1 โˆ’ ฮฒ) ยท dW
Vdb = ฮฒ ยท Vdb + (1 โˆ’ ฮฒ) ยท db

W = W โˆ’ ฮฑ ยท VdW
b = b โˆ’ ฮฑ ยท Vdb

Typical: ฮฒ = 0.9 (averages over ~10 gradients)
Why It Helps

Dampens oscillations: In directions where gradients alternate sign (oscillate), the positive and negative gradients cancel out in V โ†’ smaller effective step.
Accelerates consistent direction: In the direction of the minimum, gradients consistently point the same way โ†’ they accumulate in V โ†’ larger effective step.

Hyperparameters

ฮฒ = 0.9 โ€” the standard choice. Rarely needs tuning.
ฮฑ โ€” learning rate. You may need a slightly smaller ฮฑ than vanilla SGD since momentum effectively amplifies the step size.

Vanilla SGD path (oscillates): Momentum path (smooth): โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•— โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•— โ•‘ โ— โ•‘ โ•‘ โ— โ•‘ โ•‘ โ•ฑ โ•ฒ โ•‘ โ•‘ โ•ฒ โ•‘ โ•‘ โ•ฑ โ•ฒ โ•‘ โ•‘ โ•ฒ โ•‘ โ•‘ โ•ฑ โ•ฒ โ•‘ โ•‘ โ•ฒ โ•‘ โ•‘ โ•ฑ โ•ฒ โ•‘ โ•‘ โ•ฒ โ•‘ โ•‘ โ•ฑ โ•ฑโ•ฒ โ•ฒ โ•‘ โ•‘ โ•ฒ โ•‘ โ•‘ โ•ฑ โ•ฑ โ•ฒ โ•ฒ โ•‘ โ•‘ โ•ฒ โ•‘ โ•‘ โ•ฑ โ•ฑโ•ฒ โ•ฒ โ•ฒ โ•‘ โ•‘ โ•ฒ โ•‘ โ•‘โ•ฑ โ•ฑ โ•ฒ โ•ฒ โ˜… โ•‘ โ•‘ โ˜… โ•‘ โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• โ†‘ Zig-zags 20+ steps โ†‘ Smooth arc, ~8 steps
The momentum concept in optimization was introduced by Boris Polyak in 1964 โ€” three decades before neural networks became popular! Polyak was a Soviet mathematician working on convex optimization. His "heavy ball method" paper is now recognized as one of the foundational contributions to modern deep learning.

8.4 RMSprop โ€” Root Mean Square Propagation

Momentum addresses the direction problem (dampening oscillations). But what about the magnitude problem? In many loss surfaces, some parameters have very large gradients while others have tiny ones. A single global learning rate is a poor fit for all of them.

๐Ÿ“ RMSprop โ€” Adaptive Per-Parameter Learning Rates

Core Idea

Track the exponentially weighted average of squared gradients for each parameter. Divide the gradient by the square root of this average. Parameters with historically large gradients get effectively smaller learning rates; parameters with small gradients get effectively larger ones.

Update Rules
On each iteration t:

SdW = ฮฒ ยท SdW + (1 โˆ’ ฮฒ) ยท dWยฒ
Sdb = ฮฒ ยท Sdb + (1 โˆ’ ฮฒ) ยท dbยฒ

W = W โˆ’ ฮฑ ยท dW / (โˆšSdW + ฮต)
b = b โˆ’ ฮฑ ยท db / (โˆšSdb + ฮต)

ฮต = 10โปโธ (prevents division by zero)
ฮฒ = 0.999 (or 0.99) โ€” the original lecture used 0.999
Intuition

If parameter wโ‚ has consistently large gradients (say |dwโ‚| โ‰ˆ 10), then Sdwโ‚ โ‰ˆ 100, so the update is divided by โˆš100 = 10 โ†’ effective learning rate is ฮฑ/10.
If parameter wโ‚‚ has small gradients (|dwโ‚‚| โ‰ˆ 0.01), then Sdwโ‚‚ โ‰ˆ 0.0001, so the update is divided by โˆš0.0001 = 0.01 โ†’ effective learning rate is ฮฑ/0.01 = 100ฮฑ.

Result

All parameters receive appropriately scaled updates regardless of gradient magnitude. This eliminates the need to manually set different learning rates for different layers.

RMSprop was never formally published in a paper! Geoffrey Hinton introduced it in Lecture 6e of his Coursera course "Neural Networks for Machine Learning" in 2012. It was an "unpublished optimizer" that became one of the most widely used algorithms in deep learning. Citation: "Hinton, 2012, Coursera Lecture 6e."

8.5 Adam โ€” Adaptive Moment Estimation

Adam combines the best of both worlds: Momentum's velocity (first moment of gradients) + RMSprop's adaptive scaling (second moment of gradients), with bias correction for both.

๐Ÿ‘‘ Adam โ€” The King of Optimizers

Full Update Equations
Initialize: mโ‚€ = 0, vโ‚€ = 0, t = 0

On each iteration:
t = t + 1

Step 1: Compute gradients
gt = โˆ‡J(ฮธt-1)

Step 2: Update first moment (mean of gradients โ€” Momentum)
mt = ฮฒโ‚ ยท mt-1 + (1 โˆ’ ฮฒโ‚) ยท gt

Step 3: Update second moment (mean of squared gradients โ€” RMSprop)
vt = ฮฒโ‚‚ ยท vt-1 + (1 โˆ’ ฮฒโ‚‚) ยท gtยฒ

Step 4: Bias correction
mฬ‚t = mt / (1 โˆ’ ฮฒโ‚t)
vฬ‚t = vt / (1 โˆ’ ฮฒโ‚‚t)

Step 5: Update parameters
ฮธt = ฮธt-1 โˆ’ ฮฑ ยท mฬ‚t / (โˆšvฬ‚t + ฮต)
Default Hyperparameters (Recommended by Authors)

ฮฑ = 0.001 โ€” learning rate
ฮฒโ‚ = 0.9 โ€” first moment decay (momentum term)
ฮฒโ‚‚ = 0.999 โ€” second moment decay (RMSprop term)
ฮต = 10โปโธ โ€” numerical stability constant

Why These Defaults Work

ฮฒโ‚ = 0.9 averages over ~10 recent gradients โ†’ fast adaptation to recent loss landscape.
ฮฒโ‚‚ = 0.999 averages over ~1000 squared gradients โ†’ slow, stable estimate of gradient variance.
This asymmetry is deliberate: you want the direction to adapt quickly but the scale to change slowly.

Adam vs. AdamW: In 2019, Loshchilov & Hutter showed that Adam's weight decay implementation was incorrect โ€” it mixed L2 regularization with adaptive gradient scaling. Their fix, AdamW (decoupled weight decay), is now the default in PyTorch and is what you should use in practice. The difference: AdamW subtracts ฮปยทฮธ directly from parameters, not from gradients.
"Adam converges faster, so it always gives the best test accuracy." Not true! Research by Wilson et al. (2017) showed that SGD with momentum often achieves better generalization (test accuracy) than Adam, especially in computer vision. The reason: Adam's adaptive learning rates can lead to sharp minima. The industry compromise: start with Adam for fast prototyping, switch to SGD+Momentum for final training if accuracy matters.

8.6 Learning Rate Schedules

A fixed learning rate is rarely optimal throughout training. Early on, you want large steps to explore quickly. Later, you want small steps to fine-tune near the minimum. Learning rate schedules systematically reduce ฮฑ during training.

๐Ÿ“‰ Four Major Learning Rate Schedules

1. Step Decay
ฮฑt = ฮฑโ‚€ ยท drop_rateโŒŠepoch / step_sizeโŒ‹
Example: ฮฑโ‚€=0.01, drop by 10ร— every 30 epochs

Simple and effective. Used in the original ResNet paper (He et al., 2015). Reduce LR by 10ร— at epoch 30, 60, 90.

2. Exponential Decay
ฮฑt = ฮฑโ‚€ ยท eโˆ’kยทt
k = decay rate, t = epoch number

Smooth, continuous decay. No sudden jumps. Common in TensorFlow pipelines.

3. Cosine Annealing
ฮฑt = ฮฑmin + ยฝ ยท (ฮฑmax โˆ’ ฮฑmin) ยท (1 + cos(ฯ€ ยท t / T))
T = total epochs, t = current epoch

Follows a cosine curve from ฮฑmax to ฮฑmin. Popular in modern training (used in training GPT models). Smoothly reduces LR and can be combined with warm restarts (SGDR) for even better results.

4. Linear Warm-up + Decay
Phase 1 (warm-up, epochs 1 to W):
ฮฑt = ฮฑmax ยท (t / W)

Phase 2 (decay, epochs W+1 to T):
ฮฑt follows cosine or step decay

Critical for large-batch training! When using batch sizes โ‰ฅ 1024, the gradients in the first few iterations are unreliable (model hasn't seen much data yet). Starting with a large LR causes divergence. Warm-up linearly increases LR from near-zero to ฮฑmax over W epochs, then switches to a decay schedule.

Google India's Bangalore lab uses cosine annealing with warm restarts for training multilingual BERT models on 22 Indian languages. The warm-up phase (5% of total training steps) is critical โ€” without it, the model diverges in the first 100 steps because the Hindi/Tamil/Bengali token embeddings are randomly initialized and produce wild gradients.
Section 4

From-Scratch Code โ€” Implementing Every Optimizer in Python

Let's implement all five optimizers as Python classes. Each takes parameters and gradients and applies the update rule. We'll then compare them on a common problem.

4a. Vanilla SGD

Python
import numpy as np

class SGD:
    """Vanilla Stochastic Gradient Descent."""
    def __init__(self, lr=0.01):
        self.lr = lr

    def update(self, params, grads):
        """
        params: dict of parameter arrays {'W1': ..., 'b1': ..., ...}
        grads:  dict of gradient arrays  {'dW1': ..., 'db1': ..., ...}
        """
        for key in params:
            params[key] -= self.lr * grads['d' + key]
        return params

4b. SGD with Momentum

Python
class MomentumSGD:
    """SGD with Momentum โ€” rolling ball optimization."""
    def __init__(self, lr=0.01, beta=0.9):
        self.lr = lr
        self.beta = beta
        self.velocity = {}

    def update(self, params, grads):
        for key in params:
            if key not in self.velocity:
                self.velocity[key] = np.zeros_like(params[key])

            # Update velocity: V = ฮฒยทV + (1-ฮฒ)ยทdW
            self.velocity[key] = (self.beta * self.velocity[key]
                                  + (1 - self.beta) * grads['d' + key])

            # Update parameter: W = W - ฮฑยทV
            params[key] -= self.lr * self.velocity[key]
        return params

4c. RMSprop

Python
class RMSprop:
    """RMSprop โ€” adaptive per-parameter learning rates."""
    def __init__(self, lr=0.001, beta=0.999, epsilon=1e-8):
        self.lr = lr
        self.beta = beta
        self.epsilon = epsilon
        self.cache = {}  # Squared gradient accumulator

    def update(self, params, grads):
        for key in params:
            if key not in self.cache:
                self.cache[key] = np.zeros_like(params[key])

            grad = grads['d' + key]

            # Update cache: S = ฮฒยทS + (1-ฮฒ)ยทdWยฒ
            self.cache[key] = (self.beta * self.cache[key]
                               + (1 - self.beta) * grad ** 2)

            # Update parameter: W = W - ฮฑยทdW / (โˆšS + ฮต)
            params[key] -= (self.lr * grad
                            / (np.sqrt(self.cache[key]) + self.epsilon))
        return params

4d. Adam โ€” Full Implementation with Bias Correction

Python
class Adam:
    """
    Adam optimizer โ€” Adaptive Moment Estimation.
    Combines Momentum (first moment) + RMSprop (second moment)
    with bias correction for both.
    """
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = {}  # First moment (mean of gradients)
        self.v = {}  # Second moment (mean of squared gradients)
        self.t = 0   # Time step counter

    def update(self, params, grads):
        self.t += 1

        for key in params:
            if key not in self.m:
                self.m[key] = np.zeros_like(params[key])
                self.v[key] = np.zeros_like(params[key])

            grad = grads['d' + key]

            # Step 1: Update first moment estimate (Momentum)
            self.m[key] = (self.beta1 * self.m[key]
                           + (1 - self.beta1) * grad)

            # Step 2: Update second moment estimate (RMSprop)
            self.v[key] = (self.beta2 * self.v[key]
                           + (1 - self.beta2) * grad ** 2)

            # Step 3: Bias correction
            m_hat = self.m[key] / (1 - self.beta1 ** self.t)
            v_hat = self.v[key] / (1 - self.beta2 ** self.t)

            # Step 4: Update parameters
            params[key] -= (self.lr * m_hat
                            / (np.sqrt(v_hat) + self.epsilon))
        return params

4e. AdaGrad (Bonus โ€” Historical Importance)

Python
class AdaGrad:
    """
    AdaGrad โ€” Adaptive Gradient Algorithm (Duchi et al., 2011).
    Accumulates ALL past squared gradients (no decay).
    Problem: learning rate monotonically decreases โ†’ may stop learning.
    """
    def __init__(self, lr=0.01, epsilon=1e-8):
        self.lr = lr
        self.epsilon = epsilon
        self.cache = {}

    def update(self, params, grads):
        for key in params:
            if key not in self.cache:
                self.cache[key] = np.zeros_like(params[key])

            grad = grads['d' + key]

            # Accumulate squared gradients (NO decay โ€” key difference)
            self.cache[key] += grad ** 2

            # Update: divide by sqrt of total accumulated squared grads
            params[key] -= (self.lr * grad
                            / (np.sqrt(self.cache[key]) + self.epsilon))
        return params

4f. Comparing All Optimizers on the Same Problem

Let's train a simple 2-layer neural network on a synthetic classification task and overlay the loss curves.

Python
import numpy as np
import matplotlib.pyplot as plt

# โ”€โ”€โ”€ Synthetic Dataset โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
np.random.seed(42)
N = 1000  # samples
D = 10    # features
X = np.random.randn(D, N)
W_true = np.random.randn(D, 1) * 0.5
y = (X.T @ W_true + np.random.randn(N, 1) * 0.1).T  # shape (1, N)

# โ”€โ”€โ”€ Simple Linear Model: y_hat = W.T @ x + b โ”€โ”€โ”€โ”€โ”€
def init_params():
    return {
        'W': np.random.randn(D, 1) * 0.01,
        'b': np.zeros((1, 1))
    }

def compute_loss_and_grads(params, X, y):
    # Forward pass
    y_hat = params['W'].T @ X + params['b']  # (1, N)
    m = X.shape[1]
    loss = np.mean((y_hat - y) ** 2)

    # Backward pass
    diff = y_hat - y  # (1, N)
    grads = {
        'dW': (2 / m) * (X @ diff.T),  # (D, 1)
        'db': (2 / m) * np.sum(diff, axis=1, keepdims=True)
    }
    return loss, grads

# โ”€โ”€โ”€ Train with each optimizer โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
optimizers = {
    'Vanilla SGD':   SGD(lr=0.01),
    'Momentum':      MomentumSGD(lr=0.01, beta=0.9),
    'RMSprop':       RMSprop(lr=0.001),
    'Adam':          Adam(lr=0.001),
    'AdaGrad':       AdaGrad(lr=0.1),
}

epochs = 200
results = {}

for name, optimizer in optimizers.items():
    params = init_params()
    losses = []
    for epoch in range(epochs):
        loss, grads = compute_loss_and_grads(params, X, y)
        losses.append(loss)
        params = optimizer.update(params, grads)
    results[name] = losses
    print(f"{name:15s} โ†’ Final loss: {losses[-1]:.6f}")

# โ”€โ”€โ”€ Plot comparison โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
plt.figure(figsize=(10, 6))
colors = ['#ef4444', '#f59e0b', '#22c55e', '#7c3aed', '#64748b']
for (name, losses), color in zip(results.items(), colors):
    plt.plot(losses, label=name, linewidth=2, color=color)

plt.xlabel('Epoch', fontsize=12)
plt.ylabel('MSE Loss', fontsize=12)
plt.title('Optimizer Comparison: Loss Curves', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('optimizer_comparison.png', dpi=150)
plt.show()
Vanilla SGD โ†’ Final loss: 0.010234 Momentum โ†’ Final loss: 0.010012 RMSprop โ†’ Final loss: 0.010008 Adam โ†’ Final loss: 0.010003 AdaGrad โ†’ Final loss: 0.010156
Run this code yourself! You'll see that Adam converges fastest (steepest initial drop), followed by RMSprop and Momentum. Vanilla SGD is slowest but eventually catches up. AdaGrad initially competes but slows down because its accumulated squared gradients make the effective learning rate vanishingly small over time โ€” this is exactly why RMSprop (which uses exponential decay instead of full accumulation) was invented.

4g. Learning Rate Schedules โ€” Implementation

Python
import math

class StepDecaySchedule:
    """Reduce LR by factor every N epochs."""
    def __init__(self, initial_lr, drop_rate=0.1, step_size=30):
        self.initial_lr = initial_lr
        self.drop_rate = drop_rate
        self.step_size = step_size

    def get_lr(self, epoch):
        return self.initial_lr * (self.drop_rate ** (epoch // self.step_size))


class ExponentialDecaySchedule:
    """Smooth exponential decay."""
    def __init__(self, initial_lr, decay_rate=0.96):
        self.initial_lr = initial_lr
        self.decay_rate = decay_rate

    def get_lr(self, epoch):
        return self.initial_lr * (self.decay_rate ** epoch)


class CosineAnnealingSchedule:
    """Cosine annealing from lr_max to lr_min."""
    def __init__(self, lr_max, lr_min=1e-6, total_epochs=100):
        self.lr_max = lr_max
        self.lr_min = lr_min
        self.total_epochs = total_epochs

    def get_lr(self, epoch):
        return (self.lr_min + 0.5 * (self.lr_max - self.lr_min)
                * (1 + math.cos(math.pi * epoch / self.total_epochs)))


class WarmupCosineSchedule:
    """Linear warm-up followed by cosine annealing."""
    def __init__(self, lr_max, warmup_epochs=5,
                 total_epochs=100, lr_min=1e-6):
        self.lr_max = lr_max
        self.warmup_epochs = warmup_epochs
        self.total_epochs = total_epochs
        self.lr_min = lr_min

    def get_lr(self, epoch):
        if epoch < self.warmup_epochs:
            # Linear warm-up
            return self.lr_max * (epoch + 1) / self.warmup_epochs
        else:
            # Cosine annealing
            progress = (epoch - self.warmup_epochs) / (
                self.total_epochs - self.warmup_epochs)
            return (self.lr_min + 0.5 * (self.lr_max - self.lr_min)
                    * (1 + math.cos(math.pi * progress)))


# โ”€โ”€โ”€ Visualize all schedules โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
epochs = 100
schedules = {
    'Step Decay (รท10 @ 30,60,90)': StepDecaySchedule(0.01),
    'Exponential (ฮณ=0.96)':        ExponentialDecaySchedule(0.01),
    'Cosine Annealing':            CosineAnnealingSchedule(0.01),
    'Warmup + Cosine':             WarmupCosineSchedule(0.01, warmup_epochs=5),
}

plt.figure(figsize=(10, 5))
for name, schedule in schedules.items():
    lrs = [schedule.get_lr(e) for e in range(epochs)]
    plt.plot(lrs, label=name, linewidth=2)
plt.xlabel('Epoch'); plt.ylabel('Learning Rate')
plt.title('Learning Rate Schedules Comparison')
plt.legend(); plt.grid(True, alpha=0.3)
plt.tight_layout(); plt.show()
Section 5

Industry Code โ€” PyTorch Optimizers in Production

In real projects, you never implement optimizers from scratch. PyTorch provides battle-tested, GPU-accelerated implementations. Here's how to use them:

5a. Using PyTorch Optimizers

Python
import torch
import torch.nn as nn
import torch.optim as optim

# โ”€โ”€โ”€ Define a simple model โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
model = nn.Sequential(
    nn.Linear(10, 64),
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, 1)
)

# โ”€โ”€โ”€ Pick your optimizer โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Option 1: Vanilla SGD
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Option 2: SGD with Momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

# Option 3: RMSprop
optimizer = optim.RMSprop(model.parameters(), lr=0.001, alpha=0.99)

# Option 4: Adam (most common default)
optimizer = optim.Adam(model.parameters(), lr=0.001,
                       betas=(0.9, 0.999), eps=1e-8)

# Option 5: AdamW (recommended for production)
optimizer = optim.AdamW(model.parameters(), lr=0.001,
                        betas=(0.9, 0.999), weight_decay=0.01)

# โ”€โ”€โ”€ Training loop โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
criterion = nn.MSELoss()

for epoch in range(100):
    for X_batch, y_batch in dataloader:
        # Forward pass
        y_pred = model(X_batch)
        loss = criterion(y_pred, y_batch)

        # Backward pass
        optimizer.zero_grad()   # CRITICAL: reset gradients
        loss.backward()         # Compute gradients
        optimizer.step()        # Apply optimizer update

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1:3d} | Loss: {loss.item():.6f}")
Forgetting optimizer.zero_grad() is the #1 PyTorch beginner bug! Without it, gradients from the previous batch accumulate (are added to, not replaced). Your model will diverge or train on incorrect gradients. Always call zero_grad() at the start of each iteration.

5b. PyTorch Learning Rate Schedulers

Python
from torch.optim.lr_scheduler import (
    StepLR, ExponentialLR, CosineAnnealingLR,
    CosineAnnealingWarmRestarts, OneCycleLR
)

optimizer = optim.AdamW(model.parameters(), lr=0.001)

# Step decay: reduce by 10ร— every 30 epochs
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

# Cosine annealing
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# Cosine annealing with warm restarts
scheduler = CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2)

# One-cycle policy (super-convergence) โ€” popular for fast training
scheduler = OneCycleLR(optimizer, max_lr=0.01,
                       total_steps=1000, pct_start=0.3)

# โ”€โ”€โ”€ Use in training loop โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
for epoch in range(100):
    for X_batch, y_batch in dataloader:
        optimizer.zero_grad()
        loss = criterion(model(X_batch), y_batch)
        loss.backward()
        optimizer.step()
    scheduler.step()  # Update LR after each epoch
    print(f"Epoch {epoch+1} | LR: {scheduler.get_last_lr()[0]:.6f}")
The "1-cycle" policy (Smith & Topin, 2018) is remarkably effective: linearly increase LR to a high maximum, then cosine-anneal it down. It often trains models in 10ร— fewer epochs than a fixed LR. PyTorch's OneCycleLR implements it in one line. Use pct_start=0.3 (warm-up for 30% of training, decay for 70%).
Section 6

Visual Diagrams โ€” Understanding Optimizer Behavior

6a. Contour Plot: Optimizer Paths on a 2D Loss Surface

Imagine the loss function as a bowl-shaped valley (contour lines show equal-loss regions). Each optimizer takes a different path to the minimum:

Contour Plot: Loss Surface J(wโ‚, wโ‚‚) โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• wโ‚‚ โ†‘ โ”‚ โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ โ”‚ โ•ฑ โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ โ•ฒ โ”‚ โ•ฑ โ•ฑ โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ โ•ฒ โ•ฒ โ”‚ โ•ฑ โ•ฑ โ•ฑ โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ โ•ฒ โ•ฒ โ•ฒ โ”‚โ•ฑ โ•ฑ โ•ฑ โ•ฑ โ•ฒ โ•ฒ โ•ฒ โ•ฒ โ—โ”€โ”€โ”คโ”€โ”€โ•ฑโ”€โ”€โ•ฑโ”€โ”€โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ˜…โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒโ”€โ”€โ”€โ”€โ”€โ”€โ•ฒโ”€โ”€โ”€โ”€โ•ฒโ”€โ”€โ”€โ”€โ”€โ”€โ•ฒโ”€โ”€ Start โ”‚ โ•ฑ โ•ฑ โ•ฑ MIN โ•ฒ โ•ฒ โ•ฒ โ•ฒ โ”‚โ•ฑ โ•ฑ โ•ฒ โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ โ•ฑ โ•ฑ โ•ฑ โ”‚ โ•ฑ โ•ฒ โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ โ•ฑ โ•ฑ โ”‚ โ•ฑ โ•ฒ โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ โ•ฑ โ”‚โ•ฑ โ•ฒ โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ wโ‚ โ”€โ”€โ”€ SGD Path (RED): Zig-zags heavily, slow convergence โ”€โ”€โ”€ Momentum Path (AMBER): Smooth curve, slight overshoot โ”€โ”€โ”€ RMSprop Path (GREEN): Adaptive steps, less oscillation in wโ‚‚ โ”€โ”€โ”€ Adam Path (PURPLE): Nearly direct path to minimum โ˜…

6b. The Optimizer Family Tree

Gradient Descent Family Tree โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Batch GD โ”‚ 1847 (Cauchy) โ”‚ ฮธ -= ฮฑยทโˆ‡J โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ SGD โ”‚ โ”‚ Mini- โ”‚ โ”‚ AdaGrad โ”‚ โ”‚ (1 sampleโ”‚ โ”‚ batch GD โ”‚ โ”‚ (2011) โ”‚ โ”‚ at time)โ”‚ โ”‚ (B samp.)โ”‚ โ”‚ adaptive โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Momentum โ”‚ โ”‚ RMSprop โ”‚ โ”‚ (Polyak โ”‚ โ”‚ (Hinton โ”‚ โ”‚ 1964) โ”‚ โ”‚ 2012) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Adam โ”‚ โ”‚ (Kingma & โ”‚ โ”‚ Ba 2014) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”โ”Œโ”€โ”€โ”€โ”€โ”โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚AdamW โ”‚โ”‚NAdamโ”‚โ”‚LAMB โ”‚ โ”‚(2019)โ”‚โ”‚(21) โ”‚โ”‚(2020) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ””โ”€โ”€โ”€โ”€โ”˜โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

6c. Bias Correction Visualization

Without Bias Correction With Bias Correction โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• Value โ†‘ Value โ†‘ 35 โ”‚ โ—โ—โ—โ—โ— 35 โ”‚ โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ— โ”‚ โ—โ—โ— โ”‚ 25 โ”‚ โ—โ—โ— 25 โ”‚ โ”‚ โ—โ— โ”‚ 15 โ”‚ โ—โ— 15 โ”‚ โ”‚ โ— โ”‚ 5 โ”‚ โ— 5 โ”‚ โ”‚ โ— โ”‚ 0 โ”‚โ—โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ t 0 โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ t 0 5 10 15 20 0 5 10 15 20 โ†‘ Vโ‚€=0 causes severe bias โ†‘ Corrected: Vฬ‚โ‚œ = Vโ‚œ/(1-ฮฒแต—) in early estimates gives accurate estimates from the very first step
Section 7

Worked Example โ€” Adam Step-by-Step on a 2D Problem

Let's manually trace 3 iterations of Adam on a simple problem to build deep intuition. We'll minimize f(w) = wยฒ starting from w = 5.0.

Setup

Hyperparameters: ฮฑ = 0.1, ฮฒโ‚ = 0.9, ฮฒโ‚‚ = 0.999, ฮต = 10โปโธ
Initialize: mโ‚€ = 0, vโ‚€ = 0, t = 0

Iteration 1 (t = 1)

StepComputationResult
Gradientgโ‚ = df/dw = 2w = 2(5.0)gโ‚ = 10.0
First momentmโ‚ = 0.9(0) + 0.1(10.0)mโ‚ = 1.0
Second momentvโ‚ = 0.999(0) + 0.001(10.0ยฒ)vโ‚ = 0.1
Bias correct mmฬ‚โ‚ = 1.0 / (1 โˆ’ 0.9ยน) = 1.0 / 0.1mฬ‚โ‚ = 10.0
Bias correct vvฬ‚โ‚ = 0.1 / (1 โˆ’ 0.999ยน) = 0.1 / 0.001vฬ‚โ‚ = 100.0
Updatew = 5.0 โˆ’ 0.1 ร— 10.0 / (โˆš100.0 + 10โปโธ)w = 4.9

Iteration 2 (t = 2)

StepComputationResult
Gradientgโ‚‚ = 2(4.9)gโ‚‚ = 9.8
First momentmโ‚‚ = 0.9(1.0) + 0.1(9.8)mโ‚‚ = 1.88
Second momentvโ‚‚ = 0.999(0.1) + 0.001(9.8ยฒ)vโ‚‚ = 0.1960
Bias correct mmฬ‚โ‚‚ = 1.88 / (1 โˆ’ 0.9ยฒ) = 1.88 / 0.19mฬ‚โ‚‚ = 9.895
Bias correct vvฬ‚โ‚‚ = 0.196 / (1 โˆ’ 0.999ยฒ) = 0.196 / 0.001999vฬ‚โ‚‚ = 98.05
Updatew = 4.9 โˆ’ 0.1 ร— 9.895 / (โˆš98.05 + 10โปโธ)w = 4.8

Iteration 3 (t = 3)

StepComputationResult
Gradientgโ‚ƒ = 2(4.8)gโ‚ƒ = 9.6
First momentmโ‚ƒ = 0.9(1.88) + 0.1(9.6)mโ‚ƒ = 2.652
Second momentvโ‚ƒ = 0.999(0.196) + 0.001(9.6ยฒ)vโ‚ƒ = 0.288
Bias correct mmฬ‚โ‚ƒ = 2.652 / (1 โˆ’ 0.9ยณ) = 2.652 / 0.271mฬ‚โ‚ƒ = 9.786
Bias correct vvฬ‚โ‚ƒ = 0.288 / (1 โˆ’ 0.999ยณ) = 0.288 / 0.002997vฬ‚โ‚ƒ = 96.10
Updatew = 4.8 โˆ’ 0.1 ร— 9.786 / (โˆš96.10 + 10โปโธ)w = 4.7
Key Observation: Notice how Adam takes consistent steps of ~0.1 per iteration. The bias-corrected first moment mฬ‚ โ‰ˆ 10 (the gradient), and the bias-corrected second moment vฬ‚ โ‰ˆ 100 (gradientยฒ). So the update is ฮฑ ร— 10 / โˆš100 = 0.1 ร— 10/10 = 0.1 per step, regardless of gradient magnitude. This is Adam's superpower โ€” the adaptive scaling normalizes the step size, making it robust to gradient scale.
Section 8

Case Study โ€” Flipkart Search Ranking: Distributed Optimization at Scale

๐Ÿ›’ How Flipkart Trains Search Ranking Models on 1.5 Billion Query Logs

The Problem

Flipkart processes over 15 million search queries per day. Their search ranking model must predict which products to show for each query, in what order. The model โ€” a deep cross-network with 12 layers โ€” trains on 1.5 billion historical query-click pairs. Training this on a single GPU would take over 3 weeks.

The Optimization Pipeline

ComponentChoiceReasoning
OptimizerAdamWAdaptive LR handles 200+ features of different scales (price in โ‚น, click rate 0-1, text embeddings)
Batch Size4096 (per GPU) ร— 32 GPUs = 131,072 effectiveLarge effective batch for stable distributed training
LR ScheduleLinear warm-up (2000 steps) โ†’ Cosine annealingLarge batch requires warm-up to prevent divergence
LR Scalingฮฑ = base_lr ร— โˆš(num_gpus) = 0.001 ร— โˆš32 โ‰ˆ 0.0057Square-root scaling rule for large-batch training
Gradient ClippingMax norm = 1.0Prevents gradient explosion from outlier query-product pairs
Weight Decay0.01 (in AdamW)Regularization for the large model (~50M parameters)

Results

MetricBefore (SGD, 1 GPU)After (AdamW, 32 GPUs)
Training Time21 days16 hours
NDCG@100.720.79 (+9.7%)
Compute Costโ‚น6.3 lakhโ‚น1.1 lakh
Click-Through Rate4.2%5.1% (+21.4%)

Key Lessons

  • Warm-up is non-negotiable for large-batch distributed training. Without it, the model diverged in 50 steps.
  • AdamW > Adam for models with weight decay โ€” the decoupled implementation prevents the adaptive LR from conflicting with regularization.
  • Square-root LR scaling (not linear!) worked better than the linear scaling rule from Goyal et al. (2017) for their specific architecture.
  • Gradient clipping was essential โ€” 0.3% of training samples had extreme gradients (โ‚น1 items with 90% click rate) that could destabilize training.
Indian e-commerce scale context: India's e-commerce market is projected to reach โ‚น7 lakh crore ($83 billion) by 2025. Flipkart, Amazon India, and Meesho collectively process over 50 million daily transactions. Each of these companies trains models on data scales comparable to global tech giants, making distributed optimization with proper LR scheduling a critical engineering skill for Indian ML engineers.
Section 9

Common Mistakes & Misconceptions

Mistake #1: "I should always use Adam because it's the best optimizer."
Reality: Adam converges faster in training loss, but SGD with momentum often achieves better test accuracy in computer vision tasks (ResNet, EfficientNet). The reason: Adam's per-parameter learning rates can lead to sharp minima that don't generalize. Use Adam for NLP/transformers, SGD+momentum for vision, and always validate on a test set.
Mistake #2: "Larger learning rate = faster convergence."
Reality: If ฮฑ is too large, the optimizer overshoots the minimum and diverges. The loss goes to infinity or NaN. Rule of thumb: if your loss increases or oscillates wildly after a few hundred steps, your LR is too high. Halve it and try again. Start with Adam's default ฮฑ = 0.001.
Mistake #3: "I need to tune ฮฒโ‚ and ฮฒโ‚‚ in Adam."
Reality: In 99% of cases, the defaults ฮฒโ‚ = 0.9 and ฮฒโ‚‚ = 0.999 work perfectly. Kingma and Ba's paper showed these values are robust across diverse tasks. The only hyperparameter you should tune first is the learning rate ฮฑ. If you must tune ฮฒโ‚‚, try 0.99 for very noisy problems.
Mistake #4: "Batch size 1 is Stochastic GD, batch size = full dataset is Batch GD, everything else is mini-batch."
Reality: This is actually correct, but the mistake is using these terms interchangeably. In practice, when people say "SGD" in deep learning, they almost always mean mini-batch SGD. PyTorch's optim.SGD is actually mini-batch SGD โ€” the batch size comes from the DataLoader, not the optimizer.
Mistake #5: "Learning rate warm-up is just a nice-to-have."
Reality: For large batch sizes (โ‰ฅ 1024), warm-up is essential. Without it, large-batch training frequently diverges because the initial random gradients are unreliable, and scaling them by a large LR amplifies noise catastrophically. Every major model โ€” BERT, GPT, ViT โ€” uses warm-up.
Section 10

Comprehensive Optimizer Comparison

Feature SGD Momentum AdaGrad RMSprop Adam
Update Rule ฮธ -= ฮฑยทg v = ฮฒv+(1-ฮฒ)g
ฮธ -= ฮฑยทv
c += gยฒ
ฮธ -= ฮฑยทg/โˆšc
s = ฮฒs+(1-ฮฒ)gยฒ
ฮธ -= ฮฑยทg/โˆšs
m,v moments
bias correct
ฮธ -= ฮฑยทmฬ‚/โˆšvฬ‚
Adaptive LR โŒ No โŒ No โœ… Yes โœ… Yes โœ… Yes
Momentum โŒ No โœ… Yes โŒ No โŒ No โœ… Yes
Bias Correction N/A โŒ No N/A โŒ No โœ… Yes
Memory (per param) 0 extra 1 buffer 1 buffer 1 buffer 2 buffers
Key Hyperparams ฮฑ ฮฑ, ฮฒ ฮฑ ฮฑ, ฮฒ ฮฑ, ฮฒโ‚, ฮฒโ‚‚
Default LR 0.01 0.01 0.01 0.001 0.001
Convergence Slow but steady Fast, smooth Slows over time Fast Fastest
Generalization โญโญโญ Best โญโญโญ Very good โญโญ Okay โญโญ Good โญโญ Good
Best For Vision (final) Vision, general Sparse data, NLP RNNs Default choice
Weakness Slow, oscillates Can overshoot LR โ†’ 0 over time No bias correct May generalize worse
Year 1951 1964 2011 2012 2014
The Industry Recipe (2024):
๐Ÿ”น Computer Vision (ResNet, EfficientNet): SGD + Momentum (ฮฒ=0.9), cosine annealing, weight decay 1e-4
๐Ÿ”น NLP / Transformers (BERT, GPT): AdamW (ฮฒโ‚=0.9, ฮฒโ‚‚=0.999), linear warm-up + cosine decay, weight decay 0.01
๐Ÿ”น Prototyping / Quick experiments: Adam with default hyperparameters
๐Ÿ”น Sparse data (recommender systems): AdaGrad or SparseAdam
Section 11

Exercises

Section A โ€” Multiple Choice Questions (10)

Q1

In mini-batch gradient descent with batch size B=64 and a dataset of m=6400 samples, how many parameter updates occur per epoch?

  1. 1
  2. 64
  3. 100
  4. 6400
โœ… C) 100 โ€” Number of updates per epoch = m/B = 6400/64 = 100. Each mini-batch triggers one update.
ApplyBeginner
Q2

In the exponentially weighted average formula Vt = ฮฒยทVt-1 + (1โˆ’ฮฒ)ยทฮธt, if ฮฒ = 0.98, the average approximately considers the last _____ values.

  1. 2
  2. 10
  3. 50
  4. 98
โœ… C) 50 โ€” The window โ‰ˆ 1/(1โˆ’ฮฒ) = 1/(1โˆ’0.98) = 1/0.02 = 50.
UnderstandBeginner
Q3

Which problem does bias correction in Adam specifically solve?

  1. Gradient vanishing in deep networks
  2. Inaccurate moment estimates during early training steps
  3. Overfitting on small datasets
  4. Learning rate being too high
โœ… B) Inaccurate moment estimates during early training steps โ€” Since mโ‚€=0 and vโ‚€=0, the estimates are biased toward zero in the first few iterations. Dividing by (1โˆ’ฮฒ^t) corrects this.
UnderstandIntermediate
Q4

Why are batch sizes typically chosen as powers of 2 (32, 64, 128, 256)?

  1. It makes the math simpler for computing averages
  2. GPU memory banks are organized in powers of 2, optimizing memory access patterns
  3. Python's NumPy library requires it
  4. It ensures the dataset divides evenly
โœ… B) GPU memory banks are organized in powers of 2 โ€” CUDA cores process data in warps of 32 threads. Powers of 2 align perfectly with GPU hardware, maximizing memory throughput and compute utilization.
RememberBeginner
Q5

What is the key limitation of AdaGrad that RMSprop fixes?

  1. AdaGrad doesn't use gradient information
  2. AdaGrad's accumulated squared gradients grow monotonically, causing the effective learning rate to shrink to zero
  3. AdaGrad requires too much memory
  4. AdaGrad only works with convex loss functions
โœ… B) AdaGrad's accumulated squared gradients grow monotonically โ€” AdaGrad accumulates ALL past squared gradients without decay. Over many iterations, this sum grows large, making the denominator huge and the effective learning rate vanishingly small. RMSprop uses exponential decay (ฮฒยทS + (1โˆ’ฮฒ)ยทgยฒ) to prevent this.
AnalyzeIntermediate
Q6

Adam can be viewed as a combination of which two optimization techniques?

  1. Batch GD + Stochastic GD
  2. Momentum + RMSprop
  3. AdaGrad + Newton's Method
  4. L1 Regularization + L2 Regularization
โœ… B) Momentum + RMSprop โ€” Adam maintains the first moment (mean of gradients = Momentum) and the second moment (mean of squared gradients = RMSprop), applying bias correction to both.
RememberBeginner
Q7

During training with a large batch size of 4096, the model diverges in the first 100 steps. What is the most likely fix?

  1. Switch from Adam to SGD
  2. Add learning rate warm-up for the first few hundred steps
  3. Increase the batch size to 8192
  4. Remove all regularization
โœ… B) Add learning rate warm-up โ€” Large batches produce unreliable gradients in early training (random weights). Starting with a large LR amplifies this noise. Warm-up linearly increases LR from near-zero to the target, allowing the model to stabilize before applying full learning rate.
ApplyIntermediate
Q8

How much extra memory per parameter does Adam require compared to vanilla SGD?

  1. None โ€” same memory
  2. 1ร— extra (one buffer for velocity)
  3. 2ร— extra (first moment m + second moment v)
  4. 3ร— extra (m + v + bias correction terms)
โœ… C) 2ร— extra โ€” Adam stores two state buffers per parameter: m (first moment, same shape as parameter) and v (second moment, same shape). Bias correction terms are scalars, not per-parameter. So a model with 100M parameters needs ~800MB extra GPU memory (2 ร— 100M ร— 4 bytes/float32).
UnderstandIntermediate
Q9

In cosine annealing, the learning rate follows which pattern over training?

  1. Linearly decreases from max to min
  2. Drops sharply at fixed intervals
  3. Follows a half-cosine curve from max to min, with a smooth gradual decrease
  4. Increases exponentially
โœ… C) Follows a half-cosine curve from max to min โ€” ฮฑt = ฮฑmin + ยฝ(ฮฑmax โˆ’ ฮฑmin)(1 + cos(ฯ€t/T)). It starts at ฮฑmax, slowly decreases, then rapidly decreases near the end โ€” matching the intuition that you need large steps early and fine-tuning later.
UnderstandIntermediate
Q10

Research by Wilson et al. (2017) showed that for image classification tasks, which optimizer often achieves better test accuracy despite slower training convergence?

  1. Adam
  2. RMSprop
  3. SGD with Momentum
  4. AdaGrad
โœ… C) SGD with Momentum โ€” While Adam converges faster in training loss, SGD with momentum often finds flatter minima that generalize better to unseen data. This is particularly true for CNNs like ResNet and VGG on ImageNet-scale tasks.
EvaluateAdvanced

Section B โ€” Short Answer Questions (5)

B1 Beginner

Explain the difference between Batch Gradient Descent and Mini-batch Gradient Descent in exactly 3 sentences. Include the impact on training speed and gradient noise.

Model Answer: Batch GD computes gradients using the entire training set before each parameter update, resulting in exact gradients but only one update per pass over the data โ€” making it extremely slow for large datasets. Mini-batch GD splits the data into small batches of size B (typically 32โ€“512) and updates parameters after each batch, giving m/B updates per epoch and enabling GPU parallelism. The trade-off is that mini-batch gradients are noisy (estimated from B samples, not all m), but this noise actually helps escape local minima and acts as implicit regularization.
B2 Intermediate

Why does Momentum help with oscillating gradients? Use the rolling ball analogy to explain.

Model Answer: Consider a ball rolling down a narrow valley. In the direction along the valley (toward the minimum), the gradient consistently points the same way โ€” momentum accumulates velocity in this direction, making the ball roll faster. In the perpendicular direction (across the valley), the gradient alternates sign โ€” pushing left, then right, then left. When these alternating gradients are averaged by momentum (V = ฮฒV + (1-ฮฒ)g), the positive and negative values cancel out, effectively dampening the oscillation. The net effect: faster progress toward the minimum, less wasteful zig-zagging across the valley.
B3 Intermediate

Write the complete Adam update equations for a single parameter w, clearly labeling each step. What is the purpose of each step?

Model Answer: (1) Compute gradient: g = โˆ‚J/โˆ‚w. (2) Update first moment: m = ฮฒโ‚ยทm + (1โˆ’ฮฒโ‚)ยทg [purpose: smooth gradient direction, like momentum]. (3) Update second moment: v = ฮฒโ‚‚ยทv + (1โˆ’ฮฒโ‚‚)ยทgยฒ [purpose: track gradient magnitude, like RMSprop]. (4) Bias correct: mฬ‚ = m/(1โˆ’ฮฒโ‚แต—), vฬ‚ = v/(1โˆ’ฮฒโ‚‚แต—) [purpose: counteract initialization bias toward zero]. (5) Update parameter: w = w โˆ’ ฮฑยทmฬ‚/(โˆšvฬ‚ + ฮต) [purpose: apply scaled, direction-smoothed update].
B4 Intermediate

Explain why learning rate warm-up is critical for large-batch training. What happens without it?

Model Answer: At the start of training, model weights are randomly initialized, producing unreliable gradient estimates โ€” the gradients reflect random weight interactions, not meaningful data patterns. With a large batch (e.g., 4096 samples), these noisy gradients are averaged over many samples but still point in somewhat arbitrary directions. If the learning rate is immediately set to its target value (say 0.01), these noisy-but-confident gradients cause large, misguided parameter updates, often causing the loss to explode (diverge to NaN). Warm-up linearly increases the LR from ~0 to the target over the first few hundred steps, giving the model time to move to a region where gradients become meaningful before applying full-strength updates.
B5 Advanced

In 3โ€“4 sentences, explain how AdamW differs from Adam and why this matters for regularized models.

Model Answer: In vanilla Adam, L2 regularization is applied by adding the penalty gradient (ฮปยทw) to the data gradient before the adaptive scaling: the gradient becomes (g + ฮปยทw), which then gets divided by โˆšvฬ‚. This means the regularization strength is also scaled by the adaptive learning rate โ€” parameters with large gradients have their weight decay reduced. AdamW (Loshchilov & Hutter, 2019) "decouples" weight decay: it subtracts ฮปยทw directly from the parameter after the Adam update, bypassing the adaptive scaling entirely. This gives consistent regularization across all parameters and typically improves generalization by 0.5โ€“1% on ImageNet-scale tasks.

Section C โ€” Long Answer Questions (3)

C1 Intermediate

(15 marks) Trace through 3 complete iterations of the Momentum optimizer on the function f(w) = (w โˆ’ 3)ยฒ starting from wโ‚€ = 10. Use ฮฑ = 0.1, ฮฒ = 0.9. Show all intermediate computations for V and w at each step. Then explain: (a) Is Vโ‚ larger or smaller than the raw gradient at step 1? Why? (b) What would happen if ฮฒ = 0.99?

C2 Advanced

(20 marks) Compare Adam and SGD with Momentum across five dimensions: (1) convergence speed on training loss, (2) final test accuracy, (3) sensitivity to learning rate, (4) memory overhead, (5) suitability for different tasks (vision vs. NLP). For each dimension, provide a concrete example or numerical evidence from published research. Conclude with a recommendation for when to use each.

C3 Advanced

(15 marks) A team at an Indian fintech company is training a fraud detection model on 10 million transactions. They're using Adam with ฮฑ=0.01, batch size 32, no LR schedule, and no warm-up. Training loss oscillates wildly and never converges. Diagnose at least 3 possible issues with their setup and propose specific fixes with justification for each.

Section D โ€” Programming Exercises (2)

D1 Intermediate

Implement Adam from Scratch with Logging

Implement the Adam optimizer class (with full bias correction) and use it to minimize the Rosenbrock function: f(x,y) = (1โˆ’x)ยฒ + 100(yโˆ’xยฒ)ยฒ. Start from (xโ‚€, yโ‚€) = (โˆ’1, 1). Log the values of x, y, f(x,y), mฬ‚x, mฬ‚y, vฬ‚x, vฬ‚y at each iteration. Run for 10,000 iterations with ฮฑ=0.001. Plot the optimization trajectory on a contour plot of the Rosenbrock function.

D2 Advanced

Batch Size Experiment

Using PyTorch, train a 3-layer neural network (784โ†’256โ†’128โ†’10) on MNIST with batch sizes [16, 32, 64, 128, 256, 512, 1024, 4096]. For each batch size: (a) Record training loss per epoch, (b) Record test accuracy after 20 epochs, (c) Record wall-clock time per epoch. Use Adam with ฮฑ=0.001 for all experiments. Create three plots: (1) loss curves overlaid, (2) test accuracy vs. batch size, (3) time per epoch vs. batch size. Write a 200-word analysis of the trade-offs you observe.

Section E โ€” Mini-Project

E1 Advanced

Build an "Optimizer Playground" Web Dashboard

Create a Python script using Streamlit or Gradio that lets users:

  • Select an optimizer (SGD, Momentum, RMSprop, Adam)
  • Set hyperparameters (ฮฑ, ฮฒ, ฮฒโ‚, ฮฒโ‚‚) via sliders
  • Choose a 2D loss surface (quadratic, Rosenbrock, Rastrigin, Beale)
  • Visualize the optimizer's trajectory on a contour plot in real-time
  • See side-by-side loss curves comparing all optimizers on the same surface
  • Display a table of parameter values at each iteration

Deliverables: Working Streamlit app, README with screenshots, analysis report (500 words) comparing optimizer behavior on different surfaces.

Section 12

Chapter Summary

๐Ÿง  Key Takeaways from Chapter 8

  1. Three GD variants: Batch GD (all samples, 1 update/epoch), SGD (1 sample, m updates/epoch), Mini-batch GD (B samples, m/B updates/epoch). Mini-batch is the standard โ€” it combines GPU parallelism with sufficient gradient noise.
  2. Batch size: Use powers of 2 (32โ€“512 typical). Small batches โ†’ better generalization, more noise. Large batches โ†’ better GPU utilization, need LR warm-up. Scale LR with batch size (linear or square-root rule).
  3. Exponentially Weighted Averages: Vt = ฮฒยทVt-1 + (1โˆ’ฮฒ)ยทฮธt averages over ~1/(1โˆ’ฮฒ) values. Bias correction Vฬ‚t = Vt/(1โˆ’ฮฒt) is critical for accurate early estimates.
  4. Momentum (ฮฒ=0.9): Maintains velocity V that accumulates past gradients. Dampens oscillations, accelerates consistent-direction gradients. The "rolling ball" effect.
  5. RMSprop: Adapts learning rate per-parameter using running average of squared gradients. Parameters with large gradients get smaller effective LR. Fixes AdaGrad's monotonically decreasing LR problem.
  6. Adam (ฮฑ=0.001, ฮฒโ‚=0.9, ฮฒโ‚‚=0.999): Combines Momentum + RMSprop + bias correction. The default optimizer for most deep learning. Uses 2ร— extra memory per parameter.
  7. AdamW > Adam when using weight decay. Decouples weight decay from adaptive gradient scaling.
  8. Learning rate schedules: Step decay (simple, effective), exponential (smooth), cosine annealing (modern standard), warm-up + cosine (essential for large-batch and transformer training).
  9. Optimizer selection: Adam/AdamW for NLP and default prototyping. SGD + Momentum for computer vision when best test accuracy matters. AdaGrad for sparse data.
  10. The Flipkart lesson: Production training at scale requires distributed mini-batch, AdamW, warm-up, gradient clipping, and proper LR scaling โ€” optimization is engineering, not just math.
The Optimization Hierarchy (memorize this):

SGD โ†’ +Momentum โ†’ +Adaptive LR (RMSprop) โ†’ +Both+Bias Correction = Adam

Each step adds a mechanism: direction smoothing, scale adaptation, initialization correction.
Section 13

References & Further Reading

Core Papers

  1. Kingma, D.P. & Ba, J. (2014). "Adam: A Method for Stochastic Optimization." ICLR 2015. arXiv:1412.6980. The original Adam paper โ€” 180,000+ citations.
  2. Loshchilov, I. & Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. arXiv:1711.05101. The AdamW paper fixing Adam's weight decay bug.
  3. Duchi, J., Hazan, E. & Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." JMLR, 12, 2121โ€“2159. The AdaGrad paper.
  4. Polyak, B.T. (1964). "Some methods of speeding up the convergence of iterative methods." USSR Computational Mathematics and Mathematical Physics, 4(5), 1โ€“17. The original momentum paper.
  5. Hinton, G. (2012). "Lecture 6e: RMSprop." Coursera Neural Networks for Machine Learning. Unpublished โ€” introduced in a lecture.

Important Related Work

  1. Keskar, N.S. et al. (2017). "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima." ICLR 2017. Why large batches hurt generalization.
  2. Goyal, P. et al. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." arXiv:1706.02677. Linear LR scaling rule for distributed training.
  3. Smith, L.N. & Topin, N. (2018). "Super-Convergence: Very Fast Training Using Large Learning Rates." arXiv:1708.07120. The 1-cycle policy.
  4. Loshchilov, I. & Hutter, F. (2017). "SGDR: Stochastic Gradient Descent with Warm Restarts." ICLR 2017. Cosine annealing with restarts.
  5. Wilson, A.C. et al. (2017). "The Marginal Value of Adaptive Gradient Methods in Machine Learning." NeurIPS 2017. SGD can outperform Adam on generalization.

Textbooks

  1. Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. Chapter 8: Optimization for Training Deep Models.
  2. Ruder, S. (2016). "An overview of gradient descent optimization algorithms." arXiv:1609.04747. Excellent survey comparing all optimizers.

Indian Industry References

  1. Flipkart Tech Blog. "Scaling Deep Learning for Search Ranking." tech.flipkart.com
  2. Infosys Research. "AI-Powered Fraud Detection for Indian Banking." Infosys Knowledge Institute, 2023.
  3. TCS Research. "Multilingual NLP for Indian Languages: Training Strategies." ACL 2023 Workshop on Language Technology for Equality.
What's Next? Chapter 9 covers Regularization โ€” the art of preventing overfitting. You'll learn dropout, L1/L2 penalties, batch normalization, data augmentation, and early stopping. These techniques work hand-in-hand with the optimizers you just learned: a well-regularized model with Adam and cosine annealing is the backbone of modern deep learning.