Backpropagation &
Training Deep Networks
Master the algorithm that powers all of deep learning โ from computing gradients to tuning optimizers for billion-parameter models.
12.1 โ Learning Objectives
After completing this chapter, you will be able to:
Explain why backpropagation is necessary and how it computes gradients efficiently using the chain rule
Draw computation graphs for neural network forward passes and trace gradients backward through them
Derive the complete backpropagation equations for a 2-layer and L-layer network from first principles
Implement Batch GD, SGD, Mini-batch GD, Momentum, RMSprop, and Adam optimizer from scratch
Apply learning rate schedules: step decay, exponential decay, cosine annealing, and warmup
Implement and explain Batch Normalization (forward and backward pass) and Dropout regularization
Diagnose and fix vanishing/exploding gradient problems using proper initialization and gradient clipping
Verify backpropagation correctness using numerical gradient checking
Use TensorFlow GradientTape for custom training loops and compare optimizer performance
Relate training techniques to real-world systems like GPT-3 training and Tesla FSD
12.2 โ Introduction
In Chapter 11, we built a neural network and computed its forward pass โ given inputs, we produced predictions. But here's the crucial question: how does the network learn? How does it adjust millions (or billions) of parameters to reduce its errors?
The answer is backpropagation โ short for "backward propagation of errors." It is the single most important algorithm in deep learning. Without it, training neural networks of any meaningful depth would be computationally impossible.
The Core Problem
Consider a network with 1 million weights. To minimize the loss function, we need the gradient โL/โw for every single weight. Computing each gradient numerically (perturbing each weight and measuring the loss change) would require 2 million forward passes per training step. For a dataset with 1 million examples and 1000 training steps, that's 2 ร 10ยนโต forward passes โ utterly impossible.
Backpropagation computes all gradients in a single backward pass, making it only about 2-3ร the cost of a single forward pass. This is the miracle that makes deep learning practical.
This chapter is the engine room of deep learning. We will:
- Understand computation graphs and the chain rule
- Derive every backpropagation equation step by step
- Master gradient descent variants and modern optimizers
- Learn essential training techniques: Batch Normalization, Dropout
- Diagnose gradient pathologies and verify our implementations
12.3 โ Historical Background
Backpropagation has a surprisingly long and winding history, with multiple independent discoveries across decades:
| Year | Milestone | Contributor |
|---|---|---|
| 1960 | Automatic differentiation concepts for control theory | Henry J. Kelley |
| 1970 | Reverse-mode automatic differentiation (the mathematical foundation of backprop) | Seppo Linnainmaa |
| 1974 | Application of backpropagation to neural networks in PhD thesis | Paul Werbos |
| 1986 | Landmark paper demonstrating backprop enables learning internal representations โ popularized the algorithm | Rumelhart, Hinton & Williams |
| 1994 | Identified vanishing gradient problem in deep networks and RNNs | Yoshua Bengio et al. |
| 2012 | AlexNet trained with backprop + SGD wins ImageNet โ deep learning revolution begins | Krizhevsky, Sutskever, Hinton |
| 2014 | Adam optimizer published โ becomes the default optimizer | Kingma & Ba |
| 2015 | Batch Normalization enables training much deeper networks | Ioffe & Szegedy |
| 2017 | Warmup + cosine annealing schedules for Transformer training | Vaswani et al. / Loshchilov & Hutter |
| 2020 | GPT-3 trained with Adam optimizer on 300B tokens โ 175B parameters | OpenAI |
12.4 โ Conceptual Explanation
12.4.1 โ Why Do We Need Gradients?
Training a neural network means finding the weights W and biases b that minimize a loss function L. The loss function is a landscape in a very high-dimensional space (one dimension per parameter). We need to find the lowest valley in this landscape.
Gradient descent is our strategy: at any point on the loss surface, the gradient โL points in the direction of steepest ascent. By taking steps in the opposite direction, we move downhill toward lower loss.
ฮฑ = learning rate (step size), โL/โw = gradient of loss with respect to weight w
12.4.2 โ Computation Graphs
A computation graph represents a neural network's forward pass as a directed acyclic graph (DAG). Each node represents an operation (matrix multiply, addition, activation function), and edges represent the flow of tensors.
Input X W[1] b[1] W[2] b[2]
โ โ โ โ โ
โผ โผ โผ โผ โผ
โโโโโโโโโโโโโโโโ โโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโ
โ Z[1] = W[1]ยทX + b[1] โ โ Z[2] = W[2]ยทA[1] + b[2] โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ A[1] = ฯ(Z[1]) โ A[2] = ฯ(Z[2])
โ (ReLU / Sigmoid) โ (Sigmoid)
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ โ
โโโโโโโโโโโโโโโโ โผ
โ โโโโโโโโโโโโโ
โโโโโโโโโบโ Loss = L(A[2], Y) โ
โโโโโโโโโโโโโ
FORWARD: X โโโบ Z[1] โโโบ A[1] โโโบ Z[2] โโโบ A[2] โโโบ L
BACKWARD: X โโโ Z[1] โโโ A[1] โโโ Z[2] โโโ A[2] โโโ L
(Gradients flow backward via chain rule)
12.4.3 โ The Chain Rule โ Heart of Backpropagation
The chain rule from calculus lets us compute the derivative of a composite function. If y = f(g(x)), then:
For neural networks, we have deeply nested compositions. The loss is a function of the output, which is a function of the pre-activation, which is a function of the weights and previous activations. The chain rule lets us compute the gradient with respect to any parameter by multiplying local gradients along the path from the loss to that parameter.
Each factor is a "local gradient" โ easy to compute at each node
12.4.4 โ The Backpropagation Algorithm
Backpropagation works in two phases:
- Forward Pass: Compute all intermediate values (Z[l], A[l]) from input to output, and compute the loss L.
- Backward Pass: Starting from โL/โA[L] at the output, propagate gradients backward through each layer using the chain rule. At each layer, compute โL/โW[l] and โL/โb[l] for parameter updates, and โL/โA[l-1] to continue backward.
12.4.5 โ Gradient Descent Variants
Once we have gradients, there are different strategies for updating weights:
| Variant | Batch Size | Update Frequency | Pros | Cons |
|---|---|---|---|---|
| Batch GD | Entire dataset | Once per epoch | Stable, true gradient | Slow, memory-heavy |
| SGD | 1 sample | Every sample | Fast updates, can escape local minima | Very noisy, unstable |
| Mini-batch GD | 32-512 | Every batch | Best of both, GPU-efficient | Needs batch size tuning |
12.4.6 โ Modern Optimizers
Plain SGD can be slow and oscillate. Modern optimizers address this with adaptive learning rates:
- Momentum: Accumulates past gradients to build "velocity" โ helps push through narrow valleys
- RMSprop: Adapts learning rate per-parameter based on recent gradient magnitudes โ prevents divergence in steep dimensions
- Adam: Combines Momentum + RMSprop with bias correction โ the most widely used optimizer in deep learning
12.4.7 โ Training Techniques
Batch Normalization
Normalizes each layer's inputs to have zero mean and unit variance, then applies learnable scale (ฮณ) and shift (ฮฒ). Benefits: faster training, higher learning rates, mild regularization.
Dropout
Randomly sets a fraction of neurons to zero during training. Forces the network to learn redundant representations โ acts as a powerful regularizer. At inference, all neurons are active but outputs are scaled.
Gradient Problems
Vanishing gradients: Gradients shrink exponentially through layers (sigmoid's max derivative is 0.25 โ after 10 layers: 0.25ยนโฐ โ 10โปโถ). Solution: ReLU, skip connections, better initialization.
Exploding gradients: Gradients grow exponentially. Solution: Gradient clipping, proper initialization.
12.5 โ Mathematical Foundation
12.5.1 โ Notation and Setup
Consider an L-layer neural network with the following notation:
12.5.2 โ Loss Functions
Binary Cross-Entropy Loss (Single Example)
Cost Over m Examples
12.5.3 โ Key Derivatives We Will Need
| Function | Derivative | Notes |
|---|---|---|
| Sigmoid: ฯ(z) = 1/(1+eโปแถป) | ฯ'(z) = ฯ(z)(1 โ ฯ(z)) | Max value 0.25 at z=0 |
| ReLU: g(z) = max(0, z) | g'(z) = 1 if z > 0, else 0 | No saturation for z > 0 |
| Tanh: g(z) = tanh(z) | g'(z) = 1 โ tanhยฒ(z) | Max value 1 at z=0 |
| Linear: Z = WยทA + b | โZ/โW = Aแต, โZ/โA = Wแต, โZ/โb = I | Matrix calculus |
| BCE Loss: โ = โ[y log ลท + (1โy)log(1โลท)] | โโ/โลท = โy/ลท + (1โy)/(1โลท) | Simplifies beautifully with sigmoid |
12.5.4 โ Matrix Calculus Conventions
In neural network backpropagation, we adopt the convention that dZ[l] has the same shape as Z[l], and similarly for all other gradient tensors. This makes implementation straightforward:
Notation: dX means โL/โX throughout this chapter
12.6 โ Formula Derivations
12.6.1 โ Full Backpropagation for a 2-Layer Network
Let's derive every gradient for a 2-layer network (input โ hidden โ output). Architecture: n[0] inputs, n[1] hidden neurons (ReLU), n[2] = 1 output neuron (sigmoid), binary cross-entropy loss.
Layer 1 pre-activation:
Shape: (n[1] ร m) = (n[1] ร n[0]) ยท (n[0] ร m) + (n[1] ร 1)
Layer 1 activation:
Shape: (n[1] ร m)
Layer 2 pre-activation:
Shape: (1 ร m) = (1 ร n[1]) ยท (n[1] ร m) + (1 ร 1)
Layer 2 activation (output):
Shape: (1 ร m)
Loss function:
Compute dA[2] = โL/โA[2]:
From the loss โ = โ[yยทlog(a) + (1โy)ยทlog(1โa)]:
Shape: (1 ร m). This is the starting point of backpropagation.
Compute dZ[2] = โL/โZ[2]:
By chain rule: dZ[2] = dA[2] โ ฯ'(Z[2]), where ฯ'(z) = ฯ(z)(1โฯ(z)):
Expanding: = [โY/A[2] + (1โY)/(1โA[2])] โ A[2] โ (1โA[2])
Simplifying: = โYยท(1โA[2]) + (1โY)ยทA[2] = A[2]โYยทA[2]โY+YยทA[2] + A[2]โYยทA[2]
โจ This beautiful simplification is why sigmoid + cross-entropy is a natural pair! Shape: (1 ร m).
Compute dW[2] = โL/โW[2]:
Since Z[2] = W[2]ยทA[1] + b[2], by the chain rule:
โL/โW[2] = โL/โZ[2] ยท โZ[2]/โW[2] = dZ[2] ยท A[1]แต (averaged over m examples):
Shape: (1 ร n[1]) = (1 ร m) ยท (m ร n[1]). โ Same shape as W[2].
Compute db[2] = โL/โb[2]:
Since โZ[2]/โb[2] = 1, we simply sum dZ[2] across examples:
Shape: (1 ร 1). โ Same shape as b[2].
Compute dA[1] = โL/โA[1]:
Since Z[2] = W[2]ยทA[1] + b[2], we need โZ[2]/โA[1] = W[2]แต:
Shape: (n[1] ร m) = (n[1] ร 1) ยท (1 ร m). โ Same shape as A[1].
This is where gradients "propagate backward" through the network!
Compute dZ[1] = โL/โZ[1]:
By chain rule: dZ[1] = dA[1] โ g'[1](Z[1]). For ReLU, g'(z) = 1 if z > 0, else 0:
Shape: (n[1] ร m). The indicator function (Z[1] > 0) creates a binary mask.
Compute dW[1] = โL/โW[1]:
Same pattern as dW[2], but using dZ[1] and A[0] = X:
Shape: (n[1] ร n[0]) = (n[1] ร m) ยท (m ร n[0]). โ Same shape as W[1].
Compute db[1] = โL/โb[1]:
Shape: (n[1] ร 1). โ Same shape as b[1].
12.6.2 โ General Backpropagation for L-Layer Network
๐ฎ The General Backpropagation Formula Sheet
For any layer l in an L-layer network, the backward pass follows this pattern:
This set of 5 equations is all you need to implement backpropagation for any feedforward network, regardless of depth. Each layer reuses the same formulas with different weight matrices and activation functions.
12.6.3 โ Deriving Optimizers from First Principles
Momentum
Problem with vanilla SGD: In loss surfaces with narrow valleys, SGD oscillates perpendicular to the optimal direction, making slow progress along the valley. We want to dampen oscillations and amplify consistent directions.
Inspiration: A ball rolling downhill accumulates velocity. Past gradients in the same direction make it go faster; contradictory gradients slow it down.
Define a velocity vector v that is an exponential moving average (EMA) of past gradients:
ฮฒ โ [0, 1) controls how much history we remember. Typical: ฮฒ = 0.9 (averages ~10 steps).
Expand the EMA to see what v_t represents:
Recent gradients have exponentially more weight than older ones.
Update parameters using the velocity:
If gradients consistently point the same direction โ v grows โ faster progress.
If gradients oscillate โ v stays small โ damped oscillations.
RMSprop (Root Mean Square Propagation)
Problem: Some parameters need large updates (flat loss surface) while others need small updates (steep surface). We need per-parameter adaptive learning rates.
Track the EMA of squared gradients (measures gradient magnitude per parameter):
Here (โJ)ยฒ is element-wise squaring. ฮฒโ typically = 0.999.
โs_t estimates the RMS of recent gradients. Divide the gradient by this to normalize:
ฮต โ 10โปโธ prevents division by zero.
Effect: Parameters with large gradients โ large s โ smaller effective learning rate.
Parameters with small gradients โ small s โ larger effective learning rate.
This automatically adapts the learning rate per-parameter!
Adam (Adaptive Moment Estimation)
Adam combines the best of Momentum (first moment) and RMSprop (second moment), with bias correction to handle the initialization bias.
First moment (mean): EMA of gradients (like Momentum):
Initialize: mโ = 0. Typical ฮฒโ = 0.9.
Second moment (variance): EMA of squared gradients (like RMSprop):
Initialize: vโ = 0. Typical ฮฒโ = 0.999.
Bias correction: Since mโ = vโ = 0, the early estimates are biased toward zero. Correct this:
At t=1: mฬโ = mโ/(1โ0.9) = 10ยทmโ โ compensates for the zero initialization.
As t โ โ: (1 โ ฮฒแต) โ 1 โ correction vanishes, as the EMA has warmed up.
Parameter update:
This gives us adaptive learning rates (from vฬ) with momentum (from mฬ).
12.6.4 โ Batch Normalization Derivation
Compute batch mean:
Compute batch variance:
Normalize:
Scale and shift (learnable parameters ฮณ and ฮฒ):
ฮณ and ฮฒ are learned during training. They allow the network to undo the normalization if needed (when ฮณ = ฯ_B and ฮฒ = ฮผ_B, it recovers the original activations).
12.6.5 โ Learning Rate Schedules
12.7 โ Worked Numerical Examples
Example 1: Full Backpropagation for a 2-Layer Network
Architecture: 4 inputs โ 3 hidden (ReLU) โ 1 output (sigmoid). Single training example.
Given Values
Input: X = [1.0, 0.5, โ1.0, 2.0]แต (4ร1)
True label: Y = 1
Weights (initialized):
W[1] = [[ 0.2, -0.1, 0.4, 0.1], # 3ร4
[ 0.3, 0.2, -0.3, 0.2],
[-0.1, 0.5, 0.1, -0.2]]
b[1] = [[ 0.1], # 3ร1
[-0.1],
[ 0.0]]
W[2] = [[ 0.5, -0.3, 0.8]] # 1ร3
b[2] = [[ 0.1]] # 1ร1
Forward Pass
Step F1: Z[1] = W[1]ยทX + b[1]
Z[1] = [[0.2ร1.0 + (โ0.1)ร0.5 + 0.4ร(โ1.0) + 0.1ร2.0], = [[ 0.2 โ 0.05 โ 0.4 + 0.2 + 0.1],
[0.3ร1.0 + 0.2ร0.5 + (โ0.3)ร(โ1.0) + 0.2ร2.0], = [[ 0.05],
[(โ0.1)ร1.0 + 0.5ร0.5 + 0.1ร(โ1.0) + (โ0.2)ร2.0]] [ 1.1 โ 0.1],
[โ0.1 + 0.25 โ 0.1 โ 0.4 + 0.0]]
Z[1] = [[ 0.05], # 3ร1
[ 1.00],
[-0.35]]
Step F2: A[1] = ReLU(Z[1])
A[1] = [[max(0, 0.05)], = [[0.05],
[max(0, 1.00)], [1.00],
[max(0, -0.35)]] [0.00]] # This neuron is "dead" (ReLU killed it)
Step F3: Z[2] = W[2]ยทA[1] + b[2]
Z[2] = 0.5ร0.05 + (โ0.3)ร1.00 + 0.8ร0.00 + 0.1
= 0.025 โ 0.3 + 0 + 0.1
= โ0.175
Step F4: A[2] = ฯ(Z[2]) = ฯ(โ0.175)
A[2] = 1 / (1 + exp(0.175)) = 1 / (1 + 1.1912) = 1 / 2.1912 = 0.4564
Step F5: Loss
L = โ[1ยทlog(0.4564) + 0ยทlog(0.5436)]
= โlog(0.4564)
= โ(โ0.7847) = 0.7847
Backward Pass
Step B1-B2: dZ[2] = A[2] โ Y
dZ[2] = 0.4564 โ 1 = โ0.5436
Step B3: dW[2] = (1/m)ยทdZ[2]ยทA[1]แต (m=1)
dW[2] = (โ0.5436) ร [0.05, 1.00, 0.00]
= [โ0.0272, โ0.5436, 0.0000] # Shape: 1ร3 โ
Step B4: db[2] = dZ[2]
db[2] = โ0.5436
Step B5: dA[1] = W[2]แต ยท dZ[2]
dA[1] = [[ 0.5], ร (โ0.5436) = [[-0.2718],
[-0.3], [ 0.1631],
[ 0.8]] [-0.4349]]
Step B6: dZ[1] = dA[1] โ (Z[1] > 0)
ReLU mask = [[1], (Z[1]=0.05 > 0 โ)
[1], (Z[1]=1.00 > 0 โ)
[0]] (Z[1]=โ0.35 โค 0 โ)
dZ[1] = [[-0.2718], โ [[1], = [[-0.2718],
[ 0.1631], [1], [ 0.1631],
[-0.4349]] [0]] [ 0.0000]]
Step B7: dW[1] = (1/m)ยทdZ[1]ยทXแต
dW[1] = [[-0.2718], ร [1.0, 0.5, -1.0, 2.0]
[ 0.1631],
[ 0.0000]]
= [[-0.2718, -0.1359, 0.2718, -0.5436], # 3ร4 โ
[ 0.1631, 0.0816, -0.1631, 0.3262],
[ 0.0000, 0.0000, 0.0000, 0.0000]]
Step B8: db[1] = dZ[1]
db[1] = [[-0.2718],
[ 0.1631],
[ 0.0000]]
Weight Update (ฮฑ = 0.1)
W[2]_new = W[2] โ 0.1 ร dW[2]
= [0.5, โ0.3, 0.8] โ 0.1 ร [โ0.0272, โ0.5436, 0.0]
= [0.5027, โ0.2456, 0.8000]
b[2]_new = 0.1 โ 0.1 ร (โ0.5436) = 0.1544
(Similarly for W[1] and b[1])
Example 2: Adam Optimizer Update Step
Given: Gradient โJ = [0.1, โ0.3], step t=2, ฮฑ=0.001, ฮฒโ=0.9, ฮฒโ=0.999, ฮต=10โปโธ
Previous: mโ = [0.02, โ0.06], vโ = [0.0001, 0.0009]
Step 1: Update first moment (mโ)
mโ = 0.9 ร [0.02, โ0.06] + 0.1 ร [0.1, โ0.3]
= [0.018, โ0.054] + [0.01, โ0.03]
= [0.028, โ0.084]
Step 2: Update second moment (vโ)
vโ = 0.999 ร [0.0001, 0.0009] + 0.001 ร [0.01, 0.09]
= [0.0000999, 0.0008991] + [0.00001, 0.00009]
= [0.0001099, 0.0009891]
Step 3: Bias correction
mฬโ = mโ / (1 โ 0.9ยฒ) = [0.028, โ0.084] / 0.19 = [0.1474, โ0.4421]
vฬโ = vโ / (1 โ 0.999ยฒ) = [0.0001099, 0.0009891] / 0.001999
= [0.05498, 0.49480]
Step 4: Update
ฮธโ = ฮธโ โ 0.001 ร mฬโ / (โvฬโ + ฮต)
= ฮธโ โ 0.001 ร [0.1474, โ0.4421] / [0.2345, 0.7034]
= ฮธโ โ 0.001 ร [0.6287, โ0.6286]
= ฮธโ โ [0.000629, โ0.000629]
Notice: Both parameters get updates of similar magnitude (~0.001) despite having
very different gradient magnitudes (0.1 vs โ0.3). This is Adam's adaptive property!
Example 3: Batch Normalization Forward Pass
Given: Mini-batch of 4 pre-activations, ฮณ = 1.5, ฮฒ = 0.3, ฮต = 10โปโต
z = [2.0, 4.0, 6.0, 8.0]
Step 1: Batch mean
ฮผ_B = (2 + 4 + 6 + 8) / 4 = 5.0
Step 2: Batch variance
ฯยฒ_B = [(2โ5)ยฒ + (4โ5)ยฒ + (6โ5)ยฒ + (8โ5)ยฒ] / 4
= [9 + 1 + 1 + 9] / 4 = 5.0
Step 3: Normalize
แบ = (z โ 5.0) / โ(5.0 + 0.00001) = (z โ 5.0) / 2.23607
แบ = [โ1.3416, โ0.4472, 0.4472, 1.3416]
Step 4: Scale and shift
y = 1.5 ร แบ + 0.3
y = [โ1.7124, โ0.3708, 0.9708, 2.3124]
Verification: mean(แบ) = 0 โ, var(แบ) โ 1 โ
12.8 โ Visual Diagrams
FORWARD PASS (left to right) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโบ
โโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโ
โ X โโโโโบโ Z[1]=W[1]X โโโโโบโA[1]=g(Z[1])โโโโบโZ[2]=W[2]A[1]โโโโบโA[2]=ฯโโโโบLoss
โ(input)โ โ +b[1] โ โ (ReLU) โ โ +b[2] โ โ โ
โโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโ
โ โ โ โ
โ Caches: โ Caches: โ Caches: โ
โ W[1],X,b[1] โ Z[1] โ W[2],A[1],b[2] โ
โ โ โ โ
โโโโโโโโโโโโโโโ BACKWARD PASS (right to left) โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
dA[0]โโโdZ[1]โโโdA[1]โโโโโโโโโโโโdZ[2]โโโโโโโโโโโdA[2]โโโโโdL/dA[2]
(not โฒโโฒ โฒ โฒโโฒ โฒ
needed) โโ โ โโ โ
dW[1] Uses dW[2] dZ[2]=
db[1] ReLU' db[2] A[2]โY
mask
BATCH GRADIENT DESCENT STOCHASTIC GD (SGD)
โโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
โ Process ALL examplesโ โ Process ONE example โ
โ โโโโฌโโโฌโโโฌโโโฌโโโฌโโโโ โ โโโโ โ
โ โx1โx2โx3โx4โx5โx6โโ โ โx1โ โ update โ
โ โโโโดโโโดโโโดโโโดโโโดโโโโ โ โโโโ โ
โ โ โ โ โโโโ โ
โ Compute average โ โ โx2โ โ update โ
โ gradient โ โ โโโโ โ
โ โ โ โ โโโโ โ
โ ONE update/epoch โ โ โx3โ โ update โ
โโโโโโโโโโโโโโโโโโโโโโโ โ โโโโ โ
โ 6 updates/epoch โ
โโโโโโโโโโโโโโโโโโโโโโโ
MINI-BATCH GD (batch_size=2)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โโโโฌโโโ โโโโฌโโโ โโโโฌโโโ โ
โ โx1โx2โ โx3โx4โ โx5โx6โ โ
โ โโโโดโโโ โโโโดโโโ โโโโดโโโ โ
โ โ โ โ โ
โ update update update โ
โ โ
โ 3 updates/epoch โ BEST OF BOTH WORLDS! โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Loss contours (top view of a valley):
โญโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โฑ โญโโโโโโโโโโโโโโโโฎ โฒ
โฑ โฑ โญโโโโโโโโโโฎ โฒ โฒ
โ โ โฑ โญโโโโโโฎ โฒ โ โ
โ โ โ โ MIN โ
โ โ โ โ
โ โ โฒ โฐโโโโโโฏ โฑ โ โ
โฒ โฒ โฐโโโโโโโโโโฏ โฑ โฑ
โฒ โฐโโโโโโโโโโโโโโโโฏ โฑ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
SGD path: โโโฒโโฑโโฒโโฑโโฒโโฑโโ
(oscillates side to side)
Momentum path: โโโโโโโโฒโโโฑโโโ
(smoother, builds velocity)
Adam path: โโโโโโโโโโโโโโ
(adapts, goes nearly straight)
VANISHING GRADIENTS (sigmoid activation): โโโโโโโ โโโโโโโ โโโโโโโ โโโโโโโ โโโโโโโ โ L5 โโโโบโ L4 โโโโบโ L3 โโโโบโ L2 โโโโบโ L1 โ โ โ โ โ โ โ โ โ โ โ โgrad โ โgrad โ โgrad โ โgrad โ โgrad โ โ=1.0 โ โร0.25โ โร0.25โ โร0.25โ โร0.25โ โ โ โ=0.25โ โ=0.06โ โ=0.02โ โ=0.004โ โ Nearly zero! โโโโโโโ โโโโโโโ โโโโโโโ โโโโโโโ โโโโโโโ EXPLODING GRADIENTS (large weights): โโโโโโโ โโโโโโโ โโโโโโโ โโโโโโโ โโโโโโโ โ L5 โโโโบโ L4 โโโโบโ L3 โโโโบโ L2 โโโโบโ L1 โ โ โ โ โ โ โ โ โ โ โ โgrad โ โgrad โ โgrad โ โgrad โ โgrad โ โ=1.0 โ โ ร3 โ โ ร3 โ โ ร3 โ โ ร3 โ โ โ โ=3 โ โ=9 โ โ=27 โ โ=81 โ โ Explodes! โโโโโโโ โโโโโโโ โโโโโโโ โโโโโโโ โโโโโโโ
TRAINING (dropout rate = 0.5): INFERENCE (no dropout):
โโโ โโโ โโโ โโโ โโโ โโโ โโโ โโโ โโโ โโโ
โโโ โโณโ โโโ โโณโ โโโ โโโ โโโ โโโ โโโ โโโ
โโโ โโโ โโโ โโโ โโโ โโโ โโโ โโโ โโโ โโโ
โ โ โ โ โ โ โ โ
โโโ โโโ โโโ โโโ โโโ โโโ โโโ โโโ โโโ โโโ
โโณโ โโโ โโณโ โโโ โโณโ โโโ โโโ โโโ โโโ โโโ
โโโ โโโ โโโ โโโ โโโ โโโ โโโ โโโ โโโ โโโ
โณ = dropped (zeroed out, scaled) All neurons active
โ = kept (multiplied by 1/keep_prob) (no scaling needed with
inverted dropout)
Without Batch Norm: With Batch Norm:
โโโโโโโโโโโโ โโโโโโโโโโโโ
โ Z = WยทA โ โ Z = WยทA โ
โ + b โ โ + b โ
โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ
โ โ
โผ โโโโโโผโโโโโโ
โโโโโโโโโโโโ โBatchNorm โ
โA = g(Z) โ โแบ=(Zโฮผ)/ฯ โ
โ(activate)โ โy=ฮณแบ+ฮฒ โ
โโโโโโโโโโโโ โโโโโโฌโโโโโโ
โ
โโโโโโผโโโโโโ
โA = g(y) โ
โ(activate)โ
โโโโโโโโโโโโ
12.9 โ Flowcharts
โโโโโโโโโโโโโโโโโโโโโโโ
โ Initialize โ
โ weights & biases โ
โ (He/Xavier init) โ
โโโโโโโโโโโโฌโโโโโโโโโโโ
โ
โโโโโโโโโโโโผโโโโโโโโโโโ
โโโโโโบโ for epoch in โ
โ โ range(num_epochs): โ
โ โโโโโโโโโโโโฌโโโโโโโโโโโ
โ โ
โ โโโโโโโโโโโโผโโโโโโโโโโโ
โ โ Shuffle data โ
โ โ Create mini-batchesโ
โ โโโโโโโโโโโโฌโโโโโโโโโโโ
โ โ
โ โโโโโโโโโโโโผโโโโโโโโโโโโโโโ
โ โโโบโ for batch in batches: โ
โ โ โโโโโโโโโโโโฌโโโโโโโโโโโโโโโ
โ โ โ
โ โ โโโโโโโโโโโโผโโโโโโโโโโโ
โ โ โ FORWARD PASS โ
โ โ โ Z[l] = W[l]ยทA +b[l]โ
โ โ โ A[l] = g(Z[l]) โ
โ โ โ Compute Loss โ
โ โ โโโโโโโโโโโโฌโโโโโโโโโโโ
โ โ โ
โ โ โโโโโโโโโโโโผโโโโโโโโโโโ
โ โ โ BACKWARD PASS โ
โ โ โ dZ, dW, db for โ
โ โ โ each layer โ
โ โ โโโโโโโโโโโโฌโโโโโโโโโโโ
โ โ โ
โ โ โโโโโโโโโโโโผโโโโโโโโโโโ
โ โ โ UPDATE WEIGHTS โ
โ โ โ (SGD/Adam/etc.) โ
โ โ โ Update LR schedule โ
โ โ โโโโโโโโโโโโฌโโโโโโโโโโโ
โ โ โ
โ โ โโโโโโโโโโโโผโโโโโโโโโโโ
โ โ โ More batches? โ
โ โ โโโโโโฌโโโโโโโโโโโโฌโโโโโ
โ โ โ Yes โ No
โ โโโโโโโโโ โ
โ โโโโโโโโโโโโผโโโโโโโโโโโ
โ โ Log metrics โ
โ โ Validate on dev setโ
โ โ Check early stop โ
โ โโโโโโโโโโโโฌโโโโโโโโโโโ
โ โ
โ โโโโโโโโโโโโผโโโโโโโโโโโ
โ โ Converged? โ
โ โโโโโโฌโโโโโโโโโโโโฌโโโโโ
โ โ No โ Yes
โโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโผโโโโโโโโโโโ
โ Save model โ
โ Report results โ
โโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโ
โ Choose โ
โ Optimizer โ
โโโโโโโโโโฌโโโโโโโโโ
โ
โโโโโโโโโโผโโโโโโโโโ
โ Is model large โ
โ (>1M params)? โ
โโโโฌโโโโโโโโโโโฌโโโโ
Yes โ โ No
โโโโโโโโโโผโโโโโโโ โ
โ Use Adam or โ โ
โ AdamW โ โผ
โ (default for โ โโโโโโโโโโโโโโโโโ
โ most DL) โ โ Is data small? โ
โโโโโโโโโโโโโโโโโ โโโโฌโโโโโโโโโโฌโโโ
Yes โ โ No
โโโโโโโโโผโโโโโโโ โโผโโโโโโโโโโโโโโโ
โ SGD + โ โ Mini-batch โ
โ Momentum โ โ SGD with โ
โ (often best โ โ Momentum or โ
โ generalize) โ โ Adam โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
Rules of thumb:
โข Adam: Default for most deep learning tasks
โข SGD+Mom: Often better final accuracy (with careful tuning)
โข AdamW: Adam + weight decay (better for transformers)
โข LAMB: For very large batch sizes (distributed training)
For each parameter ฮธแตข:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1. Compute analytic gradient โ
โ g_analytic = backprop(ฮธ) โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโ
โ 2. Compute numerical gradient โ
โ ฮธโบ = ฮธ; ฮธโบ[i] += ฮต โ
โ ฮธโป = ฮธ; ฮธโป[i] -= ฮต โ
โ g_numerical = (J(ฮธโบ) - J(ฮธโป)) โ
โ / (2ฮต) โ
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโ
โ 3. Check difference โ
โ diff = โg_a โ g_nโโ โ
โ / (โg_aโโ + โg_nโโ) โ
โ โ
โ diff < 10โปโท โ CORRECT โ โ
โ diff < 10โปโต โ SUSPICIOUS โ โ
โ diff > 10โปยณ โ BUG! โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
12.10 โ Python Implementation from Scratch
12.10.1 โ Complete Neural Network with Backpropagation
import numpy as np
# ==========================================
# COMPLETE NEURAL NETWORK WITH BACKPROPAGATION
# ==========================================
class DeepNeuralNetwork:
"""
L-layer neural network with backpropagation.
Supports: ReLU, Sigmoid, Tanh activations
Optimizers: SGD, Momentum, RMSprop, Adam
Regularization: L2, Dropout, Batch Normalization
"""
def __init__(self, layer_dims, activations=None):
"""
layer_dims: list of integers [n_x, n_h1, n_h2, ..., n_y]
activations: list of strings ['relu', 'relu', ..., 'sigmoid']
"""
self.L = len(layer_dims) - 1 # Number of layers
self.layer_dims = layer_dims
self.parameters = {}
self.caches = {}
self.grads = {}
# Default activations: ReLU hidden, Sigmoid output
if activations is None:
self.activations = ['relu'] * (self.L - 1) + ['sigmoid']
else:
self.activations = activations
self._initialize_parameters()
def _initialize_parameters(self):
"""He initialization for ReLU, Xavier for Sigmoid/Tanh"""
np.random.seed(42)
for l in range(1, self.L + 1):
if self.activations[l-1] == 'relu':
# He initialization: scale by sqrt(2/n_prev)
scale = np.sqrt(2.0 / self.layer_dims[l-1])
else:
# Xavier initialization: scale by sqrt(1/n_prev)
scale = np.sqrt(1.0 / self.layer_dims[l-1])
self.parameters[f'W{l}'] = np.random.randn(
self.layer_dims[l], self.layer_dims[l-1]
) * scale
self.parameters[f'b{l}'] = np.zeros((self.layer_dims[l], 1))
# ---- Activation Functions ----
def _sigmoid(self, Z):
A = 1.0 / (1.0 + np.exp(-np.clip(Z, -500, 500)))
return A
def _relu(self, Z):
return np.maximum(0, Z)
def _tanh(self, Z):
return np.tanh(Z)
def _activation_forward(self, Z, activation):
if activation == 'sigmoid':
return self._sigmoid(Z)
elif activation == 'relu':
return self._relu(Z)
elif activation == 'tanh':
return self._tanh(Z)
def _activation_backward(self, dA, Z, activation):
if activation == 'sigmoid':
s = self._sigmoid(Z)
return dA * s * (1 - s)
elif activation == 'relu':
return dA * (Z > 0).astype(float)
elif activation == 'tanh':
return dA * (1 - np.tanh(Z) ** 2)
# ---- Forward Pass ----
def forward(self, X, keep_prob=1.0):
"""
Full forward pass through L layers.
Stores caches for backpropagation.
"""
self.caches = {}
A = X
self.caches['A0'] = X
for l in range(1, self.L + 1):
A_prev = A
W = self.parameters[f'W{l}']
b = self.parameters[f'b{l}']
# Linear step: Z = WยทA_prev + b
Z = np.dot(W, A_prev) + b
self.caches[f'Z{l}'] = Z
# Activation step
A = self._activation_forward(Z, self.activations[l-1])
# Dropout (not on output layer)
if keep_prob < 1.0 and l < self.L:
D = (np.random.rand(*A.shape) < keep_prob).astype(float)
A = A * D / keep_prob # Inverted dropout
self.caches[f'D{l}'] = D
self.caches[f'A{l}'] = A
return A
# ---- Loss Computation ----
def compute_cost(self, AL, Y, lambd=0):
"""Binary cross-entropy with optional L2 regularization"""
m = Y.shape[1]
# Cross-entropy loss
cost = -(1/m) * np.sum(
Y * np.log(AL + 1e-8) + (1 - Y) * np.log(1 - AL + 1e-8)
)
# L2 regularization
if lambd > 0:
l2_cost = 0
for l in range(1, self.L + 1):
l2_cost += np.sum(np.square(self.parameters[f'W{l}']))
cost += (lambd / (2 * m)) * l2_cost
return np.squeeze(cost)
# ---- Backward Pass ----
def backward(self, Y, lambd=0, keep_prob=1.0):
"""
Full backward pass โ computes all gradients.
This is the CORE of backpropagation.
"""
m = Y.shape[1]
AL = self.caches[f'A{self.L}']
# Output layer: dZ[L] = A[L] - Y (sigmoid + BCE)
dZ = AL - Y # Shape: (n[L] ร m)
for l in range(self.L, 0, -1):
A_prev = self.caches[f'A{l-1}']
# Gradient of weights: dW = (1/m) ยท dZ ยท A_prev^T
dW = (1/m) * np.dot(dZ, A_prev.T)
# Add L2 regularization gradient
if lambd > 0:
dW += (lambd / m) * self.parameters[f'W{l}']
# Gradient of biases: db = (1/m) ยท sum(dZ, axis=1)
db = (1/m) * np.sum(dZ, axis=1, keepdims=True)
# Store gradients
self.grads[f'dW{l}'] = dW
self.grads[f'db{l}'] = db
# Propagate gradient to previous layer (if not at layer 1)
if l > 1:
# dA_prev = W^T ยท dZ
dA_prev = np.dot(self.parameters[f'W{l}'].T, dZ)
# Apply dropout mask
if keep_prob < 1.0 and f'D{l-1}' in self.caches:
dA_prev = dA_prev * self.caches[f'D{l-1}'] / keep_prob
# dZ_prev = dA_prev โ g'(Z_prev)
dZ = self._activation_backward(
dA_prev,
self.caches[f'Z{l-1}'],
self.activations[l-2]
)
# ---- Optimizers ----
def update_parameters_sgd(self, learning_rate):
"""Vanilla SGD"""
for l in range(1, self.L + 1):
self.parameters[f'W{l}'] -= learning_rate * self.grads[f'dW{l}']
self.parameters[f'b{l}'] -= learning_rate * self.grads[f'db{l}']
def initialize_velocity(self):
"""Initialize velocity for Momentum"""
self.v = {}
for l in range(1, self.L + 1):
self.v[f'dW{l}'] = np.zeros_like(self.parameters[f'W{l}'])
self.v[f'db{l}'] = np.zeros_like(self.parameters[f'b{l}'])
def update_parameters_momentum(self, learning_rate, beta=0.9):
"""SGD with Momentum"""
for l in range(1, self.L + 1):
# Update velocity
self.v[f'dW{l}'] = beta * self.v[f'dW{l}'] + (1 - beta) * self.grads[f'dW{l}']
self.v[f'db{l}'] = beta * self.v[f'db{l}'] + (1 - beta) * self.grads[f'db{l}']
# Update parameters
self.parameters[f'W{l}'] -= learning_rate * self.v[f'dW{l}']
self.parameters[f'b{l}'] -= learning_rate * self.v[f'db{l}']
def initialize_adam(self):
"""Initialize Adam moment estimates"""
self.m = {} # First moment
self.v_adam = {} # Second moment
for l in range(1, self.L + 1):
self.m[f'dW{l}'] = np.zeros_like(self.parameters[f'W{l}'])
self.m[f'db{l}'] = np.zeros_like(self.parameters[f'b{l}'])
self.v_adam[f'dW{l}'] = np.zeros_like(self.parameters[f'W{l}'])
self.v_adam[f'db{l}'] = np.zeros_like(self.parameters[f'b{l}'])
def update_parameters_adam(self, lr, t, beta1=0.9, beta2=0.999, eps=1e-8):
"""Adam optimizer with bias correction"""
for l in range(1, self.L + 1):
for param in ['dW', 'db']:
key = f'{param}{l}'
# Update first moment (mean of gradients)
self.m[key] = beta1 * self.m[key] + (1-beta1) * self.grads[key]
# Update second moment (mean of squared gradients)
self.v_adam[key] = beta2 * self.v_adam[key] + (1-beta2) * (self.grads[key]**2)
# Bias correction
m_corrected = self.m[key] / (1 - beta1**t)
v_corrected = self.v_adam[key] / (1 - beta2**t)
# Update parameter
p_key = f'W{l}' if 'W' in param else f'b{l}'
self.parameters[p_key] -= lr * m_corrected / (np.sqrt(v_corrected) + eps)
# ---- Gradient Checking ----
def gradient_check(self, X, Y, epsilon=1e-7):
"""Verify backprop using numerical gradients"""
# Compute analytic gradients
self.forward(X)
self.backward(Y)
# Flatten all parameters and gradients
params_values = []
grads_values = []
for l in range(1, self.L + 1):
params_values.append(self.parameters[f'W{l}'].flatten())
params_values.append(self.parameters[f'b{l}'].flatten())
grads_values.append(self.grads[f'dW{l}'].flatten())
grads_values.append(self.grads[f'db{l}'].flatten())
theta = np.concatenate(params_values)
d_theta = np.concatenate(grads_values)
# Compute numerical gradients
num_grads = np.zeros_like(theta)
for i in range(len(theta)):
# ฮธ+
theta_plus = theta.copy()
theta_plus[i] += epsilon
self._set_params_from_vector(theta_plus)
AL_plus = self.forward(X)
cost_plus = self.compute_cost(AL_plus, Y)
# ฮธ-
theta_minus = theta.copy()
theta_minus[i] -= epsilon
self._set_params_from_vector(theta_minus)
AL_minus = self.forward(X)
cost_minus = self.compute_cost(AL_minus, Y)
num_grads[i] = (cost_plus - cost_minus) / (2 * epsilon)
# Restore original parameters
self._set_params_from_vector(theta)
# Compute difference
diff = np.linalg.norm(d_theta - num_grads) / \
(np.linalg.norm(d_theta) + np.linalg.norm(num_grads) + 1e-8)
if diff < 1e-7:
print(f"โ
Gradient check PASSED! Difference: {diff:.2e}")
elif diff < 1e-5:
print(f"โ ๏ธ Gradient check SUSPICIOUS. Difference: {diff:.2e}")
else:
print(f"โ Gradient check FAILED! Difference: {diff:.2e}")
return diff
def _set_params_from_vector(self, theta):
"""Helper: set parameters from flat vector"""
idx = 0
for l in range(1, self.L + 1):
w_shape = self.parameters[f'W{l}'].shape
w_size = np.prod(w_shape)
self.parameters[f'W{l}'] = theta[idx:idx+w_size].reshape(w_shape)
idx += w_size
b_shape = self.parameters[f'b{l}'].shape
b_size = np.prod(b_shape)
self.parameters[f'b{l}'] = theta[idx:idx+b_size].reshape(b_shape)
idx += b_size
# ---- Full Training Loop ----
def train(self, X, Y, epochs=1000, lr=0.01, optimizer='adam',
batch_size=32, lambd=0, keep_prob=1.0,
lr_decay=None, print_cost=True):
"""Complete training loop with mini-batches"""
m = X.shape[1]
costs = []
t = 0 # Adam counter
# Initialize optimizer state
if optimizer == 'momentum':
self.initialize_velocity()
elif optimizer == 'adam':
self.initialize_adam()
for epoch in range(epochs):
# Shuffle data
permutation = np.random.permutation(m)
X_shuffled = X[:, permutation]
Y_shuffled = Y[:, permutation]
# Create mini-batches
num_batches = m // batch_size
epoch_cost = 0
for k in range(num_batches + (1 if m % batch_size != 0 else 0)):
X_batch = X_shuffled[:, k*batch_size : (k+1)*batch_size]
Y_batch = Y_shuffled[:, k*batch_size : (k+1)*batch_size]
if X_batch.shape[1] == 0:
continue
# Forward pass
AL = self.forward(X_batch, keep_prob)
# Compute cost
batch_cost = self.compute_cost(AL, Y_batch, lambd)
epoch_cost += batch_cost
# Backward pass
self.backward(Y_batch, lambd, keep_prob)
# Update parameters
t += 1
current_lr = self._get_lr(lr, epoch, epochs, lr_decay)
if optimizer == 'sgd':
self.update_parameters_sgd(current_lr)
elif optimizer == 'momentum':
self.update_parameters_momentum(current_lr)
elif optimizer == 'adam':
self.update_parameters_adam(current_lr, t)
avg_cost = epoch_cost / max(num_batches, 1)
costs.append(avg_cost)
if print_cost and epoch % 100 == 0:
print(f"Epoch {epoch:4d} | Cost: {avg_cost:.6f} | LR: {current_lr:.6f}")
return costs
def _get_lr(self, lr_init, epoch, total_epochs, schedule):
"""Learning rate scheduler"""
if schedule is None:
return lr_init
elif schedule == 'step':
return lr_init * (0.5 ** (epoch // 200))
elif schedule == 'exponential':
return lr_init * np.exp(-0.01 * epoch)
elif schedule == 'cosine':
return lr_init * 0.5 * (1 + np.cos(np.pi * epoch / total_epochs))
elif schedule == 'warmup_cosine':
warmup = total_epochs // 10
if epoch < warmup:
return lr_init * (epoch / warmup)
return lr_init * 0.5 * (1 + np.cos(np.pi * (epoch - warmup) / (total_epochs - warmup)))
def predict(self, X, threshold=0.5):
"""Make predictions"""
AL = self.forward(X, keep_prob=1.0)
return (AL > threshold).astype(int)
def accuracy(self, X, Y):
preds = self.predict(X)
return np.mean(preds == Y) * 100
# ==========================================
# DEMO: Train on synthetic data
# ==========================================
if __name__ == "__main__":
# Generate spiral dataset (non-linearly separable)
np.random.seed(1)
N = 200 # points per class
D = 2 # dimensions
K = 2 # classes
X = np.zeros((N*K, D))
Y = np.zeros((N*K, 1))
for j in range(K):
ix = range(N*j, N*(j+1))
r = np.linspace(0.0, 1, N)
t = np.linspace(j*4, (j+1)*4, N) + np.random.randn(N)*0.2
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
Y[ix] = j
X = X.T # (2, 400)
Y = Y.T # (1, 400)
# Create network: 2 inputs โ 16 โ 8 โ 1 output
nn = DeepNeuralNetwork([2, 16, 8, 1])
# Verify backpropagation
print("=== Gradient Check ===")
nn.gradient_check(X[:, :5], Y[:, :5])
# Train with Adam
print("\n=== Training with Adam ===")
costs = nn.train(X, Y, epochs=2000, lr=0.001,
optimizer='adam', batch_size=64,
lr_decay='cosine')
print(f"\nFinal accuracy: {nn.accuracy(X, Y):.1f}%")
12.10.2 โ Batch Normalization from Scratch
import numpy as np
class BatchNorm:
"""Batch Normalization layer โ forward and backward"""
def __init__(self, num_features, momentum=0.9, eps=1e-5):
self.gamma = np.ones(num_features) # Learnable scale
self.beta = np.zeros(num_features) # Learnable shift
self.eps = eps
self.momentum = momentum
# Running stats for inference
self.running_mean = np.zeros(num_features)
self.running_var = np.ones(num_features)
def forward(self, Z, training=True):
"""
Forward pass of batch normalization.
Z: (n, m) โ n features, m examples
"""
if training:
# Step 1: Batch mean
self.mu = np.mean(Z, axis=1, keepdims=True)
# Step 2: Batch variance
self.var = np.var(Z, axis=1, keepdims=True)
# Step 3: Normalize
self.Z_centered = Z - self.mu
self.std_inv = 1.0 / np.sqrt(self.var + self.eps)
self.Z_norm = self.Z_centered * self.std_inv
# Step 4: Scale and shift
out = self.gamma.reshape(-1, 1) * self.Z_norm + self.beta.reshape(-1, 1)
# Update running statistics
self.running_mean = (self.momentum * self.running_mean +
(1 - self.momentum) * self.mu.flatten())
self.running_var = (self.momentum * self.running_var +
(1 - self.momentum) * self.var.flatten())
else:
# Inference: use running statistics
Z_norm = (Z - self.running_mean.reshape(-1, 1)) / \
np.sqrt(self.running_var.reshape(-1, 1) + self.eps)
out = self.gamma.reshape(-1, 1) * Z_norm + self.beta.reshape(-1, 1)
return out
def backward(self, dout):
"""
Backward pass of batch normalization.
dout: gradient from upstream, same shape as Z
"""
m = dout.shape[1]
gamma = self.gamma.reshape(-1, 1)
# Gradient of gamma and beta
self.dgamma = np.sum(dout * self.Z_norm, axis=1)
self.dbeta = np.sum(dout, axis=1)
# Gradient through normalization
dZ_norm = dout * gamma
dvar = np.sum(dZ_norm * self.Z_centered * (-0.5) *
(self.var + self.eps)**(-1.5), axis=1, keepdims=True)
dmu = (np.sum(dZ_norm * (-self.std_inv), axis=1, keepdims=True) +
dvar * np.mean(-2 * self.Z_centered, axis=1, keepdims=True))
dZ = dZ_norm * self.std_inv + dvar * 2 * self.Z_centered / m + dmu / m
return dZ
# Demo
bn = BatchNorm(num_features=3)
Z = np.array([[2.0, 4.0, 6.0, 8.0],
[1.0, 3.0, 5.0, 7.0],
[0.5, 1.5, 2.5, 3.5]])
out = bn.forward(Z, training=True)
print("BN output:\n", out)
print("Mean (should be ~0):", np.mean(out, axis=1))
print("Var (should be ~1):", np.var(out, axis=1))
12.11 โ TensorFlow Implementation
12.11.1 โ Custom Training Loop with GradientTape
import tensorflow as tf
import numpy as np
# ==========================================
# CUSTOM TRAINING LOOP WITH GRADIENTTAPE
# ==========================================
# 1. Build model
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(10,),
kernel_initializer='he_normal'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(32, activation='relu',
kernel_initializer='he_normal'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(1, activation='sigmoid')
])
# 2. Setup optimizer with learning rate schedule
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
initial_learning_rate=0.001,
decay_steps=1000,
alpha=0.0001 # minimum learning rate
)
optimizer = tf.keras.optimizers.Adam(
learning_rate=lr_schedule,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-8
)
loss_fn = tf.keras.losses.BinaryCrossentropy()
# 3. Generate synthetic data
np.random.seed(42)
X_train = np.random.randn(1000, 10).astype(np.float32)
Y_train = (np.sum(X_train[:, :3], axis=1) > 0).astype(np.float32).reshape(-1, 1)
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, Y_train))
train_dataset = train_dataset.shuffle(1000).batch(32)
# 4. Custom training step
@tf.function
def train_step(x_batch, y_batch):
"""One training step with explicit gradient computation"""
with tf.GradientTape() as tape:
# Forward pass
predictions = model(x_batch, training=True)
loss = loss_fn(y_batch, predictions)
# Add L2 regularization manually
l2_loss = tf.add_n([tf.nn.l2_loss(v)
for v in model.trainable_variables
if 'kernel' in v.name])
total_loss = loss + 1e-4 * l2_loss
# Backward pass โ THIS IS BACKPROPAGATION!
gradients = tape.gradient(total_loss, model.trainable_variables)
# Gradient clipping (prevent exploding gradients)
gradients, global_norm = tf.clip_by_global_norm(gradients, 1.0)
# Update weights (Adam optimizer handles momentum + RMSprop internally)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return total_loss, predictions
# 5. Training loop
for epoch in range(50):
epoch_loss = 0
num_batches = 0
for x_batch, y_batch in train_dataset:
loss, preds = train_step(x_batch, y_batch)
epoch_loss += loss.numpy()
num_batches += 1
if epoch % 10 == 0:
avg_loss = epoch_loss / num_batches
current_lr = optimizer.learning_rate(optimizer.iterations).numpy()
print(f"Epoch {epoch:3d} | Loss: {avg_loss:.4f} | LR: {current_lr:.6f}")
# 6. Inspect gradients for diagnosis
print("\n=== Gradient Analysis ===")
with tf.GradientTape() as tape:
preds = model(X_train[:32], training=False)
loss = loss_fn(Y_train[:32], preds)
grads = tape.gradient(loss, model.trainable_variables)
for var, grad in zip(model.trainable_variables, grads):
if grad is not None:
print(f"{var.name:30s} | grad mean: {tf.reduce_mean(grad):.6f} | "
f"grad std: {tf.math.reduce_std(grad):.6f}")
12.11.2 โ Comparing Optimizers in TensorFlow
import tensorflow as tf
import numpy as np
def create_model():
return tf.keras.Sequential([
tf.keras.layers.Dense(32, activation='relu', input_shape=(10,)),
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
optimizers = {
'SGD': tf.keras.optimizers.SGD(learning_rate=0.01),
'SGD+Momentum': tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9),
'RMSprop': tf.keras.optimizers.RMSprop(learning_rate=0.001),
'Adam': tf.keras.optimizers.Adam(learning_rate=0.001),
}
results = {}
for name, opt in optimizers.items():
model = create_model()
model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, Y_train, epochs=50, batch_size=32, verbose=0)
results[name] = history.history['loss']
print(f"{name:15s} | Final loss: {history.history['loss'][-1]:.4f} | "
f"Final acc: {history.history['accuracy'][-1]:.4f}")
12.12 โ Scikit-Learn Implementation
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
# Generate dataset
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Compare different solvers (optimizers) in scikit-learn
solvers = {
'sgd': {'solver': 'sgd', 'learning_rate_init': 0.01,
'momentum': 0.9, 'nesterovs_momentum': True},
'adam': {'solver': 'adam', 'learning_rate_init': 0.001,
'beta_1': 0.9, 'beta_2': 0.999},
'lbfgs': {'solver': 'lbfgs'}, # Quasi-Newton method
}
print("Scikit-Learn MLPClassifier โ Optimizer Comparison")
print("=" * 55)
for name, params in solvers.items():
mlp = MLPClassifier(
hidden_layer_sizes=(64, 32), # Two hidden layers
activation='relu',
max_iter=500,
random_state=42,
alpha=0.0001, # L2 regularization
batch_size=32,
**params
)
mlp.fit(X_train, y_train)
train_acc = mlp.score(X_train, y_train) * 100
test_acc = mlp.score(X_test, y_test) * 100
print(f"{name:8s} | Train: {train_acc:.1f}% | Test: {test_acc:.1f}% | "
f"Iters: {mlp.n_iter_}")
# Access the loss curve
print(f"\nAdam loss curve (first 10 epochs): {mlp.loss_curve_[:10]}")
print(f"Final loss: {mlp.loss_curve_[-1]:.6f}")
12.13 โ Indian Case Studies
TCS AutoML Training Pipelines
Tata Consultancy Services (TCS) developed AutoML frameworks for their enterprise clients that automatically tune training hyperparameters including optimizer choice, learning rate schedules, and regularization strength.
- Challenge: TCS serves 1000+ enterprise clients, each with unique data characteristics. Manually tuning learning rates and optimizers for each client was unsustainable.
- Solution: Built an automated training pipeline that starts with Adam (safe default), monitors loss curves for signs of divergence, and automatically switches to SGD+Momentum with warmup if loss plateaus.
- Training Strategy: Warmup + cosine annealing schedule with automatic learning rate range finding (similar to Leslie Smith's LR finder).
- Results: Reduced ML model deployment time from 6 weeks to 5 days. Achieved 15-20% better final accuracy compared to using default hyperparameters.
- Scale: The pipeline processes 50+ TB of client data daily, training models for fraud detection, demand forecasting, and customer churn prediction.
Infosys Nia โ Deep Learning Training Optimization
Infosys Nia is an AI platform that uses deep learning for knowledge management, predictive analytics, and process automation across Infosys's global operations.
- Training Infrastructure: Uses distributed training across multiple NVIDIA DGX systems with gradient accumulation for effective batch sizes of 4096+.
- Optimizer Choice: LAMB optimizer (Layer-wise Adaptive Moments for Batch training) โ a variant of Adam designed for large batch training, crucial for distributed setups.
- Batch Normalization: Synchronized Batch Normalization across GPUs to maintain consistent statistics in distributed training.
- Gradient Challenges: Encountered exploding gradients in RNN-based document classification models. Solved with gradient clipping (max norm = 1.0) and LSTM/GRU architectures.
- Impact: Reduced document processing time by 70% for 200+ enterprise clients; handles 100M+ documents annually.
ISRO โ Satellite Image Neural Network Training
The Indian Space Research Organisation (ISRO) trains deep neural networks for satellite image analysis, including crop classification, flood detection, and urban planning.
- Challenge: Satellite images are enormous (10,000 ร 10,000 pixels). Training on full images causes memory overflow. Vanishing gradients in deep segmentation networks (U-Net with 50+ layers).
- Solution: Patch-based training with mini-batch SGD + Nesterov momentum. Skip connections (from U-Net architecture) to combat vanishing gradients. Mixed precision training (FP16 forward pass, FP32 gradients) to save memory.
- Results: 94% accuracy in crop type classification across India's diverse agricultural landscape. 3ร faster training with mixed precision.
12.14 โ Global Case Studies
GPT-3 Training Setup โ OpenAI (2020)
Training GPT-3 with 175 billion parameters was one of the most ambitious training runs in history. Every aspect of training optimization discussed in this chapter was critical.
- Optimizer: Adam with ฮฒโ = 0.9, ฮฒโ = 0.95, ฮต = 10โปโธ (note: ฮฒโ = 0.95, not the typical 0.999 โ provides stronger gradient adaptation)
- Learning Rate: Linear warmup over 375M tokens, then cosine decay to 10% of peak LR. Peak LR varied by model size (0.6ร10โปโด for the largest model).
- Batch Size: Gradually increased from 32K tokens to 3.2M tokens during training (large batch training).
- Gradient Clipping: Global norm clipping at 1.0 to prevent exploding gradients.
- Training Data: 300 billion tokens, 45 TB of text data.
- Compute: Estimated 3,640 petaflop-days. Cost: $4.6M on V100 GPUs (estimated at 2020 cloud prices).
- Weight Decay: 0.1 (applied to all parameters except biases and layer norm parameters).
Tesla Full Self-Driving (FSD) Training
Tesla trains neural networks for autonomous driving using data from millions of vehicles.
- Architecture: Multi-task learning with shared backbone (HydraNet). Backpropagation must handle gradients from 50+ output tasks simultaneously.
- Training Scale: Custom Dojo supercomputer with ExaPOD training tiles. Processes 1.5 billion labeled video frames.
- Optimizer: AdamW (Adam with decoupled weight decay) โ the standard for transformer-based vision models.
- Key Challenge: Multi-task gradient conflicts โ some tasks' gradients can cancel out others. Tesla uses gradient surgery techniques (PCGrad) to handle this.
- Batch Norm Alternative: Uses Group Normalization and Layer Normalization since batch statistics are unreliable in variable-length video sequences.
Netflix โ Recommendation Model Training at Scale
Netflix trains deep learning recommendation models that serve 230+ million subscribers.
- Challenge: Models must be retrained daily on billions of user interactions. Training speed is critical โ a model that takes 48 hours to train is useless if user preferences change daily.
- Solution: Highly optimized SGD with momentum on custom hardware. Sparse gradient updates for embedding layers (only update embeddings for users/items in the batch).
- Dropout: Applied heavily (0.5) in upper layers to prevent overfitting on popular items, ensuring diverse recommendations.
- Impact: The recommendation system saves Netflix an estimated $1B/year in customer retention.
12.15 โ Startup Applications
๐ How Startups Leverage Training Optimization
Razorpay (India): Uses optimized training pipelines with Adam + cosine annealing for their fraud detection models. Mini-batch size tuned for their GPU cluster (A100s). Processes 1B+ transactions/month and retrains models every 4 hours for fresh fraud patterns.
Postman (India): API testing platform uses NLP models trained with backpropagation for API documentation generation and error detection. Training uses warmup schedules to avoid early divergence on small, domain-specific datasets.
Hugging Face (Global): Built the Accelerate library that automatically handles distributed training, mixed precision, and gradient accumulation โ abstracting away the low-level backpropagation details we studied. Their Trainer class implements warmup + linear/cosine schedules by default.
Weights & Biases (Global): Founded specifically to help ML teams track training runs โ monitoring loss curves, gradient norms, learning rates, and optimizer states. Their experiment tracking tools are used by 70,000+ ML practitioners worldwide.
12.16 โ Government Applications
๐๏ธ Government & Public Sector Deep Learning Training
UIDAI / Aadhaar (India): The biometric matching system uses deep neural networks trained with carefully tuned backpropagation. The fingerprint and iris matching networks use Adam optimizer with learning rate warmup. Batch Normalization is critical for handling the massive diversity in biometric quality across 1.3B+ entries. Dropout (0.4) is used to prevent overfitting on dominant demographic patterns.
Indian Railways: Predictive maintenance models for rolling stock are trained using optimized SGD + Momentum. The models process vibration sensor data from 12,000+ locomotives. Learning rate schedules (step decay every 50 epochs) ensure stable training on noisy sensor data.
DRDO (Defence Research): Trains computer vision models for surveillance and target recognition on custom PARAM supercomputers. Uses gradient clipping and careful initialization to train very deep networks (100+ layers) for object detection in challenging conditions.
US Department of Energy: Uses distributed backpropagation across supercomputers (Summit, Frontier) for training physics-informed neural networks. LAMB optimizer enables effective batch sizes of 65,536+ across thousands of GPUs.
12.17 โ Industry Applications
| Industry | Application | Training Setup | Key Technique |
|---|---|---|---|
| Healthcare | Medical image diagnosis | Adam, LR warmup, BatchNorm | Transfer learning + fine-tuning with low LR |
| Finance | Algorithmic trading | SGD + Momentum, gradient clipping | Online learning with mini-batch updates |
| E-commerce | Product recommendations | Adam, sparse gradients | Embedding-specific optimizers |
| Manufacturing | Quality inspection | SGD + cosine annealing | BatchNorm for batch-to-batch variation |
| Telecommunications | Network optimization | RMSprop, dropout | Recurrent networks with gradient clipping |
| Agriculture | Crop disease detection | Adam + weight decay | Heavy dropout (0.5) for small datasets |
| Energy | Demand forecasting | Adam, LR scheduling | Sequence models with layered dropout |
| Automotive | Autonomous driving | AdamW + warmup + cosine | Multi-task gradient balancing |
12.18 โ Mini Projects
๐ฌ Backpropagation Visualizer
Build an interactive visualization that shows gradients flowing backward through a neural network.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
class BackpropVisualizer:
"""Visualize forward and backward pass through a network"""
def __init__(self, layer_sizes):
self.layer_sizes = layer_sizes
self.L = len(layer_sizes) - 1
self.params = {}
self.cache = {}
self.grads = {}
# Initialize small network
np.random.seed(42)
for l in range(1, self.L + 1):
self.params[f'W{l}'] = np.random.randn(
layer_sizes[l], layer_sizes[l-1]) * 0.5
self.params[f'b{l}'] = np.zeros((layer_sizes[l], 1))
def forward_and_backward(self, X, Y):
"""Run forward and backward, store everything"""
# Forward
A = X
self.cache['A0'] = A
for l in range(1, self.L + 1):
Z = self.params[f'W{l}'] @ A + self.params[f'b{l}']
self.cache[f'Z{l}'] = Z
A = 1 / (1 + np.exp(-Z)) if l == self.L else np.maximum(0, Z)
self.cache[f'A{l}'] = A
# Backward
m = Y.shape[1]
dZ = A - Y
for l in range(self.L, 0, -1):
A_prev = self.cache[f'A{l-1}']
self.grads[f'dW{l}'] = (1/m) * dZ @ A_prev.T
self.grads[f'db{l}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)
self.grads[f'dZ{l}'] = dZ
if l > 1:
dA = self.params[f'W{l}'].T @ dZ
dZ = dA * (self.cache[f'Z{l-1}'] > 0)
def visualize(self):
"""Draw the network with gradient magnitudes"""
fig, axes = plt.subplots(1, 2, figsize=(16, 8))
# Plot 1: Network with gradient colors
ax = axes[0]
ax.set_title('Network with Gradient Magnitudes', fontweight='bold')
max_neurons = max(self.layer_sizes)
for l, n in enumerate(self.layer_sizes):
x = l * 2
for i in range(n):
y = (max_neurons - n) / 2 + i
if l > 0:
grad_mag = np.abs(self.grads[f'dZ{l}''][i, 0])
color = plt.cm.RdYlGn_r(min(grad_mag * 5, 1))
else:
color = 'lightblue'
circle = plt.Circle((x, y), 0.3, color=color, ec='black')
ax.add_patch(circle)
# Draw connections to next layer
if l < len(self.layer_sizes) - 1:
n_next = self.layer_sizes[l+1]
for j in range(n_next):
y_next = (max_neurons - n_next) / 2 + j
w = self.params[f'W{l+1}''][j, i]
ax.plot([x+0.3, (l+1)*2-0.3], [y, y_next],
color='blue' if w > 0 else 'red',
alpha=min(abs(w), 1), linewidth=abs(w)*2)
ax.set_xlim(-1, len(self.layer_sizes) * 2)
ax.set_ylim(-1, max_neurons)
ax.set_aspect('equal')
ax.axis('off')
# Plot 2: Gradient norms per layer
ax = axes[1]
layers = []
grad_norms = []
for l in range(1, self.L + 1):
layers.append(f'Layer {l}')
grad_norms.append(np.linalg.norm(self.grads[f'dW{l}'']))
ax.bar(layers, grad_norms, color=['#059669', '#0891b2', '#8b5cf6'][:len(layers)])
ax.set_title('Gradient Norms by Layer', fontweight='bold')
ax.set_ylabel('โdWโโ')
plt.tight_layout()
plt.savefig('backprop_visualization.png', dpi=150, bbox_inches='tight')
plt.show()
# Run the visualizer
viz = BackpropVisualizer([4, 6, 4, 1])
X = np.random.randn(4, 1)
Y = np.array([[1]])
viz.forward_and_backward(X, Y)
viz.visualize()
โก Optimizer Comparison Dashboard
Build a comprehensive comparison of SGD, Momentum, RMSprop, and Adam on multiple datasets, plotting convergence curves, final accuracy, and training time.
import numpy as np
import matplotlib.pyplot as plt
import time
# Use our DeepNeuralNetwork class from Section 12.10
# (assumes it's already defined/imported)
def generate_spiral_data(N=200, K=2):
np.random.seed(1)
X = np.zeros((N*K, 2))
Y = np.zeros((N*K, 1))
for j in range(K):
ix = range(N*j, N*(j+1))
r = np.linspace(0.0, 1, N)
t = np.linspace(j*4, (j+1)*4, N) + np.random.randn(N)*0.2
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
Y[ix] = j
return X.T, Y.T
X, Y = generate_spiral_data()
optimizers = ['sgd', 'momentum', 'adam']
colors = ['#ef4444', '#f97316', '#059669']
results = {}
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for i, opt in enumerate(optimizers):
nn = DeepNeuralNetwork([2, 16, 8, 1])
start = time.time()
lr = 0.01 if opt == 'sgd' else 0.001
costs = nn.train(X, Y, epochs=2000, lr=lr,
optimizer=opt, batch_size=64, print_cost=False)
elapsed = time.time() - start
acc = nn.accuracy(X, Y)
results[opt] = {'costs': costs, 'time': elapsed, 'acc': acc}
# Plot convergence
axes[0].plot(costs, color=colors[i], label=f'{opt} ({acc:.0f}%)', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Cost')
axes[0].set_title('Convergence Comparison')
axes[0].legend()
axes[0].set_yscale('log')
# Bar chart: final accuracy
axes[1].bar(optimizers, [results[o]['acc'] for o in optimizers], color=colors)
axes[1].set_title('Final Accuracy (%)')
axes[1].set_ylim(80, 100)
# Bar chart: training time
axes[2].bar(optimizers, [results[o]['time'] for o in optimizers], color=colors)
axes[2].set_title('Training Time (seconds)')
plt.suptitle('Optimizer Comparison on Spiral Dataset', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('optimizer_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
๐ Learning Rate Schedule Explorer
Visualize and compare all learning rate schedules (step decay, exponential decay, cosine annealing, warmup + cosine) and observe their effect on training.
import numpy as np
import matplotlib.pyplot as plt
total_epochs = 200
lr_init = 0.01
epochs = np.arange(total_epochs)
schedules = {
'Constant': np.ones(total_epochs) * lr_init,
'Step Decay (รท2 every 50)': lr_init * (0.5 ** (epochs // 50)),
'Exponential (k=0.02)': lr_init * np.exp(-0.02 * epochs),
'Cosine Annealing': lr_init * 0.5 * (1 + np.cos(np.pi * epochs / total_epochs)),
'Warmup + Cosine': np.where(
epochs < 20,
lr_init * epochs / 20,
lr_init * 0.5 * (1 + np.cos(np.pi * (epochs - 20) / (total_epochs - 20)))
),
}
fig, ax = plt.subplots(figsize=(12, 6))
colors = ['#64748b', '#ef4444', '#f97316', '#059669', '#0891b2']
for (name, lrs), color in zip(schedules.items(), colors):
ax.plot(epochs, lrs, label=name, linewidth=2.5, color=color)
ax.set_xlabel('Epoch', fontsize=13)
ax.set_ylabel('Learning Rate', fontsize=13)
ax.set_title('Learning Rate Schedule Comparison', fontsize=15, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('lr_schedules.png', dpi=150)
plt.show()
12.19 โ End-of-Chapter Exercises
Conceptual Questions
Explain why computing gradients numerically (finite differences) is impractical for large networks. Calculate the number of forward passes needed for a network with 10 million parameters.
Draw the complete computation graph for a 3-layer network (input โ hidden1 โ hidden2 โ output) with ReLU activations and BCE loss. Label all nodes and edges.
Explain the chain rule of calculus for composite functions. If f(x) = sin(xยฒ+3x), compute df/dx using the chain rule. Then explain how this extends to backpropagation.
Why does sigmoid + cross-entropy loss give the clean gradient dZ[L] = A[L] โ Y? Derive this simplification step by step. What analogous simplification occurs for softmax + categorical cross-entropy?
Compare and contrast Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent. Under what conditions is each most appropriate?
Mathematical Problems
For a 2-layer network with architecture [3, 4, 1], write out the shapes of W[1], b[1], Z[1], A[1], W[2], b[2], Z[2], A[2], and all their gradient counterparts (dW, db, dZ, dA) for a batch of m=5 examples.
Given gradients โJโ = [0.5, โ0.2], โJโ = [0.3, โ0.4], โJโ = [0.4, โ0.3], compute the Momentum velocity vโ with ฮฒ = 0.9 and vโ = [0, 0]. Then compute the parameter update with ฮฑ = 0.01.
Compute one step of Adam for ฮธ = [1.0, 2.0], โJ = [0.2, โ0.5], at t=1 with ฮฑ=0.001, ฮฒโ=0.9, ฮฒโ=0.999, ฮต=10โปโธ. Show the bias correction step explicitly and explain why it matters at t=1.
For a batch of values Z = [1.0, 3.0, 5.0, 7.0], compute the Batch Normalization forward pass with ฮณ = 2.0, ฮฒ = 1.0, ฮต = 10โปโต. Verify that the normalized values have zero mean and unit variance.
Prove that after k layers with sigmoid activations, the maximum gradient is (0.25)แต. For k = 20, what is this value? How does this illustrate the vanishing gradient problem?
Programming Exercises
Implement the RMSprop optimizer from scratch (add it to the DeepNeuralNetwork class). Test it on the spiral dataset and compare convergence with SGD and Adam.
Implement inverted dropout from scratch. Add a dropout layer between each hidden layer of a 4-layer network. Compare training and test accuracy with dropout rates of 0, 0.2, 0.5, and 0.8.
Implement numerical gradient checking and use it to verify your backpropagation implementation. Test with a 3-layer network and report the difference metric.
Write a TensorFlow custom training loop that implements warmup + cosine annealing learning rate schedule. Train a model on MNIST and plot the learning rate and loss curves.
Implement He initialization and Xavier initialization from scratch. Train the same network architecture with both initializations (and random initialization) and compare convergence.
Advanced Exercises
Implement gradient clipping (both by value and by norm) from scratch. Create a scenario where exploding gradients occur, then show that clipping fixes the issue.
Derive the backward pass equations for Batch Normalization from first principles (compute โL/โฮณ, โL/โฮฒ, and โL/โz_i). Implement it and verify with gradient checking.
Compare the generalization performance of Adam vs SGD+Momentum on CIFAR-10 (using TensorFlow). Is it true that SGD generalizes better? Explain your findings.
Implement a learning rate finder that sweeps learning rate from 10โปโท to 10, trains for one epoch, and plots loss vs learning rate. Find the optimal learning rate for a given dataset.
Implement a training loop that uses gradient accumulation to simulate a larger batch size. Compare effective batch sizes of 32, 128, 512, and 2048 (using gradient accumulation over 1, 4, 16, and 64 mini-batches).
Implement Nesterov Accelerated Gradient (NAG) from scratch. NAG computes the gradient at the "look-ahead" position: v_t = ฮฒยทv_{t-1} + ฮฑยทโJ(ฮธ โ ฮฒยทv_{t-1}). Compare with standard momentum.
Build a "training dashboard" that logs and plots in real-time: (1) training loss, (2) validation loss, (3) gradient norms per layer, (4) learning rate, (5) weight histograms. Use matplotlib animation.
12.20 โ Multiple Choice Questions
What is the computational complexity of backpropagation relative to the forward pass?
For a sigmoid output with binary cross-entropy loss, dZ[L] simplifies to:
In the Adam optimizer, what is the purpose of bias correction?
Which initialization is recommended for ReLU activations?
Batch Normalization normalizes which of the following?
In inverted dropout, why do we divide by keep_prob during training?
Which gradient problem is addressed by using ReLU instead of sigmoid?
The formula dW[l] = (1/m) ยท dZ[l] ยท A[l-1]แต computes the gradient of the cost w.r.t. weights. The A[l-1]แต term comes from:
For gradient checking, a relative difference of less than ______ indicates correct backpropagation:
GPT-3 was trained using which optimizer and learning rate schedule?
What is the typical mini-batch size used in practice for deep learning?
12.21 โ Interview Questions
Explain backpropagation to someone who knows calculus but not machine learning. What is it computing and why is it efficient?
Why does Adam work better than SGD for most tasks? When might SGD+Momentum be preferred?
What is the vanishing gradient problem? Name 3 techniques to mitigate it.
Explain Batch Normalization. Where is it placed? What are ฮณ and ฮฒ? What happens at inference time?
What is gradient clipping? When do you need it? What's the difference between clipping by value and clipping by norm?
What is a learning rate warmup and why is it used for training large models like GPT?
How does mixed precision training work? What role does gradient scaling play?
Explain dropout as a regularizer. How does inverted dropout simplify inference?
In practice, how do you choose between Adam and SGD? What learning rate would you start with for each?
Your model's training loss is decreasing but validation loss plateaus. Your gradient norms are healthy (not vanishing/exploding). What's happening and how do you fix it?
12.22 โ Research Problems
Research Problem 1: Optimizer Design for Sparse Gradients
Background: In recommendation systems and NLP, most gradients are zero (sparse embeddings). Standard Adam wastes computation updating second moment estimates for zero gradients.
Question: Design an optimizer variant that efficiently handles sparse gradients by only updating moment estimates for non-zero gradient entries. Prove mathematically that your optimizer has the same convergence guarantee as Adam. Implement it and benchmark on a recommendation system with 100K item embeddings.
Hint: Look into SparseAdam and Lazy Adam implementations. Consider how bias correction must be modified for sparse updates.
Research Problem 2: Automatic Learning Rate Schedule Discovery
Background: The optimal learning rate schedule depends on the dataset, model architecture, and optimization landscape โ yet practitioners often use hand-crafted schedules.
Question: Develop a meta-learning approach that automatically discovers the optimal learning rate schedule for a given task. The meta-learner should train on a family of tasks and learn to predict good LR schedules for unseen tasks based on early training metrics (loss curve shape, gradient statistics). Compare your approach to cosine annealing and warmup + linear decay baselines.
Research Problem 3: Gradient Conflict in Multi-Task Learning
Background: In multi-task learning (like Tesla's FSD), gradients from different tasks can conflict โ task A wants to increase a weight while task B wants to decrease it.
Question: Implement and compare three gradient conflict resolution methods: (1) PCGrad (project conflicting gradients), (2) GradNorm (dynamically weight task losses), (3) Your own method. Evaluate on a multi-task benchmark (e.g., NYUv2 with segmentation, depth, and normals). Analyze which layers experience the most conflict and why.
Research Problem 4: Backpropagation Alternatives
Background: Backpropagation is biologically implausible (the brain doesn't have symmetric feedback connections). This limits our understanding of biological learning.
Question: Implement and compare backpropagation with two alternative credit assignment methods: (1) Feedback Alignment (random feedback weights), (2) Direct Feedback Alignment (random connections from output to each layer). Compare convergence speed and final accuracy on MNIST and CIFAR-10. Does scaling to deeper networks work for these alternatives?
12.23 โ Key Takeaways
Backpropagation = chain rule + dynamic programming. It computes all gradients in one backward pass (O(N) time), making deep learning computationally feasible. Without it, training would require O(Nยฒ) or worse.
Five equations govern all backpropagation: dZ[L] = A[L]โY, dW[l] = (1/m)ยทdZ[l]ยทA[l-1]แต, db[l] = (1/m)ยทsum(dZ[l]), dA[l-1] = W[l]แตยทdZ[l], dZ[l] = dA[l] โ g'(Z[l]). These five formulas, applied layer by layer, train any feedforward network.
Adam is the default optimizer for most deep learning tasks. It combines momentum (first moment) and adaptive learning rates (second moment) with bias correction. Typical settings: ฮฒโ=0.9, ฮฒโ=0.999, ฮต=10โปโธ, lr=0.001.
Learning rate is the most important hyperparameter. Use warmup + cosine annealing for large models. The learning rate finder technique can help choose the starting value. Too high โ divergence, too low โ slow convergence.
Batch Normalization accelerates training by normalizing layer inputs, allowing higher learning rates and reducing sensitivity to initialization. Use running statistics at inference time.
Dropout is the simplest effective regularizer. Inverted dropout (divide by keep_prob during training) avoids any test-time modification. Typical rates: 0.2-0.5 for hidden layers, never on the output layer.
Vanishing gradients โ use ReLU + He init + skip connections. Exploding gradients โ use gradient clipping + proper initialization. Always monitor gradient norms during training to detect these problems early.
Gradient checking verifies backprop correctness. Use it during development (never in production โ it's O(Nยฒ)). A relative difference below 10โปโท means your implementation is correct.
Real-world training is an engineering challenge. GPT-3 required custom optimizer settings (ฮฒโ=0.95), warmup schedules, gradient clipping, and millions of dollars of compute. Mastering these techniques is a valuable skill in industry.
12.24 โ References
Foundational Papers
- Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). "Learning representations by back-propagating errors." Nature, 323(6088), 533-536.
- Kingma, D.P., & Ba, J. (2014). "Adam: A Method for Stochastic Optimization." arXiv:1412.6980.
- Ioffe, S., & Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." ICML.
- Srivastava, N., et al. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." JMLR, 15, 1929-1958.
- He, K., et al. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." ICCV.
- Glorot, X., & Bengio, Y. (2010). "Understanding the difficulty of training deep feedforward neural networks." AISTATS.
- Loshchilov, I., & Hutter, F. (2016). "SGDR: Stochastic Gradient Descent with Warm Restarts." arXiv:1608.03983.
Textbooks
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapters 6-8.
- Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 5.
- Nielsen, M. (2015). Neural Networks and Deep Learning. Online textbook (free).
Industry & Applied References
- Brown, T., et al. (2020). "Language Models are Few-Shot Learners" (GPT-3). NeurIPS.
- You, Y., et al. (2019). "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes." arXiv:1904.00962 (LAMB optimizer).
- Smith, L.N. (2017). "Cyclical Learning Rates for Training Neural Networks." WACV.
- TCS Research. "Automated Machine Learning for Enterprise Applications." TCS white paper.
- Tesla AI Day presentations (2021, 2022). Training infrastructure and optimization details.
Chapter Navigation