Chapter 12: Backpropagation & Training Deep Networks

1

12.1 — Learning Objectives

After completing this chapter, you will be able to:

1

Explain why backpropagation is necessary and how it computes gradients efficiently using the chain rule

2

Draw computation graphs for neural network forward passes and trace gradients backward through them

3

Derive the complete backpropagation equations for a 2-layer and L-layer network from first principles

4

Implement Batch GD, SGD, Mini-batch GD, Momentum, RMSprop, and Adam optimizer from scratch

5

Apply learning rate schedules: step decay, exponential decay, cosine annealing, and warmup

6

Implement and explain Batch Normalization (forward and backward pass) and Dropout regularization

7

Diagnose and fix vanishing/exploding gradient problems using proper initialization and gradient clipping

8

Verify backpropagation correctness using numerical gradient checking

9

Use TensorFlow GradientTape for custom training loops and compare optimizer performance

10

Relate training techniques to real-world systems like GPT-3 training and Tesla FSD

2

12.2 — Introduction

In Chapter 11, we built a neural network and computed its forward pass — given inputs, we produced predictions. But here's the crucial question: how does the network learn? How does it adjust millions (or billions) of parameters to reduce its errors?

The answer is backpropagation — short for "backward propagation of errors." It is the single most important algorithm in deep learning. Without it, training neural networks of any meaningful depth would be computationally impossible.

The Core Problem

Consider a network with 1 million weights. To minimize the loss function, we need the gradient ∂L/∂w for every single weight. Computing each gradient numerically (perturbing each weight and measuring the loss change) would require 2 million forward passes per training step. For a dataset with 1 million examples and 1000 training steps, that's 2 × 10¹⁵ forward passes — utterly impossible.

Backpropagation computes all gradients in a single backward pass, making it only about 2-3× the cost of a single forward pass. This is the miracle that makes deep learning practical.

This chapter is the engine room of deep learning. We will:

Understand computation graphs and the chain rule
Derive every backpropagation equation step by step
Master gradient descent variants and modern optimizers
Learn essential training techniques: Batch Normalization, Dropout
Diagnose gradient pathologies and verify our implementations

3

12.3 — Historical Background

Backpropagation has a surprisingly long and winding history, with multiple independent discoveries across decades:

Year	Milestone	Contributor
1960	Automatic differentiation concepts for control theory	Henry J. Kelley
1970	Reverse-mode automatic differentiation (the mathematical foundation of backprop)	Seppo Linnainmaa
1974	Application of backpropagation to neural networks in PhD thesis	Paul Werbos
1986	Landmark paper demonstrating backprop enables learning internal representations — popularized the algorithm	Rumelhart, Hinton & Williams
1994	Identified vanishing gradient problem in deep networks and RNNs	Yoshua Bengio et al.
2012	AlexNet trained with backprop + SGD wins ImageNet — deep learning revolution begins	Krizhevsky, Sutskever, Hinton
2014	Adam optimizer published — becomes the default optimizer	Kingma & Ba
2015	Batch Normalization enables training much deeper networks	Ioffe & Szegedy
2017	Warmup + cosine annealing schedules for Transformer training	Vaswani et al. / Loshchilov & Hutter
2020	GPT-3 trained with Adam optimizer on 300B tokens — 175B parameters	OpenAI

4

12.4 — Conceptual Explanation

12.4.1 — Why Do We Need Gradients?

Training a neural network means finding the weights W and biases b that minimize a loss function L. The loss function is a landscape in a very high-dimensional space (one dimension per parameter). We need to find the lowest valley in this landscape.

Gradient descent is our strategy: at any point on the loss surface, the gradient ∇L points in the direction of steepest ascent. By taking steps in the opposite direction, we move downhill toward lower loss.

Gradient Descent Update Rule w_new = w_old - α \cdot \partialL/\partialw α = learning rate (step size), \partialL/\partialw = gradient of loss with respect to weight w

12.4.2 — Computation Graphs

A computation graph represents a neural network's forward pass as a directed acyclic graph (DAG). Each node represents an operation (matrix multiply, addition, activation function), and edges represent the flow of tensors.

Computation Graph: 2-Layer Network Forward Pass

  Input X     W[1]        b[1]     W[2]        b[2]
    │          │            │        │            │
    ▼          ▼            ▼        ▼            ▼
    ╔══════════════╗    ╔════════╗  ╔══════════════╗    ╔════════╗
    ║  Z[1] = W[1]·X + b[1]    ║  ║  Z[2] = W[2]·A[1] + b[2]  ║
    ╚══════════════╝    ╚════════╝  ╚══════════════╝    ╚════════╝
           │                              │
           ▼                              ▼
    ╔══════════════╗               ╔══════════════╗
    ║ A[1] = σ(Z[1])              ║ A[2] = σ(Z[2])
    ║  (ReLU / Sigmoid)           ║   (Sigmoid)
    ╚══════════════╝               ╚══════════════╝
           │                              │
           └──────────────┐               ▼
                          │        ╔═══════════╗
                          └───────►║ Loss = L(A[2], Y) ║
                                   ╚═══════════╝

FORWARD:  X ──► Z[1] ──► A[1] ──► Z[2] ──► A[2] ──► L
BACKWARD: X ◄── Z[1] ◄── A[1] ◄── Z[2] ◄── A[2] ◄── L
                (Gradients flow backward via chain rule)

12.4.3 — The Chain Rule — Heart of Backpropagation

The chain rule from calculus lets us compute the derivative of a composite function. If y = f(g(x)), then:

Chain Rule (Single Variable) dy/dx = (dy/dg) \cdot (dg/dx)

For neural networks, we have deeply nested compositions. The loss is a function of the output, which is a function of the pre-activation, which is a function of the weights and previous activations. The chain rule lets us compute the gradient with respect to any parameter by multiplying local gradients along the path from the loss to that parameter.

Multivariate Chain Rule \partialL/\partialW[l] = (\partialL/\partialA[L]) \cdot (\partialA[L]/\partialZ[L]) \cdot (\partialZ[L]/\partialA[L-1]) \cdot ... \cdot (\partialA[l]/\partialZ[l]) \cdot (\partialZ[l]/\partialW[l]) Each factor is a "local gradient" — easy to compute at each node

12.4.4 — The Backpropagation Algorithm

Backpropagation works in two phases:

Forward Pass: Compute all intermediate values (Z[l], A[l]) from input to output, and compute the loss L.
Backward Pass: Starting from ∂L/∂A[L] at the output, propagate gradients backward through each layer using the chain rule. At each layer, compute ∂L/∂W[l] and ∂L/∂b[l] for parameter updates, and ∂L/∂A[l-1] to continue backward.

12.4.5 — Gradient Descent Variants

Once we have gradients, there are different strategies for updating weights:

Variant	Batch Size	Update Frequency	Pros	Cons
Batch GD	Entire dataset	Once per epoch	Stable, true gradient	Slow, memory-heavy
SGD	1 sample	Every sample	Fast updates, can escape local minima	Very noisy, unstable
Mini-batch GD	32-512	Every batch	Best of both, GPU-efficient	Needs batch size tuning

12.4.6 — Modern Optimizers

Plain SGD can be slow and oscillate. Modern optimizers address this with adaptive learning rates:

Momentum: Accumulates past gradients to build "velocity" — helps push through narrow valleys
RMSprop: Adapts learning rate per-parameter based on recent gradient magnitudes — prevents divergence in steep dimensions
Adam: Combines Momentum + RMSprop with bias correction — the most widely used optimizer in deep learning

12.4.7 — Training Techniques

Batch Normalization

Normalizes each layer's inputs to have zero mean and unit variance, then applies learnable scale (γ) and shift (β). Benefits: faster training, higher learning rates, mild regularization.

Dropout

Randomly sets a fraction of neurons to zero during training. Forces the network to learn redundant representations — acts as a powerful regularizer. At inference, all neurons are active but outputs are scaled.

Gradient Problems

Vanishing gradients: Gradients shrink exponentially through layers (sigmoid's max derivative is 0.25 — after 10 layers: 0.25¹⁰ ≈ 10⁻⁶). Solution: ReLU, skip connections, better initialization.

Exploding gradients: Gradients grow exponentially. Solution: Gradient clipping, proper initialization.

5

12.5 — Mathematical Foundation

12.5.1 — Notation and Setup

Consider an L-layer neural network with the following notation:

Standard Notation n[l] = number of neurons in layer l (l = 0 is input layer) W[l] \in ℝ^(n[l] \times n[l-1]) = weight matrix for layer l b[l] \in ℝ^(n[l] \times 1) = bias vector for layer l Z[l] = W[l] \cdot A[l-1] + b[l] (pre-activation) A[l] = g[l](Z[l]) (activation, where g is activation function) A[0] = X (input data) m = number of training examples (columns of X)

12.5.2 — Loss Functions

Binary Cross-Entropy Loss (Single Example)

Binary Cross-Entropy ℒ(ŷ, y) = -[y \cdot log(ŷ) + (1 - y) \cdot log(1 - ŷ)] where ŷ = A[L] (final layer output, after sigmoid)

Cost Over m Examples

Average Cost Function J(W, b) = (1/m) \cdot Σᵢ₌₁ᵐ ℒ(ŷ⁽ⁱ⁾, y⁽ⁱ⁾)

12.5.3 — Key Derivatives We Will Need

Function	Derivative	Notes
Sigmoid: σ(z) = 1/(1+e⁻ᶻ)	σ'(z) = σ(z)(1 − σ(z))	Max value 0.25 at z=0
ReLU: g(z) = max(0, z)	g'(z) = 1 if z > 0, else 0	No saturation for z > 0
Tanh: g(z) = tanh(z)	g'(z) = 1 − tanh²(z)	Max value 1 at z=0
Linear: Z = W·A + b	∂Z/∂W = Aᵀ, ∂Z/∂A = Wᵀ, ∂Z/∂b = I	Matrix calculus
BCE Loss: ℒ = −[y log ŷ + (1−y)log(1−ŷ)]	∂ℒ/∂ŷ = −y/ŷ + (1−y)/(1−ŷ)	Simplifies beautifully with sigmoid

12.5.4 — Matrix Calculus Conventions

In neural network backpropagation, we adopt the convention that dZ[l] has the same shape as Z[l], and similarly for all other gradient tensors. This makes implementation straightforward:

Shape Conventions dW[l] has same shape as W[l]: (n[l] \times n[l-1]) db[l] has same shape as b[l]: (n[l] \times 1) dA[l] has same shape as A[l]: (n[l] \times m) dZ[l] has same shape as Z[l]: (n[l] \times m) Notation: dX means \partialL/\partialX throughout this chapter

6

12.6 — Formula Derivations

12.6.1 — Full Backpropagation for a 2-Layer Network

Let's derive every gradient for a 2-layer network (input → hidden → output). Architecture: n[0] inputs, n[1] hidden neurons (ReLU), n[2] = 1 output neuron (sigmoid), binary cross-entropy loss.

Forward Pass Equations

F1

Layer 1 pre-activation:

Z[1] = W[1] · X + b[1]

Shape: (n[1] × m) = (n[1] × n[0]) · (n[0] × m) + (n[1] × 1)

F2

Layer 1 activation:

A[1] = ReLU(Z[1]) = max(0, Z[1])

Shape: (n[1] × m)

F3

Layer 2 pre-activation:

Z[2] = W[2] · A[1] + b[2]

Shape: (1 × m) = (1 × n[1]) · (n[1] × m) + (1 × 1)

F4

Layer 2 activation (output):

A[2] = σ(Z[2]) = 1 / (1 + exp(−Z[2]))

Shape: (1 × m)

F5

Loss function:

J = −(1/m) · Σ [Y · log(A[2]) + (1−Y) · log(1−A[2])]

Backward Pass — Output Layer (Layer 2)

B1

Compute dA[2] = ∂L/∂A[2]:

From the loss ℒ = −[y·log(a) + (1−y)·log(1−a)]:

dA[2] = ∂ℒ/∂A[2] = −Y/A[2] + (1−Y)/(1−A[2])

Shape: (1 × m). This is the starting point of backpropagation.

B2

Compute dZ[2] = ∂L/∂Z[2]:

By chain rule: dZ[2] = dA[2] ⊙ σ'(Z[2]), where σ'(z) = σ(z)(1−σ(z)):

dZ[2] = dA[2] ⊙ A[2] ⊙ (1 − A[2])

Expanding: = [−Y/A[2] + (1−Y)/(1−A[2])] ⊙ A[2] ⊙ (1−A[2])

Simplifying: = −Y·(1−A[2]) + (1−Y)·A[2] = A[2]−Y·A[2]−Y+Y·A[2] + A[2]−Y·A[2]

dZ[2] = A[2] − Y

✨ This beautiful simplification is why sigmoid + cross-entropy is a natural pair! Shape: (1 × m).

B3

Compute dW[2] = ∂L/∂W[2]:

Since Z[2] = W[2]·A[1] + b[2], by the chain rule:

∂L/∂W[2] = ∂L/∂Z[2] · ∂Z[2]/∂W[2] = dZ[2] · A[1]ᵀ (averaged over m examples):

dW[2] = (1/m) · dZ[2] · A[1]ᵀ

Shape: (1 × n[1]) = (1 × m) · (m × n[1]). ✓ Same shape as W[2].

B4

Compute db[2] = ∂L/∂b[2]:

Since ∂Z[2]/∂b[2] = 1, we simply sum dZ[2] across examples:

db[2] = (1/m) · Σⱼ dZ[2][:,j] = (1/m) · np.sum(dZ[2], axis=1, keepdims=True)

Shape: (1 × 1). ✓ Same shape as b[2].

Backward Pass — Hidden Layer (Layer 1)

B5

Compute dA[1] = ∂L/∂A[1]:

Since Z[2] = W[2]·A[1] + b[2], we need ∂Z[2]/∂A[1] = W[2]ᵀ:

dA[1] = W[2]ᵀ · dZ[2]

Shape: (n[1] × m) = (n[1] × 1) · (1 × m). ✓ Same shape as A[1].

This is where gradients "propagate backward" through the network!

B6

Compute dZ[1] = ∂L/∂Z[1]:

By chain rule: dZ[1] = dA[1] ⊙ g'[1](Z[1]). For ReLU, g'(z) = 1 if z > 0, else 0:

dZ[1] = dA[1] ⊙ (Z[1] > 0)

Shape: (n[1] × m). The indicator function (Z[1] > 0) creates a binary mask.

B7

Compute dW[1] = ∂L/∂W[1]:

Same pattern as dW[2], but using dZ[1] and A[0] = X:

dW[1] = (1/m) · dZ[1] · Xᵀ

Shape: (n[1] × n[0]) = (n[1] × m) · (m × n[0]). ✓ Same shape as W[1].

B8

Compute db[1] = ∂L/∂b[1]:

db[1] = (1/m) · np.sum(dZ[1], axis=1, keepdims=True)

Shape: (n[1] × 1). ✓ Same shape as b[1].

12.6.2 — General Backpropagation for L-Layer Network

🔮 The General Backpropagation Formula Sheet

For any layer l in an L-layer network, the backward pass follows this pattern:

General Formulas (Layer l, from l = L down to l = 1) dZ[L] = A[L] - Y (output layer with sigmoid + BCE) dZ[l] = dA[l] ⊙ g'[l](Z[l]) (for hidden layers) dW[l] = (1/m) \cdot dZ[l] \cdot A[l-1]ᵀ db[l] = (1/m) \cdot np.sum(dZ[l], axis=1, keepdims=True) dA[l-1] = W[l]ᵀ \cdot dZ[l]

This set of 5 equations is all you need to implement backpropagation for any feedforward network, regardless of depth. Each layer reuses the same formulas with different weight matrices and activation functions.

12.6.3 — Deriving Optimizers from First Principles

Momentum

Problem with vanilla SGD: In loss surfaces with narrow valleys, SGD oscillates perpendicular to the optimal direction, making slow progress along the valley. We want to dampen oscillations and amplify consistent directions.

Inspiration: A ball rolling downhill accumulates velocity. Past gradients in the same direction make it go faster; contradictory gradients slow it down.

Deriving Momentum from Exponential Moving Average

1

Define a velocity vector v that is an exponential moving average (EMA) of past gradients:

v_t = β · v_{t-1} + (1 − β) · ∇J(θ_t)

β ∈ [0, 1) controls how much history we remember. Typical: β = 0.9 (averages ~10 steps).

2

Expand the EMA to see what v_t represents:

v_t = (1−β)[∇J_t + β·∇J_{t-1} + β²·∇J_{t-2} + ... + βᵗ·∇J_0]

Recent gradients have exponentially more weight than older ones.

3

Update parameters using the velocity:

θ_{t+1} = θ_t − α · v_t

If gradients consistently point the same direction → v grows → faster progress.
If gradients oscillate → v stays small → damped oscillations.

RMSprop (Root Mean Square Propagation)

Problem: Some parameters need large updates (flat loss surface) while others need small updates (steep surface). We need per-parameter adaptive learning rates.

Deriving RMSprop

1

Track the EMA of squared gradients (measures gradient magnitude per parameter):

s_t = β₂ · s_{t-1} + (1 − β₂) · (∇J)²

Here (∇J)² is element-wise squaring. β₂ typically = 0.999.

2

√s_t estimates the RMS of recent gradients. Divide the gradient by this to normalize:

θ_{t+1} = θ_t − α · ∇J / (√s_t + ε)

ε ≈ 10⁻⁸ prevents division by zero.

3

Effect: Parameters with large gradients → large s → smaller effective learning rate.
Parameters with small gradients → small s → larger effective learning rate.

This automatically adapts the learning rate per-parameter!

Adam (Adaptive Moment Estimation)

Adam combines the best of Momentum (first moment) and RMSprop (second moment), with bias correction to handle the initialization bias.

Deriving Adam

1

First moment (mean): EMA of gradients (like Momentum):

m_t = β₁ · m_{t-1} + (1 − β₁) · ∇J_t

Initialize: m₀ = 0. Typical β₁ = 0.9.

2

Second moment (variance): EMA of squared gradients (like RMSprop):

v_t = β₂ · v_{t-1} + (1 − β₂) · (∇J_t)²

Initialize: v₀ = 0. Typical β₂ = 0.999.

3

Bias correction: Since m₀ = v₀ = 0, the early estimates are biased toward zero. Correct this:

m̂_t = m_t / (1 − β₁ᵗ)

v̂_t = v_t / (1 − β₂ᵗ)

At t=1: m̂₁ = m₁/(1−0.9) = 10·m₁ — compensates for the zero initialization.

As t → ∞: (1 − βᵗ) → 1 — correction vanishes, as the EMA has warmed up.

4

Parameter update:

θ_{t+1} = θ_t − α · m̂_t / (√v̂_t + ε)

This gives us adaptive learning rates (from v̂) with momentum (from m̂).

12.6.4 — Batch Normalization Derivation

Batch Normalization Forward Pass

1

Compute batch mean:

μ_B = (1/m) · Σᵢ₌₁ᵐ z_i

2

Compute batch variance:

σ²_B = (1/m) · Σᵢ₌₁ᵐ (z_i − μ_B)²

3

Normalize:

ẑ_i = (z_i − μ_B) / √(σ²_B + ε)

4

Scale and shift (learnable parameters γ and β):

y_i = γ · ẑ_i + β

γ and β are learned during training. They allow the network to undo the normalization if needed (when γ = σ_B and β = μ_B, it recovers the original activations).

12.6.5 — Learning Rate Schedules

Learning Rate Schedule Formulas Step Decay: α_t = α₀ \cdot factor^⌊epoch/step_size⌋ Exponential Decay: α_t = α₀ \cdot e^(-k\cdott) Cosine Annealing: α_t = α_min + ½(α_max - α_min)(1 + cos(π\cdott/T)) Linear Warmup: α_t = α_max \cdot (t / t_warmup) for t \leq t_warmup

7

12.7 — Worked Numerical Examples

Example 1: Full Backpropagation for a 2-Layer Network

Architecture: 4 inputs → 3 hidden (ReLU) → 1 output (sigmoid). Single training example.

Given Values

Input: X = [1.0, 0.5, −1.0, 2.0]ᵀ (4×1)

True label: Y = 1

Weights (initialized):

W[1] = [[ 0.2, -0.1,  0.4,  0.1],     # 3×4
        [ 0.3,  0.2, -0.3,  0.2],
        [-0.1,  0.5,  0.1, -0.2]]

b[1] = [[ 0.1],                         # 3×1
        [-0.1],
        [ 0.0]]

W[2] = [[ 0.5, -0.3,  0.8]]            # 1×3

b[2] = [[ 0.1]]                          # 1×1

Forward Pass

Step F1: Z[1] = W[1]·X + b[1]

Z[1] = [[0.2×1.0 + (−0.1)×0.5 + 0.4×(−1.0) + 0.1×2.0],  = [[ 0.2 − 0.05 − 0.4 + 0.2 + 0.1],
        [0.3×1.0 + 0.2×0.5 + (−0.3)×(−1.0) + 0.2×2.0],     = [[ 0.05],
        [(−0.1)×1.0 + 0.5×0.5 + 0.1×(−1.0) + (−0.2)×2.0]]    [ 1.1 − 0.1],
                                                                 [−0.1 + 0.25 − 0.1 − 0.4 + 0.0]]

Z[1] = [[ 0.05],     # 3×1
        [ 1.00],
        [-0.35]]

Step F2: A[1] = ReLU(Z[1])

A[1] = [[max(0, 0.05)],    = [[0.05],
        [max(0, 1.00)],       [1.00],
        [max(0, -0.35)]]      [0.00]]     # This neuron is "dead" (ReLU killed it)

Step F3: Z[2] = W[2]·A[1] + b[2]

Z[2] = 0.5×0.05 + (−0.3)×1.00 + 0.8×0.00 + 0.1
     = 0.025 − 0.3 + 0 + 0.1
     = −0.175

Step F4: A[2] = σ(Z[2]) = σ(−0.175)

A[2] = 1 / (1 + exp(0.175)) = 1 / (1 + 1.1912) = 1 / 2.1912 = 0.4564

Step F5: Loss

L = −[1·log(0.4564) + 0·log(0.5436)]
  = −log(0.4564)
  = −(−0.7847) = 0.7847

Backward Pass

Step B1-B2: dZ[2] = A[2] − Y

dZ[2] = 0.4564 − 1 = −0.5436

Step B3: dW[2] = (1/m)·dZ[2]·A[1]ᵀ (m=1)

dW[2] = (−0.5436) × [0.05, 1.00, 0.00]
      = [−0.0272, −0.5436, 0.0000]    # Shape: 1×3 ✓

Step B4: db[2] = dZ[2]

db[2] = −0.5436

Step B5: dA[1] = W[2]ᵀ · dZ[2]

dA[1] = [[ 0.5],     × (−0.5436) = [[-0.2718],
         [-0.3],                      [ 0.1631],
         [ 0.8]]                      [-0.4349]]

Step B6: dZ[1] = dA[1] ⊙ (Z[1] > 0)

ReLU mask = [[1],    (Z[1]=0.05 > 0 ✓)
             [1],    (Z[1]=1.00 > 0 ✓)
             [0]]    (Z[1]=−0.35 ≤ 0 ✗)

dZ[1] = [[-0.2718],    ⊙  [[1],     = [[-0.2718],
         [ 0.1631],        [1],        [ 0.1631],
         [-0.4349]]        [0]]        [ 0.0000]]

Step B7: dW[1] = (1/m)·dZ[1]·Xᵀ

dW[1] = [[-0.2718],   × [1.0, 0.5, -1.0, 2.0]
         [ 0.1631],
         [ 0.0000]]

     = [[-0.2718, -0.1359,  0.2718, -0.5436],     # 3×4 ✓
        [ 0.1631,  0.0816, -0.1631,  0.3262],
        [ 0.0000,  0.0000,  0.0000,  0.0000]]

Step B8: db[1] = dZ[1]

db[1] = [[-0.2718],
         [ 0.1631],
         [ 0.0000]]

Weight Update (α = 0.1)

W[2]_new = W[2] − 0.1 × dW[2]
         = [0.5, −0.3, 0.8] − 0.1 × [−0.0272, −0.5436, 0.0]
         = [0.5027, −0.2456, 0.8000]

b[2]_new = 0.1 − 0.1 × (−0.5436) = 0.1544

(Similarly for W[1] and b[1])

Example 2: Adam Optimizer Update Step

Given: Gradient ∇J = [0.1, −0.3], step t=2, α=0.001, β₁=0.9, β₂=0.999, ε=10⁻⁸

Previous: m₁ = [0.02, −0.06], v₁ = [0.0001, 0.0009]

Step 1: Update first moment (m₂)

m₂ = 0.9 × [0.02, −0.06] + 0.1 × [0.1, −0.3]
   = [0.018, −0.054] + [0.01, −0.03]
   = [0.028, −0.084]

Step 2: Update second moment (v₂)

v₂ = 0.999 × [0.0001, 0.0009] + 0.001 × [0.01, 0.09]
   = [0.0000999, 0.0008991] + [0.00001, 0.00009]
   = [0.0001099, 0.0009891]

Step 3: Bias correction

m̂₂ = m₂ / (1 − 0.9²) = [0.028, −0.084] / 0.19 = [0.1474, −0.4421]
v̂₂ = v₂ / (1 − 0.999²) = [0.0001099, 0.0009891] / 0.001999
   = [0.05498, 0.49480]

Step 4: Update

θ₃ = θ₂ − 0.001 × m̂₂ / (√v̂₂ + ε)
   = θ₂ − 0.001 × [0.1474, −0.4421] / [0.2345, 0.7034]
   = θ₂ − 0.001 × [0.6287, −0.6286]
   = θ₂ − [0.000629, −0.000629]

Notice: Both parameters get updates of similar magnitude (~0.001) despite having
very different gradient magnitudes (0.1 vs −0.3). This is Adam's adaptive property!

Example 3: Batch Normalization Forward Pass

Given: Mini-batch of 4 pre-activations, γ = 1.5, β = 0.3, ε = 10⁻⁵

z = [2.0, 4.0, 6.0, 8.0]

Step 1: Batch mean
μ_B = (2 + 4 + 6 + 8) / 4 = 5.0

Step 2: Batch variance
σ²_B = [(2−5)² + (4−5)² + (6−5)² + (8−5)²] / 4
     = [9 + 1 + 1 + 9] / 4 = 5.0

Step 3: Normalize
ẑ = (z − 5.0) / √(5.0 + 0.00001) = (z − 5.0) / 2.23607
ẑ = [−1.3416, −0.4472, 0.4472, 1.3416]

Step 4: Scale and shift
y = 1.5 × ẑ + 0.3
y = [−1.7124, −0.3708, 0.9708, 2.3124]

Verification: mean(ẑ) = 0 ✓, var(ẑ) ≈ 1 ✓

8

12.8 — Visual Diagrams

Backpropagation Gradient Flow

  FORWARD PASS (left to right) ──────────────────────────────────►

  ┌───────┐    ┌─────────────┐    ┌───────────┐    ┌─────────────┐    ┌──────┐
  │   X   │───►│ Z[1]=W[1]X  │───►│A[1]=g(Z[1])│──►│Z[2]=W[2]A[1]│──►│A[2]=σ│──►Loss
  │(input)│    │    +b[1]    │    │   (ReLU)   │    │    +b[2]    │    │      │
  └───────┘    └─────────────┘    └───────────┘    └─────────────┘    └──────┘
                     │                  │                 │                │
                     │ Caches:          │ Caches:         │ Caches:        │
                     │ W[1],X,b[1]      │ Z[1]           │ W[2],A[1],b[2] │
                     │                  │                 │                │
  ◄────────────── BACKWARD PASS (right to left) ────────────────────────────

  dA[0]←──dZ[1]←──dA[1]←───────────dZ[2]←──────────dA[2]←────dL/dA[2]
  (not    ▲│▲      ▲                ▲│▲             ▲
  needed) ││       │                ││              │
         dW[1]    Uses             dW[2]           dZ[2]=
         db[1]    ReLU'            db[2]           A[2]−Y
                  mask

Gradient Descent Variants Comparison

  BATCH GRADIENT DESCENT               STOCHASTIC GD (SGD)
  ┌─────────────────────┐              ┌─────────────────────┐
  │ Process ALL examples│              │ Process ONE example  │
  │ ┌──┬──┬──┬──┬──┬──┐│              │ ┌──┐                 │
  │ │x1│x2│x3│x4│x5│x6││              │ │x1│ → update        │
  │ └──┴──┴──┴──┴──┴──┘│              │ └──┘                 │
  │        │            │              │ ┌──┐                 │
  │  Compute average    │              │ │x2│ → update        │
  │     gradient        │              │ └──┘                 │
  │        │            │              │ ┌──┐                 │
  │  ONE update/epoch   │              │ │x3│ → update        │
  └─────────────────────┘              │ └──┘                 │
                                       │  6 updates/epoch     │
                                       └─────────────────────┘

  MINI-BATCH GD (batch_size=2)
  ┌──────────────────────────────────────────────┐
  │ ┌──┬──┐    ┌──┬──┐    ┌──┬──┐               │
  │ │x1│x2│    │x3│x4│    │x5│x6│               │
  │ └──┴──┘    └──┴──┘    └──┴──┘               │
  │    │          │          │                    │
  │ update     update     update                  │
  │                                               │
  │ 3 updates/epoch — BEST OF BOTH WORLDS!       │
  └──────────────────────────────────────────────┘

Optimizer Behavior in a Loss Landscape

  Loss contours (top view of a valley):

       ╭─────────────────────────╮
      ╱   ╭───────────────╮       ╲
     ╱   ╱  ╭─────────╮    ╲       ╲
    │   │  ╱  ╭─────╮  ╲    │       │
    │   │ │  │ MIN ★│   │   │       │
    │   │  ╲  ╰─────╯  ╱    │       │
     ╲   ╲  ╰─────────╯    ╱       ╱
      ╲   ╰───────────────╯       ╱
       ╰─────────────────────────╯

  SGD path:        ○─╲─╱─╲─╱─╲─╱─★   (oscillates side to side)
  Momentum path:   ○──────╲──╱──★      (smoother, builds velocity)
  Adam path:       ○────────────★      (adapts, goes nearly straight)

Vanishing vs Exploding Gradients

  VANISHING GRADIENTS (sigmoid activation):
  ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐
  │  L5 │──►│  L4 │──►│  L3 │──►│  L2 │──►│  L1 │
  │     │   │     │   │     │   │     │   │     │
  │grad │   │grad │   │grad │   │grad │   │grad │
  │=1.0 │   │×0.25│   │×0.25│   │×0.25│   │×0.25│
  │     │   │=0.25│   │=0.06│   │=0.02│   │=0.004│  ← Nearly zero!
  └─────┘   └─────┘   └─────┘   └─────┘   └─────┘

  EXPLODING GRADIENTS (large weights):
  ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐   ┌─────┐
  │  L5 │──►│  L4 │──►│  L3 │──►│  L2 │──►│  L1 │
  │     │   │     │   │     │   │     │   │     │
  │grad │   │grad │   │grad │   │grad │   │grad │
  │=1.0 │   │ ×3  │   │ ×3  │   │ ×3  │   │ ×3  │
  │     │   │=3   │   │=9   │   │=27  │   │=81  │  ← Explodes!
  └─────┘   └─────┘   └─────┘   └─────┘   └─────┘

Dropout — Training vs Inference

  TRAINING (dropout rate = 0.5):           INFERENCE (no dropout):
  ┌─┐  ┌─┐  ┌─┐  ┌─┐  ┌─┐                ┌─┐  ┌─┐  ┌─┐  ┌─┐  ┌─┐
  │●│  │╳│  │●│  │╳│  │●│                │●│  │●│  │●│  │●│  │●│
  └─┘  └─┘  └─┘  └─┘  └─┘                └─┘  └─┘  └─┘  └─┘  └─┘
   │         │         │                    │    │    │    │    │
  ┌─┐  ┌─┐  ┌─┐  ┌─┐  ┌─┐                ┌─┐  ┌─┐  ┌─┐  ┌─┐  ┌─┐
  │╳│  │●│  │╳│  │●│  │╳│                │●│  │●│  │●│  │●│  │●│
  └─┘  └─┘  └─┘  └─┘  └─┘                └─┘  └─┘  └─┘  └─┘  └─┘
   ╳ = dropped (zeroed out, scaled)         All neurons active
   ● = kept (multiplied by 1/keep_prob)     (no scaling needed with
                                             inverted dropout)

Batch Normalization — Where It Goes

  Without Batch Norm:              With Batch Norm:
  ┌──────────┐                     ┌──────────┐
  │ Z = W·A  │                     │ Z = W·A  │
  │   + b    │                     │   + b    │
  └────┬─────┘                     └────┬─────┘
       │                                │
       ▼                           ┌────▼─────┐
  ┌──────────┐                     │BatchNorm │
  │A = g(Z)  │                     │ẑ=(Z−μ)/σ │
  │(activate)│                     │y=γẑ+β    │
  └──────────┘                     └────┬─────┘
                                        │
                                   ┌────▼─────┐
                                   │A = g(y)  │
                                   │(activate)│
                                   └──────────┘

9

12.9 — Flowcharts

Complete Training Loop Flowchart

                    ┌─────────────────────┐
                    │   Initialize        │
                    │   weights & biases  │
                    │   (He/Xavier init)  │
                    └──────────┬──────────┘
                               │
                    ┌──────────▼──────────┐
              ┌────►│  for epoch in       │
              │     │  range(num_epochs): │
              │     └──────────┬──────────┘
              │                │
              │     ┌──────────▼──────────┐
              │     │  Shuffle data       │
              │     │  Create mini-batches│
              │     └──────────┬──────────┘
              │                │
              │     ┌──────────▼──────────────┐
              │  ┌─►│  for batch in batches:  │
              │  │  └──────────┬──────────────┘
              │  │             │
              │  │  ┌──────────▼──────────┐
              │  │  │  FORWARD PASS       │
              │  │  │  Z[l] = W[l]·A +b[l]│
              │  │  │  A[l] = g(Z[l])     │
              │  │  │  Compute Loss       │
              │  │  └──────────┬──────────┘
              │  │             │
              │  │  ┌──────────▼──────────┐
              │  │  │  BACKWARD PASS      │
              │  │  │  dZ, dW, db for     │
              │  │  │  each layer         │
              │  │  └──────────┬──────────┘
              │  │             │
              │  │  ┌──────────▼──────────┐
              │  │  │  UPDATE WEIGHTS     │
              │  │  │  (SGD/Adam/etc.)    │
              │  │  │  Update LR schedule │
              │  │  └──────────┬──────────┘
              │  │             │
              │  │  ┌──────────▼──────────┐
              │  │  │  More batches?      │
              │  │  └────┬───────────┬────┘
              │  │       │ Yes       │ No
              │  └───────┘           │
              │           ┌──────────▼──────────┐
              │           │  Log metrics        │
              │           │  Validate on dev set│
              │           │  Check early stop   │
              │           └──────────┬──────────┘
              │                      │
              │           ┌──────────▼──────────┐
              │           │  Converged?         │
              │           └────┬───────────┬────┘
              │                │ No        │ Yes
              └────────────────┘           │
                                ┌──────────▼──────────┐
                                │  Save model         │
                                │  Report results     │
                                └─────────────────────┘

Optimizer Selection Decision Flowchart

                    ┌─────────────────┐
                    │  Choose         │
                    │  Optimizer      │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ Is model large  │
                    │ (>1M params)?   │
                    └──┬──────────┬───┘
                   Yes │          │ No
              ┌────────▼──────┐   │
              │ Use Adam or   │   │
              │ AdamW         │   ▼
              │ (default for  │  ┌───────────────┐
              │  most DL)     │  │ Is data small? │
              └───────────────┘  └──┬─────────┬──┘
                                Yes │         │ No
                           ┌───────▼──────┐  ┌▼──────────────┐
                           │ SGD +        │  │ Mini-batch    │
                           │ Momentum     │  │ SGD with      │
                           │ (often best  │  │ Momentum or   │
                           │ generalize)  │  │ Adam          │
                           └──────────────┘  └───────────────┘

  Rules of thumb:
  • Adam:     Default for most deep learning tasks
  • SGD+Mom:  Often better final accuracy (with careful tuning)
  • AdamW:    Adam + weight decay (better for transformers)
  • LAMB:     For very large batch sizes (distributed training)

Gradient Checking Algorithm

  For each parameter θᵢ:

  ┌────────────────────────────────────┐
  │  1. Compute analytic gradient      │
  │     g_analytic = backprop(θ)       │
  └──────────────┬─────────────────────┘
                 │
  ┌──────────────▼─────────────────────┐
  │  2. Compute numerical gradient     │
  │     θ⁺ = θ; θ⁺[i] += ε           │
  │     θ⁻ = θ; θ⁻[i] -= ε           │
  │     g_numerical = (J(θ⁺) - J(θ⁻)) │
  │                  / (2ε)            │
  └──────────────┬─────────────────────┘
                 │
  ┌──────────────▼─────────────────────┐
  │  3. Check difference               │
  │     diff = ‖g_a − g_n‖₂           │
  │          / (‖g_a‖₂ + ‖g_n‖₂)      │
  │                                     │
  │  diff < 10⁻⁷  → CORRECT ✓         │
  │  diff < 10⁻⁵  → SUSPICIOUS ⚠      │
  │  diff > 10⁻³  → BUG! ✗            │
  └─────────────────────────────────────┘

10

12.10 — Python Implementation from Scratch

12.10.1 — Complete Neural Network with Backpropagation

Python — Backpropagation from Scratch

import numpy as np

# ==========================================
# COMPLETE NEURAL NETWORK WITH BACKPROPAGATION
# ==========================================

class DeepNeuralNetwork:
    """
    L-layer neural network with backpropagation.
    Supports: ReLU, Sigmoid, Tanh activations
    Optimizers: SGD, Momentum, RMSprop, Adam
    Regularization: L2, Dropout, Batch Normalization
    """

    def __init__(self, layer_dims, activations=None):
        """
        layer_dims: list of integers [n_x, n_h1, n_h2, ..., n_y]
        activations: list of strings ['relu', 'relu', ..., 'sigmoid']
        """
        self.L = len(layer_dims) - 1  # Number of layers
        self.layer_dims = layer_dims
        self.parameters = {}
        self.caches = {}
        self.grads = {}

        # Default activations: ReLU hidden, Sigmoid output
        if activations is None:
            self.activations = ['relu'] * (self.L - 1) + ['sigmoid']
        else:
            self.activations = activations

        self._initialize_parameters()

    def _initialize_parameters(self):
        """He initialization for ReLU, Xavier for Sigmoid/Tanh"""
        np.random.seed(42)
        for l in range(1, self.L + 1):
            if self.activations[l-1] == 'relu':
                # He initialization: scale by sqrt(2/n_prev)
                scale = np.sqrt(2.0 / self.layer_dims[l-1])
            else:
                # Xavier initialization: scale by sqrt(1/n_prev)
                scale = np.sqrt(1.0 / self.layer_dims[l-1])

            self.parameters[f'W{l}'] = np.random.randn(
                self.layer_dims[l], self.layer_dims[l-1]
            ) * scale
            self.parameters[f'b{l}'] = np.zeros((self.layer_dims[l], 1))

    # ---- Activation Functions ----

    def _sigmoid(self, Z):
        A = 1.0 / (1.0 + np.exp(-np.clip(Z, -500, 500)))
        return A

    def _relu(self, Z):
        return np.maximum(0, Z)

    def _tanh(self, Z):
        return np.tanh(Z)

    def _activation_forward(self, Z, activation):
        if activation == 'sigmoid':
            return self._sigmoid(Z)
        elif activation == 'relu':
            return self._relu(Z)
        elif activation == 'tanh':
            return self._tanh(Z)

    def _activation_backward(self, dA, Z, activation):
        if activation == 'sigmoid':
            s = self._sigmoid(Z)
            return dA * s * (1 - s)
        elif activation == 'relu':
            return dA * (Z > 0).astype(float)
        elif activation == 'tanh':
            return dA * (1 - np.tanh(Z) ** 2)

    # ---- Forward Pass ----

    def forward(self, X, keep_prob=1.0):
        """
        Full forward pass through L layers.
        Stores caches for backpropagation.
        """
        self.caches = {}
        A = X
        self.caches['A0'] = X

        for l in range(1, self.L + 1):
            A_prev = A
            W = self.parameters[f'W{l}']
            b = self.parameters[f'b{l}']

            # Linear step: Z = W·A_prev + b
            Z = np.dot(W, A_prev) + b
            self.caches[f'Z{l}'] = Z

            # Activation step
            A = self._activation_forward(Z, self.activations[l-1])

            # Dropout (not on output layer)
            if keep_prob < 1.0 and l < self.L:
                D = (np.random.rand(*A.shape) < keep_prob).astype(float)
                A = A * D / keep_prob  # Inverted dropout
                self.caches[f'D{l}'] = D

            self.caches[f'A{l}'] = A

        return A

    # ---- Loss Computation ----

    def compute_cost(self, AL, Y, lambd=0):
        """Binary cross-entropy with optional L2 regularization"""
        m = Y.shape[1]
        # Cross-entropy loss
        cost = -(1/m) * np.sum(
            Y * np.log(AL + 1e-8) + (1 - Y) * np.log(1 - AL + 1e-8)
        )

        # L2 regularization
        if lambd > 0:
            l2_cost = 0
            for l in range(1, self.L + 1):
                l2_cost += np.sum(np.square(self.parameters[f'W{l}']))
            cost += (lambd / (2 * m)) * l2_cost

        return np.squeeze(cost)

    # ---- Backward Pass ----

    def backward(self, Y, lambd=0, keep_prob=1.0):
        """
        Full backward pass — computes all gradients.
        This is the CORE of backpropagation.
        """
        m = Y.shape[1]
        AL = self.caches[f'A{self.L}']

        # Output layer: dZ[L] = A[L] - Y (sigmoid + BCE)
        dZ = AL - Y  # Shape: (n[L] × m)

        for l in range(self.L, 0, -1):
            A_prev = self.caches[f'A{l-1}']

            # Gradient of weights: dW = (1/m) · dZ · A_prev^T
            dW = (1/m) * np.dot(dZ, A_prev.T)

            # Add L2 regularization gradient
            if lambd > 0:
                dW += (lambd / m) * self.parameters[f'W{l}']

            # Gradient of biases: db = (1/m) · sum(dZ, axis=1)
            db = (1/m) * np.sum(dZ, axis=1, keepdims=True)

            # Store gradients
            self.grads[f'dW{l}'] = dW
            self.grads[f'db{l}'] = db

            # Propagate gradient to previous layer (if not at layer 1)
            if l > 1:
                # dA_prev = W^T · dZ
                dA_prev = np.dot(self.parameters[f'W{l}'].T, dZ)

                # Apply dropout mask
                if keep_prob < 1.0 and f'D{l-1}' in self.caches:
                    dA_prev = dA_prev * self.caches[f'D{l-1}'] / keep_prob

                # dZ_prev = dA_prev ⊙ g'(Z_prev)
                dZ = self._activation_backward(
                    dA_prev,
                    self.caches[f'Z{l-1}'],
                    self.activations[l-2]
                )

    # ---- Optimizers ----

    def update_parameters_sgd(self, learning_rate):
        """Vanilla SGD"""
        for l in range(1, self.L + 1):
            self.parameters[f'W{l}'] -= learning_rate * self.grads[f'dW{l}']
            self.parameters[f'b{l}'] -= learning_rate * self.grads[f'db{l}']

    def initialize_velocity(self):
        """Initialize velocity for Momentum"""
        self.v = {}
        for l in range(1, self.L + 1):
            self.v[f'dW{l}'] = np.zeros_like(self.parameters[f'W{l}'])
            self.v[f'db{l}'] = np.zeros_like(self.parameters[f'b{l}'])

    def update_parameters_momentum(self, learning_rate, beta=0.9):
        """SGD with Momentum"""
        for l in range(1, self.L + 1):
            # Update velocity
            self.v[f'dW{l}'] = beta * self.v[f'dW{l}'] + (1 - beta) * self.grads[f'dW{l}']
            self.v[f'db{l}'] = beta * self.v[f'db{l}'] + (1 - beta) * self.grads[f'db{l}']
            # Update parameters
            self.parameters[f'W{l}'] -= learning_rate * self.v[f'dW{l}']
            self.parameters[f'b{l}'] -= learning_rate * self.v[f'db{l}']

    def initialize_adam(self):
        """Initialize Adam moment estimates"""
        self.m = {}  # First moment
        self.v_adam = {}  # Second moment
        for l in range(1, self.L + 1):
            self.m[f'dW{l}'] = np.zeros_like(self.parameters[f'W{l}'])
            self.m[f'db{l}'] = np.zeros_like(self.parameters[f'b{l}'])
            self.v_adam[f'dW{l}'] = np.zeros_like(self.parameters[f'W{l}'])
            self.v_adam[f'db{l}'] = np.zeros_like(self.parameters[f'b{l}'])

    def update_parameters_adam(self, lr, t, beta1=0.9, beta2=0.999, eps=1e-8):
        """Adam optimizer with bias correction"""
        for l in range(1, self.L + 1):
            for param in ['dW', 'db']:
                key = f'{param}{l}'
                # Update first moment (mean of gradients)
                self.m[key] = beta1 * self.m[key] + (1-beta1) * self.grads[key]
                # Update second moment (mean of squared gradients)
                self.v_adam[key] = beta2 * self.v_adam[key] + (1-beta2) * (self.grads[key]**2)
                # Bias correction
                m_corrected = self.m[key] / (1 - beta1**t)
                v_corrected = self.v_adam[key] / (1 - beta2**t)
                # Update parameter
                p_key = f'W{l}' if 'W' in param else f'b{l}'
                self.parameters[p_key] -= lr * m_corrected / (np.sqrt(v_corrected) + eps)

    # ---- Gradient Checking ----

    def gradient_check(self, X, Y, epsilon=1e-7):
        """Verify backprop using numerical gradients"""
        # Compute analytic gradients
        self.forward(X)
        self.backward(Y)

        # Flatten all parameters and gradients
        params_values = []
        grads_values = []
        for l in range(1, self.L + 1):
            params_values.append(self.parameters[f'W{l}'].flatten())
            params_values.append(self.parameters[f'b{l}'].flatten())
            grads_values.append(self.grads[f'dW{l}'].flatten())
            grads_values.append(self.grads[f'db{l}'].flatten())

        theta = np.concatenate(params_values)
        d_theta = np.concatenate(grads_values)

        # Compute numerical gradients
        num_grads = np.zeros_like(theta)
        for i in range(len(theta)):
            # θ+
            theta_plus = theta.copy()
            theta_plus[i] += epsilon
            self._set_params_from_vector(theta_plus)
            AL_plus = self.forward(X)
            cost_plus = self.compute_cost(AL_plus, Y)

            # θ-
            theta_minus = theta.copy()
            theta_minus[i] -= epsilon
            self._set_params_from_vector(theta_minus)
            AL_minus = self.forward(X)
            cost_minus = self.compute_cost(AL_minus, Y)

            num_grads[i] = (cost_plus - cost_minus) / (2 * epsilon)

        # Restore original parameters
        self._set_params_from_vector(theta)

        # Compute difference
        diff = np.linalg.norm(d_theta - num_grads) / \
               (np.linalg.norm(d_theta) + np.linalg.norm(num_grads) + 1e-8)

        if diff < 1e-7:
            print(f"✅ Gradient check PASSED! Difference: {diff:.2e}")
        elif diff < 1e-5:
            print(f"⚠️  Gradient check SUSPICIOUS. Difference: {diff:.2e}")
        else:
            print(f"❌ Gradient check FAILED! Difference: {diff:.2e}")

        return diff

    def _set_params_from_vector(self, theta):
        """Helper: set parameters from flat vector"""
        idx = 0
        for l in range(1, self.L + 1):
            w_shape = self.parameters[f'W{l}'].shape
            w_size = np.prod(w_shape)
            self.parameters[f'W{l}'] = theta[idx:idx+w_size].reshape(w_shape)
            idx += w_size
            b_shape = self.parameters[f'b{l}'].shape
            b_size = np.prod(b_shape)
            self.parameters[f'b{l}'] = theta[idx:idx+b_size].reshape(b_shape)
            idx += b_size

    # ---- Full Training Loop ----

    def train(self, X, Y, epochs=1000, lr=0.01, optimizer='adam',
              batch_size=32, lambd=0, keep_prob=1.0,
              lr_decay=None, print_cost=True):
        """Complete training loop with mini-batches"""
        m = X.shape[1]
        costs = []
        t = 0  # Adam counter

        # Initialize optimizer state
        if optimizer == 'momentum':
            self.initialize_velocity()
        elif optimizer == 'adam':
            self.initialize_adam()

        for epoch in range(epochs):
            # Shuffle data
            permutation = np.random.permutation(m)
            X_shuffled = X[:, permutation]
            Y_shuffled = Y[:, permutation]

            # Create mini-batches
            num_batches = m // batch_size
            epoch_cost = 0

            for k in range(num_batches + (1 if m % batch_size != 0 else 0)):
                X_batch = X_shuffled[:, k*batch_size : (k+1)*batch_size]
                Y_batch = Y_shuffled[:, k*batch_size : (k+1)*batch_size]

                if X_batch.shape[1] == 0:
                    continue

                # Forward pass
                AL = self.forward(X_batch, keep_prob)

                # Compute cost
                batch_cost = self.compute_cost(AL, Y_batch, lambd)
                epoch_cost += batch_cost

                # Backward pass
                self.backward(Y_batch, lambd, keep_prob)

                # Update parameters
                t += 1
                current_lr = self._get_lr(lr, epoch, epochs, lr_decay)

                if optimizer == 'sgd':
                    self.update_parameters_sgd(current_lr)
                elif optimizer == 'momentum':
                    self.update_parameters_momentum(current_lr)
                elif optimizer == 'adam':
                    self.update_parameters_adam(current_lr, t)

            avg_cost = epoch_cost / max(num_batches, 1)
            costs.append(avg_cost)

            if print_cost and epoch % 100 == 0:
                print(f"Epoch {epoch:4d} | Cost: {avg_cost:.6f} | LR: {current_lr:.6f}")

        return costs

    def _get_lr(self, lr_init, epoch, total_epochs, schedule):
        """Learning rate scheduler"""
        if schedule is None:
            return lr_init
        elif schedule == 'step':
            return lr_init * (0.5 ** (epoch // 200))
        elif schedule == 'exponential':
            return lr_init * np.exp(-0.01 * epoch)
        elif schedule == 'cosine':
            return lr_init * 0.5 * (1 + np.cos(np.pi * epoch / total_epochs))
        elif schedule == 'warmup_cosine':
            warmup = total_epochs // 10
            if epoch < warmup:
                return lr_init * (epoch / warmup)
            return lr_init * 0.5 * (1 + np.cos(np.pi * (epoch - warmup) / (total_epochs - warmup)))

    def predict(self, X, threshold=0.5):
        """Make predictions"""
        AL = self.forward(X, keep_prob=1.0)
        return (AL > threshold).astype(int)

    def accuracy(self, X, Y):
        preds = self.predict(X)
        return np.mean(preds == Y) * 100


# ==========================================
# DEMO: Train on synthetic data
# ==========================================
if __name__ == "__main__":
    # Generate spiral dataset (non-linearly separable)
    np.random.seed(1)
    N = 200  # points per class
    D = 2   # dimensions
    K = 2   # classes

    X = np.zeros((N*K, D))
    Y = np.zeros((N*K, 1))
    for j in range(K):
        ix = range(N*j, N*(j+1))
        r = np.linspace(0.0, 1, N)
        t = np.linspace(j*4, (j+1)*4, N) + np.random.randn(N)*0.2
        X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
        Y[ix] = j

    X = X.T  # (2, 400)
    Y = Y.T  # (1, 400)

    # Create network: 2 inputs → 16 → 8 → 1 output
    nn = DeepNeuralNetwork([2, 16, 8, 1])

    # Verify backpropagation
    print("=== Gradient Check ===")
    nn.gradient_check(X[:, :5], Y[:, :5])

    # Train with Adam
    print("\n=== Training with Adam ===")
    costs = nn.train(X, Y, epochs=2000, lr=0.001,
                     optimizer='adam', batch_size=64,
                     lr_decay='cosine')

    print(f"\nFinal accuracy: {nn.accuracy(X, Y):.1f}%")

12.10.2 — Batch Normalization from Scratch

Python — Batch Normalization

import numpy as np

class BatchNorm:
    """Batch Normalization layer — forward and backward"""

    def __init__(self, num_features, momentum=0.9, eps=1e-5):
        self.gamma = np.ones(num_features)       # Learnable scale
        self.beta = np.zeros(num_features)        # Learnable shift
        self.eps = eps
        self.momentum = momentum
        # Running stats for inference
        self.running_mean = np.zeros(num_features)
        self.running_var = np.ones(num_features)

    def forward(self, Z, training=True):
        """
        Forward pass of batch normalization.
        Z: (n, m) — n features, m examples
        """
        if training:
            # Step 1: Batch mean
            self.mu = np.mean(Z, axis=1, keepdims=True)

            # Step 2: Batch variance
            self.var = np.var(Z, axis=1, keepdims=True)

            # Step 3: Normalize
            self.Z_centered = Z - self.mu
            self.std_inv = 1.0 / np.sqrt(self.var + self.eps)
            self.Z_norm = self.Z_centered * self.std_inv

            # Step 4: Scale and shift
            out = self.gamma.reshape(-1, 1) * self.Z_norm + self.beta.reshape(-1, 1)

            # Update running statistics
            self.running_mean = (self.momentum * self.running_mean +
                                (1 - self.momentum) * self.mu.flatten())
            self.running_var = (self.momentum * self.running_var +
                               (1 - self.momentum) * self.var.flatten())
        else:
            # Inference: use running statistics
            Z_norm = (Z - self.running_mean.reshape(-1, 1)) / \
                     np.sqrt(self.running_var.reshape(-1, 1) + self.eps)
            out = self.gamma.reshape(-1, 1) * Z_norm + self.beta.reshape(-1, 1)

        return out

    def backward(self, dout):
        """
        Backward pass of batch normalization.
        dout: gradient from upstream, same shape as Z
        """
        m = dout.shape[1]
        gamma = self.gamma.reshape(-1, 1)

        # Gradient of gamma and beta
        self.dgamma = np.sum(dout * self.Z_norm, axis=1)
        self.dbeta = np.sum(dout, axis=1)

        # Gradient through normalization
        dZ_norm = dout * gamma
        dvar = np.sum(dZ_norm * self.Z_centered * (-0.5) *
                      (self.var + self.eps)**(-1.5), axis=1, keepdims=True)
        dmu = (np.sum(dZ_norm * (-self.std_inv), axis=1, keepdims=True) +
               dvar * np.mean(-2 * self.Z_centered, axis=1, keepdims=True))
        dZ = dZ_norm * self.std_inv + dvar * 2 * self.Z_centered / m + dmu / m

        return dZ


# Demo
bn = BatchNorm(num_features=3)
Z = np.array([[2.0, 4.0, 6.0, 8.0],
              [1.0, 3.0, 5.0, 7.0],
              [0.5, 1.5, 2.5, 3.5]])

out = bn.forward(Z, training=True)
print("BN output:\n", out)
print("Mean (should be ~0):", np.mean(out, axis=1))
print("Var (should be ~1):", np.var(out, axis=1))

11

12.11 — TensorFlow Implementation

12.11.1 — Custom Training Loop with GradientTape

Python — TensorFlow Custom Training Loop

import tensorflow as tf
import numpy as np

# ==========================================
# CUSTOM TRAINING LOOP WITH GRADIENTTAPE
# ==========================================

# 1. Build model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(10,),
                          kernel_initializer='he_normal'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(32, activation='relu',
                          kernel_initializer='he_normal'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# 2. Setup optimizer with learning rate schedule
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.001,
    decay_steps=1000,
    alpha=0.0001  # minimum learning rate
)
optimizer = tf.keras.optimizers.Adam(
    learning_rate=lr_schedule,
    beta_1=0.9,
    beta_2=0.999,
    epsilon=1e-8
)

loss_fn = tf.keras.losses.BinaryCrossentropy()

# 3. Generate synthetic data
np.random.seed(42)
X_train = np.random.randn(1000, 10).astype(np.float32)
Y_train = (np.sum(X_train[:, :3], axis=1) > 0).astype(np.float32).reshape(-1, 1)

train_dataset = tf.data.Dataset.from_tensor_slices((X_train, Y_train))
train_dataset = train_dataset.shuffle(1000).batch(32)

# 4. Custom training step
@tf.function
def train_step(x_batch, y_batch):
    """One training step with explicit gradient computation"""
    with tf.GradientTape() as tape:
        # Forward pass
        predictions = model(x_batch, training=True)
        loss = loss_fn(y_batch, predictions)

        # Add L2 regularization manually
        l2_loss = tf.add_n([tf.nn.l2_loss(v)
                           for v in model.trainable_variables
                           if 'kernel' in v.name])
        total_loss = loss + 1e-4 * l2_loss

    # Backward pass — THIS IS BACKPROPAGATION!
    gradients = tape.gradient(total_loss, model.trainable_variables)

    # Gradient clipping (prevent exploding gradients)
    gradients, global_norm = tf.clip_by_global_norm(gradients, 1.0)

    # Update weights (Adam optimizer handles momentum + RMSprop internally)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    return total_loss, predictions

# 5. Training loop
for epoch in range(50):
    epoch_loss = 0
    num_batches = 0

    for x_batch, y_batch in train_dataset:
        loss, preds = train_step(x_batch, y_batch)
        epoch_loss += loss.numpy()
        num_batches += 1

    if epoch % 10 == 0:
        avg_loss = epoch_loss / num_batches
        current_lr = optimizer.learning_rate(optimizer.iterations).numpy()
        print(f"Epoch {epoch:3d} | Loss: {avg_loss:.4f} | LR: {current_lr:.6f}")

# 6. Inspect gradients for diagnosis
print("\n=== Gradient Analysis ===")
with tf.GradientTape() as tape:
    preds = model(X_train[:32], training=False)
    loss = loss_fn(Y_train[:32], preds)

grads = tape.gradient(loss, model.trainable_variables)
for var, grad in zip(model.trainable_variables, grads):
    if grad is not None:
        print(f"{var.name:30s} | grad mean: {tf.reduce_mean(grad):.6f} | "
              f"grad std: {tf.math.reduce_std(grad):.6f}")

12.11.2 — Comparing Optimizers in TensorFlow

Python — Optimizer Comparison

import tensorflow as tf
import numpy as np

def create_model():
    return tf.keras.Sequential([
        tf.keras.layers.Dense(32, activation='relu', input_shape=(10,)),
        tf.keras.layers.Dense(16, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

optimizers = {
    'SGD':          tf.keras.optimizers.SGD(learning_rate=0.01),
    'SGD+Momentum': tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9),
    'RMSprop':      tf.keras.optimizers.RMSprop(learning_rate=0.001),
    'Adam':         tf.keras.optimizers.Adam(learning_rate=0.001),
}

results = {}
for name, opt in optimizers.items():
    model = create_model()
    model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
    history = model.fit(X_train, Y_train, epochs=50, batch_size=32, verbose=0)
    results[name] = history.history['loss']
    print(f"{name:15s} | Final loss: {history.history['loss'][-1]:.4f} | "
          f"Final acc: {history.history['accuracy'][-1]:.4f}")

12

12.12 — Scikit-Learn Implementation

Python — Scikit-Learn MLPClassifier

from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

# Generate dataset
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Compare different solvers (optimizers) in scikit-learn
solvers = {
    'sgd': {'solver': 'sgd', 'learning_rate_init': 0.01,
            'momentum': 0.9, 'nesterovs_momentum': True},
    'adam': {'solver': 'adam', 'learning_rate_init': 0.001,
            'beta_1': 0.9, 'beta_2': 0.999},
    'lbfgs': {'solver': 'lbfgs'},  # Quasi-Newton method
}

print("Scikit-Learn MLPClassifier — Optimizer Comparison")
print("=" * 55)

for name, params in solvers.items():
    mlp = MLPClassifier(
        hidden_layer_sizes=(64, 32),  # Two hidden layers
        activation='relu',
        max_iter=500,
        random_state=42,
        alpha=0.0001,       # L2 regularization
        batch_size=32,
        **params
    )
    mlp.fit(X_train, y_train)
    train_acc = mlp.score(X_train, y_train) * 100
    test_acc = mlp.score(X_test, y_test) * 100
    print(f"{name:8s} | Train: {train_acc:.1f}% | Test: {test_acc:.1f}% | "
          f"Iters: {mlp.n_iter_}")

# Access the loss curve
print(f"\nAdam loss curve (first 10 epochs): {mlp.loss_curve_[:10]}")
print(f"Final loss: {mlp.loss_curve_[-1]:.6f}")

13

12.13 — Indian Case Studies

🇮🇳 India

TCS AutoML Training Pipelines

Tata Consultancy Services (TCS) developed AutoML frameworks for their enterprise clients that automatically tune training hyperparameters including optimizer choice, learning rate schedules, and regularization strength.

Challenge: TCS serves 1000+ enterprise clients, each with unique data characteristics. Manually tuning learning rates and optimizers for each client was unsustainable.
Solution: Built an automated training pipeline that starts with Adam (safe default), monitors loss curves for signs of divergence, and automatically switches to SGD+Momentum with warmup if loss plateaus.
Training Strategy: Warmup + cosine annealing schedule with automatic learning rate range finding (similar to Leslie Smith's LR finder).
Results: Reduced ML model deployment time from 6 weeks to 5 days. Achieved 15-20% better final accuracy compared to using default hyperparameters.
Scale: The pipeline processes 50+ TB of client data daily, training models for fraud detection, demand forecasting, and customer churn prediction.

🇮🇳 India

Infosys Nia — Deep Learning Training Optimization

Infosys Nia is an AI platform that uses deep learning for knowledge management, predictive analytics, and process automation across Infosys's global operations.

Training Infrastructure: Uses distributed training across multiple NVIDIA DGX systems with gradient accumulation for effective batch sizes of 4096+.
Optimizer Choice: LAMB optimizer (Layer-wise Adaptive Moments for Batch training) — a variant of Adam designed for large batch training, crucial for distributed setups.
Batch Normalization: Synchronized Batch Normalization across GPUs to maintain consistent statistics in distributed training.
Gradient Challenges: Encountered exploding gradients in RNN-based document classification models. Solved with gradient clipping (max norm = 1.0) and LSTM/GRU architectures.
Impact: Reduced document processing time by 70% for 200+ enterprise clients; handles 100M+ documents annually.

🇮🇳 India

ISRO — Satellite Image Neural Network Training

The Indian Space Research Organisation (ISRO) trains deep neural networks for satellite image analysis, including crop classification, flood detection, and urban planning.

Challenge: Satellite images are enormous (10,000 × 10,000 pixels). Training on full images causes memory overflow. Vanishing gradients in deep segmentation networks (U-Net with 50+ layers).
Solution: Patch-based training with mini-batch SGD + Nesterov momentum. Skip connections (from U-Net architecture) to combat vanishing gradients. Mixed precision training (FP16 forward pass, FP32 gradients) to save memory.
Results: 94% accuracy in crop type classification across India's diverse agricultural landscape. 3× faster training with mixed precision.

14

12.14 — Global Case Studies

🌍 Global

GPT-3 Training Setup — OpenAI (2020)

Training GPT-3 with 175 billion parameters was one of the most ambitious training runs in history. Every aspect of training optimization discussed in this chapter was critical.

Optimizer: Adam with β₁ = 0.9, β₂ = 0.95, ε = 10⁻⁸ (note: β₂ = 0.95, not the typical 0.999 — provides stronger gradient adaptation)
Learning Rate: Linear warmup over 375M tokens, then cosine decay to 10% of peak LR. Peak LR varied by model size (0.6×10⁻⁴ for the largest model).
Batch Size: Gradually increased from 32K tokens to 3.2M tokens during training (large batch training).
Gradient Clipping: Global norm clipping at 1.0 to prevent exploding gradients.
Training Data: 300 billion tokens, 45 TB of text data.
Compute: Estimated 3,640 petaflop-days. Cost: $4.6M on V100 GPUs (estimated at 2020 cloud prices).
Weight Decay: 0.1 (applied to all parameters except biases and layer norm parameters).

🌍 Global

Tesla Full Self-Driving (FSD) Training

Tesla trains neural networks for autonomous driving using data from millions of vehicles.

Architecture: Multi-task learning with shared backbone (HydraNet). Backpropagation must handle gradients from 50+ output tasks simultaneously.
Training Scale: Custom Dojo supercomputer with ExaPOD training tiles. Processes 1.5 billion labeled video frames.
Optimizer: AdamW (Adam with decoupled weight decay) — the standard for transformer-based vision models.
Key Challenge: Multi-task gradient conflicts — some tasks' gradients can cancel out others. Tesla uses gradient surgery techniques (PCGrad) to handle this.
Batch Norm Alternative: Uses Group Normalization and Layer Normalization since batch statistics are unreliable in variable-length video sequences.

🌍 Global

Netflix — Recommendation Model Training at Scale

Netflix trains deep learning recommendation models that serve 230+ million subscribers.

Challenge: Models must be retrained daily on billions of user interactions. Training speed is critical — a model that takes 48 hours to train is useless if user preferences change daily.
Solution: Highly optimized SGD with momentum on custom hardware. Sparse gradient updates for embedding layers (only update embeddings for users/items in the batch).
Dropout: Applied heavily (0.5) in upper layers to prevent overfitting on popular items, ensuring diverse recommendations.
Impact: The recommendation system saves Netflix an estimated $1B/year in customer retention.

15

12.15 — Startup Applications

🚀 How Startups Leverage Training Optimization

Razorpay (India): Uses optimized training pipelines with Adam + cosine annealing for their fraud detection models. Mini-batch size tuned for their GPU cluster (A100s). Processes 1B+ transactions/month and retrains models every 4 hours for fresh fraud patterns.

Postman (India): API testing platform uses NLP models trained with backpropagation for API documentation generation and error detection. Training uses warmup schedules to avoid early divergence on small, domain-specific datasets.

Hugging Face (Global): Built the Accelerate library that automatically handles distributed training, mixed precision, and gradient accumulation — abstracting away the low-level backpropagation details we studied. Their Trainer class implements warmup + linear/cosine schedules by default.

Weights & Biases (Global): Founded specifically to help ML teams track training runs — monitoring loss curves, gradient norms, learning rates, and optimizer states. Their experiment tracking tools are used by 70,000+ ML practitioners worldwide.

16

12.16 — Government Applications

🏛️ Government & Public Sector Deep Learning Training

UIDAI / Aadhaar (India): The biometric matching system uses deep neural networks trained with carefully tuned backpropagation. The fingerprint and iris matching networks use Adam optimizer with learning rate warmup. Batch Normalization is critical for handling the massive diversity in biometric quality across 1.3B+ entries. Dropout (0.4) is used to prevent overfitting on dominant demographic patterns.

Indian Railways: Predictive maintenance models for rolling stock are trained using optimized SGD + Momentum. The models process vibration sensor data from 12,000+ locomotives. Learning rate schedules (step decay every 50 epochs) ensure stable training on noisy sensor data.

DRDO (Defence Research): Trains computer vision models for surveillance and target recognition on custom PARAM supercomputers. Uses gradient clipping and careful initialization to train very deep networks (100+ layers) for object detection in challenging conditions.

US Department of Energy: Uses distributed backpropagation across supercomputers (Summit, Frontier) for training physics-informed neural networks. LAMB optimizer enables effective batch sizes of 65,536+ across thousands of GPUs.

17

12.17 — Industry Applications

Industry	Application	Training Setup	Key Technique
Healthcare	Medical image diagnosis	Adam, LR warmup, BatchNorm	Transfer learning + fine-tuning with low LR
Finance	Algorithmic trading	SGD + Momentum, gradient clipping	Online learning with mini-batch updates
E-commerce	Product recommendations	Adam, sparse gradients	Embedding-specific optimizers
Manufacturing	Quality inspection	SGD + cosine annealing	BatchNorm for batch-to-batch variation
Telecommunications	Network optimization	RMSprop, dropout	Recurrent networks with gradient clipping
Agriculture	Crop disease detection	Adam + weight decay	Heavy dropout (0.5) for small datasets
Energy	Demand forecasting	Adam, LR scheduling	Sequence models with layered dropout
Automotive	Autonomous driving	AdamW + warmup + cosine	Multi-task gradient balancing

18

12.18 — Mini Projects

Mini Project 1

🔬 Backpropagation Visualizer

Build an interactive visualization that shows gradients flowing backward through a neural network.

Python — Backprop Visualizer

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches

class BackpropVisualizer:
    """Visualize forward and backward pass through a network"""

    def __init__(self, layer_sizes):
        self.layer_sizes = layer_sizes
        self.L = len(layer_sizes) - 1
        self.params = {}
        self.cache = {}
        self.grads = {}

        # Initialize small network
        np.random.seed(42)
        for l in range(1, self.L + 1):
            self.params[f'W{l}'] = np.random.randn(
                layer_sizes[l], layer_sizes[l-1]) * 0.5
            self.params[f'b{l}'] = np.zeros((layer_sizes[l], 1))

    def forward_and_backward(self, X, Y):
        """Run forward and backward, store everything"""
        # Forward
        A = X
        self.cache['A0'] = A
        for l in range(1, self.L + 1):
            Z = self.params[f'W{l}'] @ A + self.params[f'b{l}']
            self.cache[f'Z{l}'] = Z
            A = 1 / (1 + np.exp(-Z)) if l == self.L else np.maximum(0, Z)
            self.cache[f'A{l}'] = A

        # Backward
        m = Y.shape[1]
        dZ = A - Y
        for l in range(self.L, 0, -1):
            A_prev = self.cache[f'A{l-1}']
            self.grads[f'dW{l}'] = (1/m) * dZ @ A_prev.T
            self.grads[f'db{l}'] = (1/m) * np.sum(dZ, axis=1, keepdims=True)
            self.grads[f'dZ{l}'] = dZ
            if l > 1:
                dA = self.params[f'W{l}'].T @ dZ
                dZ = dA * (self.cache[f'Z{l-1}'] > 0)

    def visualize(self):
        """Draw the network with gradient magnitudes"""
        fig, axes = plt.subplots(1, 2, figsize=(16, 8))

        # Plot 1: Network with gradient colors
        ax = axes[0]
        ax.set_title('Network with Gradient Magnitudes', fontweight='bold')
        max_neurons = max(self.layer_sizes)
        for l, n in enumerate(self.layer_sizes):
            x = l * 2
            for i in range(n):
                y = (max_neurons - n) / 2 + i
                if l > 0:
                    grad_mag = np.abs(self.grads[f'dZ{l}''][i, 0])
                    color = plt.cm.RdYlGn_r(min(grad_mag * 5, 1))
                else:
                    color = 'lightblue'
                circle = plt.Circle((x, y), 0.3, color=color, ec='black')
                ax.add_patch(circle)

                # Draw connections to next layer
                if l < len(self.layer_sizes) - 1:
                    n_next = self.layer_sizes[l+1]
                    for j in range(n_next):
                        y_next = (max_neurons - n_next) / 2 + j
                        w = self.params[f'W{l+1}''][j, i]
                        ax.plot([x+0.3, (l+1)*2-0.3], [y, y_next],
                               color='blue' if w > 0 else 'red',
                               alpha=min(abs(w), 1), linewidth=abs(w)*2)
        ax.set_xlim(-1, len(self.layer_sizes) * 2)
        ax.set_ylim(-1, max_neurons)
        ax.set_aspect('equal')
        ax.axis('off')

        # Plot 2: Gradient norms per layer
        ax = axes[1]
        layers = []
        grad_norms = []
        for l in range(1, self.L + 1):
            layers.append(f'Layer {l}')
            grad_norms.append(np.linalg.norm(self.grads[f'dW{l}'']))
        ax.bar(layers, grad_norms, color=['#059669', '#0891b2', '#8b5cf6'][:len(layers)])
        ax.set_title('Gradient Norms by Layer', fontweight='bold')
        ax.set_ylabel('‖dW‖₂')

        plt.tight_layout()
        plt.savefig('backprop_visualization.png', dpi=150, bbox_inches='tight')
        plt.show()

# Run the visualizer
viz = BackpropVisualizer([4, 6, 4, 1])
X = np.random.randn(4, 1)
Y = np.array([[1]])
viz.forward_and_backward(X, Y)
viz.visualize()

Mini Project 2

⚡ Optimizer Comparison Dashboard

Build a comprehensive comparison of SGD, Momentum, RMSprop, and Adam on multiple datasets, plotting convergence curves, final accuracy, and training time.

Python — Optimizer Comparison

import numpy as np
import matplotlib.pyplot as plt
import time

# Use our DeepNeuralNetwork class from Section 12.10
# (assumes it's already defined/imported)

def generate_spiral_data(N=200, K=2):
    np.random.seed(1)
    X = np.zeros((N*K, 2))
    Y = np.zeros((N*K, 1))
    for j in range(K):
        ix = range(N*j, N*(j+1))
        r = np.linspace(0.0, 1, N)
        t = np.linspace(j*4, (j+1)*4, N) + np.random.randn(N)*0.2
        X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
        Y[ix] = j
    return X.T, Y.T

X, Y = generate_spiral_data()
optimizers = ['sgd', 'momentum', 'adam']
colors = ['#ef4444', '#f97316', '#059669']
results = {}

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for i, opt in enumerate(optimizers):
    nn = DeepNeuralNetwork([2, 16, 8, 1])
    start = time.time()
    lr = 0.01 if opt == 'sgd' else 0.001
    costs = nn.train(X, Y, epochs=2000, lr=lr,
                     optimizer=opt, batch_size=64, print_cost=False)
    elapsed = time.time() - start
    acc = nn.accuracy(X, Y)
    results[opt] = {'costs': costs, 'time': elapsed, 'acc': acc}

    # Plot convergence
    axes[0].plot(costs, color=colors[i], label=f'{opt} ({acc:.0f}%)', linewidth=2)

axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Cost')
axes[0].set_title('Convergence Comparison')
axes[0].legend()
axes[0].set_yscale('log')

# Bar chart: final accuracy
axes[1].bar(optimizers, [results[o]['acc'] for o in optimizers], color=colors)
axes[1].set_title('Final Accuracy (%)')
axes[1].set_ylim(80, 100)

# Bar chart: training time
axes[2].bar(optimizers, [results[o]['time'] for o in optimizers], color=colors)
axes[2].set_title('Training Time (seconds)')

plt.suptitle('Optimizer Comparison on Spiral Dataset', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('optimizer_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

Mini Project 3

📊 Learning Rate Schedule Explorer

Visualize and compare all learning rate schedules (step decay, exponential decay, cosine annealing, warmup + cosine) and observe their effect on training.

Python — LR Schedule Visualization

import numpy as np
import matplotlib.pyplot as plt

total_epochs = 200
lr_init = 0.01
epochs = np.arange(total_epochs)

schedules = {
    'Constant': np.ones(total_epochs) * lr_init,
    'Step Decay (÷2 every 50)': lr_init * (0.5 ** (epochs // 50)),
    'Exponential (k=0.02)': lr_init * np.exp(-0.02 * epochs),
    'Cosine Annealing': lr_init * 0.5 * (1 + np.cos(np.pi * epochs / total_epochs)),
    'Warmup + Cosine': np.where(
        epochs < 20,
        lr_init * epochs / 20,
        lr_init * 0.5 * (1 + np.cos(np.pi * (epochs - 20) / (total_epochs - 20)))
    ),
}

fig, ax = plt.subplots(figsize=(12, 6))
colors = ['#64748b', '#ef4444', '#f97316', '#059669', '#0891b2']
for (name, lrs), color in zip(schedules.items(), colors):
    ax.plot(epochs, lrs, label=name, linewidth=2.5, color=color)

ax.set_xlabel('Epoch', fontsize=13)
ax.set_ylabel('Learning Rate', fontsize=13)
ax.set_title('Learning Rate Schedule Comparison', fontsize=15, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('lr_schedules.png', dpi=150)
plt.show()

19

12.19 — End-of-Chapter Exercises

Conceptual Questions

Exercise 12.1

Explain why computing gradients numerically (finite differences) is impractical for large networks. Calculate the number of forward passes needed for a network with 10 million parameters.

Exercise 12.2

Draw the complete computation graph for a 3-layer network (input → hidden1 → hidden2 → output) with ReLU activations and BCE loss. Label all nodes and edges.

Exercise 12.3

Explain the chain rule of calculus for composite functions. If f(x) = sin(x²+3x), compute df/dx using the chain rule. Then explain how this extends to backpropagation.

Exercise 12.4

Why does sigmoid + cross-entropy loss give the clean gradient dZ[L] = A[L] − Y? Derive this simplification step by step. What analogous simplification occurs for softmax + categorical cross-entropy?

Exercise 12.5

Compare and contrast Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent. Under what conditions is each most appropriate?

Mathematical Problems

Exercise 12.6

For a 2-layer network with architecture [3, 4, 1], write out the shapes of W[1], b[1], Z[1], A[1], W[2], b[2], Z[2], A[2], and all their gradient counterparts (dW, db, dZ, dA) for a batch of m=5 examples.

Exercise 12.7

Given gradients ∇J₁ = [0.5, −0.2], ∇J₂ = [0.3, −0.4], ∇J₃ = [0.4, −0.3], compute the Momentum velocity v₃ with β = 0.9 and v₀ = [0, 0]. Then compute the parameter update with α = 0.01.

Exercise 12.8

Compute one step of Adam for θ = [1.0, 2.0], ∇J = [0.2, −0.5], at t=1 with α=0.001, β₁=0.9, β₂=0.999, ε=10⁻⁸. Show the bias correction step explicitly and explain why it matters at t=1.

Exercise 12.9

For a batch of values Z = [1.0, 3.0, 5.0, 7.0], compute the Batch Normalization forward pass with γ = 2.0, β = 1.0, ε = 10⁻⁵. Verify that the normalized values have zero mean and unit variance.

Exercise 12.10

Prove that after k layers with sigmoid activations, the maximum gradient is (0.25)ᵏ. For k = 20, what is this value? How does this illustrate the vanishing gradient problem?

Programming Exercises

Exercise 12.11

Implement the RMSprop optimizer from scratch (add it to the DeepNeuralNetwork class). Test it on the spiral dataset and compare convergence with SGD and Adam.

Exercise 12.12

Implement inverted dropout from scratch. Add a dropout layer between each hidden layer of a 4-layer network. Compare training and test accuracy with dropout rates of 0, 0.2, 0.5, and 0.8.

Exercise 12.13

Implement numerical gradient checking and use it to verify your backpropagation implementation. Test with a 3-layer network and report the difference metric.

Exercise 12.14

Write a TensorFlow custom training loop that implements warmup + cosine annealing learning rate schedule. Train a model on MNIST and plot the learning rate and loss curves.

Exercise 12.15

Implement He initialization and Xavier initialization from scratch. Train the same network architecture with both initializations (and random initialization) and compare convergence.

Advanced Exercises

Exercise 12.16

Implement gradient clipping (both by value and by norm) from scratch. Create a scenario where exploding gradients occur, then show that clipping fixes the issue.

Exercise 12.17

Derive the backward pass equations for Batch Normalization from first principles (compute ∂L/∂γ, ∂L/∂β, and ∂L/∂z_i). Implement it and verify with gradient checking.

Exercise 12.18

Compare the generalization performance of Adam vs SGD+Momentum on CIFAR-10 (using TensorFlow). Is it true that SGD generalizes better? Explain your findings.

Exercise 12.19

Implement a learning rate finder that sweeps learning rate from 10⁻⁷ to 10, trains for one epoch, and plots loss vs learning rate. Find the optimal learning rate for a given dataset.

Exercise 12.20

Implement a training loop that uses gradient accumulation to simulate a larger batch size. Compare effective batch sizes of 32, 128, 512, and 2048 (using gradient accumulation over 1, 4, 16, and 64 mini-batches).

Exercise 12.21

Implement Nesterov Accelerated Gradient (NAG) from scratch. NAG computes the gradient at the "look-ahead" position: v_t = β·v_{t-1} + α·∇J(θ − β·v_{t-1}). Compare with standard momentum.

Exercise 12.22

Build a "training dashboard" that logs and plots in real-time: (1) training loss, (2) validation loss, (3) gradient norms per layer, (4) learning rate, (5) weight histograms. Use matplotlib animation.

20

12.20 — Multiple Choice Questions

MCQ 1

What is the computational complexity of backpropagation relative to the forward pass?

(a) Same as forward pass — O(N) (b) About 2-3× the forward pass cost (c) Quadratic in the number of parameters — O(N²) (d) Exponential in the number of layers

(b) About 2-3× the forward pass cost. Backpropagation has the same asymptotic complexity as the forward pass but requires additional memory for cached activations and has a slightly higher constant factor.

MCQ 2

For a sigmoid output with binary cross-entropy loss, dZ[L] simplifies to:

(a) σ(Z[L]) · (1 − σ(Z[L])) (b) A[L] − Y (c) Y − A[L] (d) −Y/A[L]

(b) A[L] − Y. This clean simplification occurs because sigmoid and cross-entropy are natural pairs from the exponential family of distributions.

MCQ 3

In the Adam optimizer, what is the purpose of bias correction?

(a) To prevent division by zero (b) To compensate for the zero initialization of moment estimates in early steps (c) To add regularization to the weight updates (d) To normalize the learning rate

(b) Since m₀ = v₀ = 0, the EMA estimates are biased toward zero in early iterations. Dividing by (1 − βᵗ) corrects this bias, making the estimates accurate from the first step.

MCQ 4

Which initialization is recommended for ReLU activations?

(a) Xavier initialization: W ~ N(0, 1/n_prev) (b) He initialization: W ~ N(0, 2/n_prev) (c) Zero initialization: W = 0 (d) Uniform initialization: W ~ U(-1, 1)

(b) He initialization. The factor of 2 compensates for the fact that ReLU kills half the activations (those below zero), so the variance needs to be doubled compared to Xavier to maintain signal propagation.

MCQ 5

Batch Normalization normalizes which of the following?

(a) The input data X before entering the network (b) The pre-activations Z (or activations A) within each layer (c) The weight matrices W (d) The gradients during backpropagation

(b) BatchNorm normalizes the pre-activations Z (typically) or activations A within each layer to have zero mean and unit variance across the mini-batch, then applies learnable scale (γ) and shift (β).

MCQ 6

In inverted dropout, why do we divide by keep_prob during training?

(a) To increase the learning rate (b) To compensate so that expected values match at test time (no scaling needed at inference) (c) To regularize the weights more strongly (d) To prevent vanishing gradients

(b) Dividing by keep_prob during training ensures that the expected value of each neuron's output remains the same during training and inference. This eliminates the need for any special scaling at test time.

MCQ 7

Which gradient problem is addressed by using ReLU instead of sigmoid?

(a) Exploding gradients (b) Vanishing gradients (c) Oscillating gradients (d) Noisy gradients

(b) Vanishing gradients. Sigmoid's derivative maxes out at 0.25, so gradients shrink exponentially through layers. ReLU's derivative is 1 for positive inputs, preventing this shrinkage.

MCQ 8

The formula dW[l] = (1/m) · dZ[l] · A[l-1]ᵀ computes the gradient of the cost w.r.t. weights. The A[l-1]ᵀ term comes from:

(a) The activation function derivative (b) The chain rule applied to Z[l] = W[l]·A[l-1] + b[l], specifically ∂Z/∂W = A[l-1]ᵀ (c) The loss function derivative (d) The bias gradient

(b) Since Z[l] = W[l]·A[l-1] + b[l], the derivative ∂Z[l]/∂W[l] = A[l-1]ᵀ by matrix calculus. Combined with dZ[l] from the chain rule and the 1/m averaging, we get the complete dW formula.

MCQ 9

For gradient checking, a relative difference of less than ______ indicates correct backpropagation:

(a) 10⁻¹ (b) 10⁻³ (c) 10⁻⁵ (d) 10⁻⁷

(d) 10⁻⁷. A difference below 10⁻⁷ indicates the gradients match very well. Between 10⁻⁷ and 10⁻⁵ is acceptable. Above 10⁻³ indicates a bug in backpropagation.

MCQ 10

GPT-3 was trained using which optimizer and learning rate schedule?

(a) SGD with step decay (b) Adam with linear warmup + cosine decay (c) RMSprop with exponential decay (d) AdaGrad with constant learning rate

(b) GPT-3 used Adam (β₁=0.9, β₂=0.95) with linear warmup over 375M tokens followed by cosine decay to 10% of the peak learning rate.

MCQ 11

What is the typical mini-batch size used in practice for deep learning?

(a) 1 (pure SGD) (b) 32 to 512 (c) Equal to dataset size (full batch) (d) Always exactly 256

(b) 32 to 512. Mini-batch sizes are typically powers of 2 between 32 and 512, chosen to balance gradient quality with GPU utilization. Batch sizes should fit in GPU memory and are often tuned for specific hardware.

21

12.21 — Interview Questions

Interview Q1 — Google / DeepMind

Explain backpropagation to someone who knows calculus but not machine learning. What is it computing and why is it efficient?

Backpropagation computes the gradient of a loss function with respect to all neural network parameters. It uses the chain rule of calculus, applied layer by layer from output to input. Its efficiency comes from dynamic programming: intermediate gradients (∂L/∂A[l]) computed for layer l+1 are reused when computing gradients for layer l. This gives O(N) complexity instead of O(N²) for computing all N gradients independently.

Interview Q2 — Meta / FAIR

Why does Adam work better than SGD for most tasks? When might SGD+Momentum be preferred?

Adam works well because it (1) builds momentum to dampen oscillations, (2) adapts learning rate per-parameter, and (3) uses bias correction. However, SGD+Momentum often achieves better generalization (lower test loss) because Adam's adaptive learning rates can converge to sharp minima, while SGD's noise helps find flatter minima that generalize better. SGD+Momentum is often preferred for image classification with careful tuning (schedule, warmup).

Interview Q3 — Amazon / AWS

What is the vanishing gradient problem? Name 3 techniques to mitigate it.

Gradients shrink exponentially when propagated through layers with saturating activations (sigmoid: max derivative 0.25). Solutions: (1) ReLU activations (derivative = 1 for z > 0), (2) Skip/residual connections (provide direct gradient paths), (3) Proper initialization (He for ReLU, Xavier for sigmoid), (4) Batch Normalization (keeps activations in non-saturating range), (5) LSTM/GRU for recurrent networks.

Interview Q4 — Microsoft Research

Explain Batch Normalization. Where is it placed? What are γ and β? What happens at inference time?

BatchNorm normalizes pre-activations to zero mean and unit variance across a mini-batch. Placed between linear transform and activation. γ (scale) and β (shift) are learnable — they let the network undo normalization if it hurts. At inference, use running statistics (EMA of training batch statistics) since there's no "batch" during single-sample inference. This ensures deterministic behavior.

Interview Q5 — Apple ML

What is gradient clipping? When do you need it? What's the difference between clipping by value and clipping by norm?

Gradient clipping limits gradient magnitudes to prevent exploding gradients (common in RNNs and deep networks). Clip by value: clamp each gradient element to [-threshold, threshold] — changes gradient direction. Clip by norm: if ||∇J||₂ > threshold, scale all gradients by threshold/||∇J||₂ — preserves direction. Clip by norm is preferred because it maintains the relative gradient magnitudes between parameters.

Interview Q6 — OpenAI

What is a learning rate warmup and why is it used for training large models like GPT?

Warmup gradually increases the learning rate from near-zero to the target value over the first few thousand steps. It's crucial for large models because: (1) Adam's second moment estimates are unreliable early on (biased toward zero), leading to very large effective learning rates that can destabilize training; (2) The model's early loss landscape is highly non-smooth — small steps help navigate the initial chaotic region before the optimizer "warms up" its statistics.

Interview Q7 — NVIDIA

How does mixed precision training work? What role does gradient scaling play?

Mixed precision uses FP16 for forward/backward pass (2× faster, half memory) and FP32 for parameter updates (maintain precision). Loss scaling multiplies the loss by a large factor (e.g., 1024) before backprop to prevent small gradients from underflowing in FP16 (FP16's minimum positive value is ~10⁻⁵). Gradients are then divided by the scale factor before the optimizer step.

Interview Q8 — TCS / Infosys

Explain dropout as a regularizer. How does inverted dropout simplify inference?

Dropout randomly zeroes out neurons during training, forcing the network to learn redundant representations (ensemble interpretation: it's like averaging exponentially many sub-networks). Inverted dropout: divide surviving neuron outputs by keep_prob during training. This ensures E[output_train] = E[output_test], so inference requires zero modification — just use all neurons normally.

Interview Q9 — Flipkart / Myntra

In practice, how do you choose between Adam and SGD? What learning rate would you start with for each?

Start with Adam (lr=0.001) for initial experiments — it converges fast with default hyperparameters. If maximum accuracy matters (e.g., competition), switch to SGD+Momentum (lr=0.1, momentum=0.9) with cosine annealing. Adam is default for NLP/transformers, SGD is often better for computer vision (ResNets). Use learning rate finder to pick optimal starting LR for your specific problem.

Interview Q10 — Startup CTO

Your model's training loss is decreasing but validation loss plateaus. Your gradient norms are healthy (not vanishing/exploding). What's happening and how do you fix it?

Classic overfitting — the model memorizes training data but doesn't generalize. Fixes: (1) Add dropout (start with 0.3-0.5), (2) Increase L2 regularization, (3) Add data augmentation, (4) Use early stopping (stop training when val loss starts increasing), (5) Reduce model capacity, (6) Add Batch Normalization, (7) Collect more data.

22

12.22 — Research Problems

Research Problem 1: Optimizer Design for Sparse Gradients

Background: In recommendation systems and NLP, most gradients are zero (sparse embeddings). Standard Adam wastes computation updating second moment estimates for zero gradients.

Question: Design an optimizer variant that efficiently handles sparse gradients by only updating moment estimates for non-zero gradient entries. Prove mathematically that your optimizer has the same convergence guarantee as Adam. Implement it and benchmark on a recommendation system with 100K item embeddings.

Hint: Look into SparseAdam and Lazy Adam implementations. Consider how bias correction must be modified for sparse updates.

Research Problem 2: Automatic Learning Rate Schedule Discovery

Background: The optimal learning rate schedule depends on the dataset, model architecture, and optimization landscape — yet practitioners often use hand-crafted schedules.

Question: Develop a meta-learning approach that automatically discovers the optimal learning rate schedule for a given task. The meta-learner should train on a family of tasks and learn to predict good LR schedules for unseen tasks based on early training metrics (loss curve shape, gradient statistics). Compare your approach to cosine annealing and warmup + linear decay baselines.

Research Problem 3: Gradient Conflict in Multi-Task Learning

Background: In multi-task learning (like Tesla's FSD), gradients from different tasks can conflict — task A wants to increase a weight while task B wants to decrease it.

Question: Implement and compare three gradient conflict resolution methods: (1) PCGrad (project conflicting gradients), (2) GradNorm (dynamically weight task losses), (3) Your own method. Evaluate on a multi-task benchmark (e.g., NYUv2 with segmentation, depth, and normals). Analyze which layers experience the most conflict and why.

Research Problem 4: Backpropagation Alternatives

Background: Backpropagation is biologically implausible (the brain doesn't have symmetric feedback connections). This limits our understanding of biological learning.

Question: Implement and compare backpropagation with two alternative credit assignment methods: (1) Feedback Alignment (random feedback weights), (2) Direct Feedback Alignment (random connections from output to each layer). Compare convergence speed and final accuracy on MNIST and CIFAR-10. Does scaling to deeper networks work for these alternatives?

23

12.23 — Key Takeaways

1

Backpropagation = chain rule + dynamic programming. It computes all gradients in one backward pass (O(N) time), making deep learning computationally feasible. Without it, training would require O(N²) or worse.

2

Five equations govern all backpropagation: dZ[L] = A[L]−Y, dW[l] = (1/m)·dZ[l]·A[l-1]ᵀ, db[l] = (1/m)·sum(dZ[l]), dA[l-1] = W[l]ᵀ·dZ[l], dZ[l] = dA[l] ⊙ g'(Z[l]). These five formulas, applied layer by layer, train any feedforward network.

3

Adam is the default optimizer for most deep learning tasks. It combines momentum (first moment) and adaptive learning rates (second moment) with bias correction. Typical settings: β₁=0.9, β₂=0.999, ε=10⁻⁸, lr=0.001.

4

Learning rate is the most important hyperparameter. Use warmup + cosine annealing for large models. The learning rate finder technique can help choose the starting value. Too high → divergence, too low → slow convergence.

5

Batch Normalization accelerates training by normalizing layer inputs, allowing higher learning rates and reducing sensitivity to initialization. Use running statistics at inference time.

6

Dropout is the simplest effective regularizer. Inverted dropout (divide by keep_prob during training) avoids any test-time modification. Typical rates: 0.2-0.5 for hidden layers, never on the output layer.

7

Vanishing gradients → use ReLU + He init + skip connections. Exploding gradients → use gradient clipping + proper initialization. Always monitor gradient norms during training to detect these problems early.

8

Gradient checking verifies backprop correctness. Use it during development (never in production — it's O(N²)). A relative difference below 10⁻⁷ means your implementation is correct.

9

Real-world training is an engineering challenge. GPT-3 required custom optimizer settings (β₂=0.95), warmup schedules, gradient clipping, and millions of dollars of compute. Mastering these techniques is a valuable skill in industry.

24

12.24 — References

Foundational Papers

Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). "Learning representations by back-propagating errors." Nature, 323(6088), 533-536.
Kingma, D.P., & Ba, J. (2014). "Adam: A Method for Stochastic Optimization." arXiv:1412.6980.
Ioffe, S., & Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." ICML.
Srivastava, N., et al. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." JMLR, 15, 1929-1958.
He, K., et al. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." ICCV.
Glorot, X., & Bengio, Y. (2010). "Understanding the difficulty of training deep feedforward neural networks." AISTATS.
Loshchilov, I., & Hutter, F. (2016). "SGDR: Stochastic Gradient Descent with Warm Restarts." arXiv:1608.03983.

Textbooks

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapters 6-8.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 5.
Nielsen, M. (2015). Neural Networks and Deep Learning. Online textbook (free).

Industry & Applied References

Brown, T., et al. (2020). "Language Models are Few-Shot Learners" (GPT-3). NeurIPS.
You, Y., et al. (2019). "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes." arXiv:1904.00962 (LAMB optimizer).
Smith, L.N. (2017). "Cyclical Learning Rates for Training Neural Networks." WACV.
TCS Research. "Automated Machine Learning for Enterprise Applications." TCS white paper.
Tesla AI Day presentations (2021, 2022). Training infrastructure and optimization details.