Neural Networks & Deep Learning

Chapter 13: Recurrent Neural Networks — Memory in Networks

Teaching Neural Networks to Remember the Past, Understand Sequences & Predict the Future

⏱️ Reading Time: ~4 hours  |  📖 Part IV: Architectures  |  🧠 Theory + Code Chapter

📋 Prerequisites: Chapters 7–8 (Deep Networks, Backpropagation, Optimization)

Bloom's Taxonomy Map for This Chapter

Bloom's LevelWhat You'll Achieve
🔵 RememberRecall the RNN cell equations, BPTT algorithm, and the definitions of vanishing/exploding gradients
🔵 UnderstandExplain why feedforward networks fail on sequential data, how hidden states carry memory, and why gradients vanish over long sequences
🟢 ApplyImplement an RNN cell from scratch in NumPy, train a character-level language model, and apply gradient clipping
🟡 AnalyzeTrace gradient flow through an unrolled RNN, compute BPTT for a 3-step sequence, and identify vanishing gradient signatures
🟠 EvaluateChoose the right RNN architecture (one-to-many, many-to-one, many-to-many) for a given task, and assess when vanilla RNNs are sufficient vs. when LSTMs/GRUs are needed
🔴 CreateDesign an end-to-end time-series forecasting pipeline for grocery demand prediction and build a sentiment classifier using RNNs
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  • Define sequential data and explain why standard feedforward networks cannot capture temporal dependencies
  • Derive the RNN forward equations: a⟨t⟩ = tanh(W_aa · a⟨t-1⟩ + W_ax · x⟨t⟩ + b_a) and ŷ⟨t⟩ = softmax(W_ya · a⟨t⟩ + b_y)
  • Classify RNN architectures — one-to-one, one-to-many, many-to-one, many-to-many — and map each to real tasks
  • Derive Backpropagation Through Time (BPTT) with full gradient chain-rule expansions
  • Explain the vanishing gradient problem mathematically: why ∏ diag(tanh'(·)) · W_aa shrinks exponentially
  • Apply gradient clipping to prevent exploding gradients during training
  • Formulate language modeling as P(sentence) = ∏ P(word_t | word_1, …, word_{t-1}) and train a character-level model
  • Implement an RNN cell from scratch using NumPy — forward pass, BPTT, and text generation
  • Build a time-series forecasting model for real-world demand prediction using PyTorch
  • Evaluate when vanilla RNNs are insufficient and motivate the need for LSTM/GRU (covered in Chapter 14)
Section 2

Opening Hook — The IRCTC Durga Puja Rush

🚂 How Does IRCTC Know 14 Lakh Tickets Will Be Booked Tomorrow?

Every year during Durga Puja, IRCTC faces a tsunami of ticket bookings. Kolkata-bound trains from Delhi, Mumbai, Bangalore, and Chennai see demand spike by 300–500% in the weeks before the festival. In 2024, IRCTC handled over 25 lakh bookings in a single day during peak season.

Here's the key insight: today's ticket demand depends on what happened yesterday, last week, and last month. If 8 lakh tickets were booked on Monday, 10 lakh on Tuesday, and 12 lakh on Wednesday — you can feel the crescendo building toward the weekend rush. A standard feedforward network sees each day in isolation. It has no memory.

What if the network could remember? What if it could carry forward information from Monday → Tuesday → Wednesday and use that accumulated context to predict Thursday's demand?

That is exactly what a Recurrent Neural Network (RNN) does. It adds a hidden state — a memory vector — that gets updated at every time step. The RNN doesn't just see "12 lakh tickets on Wednesday." It sees "8 → 10 → 12 lakh, with increasing momentum, during pre-Puja season, with weather turning pleasant" — a sequence.

IRCTCIndian RailwaysDemand ForecastingSequential Data
Sequential data is everywhere in India: UPI transaction patterns on Paytm (₹20+ lakh crore monthly), Jio's network traffic spikes during cricket matches, Flipkart's order surges during Big Billion Days, Zomato's hourly order patterns across 500+ cities, and Ola's ride-demand curves during monsoon. Every one of these problems requires a neural network with memory.
Section 3

Core Concepts

13.1 — Why Sequential Data Needs Memory

A feedforward network processes a fixed-size input and produces a fixed-size output. It has no notion of order. Consider these two sentences:

  • "Sachin hit the ball" → Positive (cricket commentary)
  • "The ball hit Sachin" → Negative (injury report)

Both sentences contain the exact same words. A bag-of-words feedforward model cannot distinguish them. Order matters.

Types of Sequential Data

Data TypeSequence DimensionIndian Example
Time SeriesValues over time stepsSensex daily closing prices, IRCTC daily bookings
Text / NLPWords/characters in orderHindi poetry, Amazon India product reviews
Audio / SpeechSound amplitude samplesAlexa Hindi voice commands, Bhashini translations
VideoFrames over timeCCTV footage analysis for Smart Cities Mission
Sensor DataIoT readings over timeISRO satellite telemetry, AgriStack soil moisture
User BehaviorClick/action sequencesFlipkart browsing history, Hotstar watch sequences

The Fundamental Limitation of Feedforward Networks

A feedforward network maps input x to output y: y = f(x). For sequential data at time t, we need:

y⟨t⟩ = f(x⟨t⟩, x⟨t-1⟩, x⟨t-2⟩, …, x⟨1⟩)

This means the output at time t depends on all previous inputs. We could concatenate all past inputs into one giant vector, but this approach has three fatal problems:

  1. Variable-length sequences: Sentences have different lengths; we'd need a different network for each length
  2. No parameter sharing: The weight connecting "word at position 3" to the output is different from "word at position 7" — the network can't generalize across positions
  3. Computational explosion: For a 500-word review, the input vector is 500 × embedding_dim — enormous!
Jeffrey Elman published the foundational "Finding Structure in Time" paper in 1990, introducing what we now call the "Elman network" — the simplest form of RNN. The key idea: instead of feeding in the entire history, maintain a hidden state that summarizes everything seen so far.

13.2 — The RNN Cell: A Neuron with Memory

The core idea of an RNN is breathtakingly simple: the output of the hidden layer at time step t is fed back as an additional input at time step t+1.

🧠 The RNN Cell Equations

Hidden State Update (Memory Update)
a⟨t⟩ = tanh(W_aa · a⟨t-1⟩ + W_ax · x⟨t⟩ + b_a)

Where:

  • a⟨t⟩ — hidden state (activation) at time step t, shape (n_a, 1)
  • a⟨t-1⟩ — hidden state from the previous time step, shape (n_a, 1)
  • x⟨t⟩ — input at current time step, shape (n_x, 1)
  • W_aa — weight matrix for hidden-to-hidden connection, shape (n_a, n_a)
  • W_ax — weight matrix for input-to-hidden connection, shape (n_a, n_x)
  • b_a — bias vector, shape (n_a, 1)
Output Prediction
ŷ⟨t⟩ = softmax(W_ya · a⟨t⟩ + b_y)

Where:

  • ŷ⟨t⟩ — predicted output at time step t, shape (n_y, 1)
  • W_ya — weight matrix for hidden-to-output, shape (n_y, n_a)
  • b_y — output bias vector, shape (n_y, 1)
Compact Notation (Common in Literature)
a⟨t⟩ = tanh(W_a · [a⟨t-1⟩; x⟨t⟩] + b_a)

Here W_a = [W_aa | W_ax] is a concatenated weight matrix of shape (n_a, n_a + n_x), and [a⟨t-1⟩; x⟨t⟩] is the vertical concatenation of the previous hidden state and current input.

Initial Hidden State

At t = 0, we initialize a⟨0⟩ as a zero vector: a⟨0⟩ = 0⃗. This represents "no memory before the sequence starts."

Why tanh and Not ReLU?

The hidden state is computed by repeatedly multiplying by W_aa and applying an activation. Using ReLU (unbounded) would cause the hidden state to grow without bound over long sequences. The tanh function squashes values to [-1, +1], keeping the hidden state bounded. However, this bounded nature also contributes to the vanishing gradient problem (Section 13.5).

Parameter Sharing is the Key Insight. The same weight matrices W_aa, W_ax, W_ya are used at every time step. This means:
  • The RNN can handle sequences of any length — same parameters, just more steps
  • Patterns learned at position 5 transfer to position 500
  • Total parameters are independent of sequence length

The Hidden State as a "Summary"

Think of a⟨t⟩ as a compressed summary of everything the network has seen from time step 1 through time step t. It's like reading a 500-page novel and trying to carry the plot summary in your head — at each new page, you update your mental summary based on the new information and your existing understanding.

"RNNs have different parameters for each time step." No! The same W_aa, W_ax, W_ya, b_a, b_y are reused at every time step. This is called weight tying or parameter sharing across time. It's what makes RNNs generalizable to variable-length sequences.

13.3 — RNN Architectures: One-to-Many, Many-to-One, Many-to-Many

The flexibility of RNNs comes from how we configure inputs and outputs across time steps. There are five fundamental architectures:

📐 RNN Architecture Taxonomy

1. One-to-One (Standard Network)

T_x = T_y = 1. This is just a regular feedforward network — no recurrence. Included for completeness.

Example: Image classification (single image → single label).

2. One-to-Many

T_x = 1, T_y > 1. Single input generates a sequence of outputs. The input is fed only at t=1; subsequent steps receive no external input (or the previous output is fed as input).

Example: Music generation (seed note → melody), image captioning (image → "A man riding a bicycle on Marine Drive").

3. Many-to-One

T_x > 1, T_y = 1. A sequence of inputs produces a single output. The output is taken only from the last time step.

Example: Sentiment analysis ("Flipkart delivery was amazing!" → ⭐⭐⭐⭐⭐), spam detection on Hindi SMS.

4. Many-to-Many (Same Length)

T_x = T_y. Each input time step produces a corresponding output at the same time step.

Example: Named Entity Recognition ("Narendra Modi visited Varanasi" → [PERSON, PERSON, O, LOCATION]), POS tagging in Hindi.

5. Many-to-Many (Different Length) — Encoder-Decoder

T_x ≠ T_y. An encoder RNN reads the input sequence and compresses it into a context vector. A decoder RNN generates the output sequence from this context.

Example: Machine translation (Hindi → English on Google Translate), text summarization of news articles on Inshorts.

ArchitectureT_xT_yIndian Industry Use Case
One-to-One11Aadhaar fingerprint → identity match
One-to-Many1TAlbum artwork → AI-generated Bollywood song lyrics
Many-to-OneT1Amazon India review text → star rating prediction
Many-to-Many (=)TTHindi sentence → POS tags (noun, verb, adjective…)
Many-to-Many (≠)T_xT_yBhashini: Hindi speech → Tamil text translation

13.4 — Backpropagation Through Time (BPTT)

Training an RNN uses a variant of backpropagation called Backpropagation Through Time (BPTT). The idea: "unroll" the RNN across all time steps to create a deep feedforward network, then apply standard backpropagation.

Step 1: Forward Pass (Unrolled)

For a sequence of length T:

# At each time step t = 1, 2, ..., T:
a⟨t⟩ = tanh(W_aa · a⟨t-1⟩ + W_ax · x⟨t⟩ + b_a)
ŷ⟨t⟩ = softmax(W_ya · a⟨t⟩ + b_y)
L⟨t⟩ = -Σ_i  y_i⟨t⟩ · log(ŷ_i⟨t⟩)     # Cross-entropy at step t

Step 2: Total Loss

L = Σ_{t=1}^{T} L⟨t⟩ = -Σ_{t=1}^{T} Σ_i y_i⟨t⟩ · log(ŷ_i⟨t⟩)

Step 3: Backward Pass — Gradient Derivation

The key challenge: W_aa affects every time step (because it's shared). So its gradient accumulates contributions from all time steps.

Gradient w.r.t. Output Weights (Simple)
∂L/∂W_ya = Σ_{t=1}^{T} (ŷ⟨t⟩ - y⟨t⟩) · a⟨t⟩ᵀ
Gradient w.r.t. Hidden State at Final Step
∂L/∂a⟨t⟩ = W_yaᵀ · (ŷ⟨t⟩ - y⟨t⟩) + W_aaᵀ · diag(1 - a⟨t+1⟩²) · ∂L/∂a⟨t+1⟩

Notice the recursive structure — the gradient at time t depends on the gradient at time t+1. This is why it's called "backpropagation through time" — we propagate gradients backward from T to 1.

Gradient w.r.t. Shared Recurrent Weights
∂L/∂W_aa = Σ_{t=1}^{T} diag(1 - a⟨t⟩²) · δ⟨t⟩ · a⟨t-1⟩ᵀ

where δ⟨t⟩ is the backpropagated error signal at time step t, computed recursively:

δ⟨t⟩ = W_yaᵀ · (ŷ⟨t⟩ - y⟨t⟩) + W_aaᵀ · diag(1 - a⟨t+1⟩²) · δ⟨t+1⟩
Gradient w.r.t. Input Weights
∂L/∂W_ax = Σ_{t=1}^{T} diag(1 - a⟨t⟩²) · δ⟨t⟩ · x⟨t⟩ᵀ
BPTT ≡ Standard Backpropagation on the Unrolled Graph. There is no special "new" algorithm. The only subtlety is that because W_aa, W_ax, W_ya are shared across time steps, their gradients are the sum of contributions from all time steps.

Truncated BPTT

For very long sequences (e.g., a 10,000-word document), unrolling the full graph is computationally prohibitive. Truncated BPTT limits the backward pass to a fixed window of k time steps (typically k = 20–50). This sacrifices gradient accuracy for long-range dependencies but makes training tractable.

13.5 — The Vanishing & Exploding Gradient Problem

This is the Achilles' heel of vanilla RNNs and the reason LSTMs/GRUs were invented.

The Mathematical Root Cause

Consider the gradient of the loss at time step T with respect to the hidden state at an earlier time step k:

∂a⟨T⟩/∂a⟨k⟩ = ∏_{t=k+1}^{T} diag(1 - a⟨t⟩²) · W_aa

This is a product of (T - k) matrices. Let's analyze what happens:

Case 1: Vanishing Gradient

The tanh' derivative satisfies 0 < tanh'(z) ≤ 1. If the largest singular value of W_aa is less than 1 (i.e., σ_max(W_aa) < 1), then:

‖∂a⟨T⟩/∂a⟨k⟩‖ ≤ (σ_max(W_aa))^{T-k} → 0 as (T-k) → ∞

The gradient decays exponentially with the distance between time steps. Information from 100 steps ago has virtually zero gradient signal — the network cannot learn long-range dependencies.

Practical Illustration

Consider the sentence: "The students, who came from Varanasi and had been studying Sanskrit for many years under the guidance of Professor Sharma at the Banaras Hindu University, were brilliant."

The verb "were" (plural) must agree with "students" (plural) — but they are separated by 25+ words. A vanilla RNN's gradient from "were" to "students" passes through 25+ matrix multiplications, shrinking to near-zero. The network never learns this long-distance agreement.

Case 2: Exploding Gradient

If σ_max(W_aa) > 1, the gradient grows exponentially:

‖∂a⟨T⟩/∂a⟨k⟩‖ → ∞ as (T-k) → ∞

This causes NaN values in weights, loss spikes to infinity, and training crashes.

Gradient Clipping: Solving Exploding Gradients

The fix for exploding gradients is surprisingly simple — clip the gradient norm:

if ‖g‖ > threshold: g ← (threshold / ‖g‖) · g

This rescales the gradient to have maximum norm = threshold (typically 1.0 or 5.0), preserving direction but limiting magnitude.

# Gradient clipping in NumPy
def clip_gradients(gradients, max_norm=5.0):
    """Clip gradients to prevent exploding gradient problem."""
    total_norm = np.sqrt(sum(np.sum(g**2) for g in gradients.values()))
    if total_norm > max_norm:
        clip_coef = max_norm / (total_norm + 1e-6)
        for key in gradients:
            gradients[key] *= clip_coef
    return gradients
"Gradient clipping solves the vanishing gradient problem." No! Gradient clipping only addresses exploding gradients. The vanishing gradient problem requires architectural changes — LSTM, GRU, skip connections, or attention mechanisms. Don't confuse the two!
Sepp Hochreiter identified the vanishing gradient problem in his 1991 diploma thesis (in German!). It took until 1997 for him and Jürgen Schmidhuber to publish the LSTM — a solution that would become one of the most impactful architectures in deep learning history.

13.6 — Language Modeling with RNNs

A language model assigns a probability to a sequence of words (or characters). It answers the question: "How likely is this sentence?"

📖 The Language Model Formulation

Joint Probability via Chain Rule
P(w_1, w_2, …, w_T) = ∏_{t=1}^{T} P(w_t | w_1, w_2, …, w_{t-1})

Example:

P("नमस्ते आप कैसे हैं") = P(नमस्ते) × P(आप | नमस्ते) × P(कैसे | नमस्ते, आप) × P(हैं | नमस्ते, आप, कैसे)

RNN as Language Model

At each time step t:

  • Input: x⟨t⟩ = w_t (current word/character, one-hot or embedded)
  • Target: y⟨t⟩ = w_{t+1} (next word/character)
  • Output: ŷ⟨t⟩ = softmax(W_ya · a⟨t⟩ + b_y) gives a probability distribution over the vocabulary for the next token
Training Loss: Cross-Entropy
L = -Σ_{t=1}^{T} log P(w_{t+1} | w_1, …, w_t)
Perplexity (Evaluation Metric)
Perplexity = exp(L / T) = exp(-1/T · Σ_{t=1}^{T} log P(w_{t+1} | w_1, …, w_t))

Lower perplexity = better model. A perplexity of k means the model is as uncertain as choosing uniformly among k options at each step.

Character-Level vs. Word-Level Models

AspectCharacter-LevelWord-Level
Vocabulary Size~70–200 characters30,000–100,000 words
Sequence LengthVery long (each char is a step)Shorter (each word is a step)
Handles Typos / New WordsYes (sees individual characters)No (unknown words = UNK)
Hindi / DevanagariWorks naturally with Unicode charsRequires word tokenizer
Training SpeedSlower (longer sequences)Faster per sequence
Language modeling in Indian languages is particularly challenging because of the rich morphology of languages like Hindi, Tamil, and Telugu. The word "खेलना" (to play) has forms like खेलता, खेलती, खेलते, खेला, खेली, खेले, खेलूँगा — a character-level RNN can learn these patterns naturally by observing the shared root "खेल" across all forms. This is one reason character-level models are popular for Indian-language NLP.

Sampling / Text Generation

Once trained, we generate text by repeatedly sampling from the predicted distribution:

  1. Feed a seed character (e.g., "क") as x⟨1⟩
  2. Compute ŷ⟨1⟩ = softmax(…) — a probability distribution over all characters
  3. Sample the next character from this distribution: x⟨2⟩ ~ ŷ⟨1⟩
  4. Feed x⟨2⟩ back, compute ŷ⟨2⟩, sample x⟨3⟩, and repeat
Temperature Scaling for Generation: Before applying softmax, divide the logits by a temperature τ:
  • τ → 0: output becomes deterministic (always picks the most likely character — "greedy")
  • τ = 1: standard softmax (samples according to learned probabilities)
  • τ → ∞: output becomes uniform random (completely random characters)
Use τ ≈ 0.7–0.8 for creative text generation with reasonable coherence.
Section 4

From-Scratch Code — RNN in NumPy

We'll implement a complete RNN from scratch: forward pass, BPTT, gradient clipping, and character-level text generation — trained on Hindi poetry.

4.1 — RNN Cell: Forward Step

import numpy as np

def softmax(z):
    """Numerically stable softmax."""
    e_z = np.exp(z - np.max(z, axis=0, keepdims=True))
    return e_z / np.sum(e_z, axis=0, keepdims=True)

def rnn_cell_forward(xt, a_prev, parameters):
    """
    Single RNN cell forward step.
    
    Args:
        xt:     input at time step t, shape (n_x, 1)
        a_prev: hidden state from previous step, shape (n_a, 1)
        parameters: dict with W_aa, W_ax, W_ya, b_a, b_y
    
    Returns:
        a_next: new hidden state, shape (n_a, 1)
        yt_hat: predicted output (softmax), shape (n_y, 1)
        cache:  values needed for backpropagation
    """
    W_aa = parameters['W_aa']
    W_ax = parameters['W_ax']
    W_ya = parameters['W_ya']
    b_a  = parameters['b_a']
    b_y  = parameters['b_y']
    
    # Hidden state update: a⟨t⟩ = tanh(W_aa · a⟨t-1⟩ + W_ax · x⟨t⟩ + b_a)
    z_a = np.dot(W_aa, a_prev) + np.dot(W_ax, xt) + b_a
    a_next = np.tanh(z_a)
    
    # Output prediction: ŷ⟨t⟩ = softmax(W_ya · a⟨t⟩ + b_y)
    z_y = np.dot(W_ya, a_next) + b_y
    yt_hat = softmax(z_y)
    
    cache = (a_next, a_prev, xt, parameters)
    return a_next, yt_hat, cache

4.2 — Full Forward Pass Over a Sequence

def rnn_forward(X, a0, parameters):
    """
    Full forward pass through the RNN for a sequence.
    
    Args:
        X:  list of one-hot input vectors, length T
        a0: initial hidden state, shape (n_a, 1)
        parameters: dict with W_aa, W_ax, W_ya, b_a, b_y
    
    Returns:
        y_hats:  list of output probabilities at each step
        caches:  list of caches for BPTT
        a_states: list of hidden states (for analysis)
    """
    caches = []
    y_hats = []
    a_states = [a0]
    a_prev = a0
    
    for t in range(len(X)):
        a_next, yt_hat, cache = rnn_cell_forward(X[t], a_prev, parameters)
        y_hats.append(yt_hat)
        caches.append(cache)
        a_states.append(a_next)
        a_prev = a_next
    
    return y_hats, caches, a_states

4.3 — BPTT: Backward Pass

def rnn_cell_backward(dy, da_next_from_future, cache):
    """
    Single RNN cell backward step.
    
    Args:
        dy:   gradient from loss at time step t, shape (n_y, 1)
        da_next_from_future: gradient flowing from future time steps
        cache: from forward pass
    
    Returns:
        gradients: dict with dW_aa, dW_ax, dW_ya, db_a, db_y
        da_prev:   gradient to pass to time step t-1
    """
    a_next, a_prev, xt, parameters = cache
    W_aa = parameters['W_aa']
    W_ya = parameters['W_ya']
    
    # Gradient of loss w.r.t. output layer
    dW_ya = np.dot(dy, a_next.T)
    db_y  = dy.copy()
    
    # Gradient flowing into hidden state from output + future
    da = np.dot(W_ya.T, dy) + da_next_from_future
    
    # Gradient through tanh: dtanh = (1 - a^2) * da
    dz = (1 - a_next ** 2) * da
    
    # Gradient w.r.t. parameters
    dW_aa = np.dot(dz, a_prev.T)
    dW_ax = np.dot(dz, xt.T)
    db_a  = dz.copy()
    
    # Gradient to pass to previous time step
    da_prev = np.dot(W_aa.T, dz)
    
    gradients = {
        'dW_aa': dW_aa, 'dW_ax': dW_ax, 'dW_ya': dW_ya,
        'db_a': db_a, 'db_y': db_y
    }
    return gradients, da_prev


def rnn_backward(X, Y, y_hats, caches, parameters):
    """
    Full BPTT backward pass.
    
    Args:
        X: list of one-hot inputs
        Y: list of one-hot targets
        y_hats: predicted outputs from forward pass
        caches: caches from forward pass
        parameters: model parameters
    
    Returns:
        gradients: accumulated gradients for all parameters
        loss: total cross-entropy loss
    """
    T = len(X)
    n_a = parameters['W_aa'].shape[0]
    
    # Initialize accumulated gradients
    grads = {k: np.zeros_like(parameters[k.replace('d','',1)]) 
             for k in ['dW_aa','dW_ax','dW_ya','db_a','db_y']}
    
    da_next = np.zeros((n_a, 1))
    loss = 0.0
    
    # Backward through time: from T-1 down to 0
    for t in reversed(range(T)):
        # Cross-entropy gradient: dy = ŷ - y
        dy = y_hats[t] - Y[t]
        
        # Loss at this step
        loss -= np.sum(Y[t] * np.log(y_hats[t] + 1e-8))
        
        # Backward through RNN cell
        cell_grads, da_next = rnn_cell_backward(dy, da_next, caches[t])
        
        # Accumulate gradients (sum over time)
        for key in grads:
            grads[key] += cell_grads[key]
    
    return grads, loss

4.4 — Gradient Clipping

def clip_gradients(gradients, max_norm=5.0):
    """Clip gradient norms to prevent exploding gradients."""
    total_norm = np.sqrt(
        sum(np.sum(gradients[k] ** 2) for k in gradients)
    )
    if total_norm > max_norm:
        clip_coef = max_norm / (total_norm + 1e-6)
        for k in gradients:
            gradients[k] *= clip_coef
    return gradients

4.5 — Character-Level Language Model on Hindi Poetry

# ─── Data Preparation ───
# Hindi poetry corpus (sample)
corpus = """
मधुशाला

मृदु भावों के अंगूरों की आज बना लाया हाला
प्रियतम के लिए कुसुम-कलियों का प्याला
भर लाया बन-बन में फिर-फिर हाथ फेरकर
अपने ही हाथों आज बना लाया मधुशाला

हाथ में मानिक-मधु का प्याला अमृत-मय
मैं पीने को आतुर देखकर इसे अपनाय
कामिनी ने अधर-पट खोले मुस्काय
और मुझको पिला रही है मधुशाला
"""

# Build character vocabulary
chars = sorted(set(corpus))
vocab_size = len(chars)
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

print(f"Vocabulary size: {vocab_size} characters")
print(f"Sample chars: {chars[:10]}")

# One-hot encoding helper
def one_hot(idx, size):
    vec = np.zeros((size, 1))
    vec[idx] = 1.0
    return vec

# ─── Initialize Parameters ───
np.random.seed(42)
n_a = 128   # hidden state size
n_x = vocab_size
n_y = vocab_size

parameters = {
    'W_aa': np.random.randn(n_a, n_a) * 0.01,
    'W_ax': np.random.randn(n_a, n_x) * 0.01,
    'W_ya': np.random.randn(n_y, n_a) * 0.01,
    'b_a':  np.zeros((n_a, 1)),
    'b_y':  np.zeros((n_y, 1))
}

# ─── Training Loop ───
learning_rate = 0.01
seq_length = 25   # truncated BPTT window
num_epochs = 500

for epoch in range(num_epochs):
    total_loss = 0.0
    a_prev = np.zeros((n_a, 1))
    
    # Slide through corpus in chunks of seq_length
    for start in range(0, len(corpus) - seq_length - 1, seq_length):
        # Prepare input-target pairs
        X_seq = [one_hot(char_to_idx[corpus[start + t]], n_x) 
                 for t in range(seq_length)]
        Y_seq = [one_hot(char_to_idx[corpus[start + t + 1]], n_y) 
                 for t in range(seq_length)]
        
        # Forward pass
        y_hats, caches, a_states = rnn_forward(X_seq, a_prev, parameters)
        
        # Backward pass (BPTT)
        gradients, loss = rnn_backward(X_seq, Y_seq, y_hats, caches, parameters)
        total_loss += loss
        
        # Clip gradients
        gradients = clip_gradients(gradients, max_norm=5.0)
        
        # Update parameters (SGD)
        for key in ['W_aa', 'W_ax', 'W_ya', 'b_a', 'b_y']:
            parameters[key] -= learning_rate * gradients['d' + key]
        
        # Carry forward hidden state (detached)
        a_prev = a_states[-1].copy()
    
    if epoch % 50 == 0:
        avg_loss = total_loss / (len(corpus) // seq_length)
        print(f"Epoch {epoch}  |  Loss: {avg_loss:.4f}")

4.6 — Text Generation (Sampling)

def generate_text(parameters, seed_char, length=200, temperature=0.8):
    """Generate text character by character."""
    a = np.zeros((n_a, 1))
    x = one_hot(char_to_idx[seed_char], n_x)
    generated = seed_char
    
    for _ in range(length):
        # Forward step
        z_a = np.dot(parameters['W_aa'], a) + np.dot(parameters['W_ax'], x) + parameters['b_a']
        a = np.tanh(z_a)
        z_y = np.dot(parameters['W_ya'], a) + parameters['b_y']
        
        # Temperature scaling
        z_y = z_y / temperature
        probs = softmax(z_y).ravel()
        
        # Sample from the distribution
        idx = np.random.choice(range(n_y), p=probs)
        generated += idx_to_char[idx]
        
        # Feed sampled character as next input
        x = one_hot(idx, n_x)
    
    return generated

# Generate Hindi poetry!
print("─── Generated Text ───")
print(generate_text(parameters, seed_char="म", length=200, temperature=0.7))
Epoch 0 | Loss: 3.9124 Epoch 50 | Loss: 2.8451 Epoch 100 | Loss: 2.1803 Epoch 150 | Loss: 1.7256 Epoch 200 | Loss: 1.4590 ... ─── Generated Text ─── मधुशाला में बैठ कर पीने को अपने हाथ में प्याला भर लाया अमृत-मय हाला प्रियतम ने अधर खोले फिर-फिर हाथ फेरकर अपने मन को बना मधुशाला...
How the memory works in practice: After training, the hidden state a⟨t⟩ at each step encodes what the model "remembers." After seeing "मधुशा", the hidden state has learned that "ला" is the most likely continuation. After "प्या", it predicts "ला" again — the network has learned Hindi syllable patterns!
Section 5

Industry Code — PyTorch RNN

5.1 — Time-Series Forecasting with PyTorch RNN

import torch
import torch.nn as nn
import numpy as np

class DemandRNN(nn.Module):
    """
    RNN for time-series demand forecasting.
    Predicts next-hour order volume for quick-commerce.
    """
    def __init__(self, input_size, hidden_size, output_size, num_layers=2):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # PyTorch's built-in RNN
        self.rnn = nn.RNN(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=0.2
        )
        
        # Output projection
        self.fc = nn.Linear(hidden_size, output_size)
    
    def forward(self, x, h0=None):
        """
        Args:
            x:  (batch, seq_len, input_size)
            h0: (num_layers, batch, hidden_size) — optional initial state
        Returns:
            out: (batch, output_size) — prediction for next step
        """
        if h0 is None:
            h0 = torch.zeros(self.num_layers, x.size(0), 
                             self.hidden_size).to(x.device)
        
        # rnn_out: (batch, seq_len, hidden_size)
        # h_n:     (num_layers, batch, hidden_size)
        rnn_out, h_n = self.rnn(x, h0)
        
        # Take the last time step's output
        last_hidden = rnn_out[:, -1, :]   # (batch, hidden_size)
        out = self.fc(last_hidden)        # (batch, output_size)
        return out, h_n

# ─── Model Configuration ───
INPUT_FEATURES = 5    # [orders, temperature, rain, holiday_flag, hour_sin]
HIDDEN_SIZE    = 64
OUTPUT_SIZE    = 1    # predicted orders for next hour

model = DemandRNN(INPUT_FEATURES, HIDDEN_SIZE, OUTPUT_SIZE)
print(model)
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
DemandRNN( (rnn): RNN(5, 64, num_layers=2, batch_first=True, dropout=0.2) (fc): Linear(in_features=64, out_features=1, bias=True) ) Total parameters: 13,057

5.2 — Training Loop with Gradient Clipping

# ─── Training Setup ───
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()
MAX_GRAD_NORM = 1.0

# ─── Training Loop ───
def train_epoch(model, dataloader, optimizer, criterion):
    model.train()
    total_loss = 0.0
    
    for batch_x, batch_y in dataloader:
        # batch_x: (batch, seq_len=24, features=5)
        # batch_y: (batch, 1) — next hour's orders
        
        optimizer.zero_grad()
        predictions, _ = model(batch_x)
        loss = criterion(predictions, batch_y)
        loss.backward()
        
        # ⚡ Gradient clipping — prevents exploding gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), MAX_GRAD_NORM)
        
        optimizer.step()
        total_loss += loss.item()
    
    return total_loss / len(dataloader)

# ─── Training ───
for epoch in range(100):
    train_loss = train_epoch(model, train_loader, optimizer, criterion)
    if epoch % 10 == 0:
        print(f"Epoch {epoch:3d} | Train Loss: {train_loss:.4f}")

5.3 — Sentiment Analysis: Many-to-One RNN

class SentimentRNN(nn.Module):
    """Many-to-One RNN for Amazon India review sentiment."""
    
    def __init__(self, vocab_size, embed_dim, hidden_size, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.rnn = nn.RNN(embed_dim, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)
    
    def forward(self, x):
        # x: (batch, seq_len) — token indices
        embedded = self.embedding(x)         # (batch, seq_len, embed_dim)
        rnn_out, h_n = self.rnn(embedded)    # h_n: (1, batch, hidden_size)
        
        # Many-to-One: use only the LAST hidden state
        last_hidden = h_n.squeeze(0)          # (batch, hidden_size)
        logits = self.fc(last_hidden)        # (batch, num_classes)
        return logits

# Example usage
model = SentimentRNN(
    vocab_size=30000,  # Amazon India Hindi+English vocab
    embed_dim=100,
    hidden_size=128,
    num_classes=5      # 1-star to 5-star
)
print(f"Sentiment model params: {sum(p.numel() for p in model.parameters()):,}")
Sentiment model params: 3,149,433

🏭 Industry Note: Why Companies Use LSTM/GRU, Not Vanilla RNN

In production, almost no one uses vanilla nn.RNN. Companies like Flipkart, Zepto, and Paytm use nn.LSTM or nn.GRU because vanilla RNNs struggle with sequences longer than ~20 steps. We cover LSTMs in Chapter 14, but PyTorch makes switching trivial — just replace nn.RNN with nn.LSTM!

Section 6

Visual Diagrams

6.1 — The RNN Cell (Folded View)

┌───────────────────────┐ │ │ x⟨t⟩ ─────▶│ RNN Cell │─────▶ ŷ⟨t⟩ │ │ │ a⟨t⟩ = tanh( │ ┌───────▶│ W_aa·a⟨t-1⟩ │───────┐ │ │ + W_ax·x⟨t⟩ + b_a) │ │ │ │ │ │ │ └───────────────────────┘ │ │ │ └────────────── a⟨t⟩ ◀──────────────────┘ (feedback loop — memory!)

6.2 — The RNN Unrolled Through Time

x⟨1⟩ x⟨2⟩ x⟨3⟩ x⟨4⟩ x⟨T⟩ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │ │ │ │ │ │ │ │ │ │ │ RNN │─▶│ RNN │─▶│ RNN │─▶│ RNN │─···│ RNN │ │ Cell │ │ Cell │ │ Cell │ │ Cell │ │ Cell │ │ │ │ │ │ │ │ │ │ │ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ŷ⟨1⟩ ŷ⟨2⟩ ŷ⟨3⟩ ŷ⟨4⟩ ŷ⟨T⟩ a⟨0⟩──▶ a⟨1⟩──▶ a⟨2⟩──▶ a⟨3⟩──▶ a⟨4⟩──···──▶ a⟨T⟩ (zero) (hidden states carry memory forward →) Same W_aa, W_ax, W_ya used at EVERY step!

6.3 — RNN Architecture Types

ONE-TO-ONE ONE-TO-MANY MANY-TO-ONE MANY-TO-MANY (=) MANY-TO-MANY (≠) Encoder │ Decoder x x x₁ x₂ x₃ x₁ x₂ x₃ x₁ x₂ │ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ │ ┌───┐ ┌───┬───┬───┐ ┌───┬───┬───┐ ┌───┬───┬───┐ ┌───┬───┬───┬───┬───┐ │RNN│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ c │ │ │ └─┬─┘ └─┬─┴─┬─┴─┬─┘ └───┴───┴─┬─┘ └─┬─┴─┬─┴─┬─┘ └───┴───┴─┬─┴─┬─┴─┬─┘ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ y y₁ y₂ y₃ y y₁ y₂ y₃ y₁ y₂ y₃

6.4 — Gradient Flow in BPTT (Vanishing Gradient)

Loss Loss Loss Loss Loss ↑ ↑ ↑ ↑ ↑ │ │ │ │ │ ŷ⟨5⟩ ŷ⟨4⟩ ŷ⟨3⟩ ŷ⟨2⟩ ŷ⟨1⟩ ↑ ↑ ↑ ↑ ↑ │ │ │ │ │ a⟨5⟩ ◀── a⟨4⟩ ◀── a⟨3⟩ ◀── a⟨2⟩ ◀── a⟨1⟩ │ ↑ │ Gradient must travel through │ └────── 4 matrix multiplications ────┘ Each: W_aaᵀ · diag(1-a²) If ‖W_aa‖ < 1 → gradient → 0 (Vanishing Gradient Problem!)

6.5 — Gradient Clipping Visualization

WITHOUT CLIPPING: WITH CLIPPING (threshold=5): gradient norm gradient norm ↑ ↑ 500│ * 5 │ ─ ─ ─ ─ * ─ ─ ─ ─ (clipped) │ * * │ * * │ * * │ * * 50│ * * │ * * │* * * │* * * 5│─────────*──*── │─────────*──*── │ ** ** │ ** ** └───────────────▶ time └───────────────▶ time Training CRASHES Training STABLE (NaN loss, diverges) (gradients rescaled, direction preserved)
Section 7

Worked Example — Hand-Computing a 3-Step RNN

Let's trace through a tiny RNN by hand. We'll predict the next character in the sequence "बा" → "ल" (from "बाल" = child).

Setup

Vocabulary: {बा=0, ल=1, क=2} → n_x = n_y = 3, hidden size n_a = 2

Parameters (Simplified)
W_ax = [[0.1, 0.3, -0.1],    # shape (2, 3)
        [0.2, -0.2, 0.4]]

W_aa = [[0.5, 0.1],           # shape (2, 2)
        [-0.3, 0.6]]

W_ya = [[0.2, -0.1],          # shape (3, 2)
        [0.4,  0.3],
        [-0.2, 0.1]]

b_a = [[0], [0]]              # shape (2, 1)
b_y = [[0], [0], [0]]         # shape (3, 1)
a⟨0⟩ = [[0], [0]]             # initial hidden state

Step 1: t=1, Input = "बा" (x⟨1⟩ = [1, 0, 0]ᵀ)

# Compute z_a⟨1⟩ = W_aa · a⟨0⟩ + W_ax · x⟨1⟩ + b_a
z_a⟨1⟩ = [[0.5, 0.1],  · [[0],   +  [[0.1, 0.3, -0.1],  · [[1],   + [[0],
          [-0.3, 0.6]]    [0]]       [0.2, -0.2, 0.4]]     [0],      [0]]
                                                             [0]]

z_a⟨1⟩ = [[0],  + [[0.1],  + [[0],   = [[0.1],
          [0]]    [0.2]]     [0]]       [0.2]]

# Apply tanh
a⟨1⟩ = tanh([[0.1], [0.2]]) = [[0.0997], [0.1974]]

# Compute output
z_y⟨1⟩ = W_ya · a⟨1⟩ + b_y
z_y⟨1⟩ = [[0.2·0.0997 + (-0.1)·0.1974],    = [[0.0002],
          [0.4·0.0997 + 0.3·0.1974],            [0.0991],
          [-0.2·0.0997 + 0.1·0.1974]]           [-0.0002]]

ŷ⟨1⟩ = softmax([[0.0002], [0.0991], [-0.0002]])
     ≈ [[0.3187], [0.3519], [0.3294]]

# Network predicts "ल" (index 1) with probability 0.3519 ← highest!

Step 2: t=2, Input = "ल" (x⟨2⟩ = [0, 1, 0]ᵀ), using a⟨1⟩ from above

# Hidden state update
z_a⟨2⟩ = W_aa · a⟨1⟩ + W_ax · x⟨2⟩ + b_a
z_a⟨2⟩ = [[0.5·0.0997 + 0.1·0.1974],   +  [[0.3],   = [[0.3694],
          [-0.3·0.0997 + 0.6·0.1974]]       [-0.2]]     [-0.1116]]

a⟨2⟩ = tanh([[0.3694], [-0.1116]]) = [[0.3540], [-0.1113]]

# The hidden state NOW carries information about BOTH "बा" AND "ल"!
# a⟨1⟩ encoded "बा", and a⟨2⟩ blends that with the new input "ल"
Key Observation: Notice how a⟨2⟩ = [0.354, -0.111] is different from what we'd get if we only fed "ल" without the context of "बा". The value 0.354 in the first dimension is influenced by the history — this IS the memory mechanism! Without the recurrent connection (W_aa · a⟨1⟩), we'd get [0.291, -0.197] — a measurably different state.

BPTT Backward (Sketch for Step 2)

# Suppose target at t=2 is "क" → y⟨2⟩ = [0, 0, 1]ᵀ
# dy⟨2⟩ = ŷ⟨2⟩ - y⟨2⟩  (softmax cross-entropy gradient)

# ∂L/∂W_ya += dy⟨2⟩ · a⟨2⟩ᵀ   ← gradient accumulates at each step
# ∂L/∂a⟨2⟩ = W_yaᵀ · dy⟨2⟩    ← flows backward to hidden state
# dz⟨2⟩ = (1 - a⟨2⟩²) · ∂L/∂a⟨2⟩   ← through tanh
# ∂L/∂W_aa += dz⟨2⟩ · a⟨1⟩ᵀ   ← gradient uses PREVIOUS hidden state
# ∂L/∂a⟨1⟩ = W_aaᵀ · dz⟨2⟩    ← CONTINUES backward to t=1!
Section 8

Case Study — Zepto: 10-Minute Grocery Demand Forecasting

🛒 Zepto — Predicting Hourly Orders Across 300+ Dark Stores

The Business Problem

Zepto promises 10-minute delivery in 10+ Indian cities. This is only possible if the right products are already stocked in the dark store closest to the customer. But dark stores are small (2,000–3,000 sq ft) — they can't stock everything. They must predict exactly what will be ordered, when, and how much.

If Zepto overstocks: products expire (especially dairy, fruits, vegetables — ₹10–50 crore annual wastage). If Zepto understocks: orders get cancelled, customers churn to Blinkit/Swiggy Instamart.

Why This Is a Sequential Problem

Demand at 7 PM depends on:

  • Demand at 6 PM, 5 PM, 4 PM (hourly momentum)
  • Day-of-week pattern (weekends vs. weekdays)
  • Weather (rain → more orders → more delivery delays)
  • Festivals (Holi → colors & sweets spike, Diwali → dry fruits & decorations)
  • IPL matches (7–10 PM → snacks, chips, cold drinks surge)
  • Payday effect (month-end → higher grocery spending)

The RNN Architecture

ComponentConfiguration
ArchitectureMany-to-One RNN (24h → predict next hour)
Input Features (per hour)orders_count, avg_basket_size (₹), temperature_°C, rain_mm, is_holiday, hour_sin, hour_cos, day_of_week_encoded
Sequence LengthT = 24 (look-back of 24 hours)
Hidden Sizen_a = 64
OutputPredicted order count for the next hour
Loss FunctionMSE (Mean Squared Error)
Gradient Clippingmax_norm = 1.0

Input Feature Engineering

import pandas as pd
import numpy as np

# ─── Feature Engineering for Zepto Dark Store ───
def engineer_features(df):
    """
    Transform raw hourly data into RNN-ready features.
    df columns: timestamp, orders, avg_basket_inr, temp_c, rain_mm
    """
    df['hour'] = df['timestamp'].dt.hour
    df['day_of_week'] = df['timestamp'].dt.dayofweek
    
    # Cyclical encoding for hour (so 23→0 is smooth)
    df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
    df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
    
    # Holidays: Diwali, Holi, Independence Day, etc.
    indian_holidays = ['2024-11-01', '2024-03-25', '2024-08-15']
    df['is_holiday'] = df['timestamp'].dt.date.astype(str).isin(indian_holidays).astype(int)
    
    # Normalize numerical features
    for col in ['orders', 'avg_basket_inr', 'temp_c', 'rain_mm']:
        df[col] = (df[col] - df[col].mean()) / (df[col].std() + 1e-8)
    
    feature_cols = ['orders', 'avg_basket_inr', 'temp_c', 'rain_mm', 
                    'is_holiday', 'hour_sin', 'hour_cos']
    return df[feature_cols].values

# Create sliding window sequences
def create_sequences(data, lookback=24):
    X, y = [], []
    for i in range(lookback, len(data)):
        X.append(data[i - lookback:i])   # past 24 hours
        y.append(data[i, 0])             # next hour's orders
    return np.array(X), np.array(y).reshape(-1, 1)

Results

MetricBaseline (Yesterday's Orders)Vanilla RNNLSTM (Ch.14)
MAE (orders/hour)45.228.718.3
MAPE22.1%14.5%9.2%
Wastage Reduction18% ↓31% ↓
Stockout Reduction15% ↓27% ↓

Key Findings

  • Vanilla RNN beat the baseline by 37% on MAE — the temporal patterns (morning milk, evening snacks, weekend brunch) were clearly captured
  • Limitation: Vanilla RNN struggled with weekly patterns (7-day cycles = 168 time steps). The gradient from "last Saturday's peak" could not reach "this Saturday's prediction" — classic vanishing gradient
  • Solution preview: LSTM (Chapter 14) solved this by maintaining a separate cell state that can carry information across 168+ steps
  • Business impact: Even the vanilla RNN saved Zepto an estimated ₹8–12 crore annually in reduced perishable wastage across their dark store network
Quick-commerce in India (Zepto, Blinkit, Swiggy Instamart) processes over 2 million orders daily. Each dark store generates 24+ time series (one per hour) × 3,000+ SKUs. That's 72,000+ time series per store, multiplied by 300+ stores = over 21 million individual time series that need forecasting. RNNs/LSTMs power the backbone of these predictions.
Section 9

Common Mistakes & Misconceptions

Mistake 1: "Each time step has its own set of weights."
Wrong. RNNs use parameter sharing — the same W_aa, W_ax, W_ya are applied at every time step. This is what enables them to handle variable-length sequences. If you had different weights per step, you'd need a different model for every sequence length.
Mistake 2: "The hidden state a⟨t⟩ is the same as the output ŷ⟨t⟩."
No! a⟨t⟩ is the internal memory — it's passed to the next time step. ŷ⟨t⟩ is the prediction at time step t, computed by projecting a⟨t⟩ through W_ya and softmax. In many-to-one architectures, we only compute ŷ at the final step, but a⟨t⟩ exists at every step.
Mistake 3: "RNNs process all time steps in parallel."
No! RNNs are inherently sequential — you must compute a⟨1⟩ before you can compute a⟨2⟩, because a⟨2⟩ depends on a⟨1⟩. This sequential dependency is the main reason Transformers (with parallel self-attention) have largely replaced RNNs for many tasks.
Mistake 4: "Gradient clipping fixes the vanishing gradient problem."
Gradient clipping only prevents exploding gradients (by capping the norm). The vanishing gradient problem is the opposite — gradients are too small. You can't clip something that's already near zero. Vanishing gradients require architectural changes: LSTM, GRU, skip connections, or attention.
Mistake 5: "Longer sequences always give better results with RNNs."
Due to the vanishing gradient problem, vanilla RNNs effectively have a "memory horizon" of ~10–20 steps. Feeding a 500-step sequence doesn't mean the model uses all 500 steps — it practically "forgets" everything beyond the most recent ~20 steps. Truncated BPTT makes this explicit.
Mistake 6: "Using ReLU in RNN hidden states is fine."
ReLU is unbounded (output can be 0 to ∞). In an RNN, the hidden state is multiplied by W_aa repeatedly. With ReLU, the hidden state can grow unboundedly, causing numerical overflow. tanh bounds the state to [-1, +1]. (LSTMs use a combination of sigmoid gates and tanh for this reason.)
Mistake 7: "Forgetting to reset hidden state between unrelated sequences."
If you train on sequence A and then sequence B (unrelated), you must reset a⟨0⟩ = 0 before starting B. Otherwise, the hidden state from the end of A "leaks" into B, causing noisy gradients. In PyTorch, this means calling h0 = torch.zeros(...) for each new batch.
Section 10

Comparison Table

10.1 — RNN vs. Feedforward Networks

AspectFeedforward NetworkRecurrent Neural Network
InputFixed-size vectorVariable-length sequence
MemoryNone — each input independentHidden state carries context forward
Parameter SharingAcross layers (each layer has own W)Across time steps (same W at every step)
TrainingStandard backpropagationBPTT (backpropagation through time)
DepthNumber of layersNumber of layers × sequence length (unrolled)
ParallelismEach sample in batch parallelizableSequential within a sample (time steps depend on each other)
Key ProblemCan't handle sequences nativelyVanishing gradient over long sequences
Use CaseImage classification, tabular dataText, time series, audio, video

10.2 — RNN Architecture Selection Guide

TaskArchitectureWhy?Indian Example
Sentiment AnalysisMany-to-OneEntire review → single sentiment scoreAmazon India review → ⭐ rating
Language ModelingMany-to-Many (=)Predict next token at each stepHindi text completion on Koo app
Machine TranslationMany-to-Many (≠)Input/output lengths differBhashini: Hindi → Tamil
Music GenerationOne-to-ManyOne seed → sequence of notesAI-generated raga compositions
Named Entity RecognitionMany-to-Many (=)Each token → entity labelExtracting names from Aadhaar forms
Demand ForecastingMany-to-One24h history → next hour predictionZepto dark store inventory
Video ClassificationMany-to-OneFrame sequence → single labelCricket shot classification for Hotstar

10.3 — Vanilla RNN vs. LSTM vs. GRU (Preview)

FeatureVanilla RNNLSTM (Ch.14)GRU (Ch.14)
Equations per Cell1 (hidden update)4 (forget, input, cell, output)3 (reset, update, candidate)
Memory MechanismHidden state onlySeparate cell state (highway)Hidden state with gating
Long-term DependenciesFails (~10–20 steps)Handles 100+ stepsHandles 100+ steps
Vanishing GradientSevereMitigated (additive gradient flow)Mitigated
Parameters per Celln_a² + n_a·n_x + n_a4×(n_a² + n_a·n_x + n_a)3×(n_a² + n_a·n_x + n_a)
Training SpeedFastestSlowest (4 gates)Middle
When to UseShort sequences, simple tasksLong sequences, complex patternsLong sequences, fewer params needed
Section 11

Exercises

Section A — Multiple Choice Questions (10)

Q1.

In the RNN hidden state equation a⟨t⟩ = tanh(W_aa · a⟨t-1⟩ + W_ax · x⟨t⟩ + b_a), what is the shape of W_aa if the hidden state has 128 units?

  1. (128, 1)
  2. (128, 128)
  3. (1, 128)
  4. (128, n_x)
Answer: B — W_aa maps from hidden state (n_a=128) to hidden state (n_a=128), so its shape is (n_a, n_a) = (128, 128). This is the recurrent weight matrix that connects the previous hidden state to the current one.
RememberBeginner
Q2.

Which RNN architecture would you use for classifying an Amazon India product review as positive or negative?

  1. One-to-One
  2. One-to-Many
  3. Many-to-One
  4. Many-to-Many (same length)
Answer: C — Sentiment classification takes a sequence of words (many inputs) and produces a single label (one output). This is the Many-to-One architecture — we read the entire review and produce a single prediction from the final hidden state.
UnderstandBeginner
Q3.

The key advantage of parameter sharing in RNNs is:

  1. It makes training faster by reducing the number of gradient computations
  2. It enables the network to handle variable-length input sequences with the same model
  3. It prevents the vanishing gradient problem
  4. It ensures the output is always the same dimension as the input
Answer: B — Parameter sharing (same W_aa, W_ax, W_ya at every step) means the model doesn't change based on sequence length. A model trained on 10-word sentences can process 1,000-word documents. Without parameter sharing, you'd need a different model architecture for each sequence length.
UnderstandBeginner
Q4.

During BPTT, the gradient of the loss with respect to W_aa is computed as:

  1. The gradient at the last time step only
  2. The average of gradients across all time steps
  3. The sum of gradient contributions from all time steps
  4. The maximum gradient across all time steps
Answer: C — Since W_aa is shared across all time steps, its total gradient is the SUM of its contributions at each step: ∂L/∂W_aa = Σ_{t=1}^{T} (contribution from step t). This is analogous to how shared weights in CNNs accumulate gradients from all spatial locations.
UnderstandIntermediate
Q5.

The vanishing gradient problem in RNNs occurs because:

  1. The learning rate is too small
  2. The repeated multiplication of matrices with spectral radius < 1 causes gradients to decay exponentially
  3. The softmax function saturates at extreme values
  4. The bias terms are initialized to zero
Answer: B — The gradient through k time steps involves the product ∏ diag(1-a⟨t⟩²)·W_aa. Since tanh' ∈ (0,1] and if the largest singular value of W_aa < 1, this product shrinks exponentially with k. After 20-30 steps, the gradient is essentially zero.
AnalyzeIntermediate
Q6.

Gradient clipping with threshold τ performs the following operation when ‖g‖ > τ:

  1. g ← 0 (set gradient to zero)
  2. g ← τ · sign(g) (element-wise clipping)
  3. g ← (τ / ‖g‖) · g (rescale to norm τ, preserve direction)
  4. g ← g - τ (subtract threshold)
Answer: C — Gradient clipping rescales the entire gradient vector so its norm equals τ, while preserving the gradient direction. This is norm-based clipping, which is preferred over element-wise clipping because it maintains the relative magnitudes between gradient components.
ApplyIntermediate
Q7.

In a character-level language model, the perplexity is 15. This means:

  1. The model makes 15 errors per sentence
  2. The model is, on average, as uncertain as choosing uniformly among 15 characters at each step
  3. The model achieves 15% accuracy
  4. The vocabulary must have exactly 15 characters
Answer: B — Perplexity = exp(average negative log-likelihood). A perplexity of 15 means the model's uncertainty at each prediction step is equivalent to rolling a fair 15-sided die. Lower perplexity = more confident, better model. For a Hindi character-level model with ~70 characters, perplexity of 15 means it has narrowed down the options significantly.
UnderstandIntermediate
Q8.

Why is tanh preferred over ReLU for the hidden state activation in vanilla RNNs?

  1. tanh is computationally cheaper than ReLU
  2. tanh outputs are bounded in [-1, +1], preventing hidden state explosion over many time steps
  3. tanh always has a non-zero gradient, completely solving vanishing gradients
  4. tanh is differentiable everywhere, while ReLU is not
Answer: B — In an RNN, the hidden state is computed by repeatedly applying a⟨t⟩ = tanh(W·a⟨t-1⟩ + ...). The tanh bounds the output to [-1, +1], preventing the hidden state from growing without bound. ReLU (unbounded above) could cause a⟨t⟩ to explode over many steps. Note: tanh does NOT solve vanishing gradients — its derivative is still ≤ 1.
AnalyzeIntermediate
Q9.

An RNN with hidden size n_a=64 and input size n_x=100 is used for a many-to-many task with vocabulary size n_y=100. How many trainable parameters does it have (excluding biases)?

  1. 64 × 64 + 64 × 100 + 100 × 64 = 4,096 + 6,400 + 6,400 = 16,896
  2. 64 × 100 + 100 × 64 = 12,800
  3. 64 × 64 × 100 = 409,600
  4. 3 × 64 × 64 = 12,288
Answer: A — W_aa: (n_a, n_a) = 64×64 = 4,096; W_ax: (n_a, n_x) = 64×100 = 6,400; W_ya: (n_y, n_a) = 100×64 = 6,400. Total = 16,896. Note: this count is independent of sequence length — the same parameters are used at every time step.
ApplyIntermediate
Q10.

In truncated BPTT with window size k=20, applied to a sequence of length T=100:

  1. Gradients are computed for only the last 20 time steps and ignored for the first 80
  2. The gradient computation at any time step t only propagates backward through the previous 20 steps (t to t-20), not all the way to t=1
  3. The sequence is split into 5 independent sub-sequences of length 20 with no state carried over
  4. The network architecture is modified to only have 20 recurrent layers
Answer: B — Truncated BPTT limits how far back the gradient is propagated. The forward pass still runs through all 100 steps (maintaining the hidden state), but at each step, the backward pass only goes k=20 steps back. This trades off long-range gradient accuracy for computational efficiency.
AnalyzeAdvanced

Section B — Short Answer Questions (5)

B1.

Explain in 3–4 sentences why a feedforward network cannot properly model the task: "Given the past 7 days of Sensex closing prices, predict tomorrow's price."

A feedforward network takes a fixed-size input and has no mechanism to understand that the 7 values are ordered in time. If you shuffle the 7 prices randomly (e.g., day 5 swapped with day 2), the feedforward network with a concatenated input vector would produce different results despite having the same set of values — it cannot distinguish meaningful time order from a random permutation. Additionally, it cannot generalize if the input suddenly becomes 10 days or 30 days (different input size = different model). An RNN processes each day sequentially, maintaining a hidden state that captures trends, momentum, and temporal patterns — and works with any number of past days.
B2.

What is the difference between the "folded" and "unrolled" views of an RNN? Why is unrolling necessary for training?

The folded view shows the RNN as a single cell with a self-loop (feedback arrow from a⟨t⟩ back to the input at the next step). The unrolled view replicates this cell T times (one copy per time step) and shows the explicit dataflow from a⟨0⟩ → a⟨1⟩ → … → a⟨T⟩. Unrolling is necessary for training because backpropagation requires a directed acyclic graph (DAG) — the self-loop in the folded view creates a cycle. By unrolling, we convert the cycle into a deep feedforward graph where standard backpropagation (BPTT) can be applied. Each unrolled copy shares the same weights.
B3.

Why is the hidden state initialized to the zero vector a⟨0⟩ = 0? Could we initialize it randomly?

The zero initialization represents "no prior knowledge" — before seeing any input, the network has no memory. Random initialization is possible and sometimes used (especially in encoder-decoder models where the decoder's initial hidden state is the encoder's final state). However, random initialization adds noise to the gradient computation at early time steps and can make training less stable. In practice, zero initialization works well and is the standard because: (1) the network quickly learns meaningful hidden states after the first few time steps, and (2) it provides a consistent starting point for reproducible training.
B4.

A Zepto data scientist notices that their RNN performs well for predicting orders 1–2 hours ahead but poorly for predicting 24 hours ahead. Explain why, using the concept of vanishing gradients.

When predicting 1–2 hours ahead, the relevant information (recent order trends, current weather) is within the last few time steps — well within the vanilla RNN's effective memory range. But predicting 24 hours ahead requires the model to "remember" patterns from 24+ time steps ago. During BPTT, the gradient from the loss at t=24 must flow backward through 24 matrix multiplications (W_aaᵀ · diag(1-a²)) to reach information at t=0. Each multiplication shrinks the gradient by a factor of ~0.5–0.8, so after 24 steps: 0.7²⁴ ≈ 0.0008 — essentially zero. The model cannot learn the relationship between conditions 24 hours ago and current demand. The solution is to use LSTM/GRU, which maintain a separate "gradient highway" that allows gradients to flow over long distances without exponential decay.
B5.

Explain temperature scaling in text generation. What happens if temperature → 0 and temperature → ∞?

Temperature τ scales the logits before softmax: softmax(z/τ). As τ → 0: The softmax distribution becomes sharper — it approaches a one-hot vector placing all probability on the most likely character. The generation becomes deterministic (greedy decoding). Output is highly repetitive but grammatically consistent. As τ → ∞: z/τ → 0 for all logits, making softmax approach a uniform distribution. Every character becomes equally likely, producing completely random, incoherent gibberish. Sweet spot: τ ≈ 0.7–0.8 gives creative but coherent text — enough randomness for variety, enough structure for readability. For Hindi poetry generation, τ = 0.7 produces text that follows syllable patterns while introducing creative variations.

Section C — Long Answer Questions (3)

C1. Advanced

Derive the BPTT gradients for a 3-step RNN. Given a sequence of length T=3 with inputs x⟨1⟩, x⟨2⟩, x⟨3⟩ and targets y⟨1⟩, y⟨2⟩, y⟨3⟩, starting from the cross-entropy loss L = L⟨1⟩ + L⟨2⟩ + L⟨3⟩:

  1. Write down the complete forward pass equations for all 3 steps
  2. Derive ∂L/∂W_ya by computing the contribution from each time step
  3. Derive ∂L/∂W_aa — showing how the gradient at step 3 flows backward through steps 2 and 1
  4. Show that the gradient ∂L⟨3⟩/∂a⟨1⟩ involves the product diag(1-a⟨3⟩²)·W_aa·diag(1-a⟨2⟩²)·W_aa — two matrix multiplications, which would be k-1 multiplications for distance k
  5. Explain why this leads to vanishing gradients when ‖W_aa‖ < 1

[Expected length: 2–3 pages with equations]

C2. Intermediate

Compare and contrast the five RNN architectures (one-to-one, one-to-many, many-to-one, many-to-many same length, many-to-many different length). For each:

  1. Draw the architecture diagram showing input/output at each time step
  2. Write the mathematical formulation (which steps have inputs, which produce outputs)
  3. Give two Indian industry examples with justification
  4. Specify what the loss function looks like (at which steps is loss computed)

[Expected length: 2 pages with diagrams]

C3. Advanced

Language Modeling Deep Dive.

  1. Formulate the language model probability P(w₁, w₂, …, w_T) using the chain rule
  2. Explain how an RNN implements each conditional P(w_t | w₁, …, w_{t-1}) using the hidden state
  3. Derive the training objective (negative log-likelihood) and show it equals the cross-entropy loss summed over time steps
  4. Define perplexity mathematically and explain its intuitive interpretation
  5. For a Hindi character-level model with vocabulary size 75: What is the perplexity of a random model (uniform distribution)? What would be a "good" perplexity, and why?
  6. Explain the sampling procedure for text generation, including the role of temperature

[Expected length: 2–3 pages]

Section D — Programming Questions (2)

D1. Intermediate

Sentiment Analysis on Amazon India Reviews. Build a complete many-to-one RNN for sentiment classification:

  1. Load the Amazon India product review dataset (or create a synthetic one with 5,000 Hindi+English reviews)
  2. Tokenize the text, build a vocabulary, and convert reviews to padded integer sequences
  3. Implement a SentimentRNN class in PyTorch with: Embedding layer → RNN → Linear → Softmax
  4. Train with cross-entropy loss, Adam optimizer, and gradient clipping (max_norm=1.0)
  5. Report accuracy, confusion matrix, and show 5 example predictions with the model's internal hidden state evolution (plot ‖a⟨t⟩‖ over time for a sample review)
  6. Bonus: Compare vanilla RNN vs. LSTM performance and discuss the difference
D2. Advanced

Character-Level Text Generator for Indian Languages. Build a from-scratch character-level RNN:

  1. Collect a corpus of Hindi poetry (Harivansh Rai Bachchan's Madhushala or any publicly available Hindi text, ~10,000 characters minimum)
  2. Implement the RNN cell, forward pass, BPTT backward pass, and gradient clipping entirely in NumPy (no PyTorch)
  3. Train the model and plot the loss curve over 1,000+ epochs
  4. Generate text at three temperature values (0.3, 0.7, 1.5) and qualitatively compare the outputs
  5. Visualize the hidden state activation patterns: for a generated sequence of 50 characters, plot a heatmap of a⟨t⟩ (t × n_a matrix) to show how different hidden units activate for different characters
  6. Experiment with hidden sizes (32, 64, 128, 256) and report which produces the best text quality

Section E — Mini-Project

E1. Advanced

🛒 Zepto-Style Demand Forecasting System

Build an end-to-end demand forecasting pipeline for a simulated dark store in Mumbai:

  1. Data Generation: Simulate 90 days of hourly order data (2,160 data points) with realistic patterns:
    • Morning peak (8–10 AM: milk, bread, eggs)
    • Lunch lull (1–3 PM)
    • Evening spike (6–9 PM: dinner ingredients, snacks)
    • Weekend 2x multiplier
    • Rain → 1.5x orders (people don't go out)
    • IPL match evenings → snack orders spike
    • Diwali week → 3x orders on sweets & dry fruits
  2. Feature Engineering: Create features: hour_sin, hour_cos, day_of_week, is_weekend, temperature, rain_flag, is_holiday, lag_1h, lag_24h, rolling_mean_6h
  3. Model: Implement vanilla RNN + LSTM in PyTorch with 24-hour lookback window
  4. Evaluation: MAE, MAPE, plot predicted vs. actual for a full week
  5. Business Analysis: If Zepto stocks 10% above predicted demand as safety buffer, calculate the expected wastage cost (assume avg order = ₹350, wastage cost = ₹50 per unnecessary unit) and stockout cost (assume ₹200 lost revenue per missed order)
  6. Report: Write a 2-page report comparing vanilla RNN vs. LSTM, with insights on which time-of-day patterns each model captures well

Deliverables: Jupyter notebook + PDF report

Section 12

Chapter Summary

📝 Key Takeaways — Chapter 13

  1. Sequential data is everywhere — text, time series, audio, video. Feedforward networks fail because they have no notion of order or memory.
  2. The RNN cell adds a hidden state a⟨t⟩ that carries information forward through time: a⟨t⟩ = tanh(W_aa · a⟨t-1⟩ + W_ax · x⟨t⟩ + b_a).
  3. Parameter sharing — the same W_aa, W_ax, W_ya are used at every time step, enabling variable-length input handling.
  4. Five architectures — one-to-one, one-to-many, many-to-one, many-to-many (same/different length) — cover all sequence tasks from sentiment analysis to machine translation.
  5. BPTT trains RNNs by unrolling the computation graph across time and applying standard backpropagation. Gradients for shared weights are summed across all time steps.
  6. Vanishing gradient — the product ∏ diag(1-a²)·W_aa shrinks exponentially, preventing learning of long-range dependencies (>20 steps).
  7. Exploding gradient — the same product can grow exponentially, causing NaN values. Solved by gradient clipping: rescaling g to have max norm τ.
  8. Language modeling — RNNs learn P(word_t | history) at each step. The model generates text by sampling from the predicted distribution.
  9. Perplexity = exp(average NLL) — measures how "surprised" the model is. Lower = better.
  10. Temperature scaling — controls the creativity/randomness trade-off in text generation.
  11. Vanilla RNNs are limited — they work for short sequences but fail for long-term dependencies. This motivates LSTM and GRU (Chapter 14).
  12. Industry impact — from IRCTC booking forecasts to Zepto dark store inventory, RNNs (and their LSTM/GRU variants) power critical time-series and NLP systems across Indian tech.

✅ Self-Assessment Checklist

  • Can I write the RNN cell equations from memory?
  • Can I explain why feedforward networks fail on sequential data?
  • Can I draw all 5 RNN architecture types and give examples for each?
  • Can I derive the BPTT gradient for W_aa for a 3-step sequence?
  • Can I explain the vanishing gradient mathematically (product of matrices)?
  • Can I implement gradient clipping from scratch?
  • Can I implement an RNN cell in NumPy (forward + backward)?
  • Can I explain language modeling (chain rule → conditional probability → RNN)?
  • Can I compute perplexity and interpret what it means?
  • Can I use PyTorch's nn.RNN for time-series forecasting?
  • Can I articulate why vanilla RNNs need to be replaced by LSTMs for long sequences?
Section 13

References & Further Reading

Foundational Papers

  1. Elman, J. L. (1990). "Finding Structure in Time." Cognitive Science, 14(2), 179–211. — The original "simple recurrent network" (Elman network) paper.
  2. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning Representations by Back-Propagating Errors." Nature, 323, 533–536. — Introduced backpropagation, the foundation for BPTT.
  3. Werbos, P. J. (1990). "Backpropagation Through Time: What It Does and How to Do It." Proceedings of the IEEE, 78(10), 1550–1560. — The definitive BPTT paper.
  4. Hochreiter, S. (1991). "Untersuchungen zu dynamischen neuronalen Netzen" [Diploma Thesis]. — First formal analysis of the vanishing gradient problem in RNNs.
  5. Bengio, Y., Simard, P., & Frasconi, P. (1994). "Learning Long-Term Dependencies with Gradient Descent is Difficult." IEEE Transactions on Neural Networks, 5(2), 157–166. — Comprehensive analysis of why RNNs fail on long sequences.
  6. Pascanu, R., Mikolov, T., & Bengio, Y. (2013). "On the Difficulty of Training Recurrent Neural Networks." ICML. — Gradient clipping and analysis of exploding/vanishing gradients.

Textbooks

  1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 10: Sequence Modeling — Recurrent and Recursive Nets.
  2. Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd ed. draft). Chapter 9: RNNs and LSTMs. — Available at web.stanford.edu/~jurafsky/slp3/
  3. Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). Dive into Deep Learning. Chapter 9: Recurrent Neural Networks. — Available at d2l.ai

Online Resources & Tutorials

  1. Karpathy, A. (2015). "The Unreasonable Effectiveness of Recurrent Neural Networks." — Classic blog post with character-level RNN experiments. karpathy.github.io/2015/05/21/rnn-effectiveness/
  2. Olah, C. (2015). "Understanding LSTM Networks." — Beautifully illustrated guide to RNN/LSTM internals. colah.github.io/posts/2015-08-Understanding-LSTMs/
  3. PyTorch Documentation. torch.nn.RNN. pytorch.org/docs/stable/generated/torch.nn.RNN.html
  4. Andrew Ng. Deep Learning Specialization, Course 5: Sequence Models. Coursera. — Excellent video lectures on RNNs, BPTT, and language modeling.

Indian Industry Context

  1. IRCTC Annual Report (2023–24). — Statistics on daily ticket bookings and seasonal demand patterns.
  2. RedSeer Consulting (2024). "Quick Commerce in India." — Market analysis of Zepto, Blinkit, and Swiggy Instamart.
  3. NPCI (2024). "UPI Ecosystem Statistics." — Monthly UPI transaction volumes used as sequential data examples.