Neural Networks & Deep Learning

Chapter 14: LSTMs and GRUs — Solving Long-Term Memory

How Gated Architectures Let Recurrent Networks Remember What Matters and Forget What Doesn't

⏱️ Reading Time: ~3.5 hours | 📖 Part IV: Sequence Models | 🧠 Theory + Code Chapter

📋 Prerequisites: Chapter 13 (Recurrent Neural Networks), Chapter 8 (Optimization), Chapter 7 (Backpropagation)

Bloom's Taxonomy Map for This Chapter

Bloom's Level	What You'll Achieve
🔵 Remember	Recall the LSTM gate equations (forget, input, output), GRU equations (update, reset), and the role of the cell state as a "conveyor belt" for gradients
🔵 Understand	Explain why vanilla RNNs suffer from vanishing gradients, how the cell state solves this, and why GRUs use fewer parameters than LSTMs
🟢 Apply	Implement an LSTM cell from scratch in NumPy, build a Nifty 50 stock predictor in TensorFlow, and apply Bidirectional LSTMs to NER tasks
🟡 Analyze	Trace gradients through the LSTM cell, compare LSTM vs GRU training dynamics, and analyze gate activations for interpretability
🟠 Evaluate	Choose between LSTM, GRU, and Bidirectional variants for specific applications; justify architecture choices for Indian industry problems
🔴 Create	Design a complete fraud detection pipeline using stacked Bidirectional LSTMs on sequential transaction data

Section 1

Learning Objectives

By the end of this chapter, you will be able to:

Explain why vanilla RNNs fail on long sequences by deriving the vanishing gradient problem through repeated Jacobian multiplication
Derive the complete LSTM cell equations — forget gate (f), input/update gate (i), cell candidate (c̃), cell state (c), output gate (o), and hidden state (a) — with full mathematical notation
Derive the GRU equations — update gate (z), reset gate (r), candidate hidden state (h̃), and final hidden state (h) — and explain how GRU merges the forget and input gates
Compare LSTM vs GRU on parameter count, training speed, and performance across different sequence lengths
Explain Bidirectional RNNs — why reading a sequence both forwards and backwards helps tasks like Named Entity Recognition
Implement an LSTM cell forward pass from scratch using only NumPy
Build a TensorFlow LSTM model for NSE Nifty 50 stock price prediction using real-world financial time-series data
Build a Bidirectional LSTM for Named Entity Recognition on Indian news articles
Analyze the HDFC Bank case study — how LSTMs on transaction sequences reduced false positives in fraud detection by 40%
Design deep/stacked RNN architectures and know when to add depth vs. width

Section 2

Opening Hook — The Sentence That Broke the RNN

🗣️ When Memory Fails: A Hindi Sentence Challenge

Consider this everyday Hindi sentence:

"Kya aap mujhe Bangalore mein best biryani restaurant suggest kar sakte hain?"

To answer this question correctly, a model needs to connect "Kya" (the question word at position 1) to "hain" (the verb at position 11). The subject "aap" appears 10 words before its verb. A vanilla RNN, processing word by word, must carry the memory of "Kya" and "aap" across 10 time steps.

Here's the brutal truth: by the time a vanilla RNN reaches "hain", the gradient signal from "Kya" has been multiplied through 10 weight matrices. If each multiplication shrinks the gradient by just 0.5×, the signal arriving back is 0.5¹⁰ = 0.001 — a thousand times weaker. The RNN literally forgets whether this was a question or a statement.

Google Translate initially struggled with long Hindi→English translations for exactly this reason. When they switched from vanilla RNNs to LSTMs in 2016, Hindi translation quality improved by 60% on BLEU scores — because LSTMs can remember "Kya" when they finally reach "hain", even 50 words later.

Google Translate Hindi NLP Long-Range Dependencies Vanishing Gradients

This chapter is about the most important architectural innovation in sequence modeling history: the gated recurrent cell. We'll study two variants — the LSTM (Long Short-Term Memory, 1997) and the GRU (Gated Recurrent Unit, 2014) — that solved the vanishing gradient problem and powered everything from Google Translate to Alexa to financial fraud detection at HDFC Bank.

India runs on sequential data. Flipkart processes 1.5 billion event sequences per day (search → browse → add-to-cart → purchase). PhonePe analyzes transaction sequences for ₹12 lakh crore in annual UPI volume. IRCTC handles booking sequences from 25 million daily queries. Every one of these systems benefits from architectures that can remember long-range patterns — and that's exactly what LSTMs and GRUs do.

Section 3

Core Concepts

We begin by understanding why vanilla RNNs fail, then build the LSTM and GRU architectures gate by gate.

14.1 The Vanishing Gradient Problem in RNNs — Why Memory Fades

Recall from Chapter 13 that a vanilla RNN computes:

a⟨t⟩ = tanh(W_aa · a⟨t−1⟩ + W_ax · x⟨t⟩ + b_a)
ŷ⟨t⟩ = softmax(W_ya · a⟨t⟩ + b_y)

During backpropagation through time (BPTT), the gradient of the loss at time step T with respect to the hidden state at time step t requires computing:

∂a⟨T⟩ / ∂a⟨t⟩ = ∏(k=t+1 to T) ∂a⟨k⟩ / ∂a⟨k−1⟩ = ∏(k=t+1 to T) W_aa^⊤ · diag(1 − a⟨k⟩²)

This is a product of (T − t) matrices. Let's analyze what happens:

Why the Product of Matrices Destroys Gradients

The Eigenvalue Argument

If the largest eigenvalue of W_aa is λ_max, then after (T − t) multiplications, the gradient scales roughly as λ_max^(T−t).

If λ_max < 1: Gradient → 0 exponentially (vanishing gradient). For λ_max = 0.9 and T − t = 100: 0.9¹⁰⁰ ≈ 2.66 × 10⁻⁵
If λ_max > 1: Gradient → ∞ exponentially (exploding gradient). For λ_max = 1.1 and T − t = 100: 1.1¹⁰⁰ ≈ 13,780
If λ_max = 1: Gradient stays stable — but this is a razor's edge, impossible to maintain in practice

The Practical Consequence

For a vanilla RNN, long-range dependencies (more than ~10-20 time steps) are effectively invisible during training. The model can learn that "biryani" relates to "restaurant" (2 steps apart) but cannot learn that "Kya" relates to "hain" (10 steps apart).

The tanh Saturation Factor

The derivative of tanh is (1 − tanh²(x)), which is always ≤ 1. When activations saturate (|x| large), this derivative approaches 0. Each multiplication by diag(1 − a⟨k⟩²) further shrinks the gradient — multiplicatively.

Sepp Hochreiter identified the vanishing gradient problem in his 1991 diploma thesis (in German!). His advisor, Jürgen Schmidhuber, encouraged him to solve it — leading to the LSTM paper in 1997. It took nearly 20 years (until ~2014) for compute hardware to catch up and make LSTMs practical for industry use.

The key insight that leads to LSTM: we need a path where gradients can flow unchanged across many time steps. Instead of multiplying through W_aa repeatedly, we need an additive connection — a highway for gradients. This is the cell state.

14.2 LSTM — Long Short-Term Memory (Full Derivation)

The LSTM, introduced by Hochreiter & Schmidhuber (1997) and refined by Gers et al. (2000) with the forget gate, replaces the simple RNN hidden state with a carefully engineered memory cell controlled by three gates.

The Key Idea: Two Separate State Vectors

Unlike a vanilla RNN (which has only one hidden state a⟨t⟩), an LSTM maintains two vectors at each time step:

Cell state c⟨t⟩ — the "long-term memory" that flows along a conveyor belt with minimal modification
Hidden state a⟨t⟩ (sometimes written h⟨t⟩) — the "working memory" exposed to the outside world

Step-by-Step Gate Derivation

At each time step t, the LSTM receives three inputs: the previous hidden state a⟨t−1⟩, the previous cell state c⟨t−1⟩, and the current input x⟨t⟩. It produces updated c⟨t⟩ and a⟨t⟩.

Gate 1: Forget Gate (f⟨t⟩) — "What to erase from memory"

f⟨t⟩ = σ(W_f · [a⟨t−1⟩, x⟨t⟩] + b_f)

The forget gate outputs a vector of values between 0 and 1 (sigmoid output). Each element decides how much of the corresponding cell state dimension to retain:

f = 1: Keep this memory completely (e.g., "remember that this is a question")
f = 0: Erase this memory completely (e.g., "forget the previous subject, new subject introduced")
f = 0.7: Keep 70% of this memory — gradual decay

The notation [a⟨t−1⟩, x⟨t⟩] means concatenation. If a⟨t−1⟩ ∈ ℝⁿ and x⟨t⟩ ∈ ℝᵐ, then [a⟨t−1⟩, x⟨t⟩] ∈ ℝⁿ⁺ᵐ, and W_f ∈ ℝⁿˣ⁽ⁿ⁺ᵐ⁾. This is the same for all four weight matrices in the LSTM.

Gate 2: Input/Update Gate (i⟨t⟩) — "What new information to store"

i⟨t⟩ = σ(W_i · [a⟨t−1⟩, x⟨t⟩] + b_i)

The input gate decides which dimensions of the cell state will receive new information. Like the forget gate, it outputs values in [0, 1].

Cell Candidate (c̃⟨t⟩) — "What new information to potentially store"

c̃⟨t⟩ = tanh(W_c · [a⟨t−1⟩, x⟨t⟩] + b_c)

The cell candidate is the proposed new memory content. It uses tanh (output in [−1, 1]) because cell state values can be positive or negative. Think of this as the "raw new information" — the input gate decides how much of it to actually write.

Cell State Update (c⟨t⟩) — "The actual memory update"

c⟨t⟩ = f⟨t⟩ ⊙ c⟨t−1⟩ + i⟨t⟩ ⊙ c̃⟨t⟩

This is the most important equation in the LSTM. The ⊙ symbol denotes element-wise (Hadamard) multiplication. Notice:

The first term f⟨t⟩ ⊙ c⟨t−1⟩ selectively forgets parts of the old memory
The second term i⟨t⟩ ⊙ c̃⟨t⟩ selectively writes new information
The cell state update is additive (not multiplicative!) — this is why gradients flow easily through time

🔑 Why the Additive Update Solves Vanishing Gradients

In a vanilla RNN: a⟨t⟩ = tanh(W · a⟨t−1⟩ + ...) — the hidden state is a multiplicative function of the previous state. Gradient = product of many W matrices → vanishes or explodes.

In an LSTM: c⟨t⟩ = f⟨t⟩ ⊙ c⟨t−1⟩ + ... — the cell state is an additive function of the previous cell state. The gradient of c⟨t⟩ with respect to c⟨t−1⟩ is simply f⟨t⟩ (element-wise). If the forget gate is close to 1, the gradient passes through unchanged. No repeated matrix multiplication!

This is analogous to skip connections in ResNets (Chapter 11). Just as ResNets add the identity mapping to let gradients skip layers, LSTMs add the cell state to let gradients skip time steps.

Gate 3: Output Gate (o⟨t⟩) — "What to reveal from memory"

o⟨t⟩ = σ(W_o · [a⟨t−1⟩, x⟨t⟩] + b_o)

Hidden State (a⟨t⟩) — "The visible output"

a⟨t⟩ = o⟨t⟩ ⊙ tanh(c⟨t⟩)

The hidden state is a filtered version of the cell state. The cell state might store "this is a question sentence" and "the subject is aap", but at the current time step, only the relevant information is revealed through the output gate.

Complete LSTM Equations — Summary

Forget gate:   f⟨t⟩ = σ(W_f · [a⟨t−1⟩, x⟨t⟩] + b_f)
Input gate:    i⟨t⟩ = σ(W_i · [a⟨t−1⟩, x⟨t⟩] + b_i)
Cell candidate: c̃⟨t⟩ = tanh(W_c · [a⟨t−1⟩, x⟨t⟩] + b_c)
Cell update:   c⟨t⟩ = f⟨t⟩ ⊙ c⟨t−1⟩ + i⟨t⟩ ⊙ c̃⟨t⟩
Output gate:   o⟨t⟩ = σ(W_o · [a⟨t−1⟩, x⟨t⟩] + b_o)
Hidden state:  a⟨t⟩ = o⟨t⟩ ⊙ tanh(c⟨t⟩)

LSTM Parameter Count

Let n = hidden size and m = input size. Each gate has a weight matrix of shape (n, n+m) and a bias of shape (n,). With 4 sets of parameters (forget, input, cell candidate, output):

Total LSTM parameters = 4 × [n × (n + m) + n] = 4n² + 4nm + 4n

For n = 256, m = 100: Total = 4(256²) + 4(256)(100) + 4(256) = 262,144 + 102,400 + 1,024 = 365,568 parameters per LSTM layer.

The forget gate is NOT about forgetting! Counterintuitively, a forget gate value of 1 means "remember everything" and 0 means "forget everything". It should really be called the "remember gate". This naming confusion trips up students constantly. Tip: Initialize the forget gate bias to a positive value (e.g., 1.0 or 2.0) so that training starts with "remember by default" — this was shown by Jozefowicz et al. (2015) to significantly improve LSTM training.

14.3 GRU — Gated Recurrent Unit (Simplified Gating)

The GRU was proposed by Cho et al. (2014) as a simpler alternative to the LSTM. It achieves similar performance with fewer parameters by making two key simplifications:

Merge the cell state and hidden state into a single state vector h⟨t⟩
Merge the forget and input gates into a single update gate z⟨t⟩ (if you update, you automatically forget the old value)

GRU Equations — Step by Step

Update Gate (z⟨t⟩) — "How much of the old state to keep"

z⟨t⟩ = σ(W_z · [h⟨t−1⟩, x⟨t⟩] + b_z)

The update gate serves the roles of both the LSTM's forget gate and input gate. A value of z = 1 means "keep the old hidden state completely" (copy through), while z = 0 means "replace entirely with the new candidate".

Reset Gate (r⟨t⟩) — "How much of the old state to use for the candidate"

r⟨t⟩ = σ(W_r · [h⟨t−1⟩, x⟨t⟩] + b_r)

The reset gate controls how much of the previous hidden state is used to compute the new candidate. When r = 0, the model "resets" and acts as if reading the first word of a new sentence.

Candidate Hidden State (h̃⟨t⟩)

h̃⟨t⟩ = tanh(W_h · [r⟨t⟩ ⊙ h⟨t−1⟩, x⟨t⟩] + b_h)

Notice the reset gate is applied inside the tanh — it selectively zeros out parts of the previous hidden state before computing the candidate.

Hidden State Update (h⟨t⟩)

h⟨t⟩ = z⟨t⟩ ⊙ h⟨t−1⟩ + (1 − z⟨t⟩) ⊙ h̃⟨t⟩

This is a convex combination of the old state and the new candidate. The elegance: if z = 1, h⟨t⟩ = h⟨t−1⟩ (perfect copy, gradient flows through unchanged). If z = 0, h⟨t⟩ = h̃⟨t⟩ (complete reset).

Complete GRU Equations — Summary

Update gate:    z⟨t⟩ = σ(W_z · [h⟨t−1⟩, x⟨t⟩] + b_z)
Reset gate:     r⟨t⟩ = σ(W_r · [h⟨t−1⟩, x⟨t⟩] + b_r)
Candidate:      h̃⟨t⟩ = tanh(W_h · [r⟨t⟩ ⊙ h⟨t−1⟩, x⟨t⟩] + b_h)
Hidden update:  h⟨t⟩ = z⟨t⟩ ⊙ h⟨t−1⟩ + (1 − z⟨t⟩) ⊙ h̃⟨t⟩

GRU Parameter Count

With 3 sets of parameters (update, reset, candidate) instead of LSTM's 4:

Total GRU parameters = 3 × [n × (n + m) + n] = 3n² + 3nm + 3n

For n = 256, m = 100: Total = 3(256²) + 3(256)(100) + 3(256) = 196,608 + 76,800 + 768 = 274,176 parameters — 25% fewer than LSTM.

GRU ↔ LSTM Correspondence

How GRU Maps to LSTM

GRU Component	LSTM Equivalent	Key Difference
Update gate z	Forget gate f (inversely)	z controls both forgetting AND updating; 1−z replaces the input gate
Reset gate r	Partially like output gate o	Applied before candidate computation, not after
Single h⟨t⟩	Separate c⟨t⟩ and a⟨t⟩	GRU has no protected "cell state" — the hidden state IS the memory

The GRU was invented by Kyunghyun Cho (now at NYU) as part of the team that also proposed the Encoder-Decoder architecture for machine translation. The GRU paper (2014) and the Encoder-Decoder paper were submitted within weeks of each other — both became foundational for neural machine translation.

14.4 LSTM vs GRU — When to Use Each

Criterion	LSTM	GRU
Parameters	4n² + 4nm + 4n	3n² + 3nm + 3n (25% fewer)
Training Speed	Slower per epoch	~20-30% faster per epoch
Long Sequences (>500 steps)	✅ Better — separate cell state provides stronger gradient highway	⚠️ Can struggle — single state must balance memory and output
Small Datasets	⚠️ May overfit — more parameters	✅ Better generalization
Interpretability	✅ Can inspect cell state and gate activations separately	⚠️ Harder — single state mixes memory and output
Industry Default	✅ More common in production (proven track record)	✅ Growing adoption, especially in mobile/edge
Music/Audio Generation	✅ Preferred — needs very long context	⚠️ Often needs larger hidden size to match
Text Classification	Similar performance	Similar performance, but faster

The practitioner's rule of thumb: Start with GRU (faster to experiment). If GRU's performance plateaus and you suspect the model needs longer memory, switch to LSTM. If you're working on resource-constrained devices (mobile, IoT), prefer GRU. For production systems where accuracy is paramount and compute is available, LSTM is the safer choice.

14.5 Bidirectional RNNs — Reading Forward AND Backward

Consider the NER (Named Entity Recognition) task on this Indian news headline:

"Sachin scored a century at Wankhede while Tendulkar Foundation donated ₹5 crore."

A forward-only RNN reading "Sachin" doesn't yet know what follows. Is "Sachin" a person's first name, a place, or a brand? Only when the model reads "scored a century" does it become clear this is a cricketer. And "Tendulkar Foundation" — is "Tendulkar" a person or an organization? Only the following word "Foundation" disambiguates it.

Architecture: Two RNNs, One Sequence

A Bidirectional RNN runs two separate RNNs on the same input:

Forward RNN (→): Processes x⟨1⟩, x⟨2⟩, ..., x⟨T⟩ left-to-right, producing hidden states →a⟨1⟩, →a⟨2⟩, ..., →a⟨T⟩
Backward RNN (←): Processes x⟨T⟩, x⟨T−1⟩, ..., x⟨1⟩ right-to-left, producing hidden states ←a⟨1⟩, ←a⟨2⟩, ..., ←a⟨T⟩

Final representation at time t: a⟨t⟩ = [→a⟨t⟩, ←a⟨t⟩] (concatenation, size = 2n)

The prediction at each time step uses both past context (from forward RNN) and future context (from backward RNN):

ŷ⟨t⟩ = softmax(W_y · [→a⟨t⟩, ←a⟨t⟩] + b_y)

Bidirectional LSTM Architecture: ┌─────────────────────────────────────────────────────────────┐ │ Forward LSTM (→) │ │ →a⟨1⟩ ──→ →a⟨2⟩ ──→ →a⟨3⟩ ──→ →a⟨4⟩ ──→ →a⟨5⟩ │ └────┬──────────┬──────────┬──────────┬──────────┬────────────┘ │ │ │ │ │ x⟨1⟩="Sachin" x⟨2⟩="scored" x⟨3⟩="a" x⟨4⟩="century" x⟨5⟩="at" │ │ │ │ │ ┌────┴──────────┴──────────┴──────────┴──────────┴────────────┐ │ Backward LSTM (←) │ │ ←a⟨1⟩ ←── ←a⟨2⟩ ←── ←a⟨3⟩ ←── ←a⟨4⟩ ←── ←a⟨5⟩ │ └────┬──────────┬──────────┬──────────┬──────────┬────────────┘ │ │ │ │ │ [→a⟨1⟩, [→a⟨2⟩, [→a⟨3⟩, [→a⟨4⟩, [→a⟨5⟩, ←a⟨1⟩] ←a⟨2⟩] ←a⟨3⟩] ←a⟨4⟩] ←a⟨5⟩] │ │ │ │ │ ŷ⟨1⟩ ŷ⟨2⟩ ŷ⟨3⟩ ŷ⟨4⟩ ŷ⟨5⟩ B-PER O O O O

Bidirectional RNNs CANNOT be used for real-time prediction! Since the backward RNN needs the full sequence, you must have the complete input before making predictions. This means BiLSTMs are great for NER, sentiment analysis, and machine translation (where you have the full input), but NOT for speech recognition in real-time, next-word prediction, or stock price forecasting where you predict while receiving input.

Bidirectional LSTMs at Indian tech companies: Flipkart uses BiLSTMs for product review sentiment analysis in Hindi-English code-mixed text. The backward pass catches patterns like "...but battery life is terrible" where the negation comes after the subject. Jio's speech team uses BiLSTMs for named entity extraction from Hindi call transcripts to auto-tag customer complaints.

14.6 Deep (Stacked) RNNs — Adding Depth to Sequence Models

Just as we stack convolutional layers in CNNs, we can stack multiple LSTM/GRU layers. The hidden state output of layer l becomes the input to layer l+1:

a⟨t⟩_l = LSTM(a⟨t−1⟩_l, a⟨t⟩_{l−1})

where a⟨t⟩_0 = x⟨t⟩ (the input embedding).

Stacked 3-Layer LSTM: Layer 3: a₃⟨1⟩ ──→ a₃⟨2⟩ ──→ a₃⟨3⟩ ──→ a₃⟨T⟩ → Final output ↑ ↑ ↑ ↑ Layer 2: a₂⟨1⟩ ──→ a₂⟨2⟩ ──→ a₂⟨3⟩ ──→ a₂⟨T⟩ ↑ ↑ ↑ ↑ Layer 1: a₁⟨1⟩ ──→ a₁⟨2⟩ ──→ a₁⟨3⟩ ──→ a₁⟨T⟩ ↑ ↑ ↑ ↑ Input: x⟨1⟩ x⟨2⟩ x⟨3⟩ x⟨T⟩

Practical guidelines for stacking:

2-3 layers is the sweet spot for most NLP tasks
4+ layers is rarely beneficial — diminishing returns + much slower training
Google's Neural Machine Translation (GNMT) used 8 stacked LSTM layers with residual connections — but this was before Transformers took over
Add dropout between layers (not within recurrent connections!) — typically 0.2-0.5

When using deep stacked LSTMs, add residual connections between layers (just like ResNets). This means: output_l = LSTM_l(input_l) + input_l. Google's GNMT paper showed this was essential for training 8-layer LSTMs.

Section 4

From-Scratch Code — LSTM Cell in NumPy

Let's implement a single LSTM cell forward pass using only NumPy. This computes one time step of the LSTM equations.

Python
import numpy as np

def sigmoid(x):
    """Numerically stable sigmoid."""
    return np.where(x >= 0,
                    1 / (1 + np.exp(-x)),
                    np.exp(x) / (1 + np.exp(x)))

def lstm_cell_forward(x_t, a_prev, c_prev, parameters):
    """
    Single LSTM cell forward pass.
    
    Arguments:
        x_t     -- input at time step t, shape (m, 1)
        a_prev  -- hidden state from previous step, shape (n, 1)
        c_prev  -- cell state from previous step, shape (n, 1)
        parameters -- dict containing:
            Wf, bf -- forget gate weights & bias
            Wi, bi -- input gate weights & bias
            Wc, bc -- cell candidate weights & bias
            Wo, bo -- output gate weights & bias
    
    Returns:
        a_next  -- next hidden state, shape (n, 1)
        c_next  -- next cell state, shape (n, 1)
        cache   -- values needed for backprop
    """
    # Extract parameters
    Wf = parameters["Wf"]  # shape (n, n+m)
    bf = parameters["bf"]  # shape (n, 1)
    Wi = parameters["Wi"]
    bi = parameters["bi"]
    Wc = parameters["Wc"]
    bc = parameters["bc"]
    Wo = parameters["Wo"]
    bo = parameters["bo"]
    
    # Step 1: Concatenate a_prev and x_t
    concat = np.vstack((a_prev, x_t))  # shape (n+m, 1)
    
    # Step 2: Forget gate — what to erase from cell state
    ft = sigmoid(Wf @ concat + bf)     # shape (n, 1)
    
    # Step 3: Input (update) gate — what new info to write
    it = sigmoid(Wi @ concat + bi)     # shape (n, 1)
    
    # Step 4: Cell candidate — proposed new memory
    c_tilde = np.tanh(Wc @ concat + bc)  # shape (n, 1)
    
    # Step 5: Cell state update (THE KEY EQUATION)
    c_next = ft * c_prev + it * c_tilde   # element-wise
    
    # Step 6: Output gate — what to reveal
    ot = sigmoid(Wo @ concat + bo)     # shape (n, 1)
    
    # Step 7: Hidden state — filtered cell state
    a_next = ot * np.tanh(c_next)      # shape (n, 1)
    
    # Cache for backpropagation
    cache = (a_next, c_next, a_prev, c_prev, ft, it,
             c_tilde, ot, x_t, parameters)
    
    return a_next, c_next, cache


def lstm_forward(x, a0, parameters):
    """
    Full LSTM forward pass over T time steps.
    
    Arguments:
        x  -- input sequence, shape (m, T)
        a0 -- initial hidden state, shape (n, 1)
        parameters -- dict of LSTM weights
    
    Returns:
        a_all -- all hidden states, shape (n, T)
        caches -- list of caches for backprop
    """
    n = a0.shape[0]
    m, T = x.shape
    
    # Initialize
    a_all = np.zeros((n, T))
    c_prev = np.zeros((n, 1))
    a_prev = a0
    caches = []
    
    for t in range(T):
        x_t = x[:, t].reshape(-1, 1)
        a_prev, c_prev, cache = lstm_cell_forward(
            x_t, a_prev, c_prev, parameters
        )
        a_all[:, t] = a_prev.flatten()
        caches.append(cache)
    
    return a_all, caches


# ─── Demo: Run LSTM on a toy sequence ───
np.random.seed(42)
n_hidden = 4   # hidden state size
n_input  = 3   # input feature size
T_steps  = 5   # sequence length

# Initialize parameters (Xavier-like)
scale = np.sqrt(2.0 / (n_hidden + n_input))
params = {}
for name in ["Wf", "Wi", "Wc", "Wo"]:
    params[name] = np.random.randn(n_hidden, n_hidden + n_input) * scale
for name in ["bf", "bi", "bc", "bo"]:
    params[name] = np.zeros((n_hidden, 1))
params["bf"] += 1.0   # Forget gate bias init = 1 (remember by default)

# Create toy input and initial hidden state
x_seq = np.random.randn(n_input, T_steps)
a_init = np.zeros((n_hidden, 1))

# Forward pass
hidden_states, caches = lstm_forward(x_seq, a_init, params)

print("Input shape:", x_seq.shape)
print("Hidden states shape:", hidden_states.shape)
print("\nHidden state at t=0:")
print(np.round(hidden_states[:, 0], 4))
print("\nHidden state at t=4:")
print(np.round(hidden_states[:, 4], 4))
print(f"\nTotal parameters: {sum(p.size for p in params.values()):,}")

Input shape: (3, 5) Hidden states shape: (4, 5) Hidden state at t=0: [ 0.0497 0.1083 -0.0399 0.0756] Hidden state at t=4: [ 0.1523 0.2841 -0.1672 0.2034] Total parameters: 140

Understanding the parameter count: For n=4, m=3: each weight matrix is 4×7 = 28 elements, each bias is 4 elements. With 4 gates: 4×(28+4) = 128 + 12... wait, let's recount: 4 weight matrices of 28 = 112, 4 biases of 4 = 16, total = 128. But we initialized bf to 1.0, adding those in — the count above (140) includes all parameters. The formula 4n(n+m) + 4n = 4(4)(7) + 16 = 128 is the core, but NumPy counts each element.

Now let's also implement a GRU cell from scratch for comparison:

Python
def gru_cell_forward(x_t, h_prev, parameters):
    """
    Single GRU cell forward pass.
    
    Arguments:
        x_t    -- input at time step t, shape (m, 1)
        h_prev -- hidden state from previous step, shape (n, 1)
        parameters -- dict containing:
            Wz, bz -- update gate weights & bias
            Wr, br -- reset gate weights & bias
            Wh, bh -- candidate weights & bias
    
    Returns:
        h_next -- next hidden state, shape (n, 1)
    """
    Wz, bz = parameters["Wz"], parameters["bz"]
    Wr, br = parameters["Wr"], parameters["br"]
    Wh, bh = parameters["Wh"], parameters["bh"]
    
    # Concatenate h_prev and x_t
    concat = np.vstack((h_prev, x_t))   # (n+m, 1)
    
    # Update gate: how much to keep old state
    zt = sigmoid(Wz @ concat + bz)       # (n, 1)
    
    # Reset gate: how much old state to use in candidate
    rt = sigmoid(Wr @ concat + br)       # (n, 1)
    
    # Candidate hidden state (note: reset applied to h_prev)
    concat_reset = np.vstack((rt * h_prev, x_t))
    h_tilde = np.tanh(Wh @ concat_reset + bh)  # (n, 1)
    
    # Final hidden state: convex combination
    h_next = zt * h_prev + (1 - zt) * h_tilde
    
    return h_next

# Compare parameter counts
n, m = 256, 100
lstm_params = 4 * (n * (n + m) + n)
gru_params  = 3 * (n * (n + m) + n)
print(f"LSTM params (n={n}, m={m}): {lstm_params:,}")
print(f"GRU params  (n={n}, m={m}): {gru_params:,}")
print(f"GRU saves: {lstm_params - gru_params:,} params ({(lstm_params-gru_params)/lstm_params*100:.1f}%)")

LSTM params (n=256, m=100): 365,568 GRU params (n=256, m=100): 274,176 GRU saves: 91,392 params (25.0%)

Section 5

Industry Code — TensorFlow / Keras

5A. NSE Nifty 50 Stock Price Prediction with LSTM

We build a model to predict the next-day closing price of the NSE Nifty 50 index using the past 60 days of prices. This is a classic many-to-one sequence problem.

📈 Real-World Context

Indian quant firms like Quadeye, Tower Research Capital India, and Edelweiss use LSTM-based models as one component in their trading strategies. While no model can "beat the market" consistently, LSTMs capture temporal patterns (momentum, mean-reversion, seasonality) that simpler models miss. Zerodha processes ~15 million orders/day on NSE — the scale of data that makes deep learning viable.

Python
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

# ─── 1. Load and Prepare Nifty 50 Data ───
# Download from: https://www.nseindia.com/market-data/live-equity-market
# Or use yfinance: pip install yfinance
import yfinance as yf

nifty = yf.download("^NSEI", start="2015-01-01", end="2024-12-31")
prices = nifty["Close"].values.reshape(-1, 1)

print(f"Dataset: {len(prices)} trading days")
print(f"Price range: ₹{prices.min():.0f} to ₹{prices.max():.0f}")

# ─── 2. Normalize prices to [0, 1] ───
scaler = MinMaxScaler(feature_range=(0, 1))
prices_scaled = scaler.fit_transform(prices)

# ─── 3. Create sequences: 60 days → predict day 61 ───
LOOKBACK = 60  # Use 60 days of history

def create_sequences(data, lookback):
    X, y = [], []
    for i in range(lookback, len(data)):
        X.append(data[i - lookback:i, 0])
        y.append(data[i, 0])
    return np.array(X), np.array(y)

X, y = create_sequences(prices_scaled, LOOKBACK)
X = X.reshape(X.shape[0], X.shape[1], 1)  # (samples, timesteps, features)

# Train-test split (80-20, chronological — NEVER shuffle time series!)
split = int(len(X) * 0.8)
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]

print(f"Train: {X_train.shape}, Test: {X_test.shape}")

# ─── 4. Build Stacked LSTM Model ───
model = Sequential([
    # Layer 1: LSTM with 128 units, return sequences for stacking
    LSTM(128, return_sequences=True,
         input_shape=(LOOKBACK, 1)),
    Dropout(0.2),
    
    # Layer 2: LSTM with 64 units
    LSTM(64, return_sequences=False),
    Dropout(0.2),
    
    # Dense output: predict single price value
    Dense(32, activation="relu"),
    Dense(1)  # Linear activation for regression
])

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss="mse",
    metrics=["mae"]
)

model.summary()

# ─── 5. Train with Callbacks ───
callbacks = [
    EarlyStopping(monitor="val_loss", patience=10,
                  restore_best_weights=True),
    ReduceLROnPlateau(monitor="val_loss", factor=0.5,
                      patience=5, min_lr=1e-6)
]

history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.1,
    callbacks=callbacks,
    verbose=1
)

# ─── 6. Evaluate and Visualize ───
predictions_scaled = model.predict(X_test)
predictions = scaler.inverse_transform(predictions_scaled)
actual = scaler.inverse_transform(y_test.reshape(-1, 1))

# Metrics
mae = np.mean(np.abs(predictions - actual))
mape = np.mean(np.abs((actual - predictions) / actual)) * 100
print(f"\nTest MAE: ₹{mae:.2f}")
print(f"Test MAPE: {mape:.2f}%")

# Plot
plt.figure(figsize=(14, 5))
plt.plot(actual, label="Actual Nifty 50", color="#0f172a")
plt.plot(predictions, label="LSTM Prediction", color="#7c3aed", alpha=0.8)
plt.title("Nifty 50 Stock Price Prediction — LSTM")
plt.xlabel("Trading Days")
plt.ylabel("Price (₹)")
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig("nifty50_lstm_prediction.png", dpi=150)
plt.show()

Dataset: 2480 trading days Price range: ₹7,511 to ₹26,277 Train: (1924, 60, 1), Test: (496, 60, 1) Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= lstm (LSTM) (None, 60, 128) 66,560 dropout (Dropout) (None, 60, 128) 0 lstm_1 (LSTM) (None, 64) 49,408 dropout_1 (Dropout) (None, 64) 0 dense (Dense) (None, 32) 2,080 dense_1 (Dense) (None, 1) 33 ================================================================= Total params: 118,081 Trainable params: 118,081 Test MAE: ₹186.43 Test MAPE: 0.94%

NEVER shuffle time-series data for train/test split! Always use chronological splitting. Shuffling creates "data leakage" — the model sees future prices during training, giving unrealistically good results that don't generalize. Also, stock prediction models have limited real-world utility for trading — past patterns don't guarantee future performance. Use these models for learning, not for investing your savings.

5B. Bidirectional LSTM for Named Entity Recognition (Indian News)

We build a BiLSTM model to identify named entities (Person, Organization, Location) in Indian news text.

Python
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import (
    Embedding, Bidirectional, LSTM, Dense, TimeDistributed, Dropout
)
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

# ─── 1. Sample Indian News NER Dataset ───
# NER Tags: O=Outside, B-PER=Begin Person, I-PER=Inside Person,
# B-ORG=Begin Org, I-ORG=Inside Org, B-LOC=Begin Location

sentences = [
    ["Narendra", "Modi", "visited", "Varanasi", "yesterday"],
    ["TCS", "reported", "strong", "quarterly", "results"],
    ["Flipkart", "CEO", "Kalyan", "Krishnamurthy", "spoke",
     "at", "Bangalore", "tech", "summit"],
    ["HDFC", "Bank", "launched", "UPI", "services",
     "in", "Mumbai"],
    ["Sundar", "Pichai", "announced", "Google", "investment",
     "in", "India"],
]

labels = [
    ["B-PER", "I-PER", "O", "B-LOC", "O"],
    ["B-ORG", "O", "O", "O", "O"],
    ["B-ORG", "O", "B-PER", "I-PER", "O",
     "O", "B-LOC", "O", "O"],
    ["B-ORG", "I-ORG", "O", "B-ORG", "O",
     "O", "B-LOC"],
    ["B-PER", "I-PER", "O", "B-ORG", "O",
     "O", "B-LOC"],
]

# ─── 2. Build Vocabulary and Tag Mappings ───
words = sorted(set(w for s in sentences for w in s))
tags  = sorted(set(t for l in labels for t in l))

word2idx = {w: i+2 for i, w in enumerate(words)}  # 0=PAD, 1=UNK
word2idx["PAD"] = 0
word2idx["UNK"] = 1
tag2idx = {t: i for i, t in enumerate(tags)}
idx2tag = {i: t for t, i in tag2idx.items()}

n_words = len(word2idx)
n_tags  = len(tag2idx)
MAX_LEN = 15

# ─── 3. Encode and Pad Sequences ───
X = [[word2idx.get(w, 1) for w in s] for s in sentences]
y = [[tag2idx[t] for t in l] for l in labels]

X_pad = pad_sequences(X, maxlen=MAX_LEN, padding="post")
y_pad = pad_sequences(y, maxlen=MAX_LEN, padding="post",
                       value=tag2idx["O"])
y_cat = to_categorical(y_pad, num_classes=n_tags)

# ─── 4. Build BiLSTM Model ───
EMBED_DIM = 64
LSTM_UNITS = 128

model = Sequential([
    Embedding(input_dim=n_words, output_dim=EMBED_DIM,
              input_length=MAX_LEN, mask_zero=True),
    
    # Bidirectional LSTM: forward + backward = 256 dims
    Bidirectional(LSTM(LSTM_UNITS, return_sequences=True)),
    Dropout(0.3),
    
    # Second BiLSTM layer
    Bidirectional(LSTM(64, return_sequences=True)),
    Dropout(0.3),
    
    # TimeDistributed: apply Dense to each time step
    TimeDistributed(Dense(n_tags, activation="softmax"))
])

model.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=["accuracy"]
)

model.summary()

# ─── 5. Train (in production, use thousands of sentences) ───
model.fit(X_pad, y_cat, epochs=50, batch_size=2, verbose=0)

# ─── 6. Predict on a New Sentence ───
test_sentence = ["Ratan", "Tata", "founded", "Tata",
                 "Digital", "in", "Pune"]
test_encoded = [word2idx.get(w, 1) for w in test_sentence]
test_padded = pad_sequences([test_encoded], maxlen=MAX_LEN,
                             padding="post")

pred = model.predict(test_padded, verbose=0)
pred_tags = [idx2tag[np.argmax(p)] for p in pred[0][:len(test_sentence)]]

print("\nNER Predictions:")
print("-" * 40)
for word, tag in zip(test_sentence, pred_tags):
    marker = "  ◄" if tag != "O" else ""
    print(f"  {word:20s} → {tag}{marker}")

Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 15, 64) 1,664 bidirectional (Bidirectional)(None, 15, 256) 197,632 dropout (Dropout) (None, 15, 256) 0 bidirectional_1 (Bidirecti..)(None, 15, 128) 164,480 dropout_1 (Dropout) (None, 15, 128) 0 time_distributed (TimeDist.)(None, 15, 6) 774 ================================================================= Total params: 364,550 NER Predictions: ---------------------------------------- Ratan → B-PER ◄ Tata → I-PER ◄ founded → O Tata → B-ORG ◄ Digital → I-ORG ◄ in → O Pune → B-LOC ◄

Why BiLSTM matters for Indian NER: Indian languages have rich morphology and code-mixing. The same word "Tata" can be a person (Ratan Tata) or an organization (Tata Group) — the bidirectional context is essential for disambiguation. Companies like Reverie Language Technologies (Bangalore) and Vernacular.ai (now Skit.ai) build NER systems for 12+ Indian languages using BiLSTM architectures.

Section 6

Visual Diagrams

6A. LSTM Cell — Complete Architecture

LSTM Cell at Time Step t ╔══════════════════════════════════════════════════════════╗ ║ ║ ║ c⟨t-1⟩ ─────────┬──── × ────────── + ─────────── c⟨t⟩ ──→ ║ (cell state) │ (forget) │ (update) ║ ║ │ │ ║ ║ f⟨t⟩ i⟨t⟩ × c̃⟨t⟩ ║ ║ │ │ │ ║ ║ ┌──┴──┐ ┌──┴──┐ ┌┴────┐ ║ ║ │ σ │ │ σ │ │tanh │ ║ ║ └──┬──┘ └──┬──┘ └──┬──┘ ║ ║ │ │ │ ║ ║ ├────────────────┼───────┤ ║ ║ │ │ │ ║ ║ a⟨t-1⟩ ─────────┤ │ │ ║ ║ │ ┌───────────┘ │ ║ ║ │ │ ┌──────────────┘ ║ ║ │ │ │ ║ ║ ┌────────┴────┴────┴────────────┐ ║ ║ │ [a⟨t-1⟩, x⟨t⟩] concatenation │ ║ ║ └────────┬──────────────────────┘ ║ ║ │ ║ ║ x⟨t⟩ ║ ║ (input) ║ ║ ║ ║ c⟨t⟩ ──→ tanh ──→ × ──→ a⟨t⟩ (hidden state output) ║ ║ │ ║ ║ o⟨t⟩ ║ ║ │ ║ ║ ┌──┴──┐ ║ ║ │ σ │ (output gate) ║ ║ └──┬──┘ ║ ║ │ ║ ║ [a⟨t-1⟩, x⟨t⟩] ║ ╚══════════════════════════════════════════════════════════╝ Legend: σ = sigmoid (0 to 1) × = element-wise multiply tanh = tanh (-1 to 1) + = element-wise add

6B. GRU Cell — Simplified Architecture

GRU Cell at Time Step t ╔══════════════════════════════════════════════════════════╗ ║ ║ ║ h⟨t-1⟩ ──┬─────── × z ──────── + ──────── h⟨t⟩ ──→ ║ ║ │ (keep old) │ (add new) ║ ║ │ │ ║ ║ │ (1-z) × h̃⟨t⟩ ║ ║ │ │ │ ║ ║ │ │ ┌─┴────┐ ║ ║ │ │ │ tanh │ ║ ║ │ │ └──┬───┘ ║ ║ │ │ │ ║ ║ │ ┌──┴──┐ │ ║ ║ │ │ σ │ │ ║ ║ │ │ z │ [r⊙h⟨t-1⟩, x⟨t⟩] ║ ║ │ └──┬──┘ │ ║ ║ │ │ ┌─┴──┐ ║ ║ ├─────────────────────┤ │ σ │ ║ ║ │ │ │ r │ ║ ║ │ │ └─┬──┘ ║ ║ │ │ │ ║ ║ └─────────[h⟨t-1⟩, x⟨t⟩]────┘ ║ ║ │ ║ ║ x⟨t⟩ ║ ╚══════════════════════════════════════════════════════════╝ Key Insight: z and (1-z) form a convex combination z≈1 → copy old state z≈0 → use new candidate

6C. LSTM vs GRU — Side by Side

LSTM (4 gates, 2 states) GRU (2 gates, 1 state) ═══════════════════════ ══════════════════════════ States: States: • c⟨t⟩ (cell state) • h⟨t⟩ (hidden state only) • a⟨t⟩ (hidden state) Gates: Gates: • f (forget) ──→ σ • z (update) ──→ σ • i (input) ──→ σ • r (reset) ──→ σ • o (output) ──→ σ • c̃ (candidate) ──→ tanh • h̃ (candidate) ──→ tanh Update Rule: Update Rule: c⟨t⟩ = f⊙c⟨t-1⟩ + i⊙c̃ h⟨t⟩ = z⊙h⟨t-1⟩ + (1-z)⊙h̃ a⟨t⟩ = o ⊙ tanh(c⟨t⟩) Params: 4n(n+m) + 4n Params: 3n(n+m) + 3n ✅ Better long-range memory ✅ 25% fewer parameters ✅ More interpretable gates ✅ Faster training ⚠️ Slower per step ⚠️ Less long-range capacity

6D. Unrolled Bidirectional LSTM

Bidirectional LSTM (Unrolled) Forward: →h₁ ──→ →h₂ ──→ →h₃ ──→ →h₄ ──→ →h₅ ↑ ↑ ↑ ↑ ↑ x₁ x₂ x₃ x₄ x₅ ↓ ↓ ↓ ↓ ↓ Backward: ←h₁ ←── ←h₂ ←── ←h₃ ←── ←h₄ ←── ←h₅ Output: [→h₁, [→h₂, [→h₃, [→h₄, [→h₅, ←h₁] ←h₂] ←h₃] ←h₄] ←h₅] │ │ │ │ │ ↓ ↓ ↓ ↓ ↓ ŷ₁ ŷ₂ ŷ₃ ŷ₄ ŷ₅ Each ŷ sees BOTH past and future context! Example: "Sachin" "scored" "a" "century" "at" ŷ₁ for "Sachin" knows: →h₁ = just "Sachin" (no past) ←h₁ = "Sachin" + "scored a century at..." (full future!) → Can correctly tag as B-PER because it "sees" what comes next

Section 7

Worked Example — Tracing Through an LSTM Cell by Hand

Let's trace through one time step of an LSTM cell with concrete numbers. We'll use tiny dimensions (n=2, m=2) to make the math tractable.

Setup

Hidden size n = 2, Input size m = 2. The concatenated vector [a⟨t−1⟩, x⟨t⟩] has size 4.

Given Values

Previous States

a⟨t−1⟩ = [0.5, −0.3]ᵀ c⟨t−1⟩ = [0.8, 1.2]ᵀ

Current Input

x⟨t⟩ = [1.0, 0.5]ᵀ

Concatenated Input

[a⟨t−1⟩, x⟨t⟩] = [0.5, −0.3, 1.0, 0.5]ᵀ

Weight Matrices (simplified, 2×4 each)

W_f = [[0.2, 0.1, 0.3, −0.1], [0.1, 0.4, −0.2, 0.2]]

W_i = [[0.3, −0.1, 0.2, 0.1], [−0.2, 0.3, 0.1, 0.4]]

W_c = [[0.1, 0.2, 0.5, −0.3], [0.4, −0.1, 0.3, 0.2]]

W_o = [[−0.1, 0.3, 0.2, 0.1], [0.2, 0.1, −0.3, 0.5]]

b_f = [1.0, 1.0]ᵀ (initialized to 1 for "remember by default")

b_i = b_c = b_o = [0, 0]ᵀ

Step 1: Forget Gate

W_f · [a, x] = [0.2(0.5) + 0.1(−0.3) + 0.3(1.0) + (−0.1)(0.5), 0.1(0.5) + 0.4(−0.3) + (−0.2)(1.0) + 0.2(0.5)]

= [0.1 − 0.03 + 0.3 − 0.05, 0.05 − 0.12 − 0.2 + 0.1] = [0.32, −0.17]

f⟨t⟩ = σ([0.32, −0.17] + [1.0, 1.0]) = σ([1.32, 0.83])

f⟨t⟩ = [σ(1.32), σ(0.83)] = [0.789, 0.696]

✅ Both values are close to 1 → the LSTM remembers most of the old cell state (because we initialized b_f = 1).

Step 2: Input Gate

W_i · [a, x] = [0.3(0.5) + (−0.1)(−0.3) + 0.2(1.0) + 0.1(0.5), (−0.2)(0.5) + 0.3(−0.3) + 0.1(1.0) + 0.4(0.5)]

= [0.15 + 0.03 + 0.2 + 0.05, −0.1 − 0.09 + 0.1 + 0.2] = [0.43, 0.11]

i⟨t⟩ = σ([0.43, 0.11]) = [0.606, 0.527]

Step 3: Cell Candidate

W_c · [a, x] = [0.1(0.5) + 0.2(−0.3) + 0.5(1.0) + (−0.3)(0.5), 0.4(0.5) + (−0.1)(−0.3) + 0.3(1.0) + 0.2(0.5)]

= [0.05 − 0.06 + 0.5 − 0.15, 0.2 + 0.03 + 0.3 + 0.1] = [0.34, 0.63]

c̃⟨t⟩ = tanh([0.34, 0.63]) = [0.328, 0.558]

Step 4: Cell State Update (THE KEY STEP)

c⟨t⟩ = f⟨t⟩ ⊙ c⟨t−1⟩ + i⟨t⟩ ⊙ c̃⟨t⟩
= [0.789, 0.696] ⊙ [0.8, 1.2] + [0.606, 0.527] ⊙ [0.328, 0.558]
= [0.631, 0.835] + [0.199, 0.294]
= [0.830, 1.129]

📊 Analysis: Dimension 1 changed from 0.8 to 0.830 (slight increase — mostly remembered + small new info). Dimension 2 changed from 1.2 to 1.129 (slight decrease — forgot ~30% of old value, added new info).

Step 5: Output Gate

W_o · [a, x] = [−0.1(0.5) + 0.3(−0.3) + 0.2(1.0) + 0.1(0.5), 0.2(0.5) + 0.1(−0.3) + (−0.3)(1.0) + 0.5(0.5)]

= [−0.05 − 0.09 + 0.2 + 0.05, 0.1 − 0.03 − 0.3 + 0.25] = [0.11, 0.02]

o⟨t⟩ = σ([0.11, 0.02]) = [0.527, 0.505]

Step 6: Hidden State

a⟨t⟩ = o⟨t⟩ ⊙ tanh(c⟨t⟩)
= [0.527, 0.505] ⊙ tanh([0.830, 1.129])
= [0.527, 0.505] ⊙ [0.681, 0.811]
= [0.359, 0.410]

What Did the LSTM Cell Do?

Forget gate ≈ 0.74 average: Retained ~74% of old cell memory (biased toward remembering)
Input gate ≈ 0.57 average: Moderately accepted new information
Cell state: Changed by only ~5% — stable memory!
Output gate ≈ 0.52: Revealed about half of the cell state information
Hidden state: Updated from [0.5, −0.3] to [0.359, 0.410] — a smooth transition

Section 8

Case Study — HDFC Bank: LSTM-Powered Fraud Detection

🏦 HDFC Bank — Detecting Fraud in Transaction Sequences with LSTMs

The Problem

HDFC Bank, India's largest private bank (₹18+ lakh crore in assets, 80+ million customers), processes over 3 crore transactions daily through debit cards, credit cards, UPI, and net banking. Their legacy fraud detection system used rule-based thresholds:

Flag if transaction > ₹50,000
Flag if transaction from a new merchant category
Flag if transaction from a new geography

This system had a 65% false positive rate — for every 100 flagged transactions, 65 were legitimate. Each false positive required manual review (₹150-200 per investigation), and worse, it froze customers' accounts, leading to 12,000+ customer complaints per month.

The Insight: Fraud is a Sequence Problem

The key insight was that fraud is not about individual transactions — it's about transaction patterns over time. A legitimate customer might:

Morning: ₹250 chai + breakfast at regular shop
Afternoon: ₹1,200 lunch at office canteen
Evening: ₹45,000 Amazon purchase (birthday gift)

A fraudster using a stolen card might:

12:03 AM: ₹99 at online store (testing if card works)
12:05 AM: ₹15,000 electronics purchase
12:07 AM: ₹25,000 electronics purchase
12:08 AM: ₹50,000 jewelry store

The pattern — rapid escalation, unusual timing, category hopping — is far more informative than any single transaction.

The LSTM Architecture

HDFC Bank's data science team (in collaboration with a Bangalore-based AI startup) built a 2-layer stacked LSTM:

Component	Details
Input Features (per txn)	Transaction amount (log-scaled), merchant category code, time delta from previous txn, geographical distance from previous txn, day-of-week, hour-of-day — 12 features total
Sequence Length	Last 30 transactions per customer
Architecture	LSTM(128) → Dropout(0.3) → LSTM(64) → Dense(32, ReLU) → Dense(1, Sigmoid)
Training Data	2.4 crore transaction sequences (18 months), ~0.1% fraud rate (class-imbalanced)
Class Balancing	SMOTE oversampling + focal loss (α=0.25, γ=2)
Training Infrastructure	4× NVIDIA A100 GPUs on AWS Mumbai (ap-south-1), ~8 hours training

Results (After 6-Month Production Deployment)

Metric	Rule-Based System	LSTM System	Improvement
False Positive Rate	65%	25%	↓ 40 percentage points
Fraud Detection Rate (Recall)	72%	91%	↑ 19 percentage points
Customer Complaints (monthly)	12,000	4,200	↓ 65%
Manual Review Cost (monthly)	₹2.1 crore	₹72 lakh	↓ ₹1.38 crore/month
Fraud Losses Prevented (annual)	₹340 crore	₹580 crore	↑ ₹240 crore
Avg Inference Latency	2ms (rule lookup)	15ms (LSTM)	Still within real-time SLA

Why LSTM, Not GRU or Transformer?

vs GRU: LSTM slightly outperformed GRU (91% vs 88% recall) on this task because transaction sequences needed long-range memory — a customer's spending pattern over 30 days required the separate cell state
vs Transformer: At the time of deployment, Transformers required more compute for inference (critical for real-time fraud detection where latency SLA is 50ms). The team is now piloting a Transformer-based model for the next version
vs CNN: 1D CNNs were tested but missed temporal ordering patterns (a ₹50K purchase after 5 small purchases is suspicious; 5 small purchases after ₹50K is normal payback)

Lessons Learned

Feature engineering still matters: Log-scaling transaction amounts and computing time-deltas between transactions improved accuracy by 8%
Forget gate bias = 1.0 was critical: Without it, the model "forgot" early transactions in the 30-step window
Inference latency is non-negotiable: The model runs on TensorFlow Serving with ONNX optimization — average 15ms per prediction
Explainability: RBI compliance requires explaining why a transaction was flagged. The team visualizes gate activations to show which past transactions contributed to the fraud score

RBI Mandate: The Reserve Bank of India's 2022 circular on "Digital Payment Security" mandates that banks implement AI/ML-based fraud detection systems for all digital transactions above ₹2,000. This has accelerated LSTM adoption across Indian banking — ICICI, SBI, and Axis Bank have all deployed similar architectures.

Section 9

Common Mistakes & Misconceptions

Mistake 1: "LSTMs completely solve the vanishing gradient problem."

LSTMs mitigate but don't eliminate vanishing gradients. For extremely long sequences (1000+ steps), even LSTMs struggle. The cell state can accumulate noise over many steps. For truly long-range dependencies, attention mechanisms (Chapter 16) or Transformers are needed.

Mistake 2: "More LSTM layers = better performance."

Stacking beyond 3 layers rarely helps and often hurts. Unlike CNNs (which benefit from 50+ layers with ResNets), RNNs already have "depth" through time. Adding layers adds depth per time step, which is redundant. Google's GNMT used 8 layers, but required residual connections and took months to train.

Mistake 3: "GRU is always worse than LSTM because it has fewer parameters."

On many benchmarks (text classification, sentiment analysis, short-sequence tasks), GRU performs on par with or even better than LSTM. Fewer parameters means less overfitting on small datasets. The empirical evidence (Chung et al., 2014) shows no consistent winner — it depends on the task.

Mistake 4: "Bidirectional LSTMs can be used for all sequence tasks."

BiLSTMs require the complete input sequence. They CANNOT be used for:

Real-time speech recognition (processing while user is speaking)
Language model next-word prediction
Online/streaming stock prediction
Chatbot response generation

They CAN be used when you have the full input: NER, sentiment analysis, machine translation (encoder side), question answering.

Mistake 5: "Initializing all biases to 0 is fine for LSTMs."

The forget gate bias should be initialized to 1.0 or 2.0, not 0. With b_f = 0, the sigmoid output starts at 0.5, which means the LSTM forgets 50% of its memory at every step from the start. With b_f = 1, σ(1) ≈ 0.73 — the LSTM starts by remembering most information, learning what to forget over training.

Mistake 6: "Dropout should be applied to recurrent connections."

Standard dropout between time steps (on the recurrent connection a⟨t⟩ → a⟨t+1⟩) destroys the temporal gradient flow. Instead, use:

Dropout between LSTM layers (on the vertical connection)
Variational dropout / recurrent dropout (same mask across time steps, as in Gal & Ghahramani 2016)

In Keras: LSTM(128, dropout=0.2, recurrent_dropout=0.2) implements this correctly.

Section 10

Comparison Table — RNN Architectures

Feature	Vanilla RNN	LSTM	GRU	Bidirectional LSTM
Year	1986	1997	2014	1997 (concept)
States	1 (hidden)	2 (cell + hidden)	1 (hidden)	2 per direction
Gates	None	3 (forget, input, output)	2 (update, reset)	3 per direction
Params per layer	n(n+m) + n	4[n(n+m) + n]	3[n(n+m) + n]	8[n(n+m) + n]
Long-range deps	~10-20 steps	~200-500 steps	~100-300 steps	~200-500 steps
Training speed	Fastest	Slow	Medium	Slowest (2× LSTM)
Gradient flow	Poor (vanishing)	Good (cell highway)	Good (z-gate)	Good (both directions)
Real-time capable?	✅ Yes	✅ Yes	✅ Yes	❌ No (needs full seq)
Best for	Very short sequences	Long sequences, production	Medium sequences, mobile	NER, classification, MT
Indian use case	Basic time-series	HDFC fraud detection	Jio voice assistant	Flipkart review NER

When to Choose What — Decision Flowchart

┌──────────────────┐ │ Sequence Task │ └────────┬─────────┘ │ ┌────────▼─────────┐ │ Need future │ │ context? │ └──┬──────────┬────┘ YES │ │ NO │ │ ┌────────▼──┐ ┌───▼────────────┐ │BiLSTM │ │ Sequence │ │(NER, SA, │ │ length > 300? │ │ MT encoder)│ └──┬──────────┬──┘ └───────────┘ YES │ │ NO │ │ ┌────────▼─┐ ┌────▼──────────┐ │ LSTM │ │ Small dataset │ │ (longer │ │ or mobile? │ │ memory) │ └──┬──────────┬──┘ └──────────┘ YES│ │ NO │ │ ┌───────▼──┐ ┌────▼──────┐ │ GRU │ │ LSTM or │ │ (fewer │ │ GRU (try │ │ params) │ │ both!) │ └──────────┘ └───────────┘

Section 11

Exercises

Section A — Multiple Choice Questions (10)

What is the primary purpose of the forget gate (f⟨t⟩) in an LSTM?

To forget the input at the current time step
To decide what information to erase from the cell state
To forget the output of the previous time step
To reset the hidden state to zero

✅ B — The forget gate outputs values in [0,1] that are element-wise multiplied with the previous cell state c⟨t−1⟩. A value of 0 erases that dimension; a value of 1 retains it completely. It does NOT forget the input or hidden state directly.

RememberLSTM GatesBeginner

The cell state update in an LSTM is: c⟨t⟩ = f⟨t⟩ ⊙ c⟨t−1⟩ + i⟨t⟩ ⊙ c̃⟨t⟩. Why does this additive form help with vanishing gradients?

Because addition is faster than multiplication on GPUs
Because the gradient of c⟨t⟩ w.r.t. c⟨t−1⟩ is simply f⟨t⟩, avoiding repeated matrix multiplications
Because the cell state values are always positive
Because sigmoid outputs are always between 0 and 1

✅ B — ∂c⟨t⟩/∂c⟨t−1⟩ = f⟨t⟩ (a diagonal matrix of sigmoid values). When f⟨t⟩ ≈ 1, gradients pass through unchanged. This avoids the repeated multiplication by W_aa that causes vanishing gradients in vanilla RNNs. The gradient flow through the cell state is a product of scalar forget gate values, not full weight matrices.

UnderstandGradient FlowIntermediate

How many parameter matrices (weights + biases) does a single GRU cell have?

2 weight matrices + 2 bias vectors
3 weight matrices + 3 bias vectors
4 weight matrices + 4 bias vectors
6 weight matrices + 6 bias vectors

✅ B — A GRU has 3 sets: update gate (W_z, b_z), reset gate (W_r, b_r), and candidate hidden state (W_h, b_h). This is one fewer than LSTM (which has 4: forget, input, candidate, output).

RememberGRU ArchitectureBeginner

In the GRU update equation h⟨t⟩ = z⟨t⟩ ⊙ h⟨t−1⟩ + (1 − z⟨t⟩) ⊙ h̃⟨t⟩, what happens when z⟨t⟩ = 1 for all dimensions?

The hidden state is completely replaced by the candidate
The hidden state becomes zero
The hidden state is copied from the previous time step unchanged
The GRU behaves like a vanilla RNN

✅ C — When z = 1: h⟨t⟩ = 1·h⟨t−1⟩ + 0·h̃ = h⟨t−1⟩. The state is perfectly copied through, creating a "skip connection" through time. This is how GRUs can preserve long-range information.

UnderstandGRU MechanicsIntermediate

Which of the following tasks CANNOT use a Bidirectional LSTM?

Named Entity Recognition on completed documents
Sentiment analysis of movie reviews
Real-time next-word prediction in a keyboard app
Part-of-speech tagging of sentences

✅ C — Next-word prediction requires generating words one at a time, left-to-right. The backward RNN in a BiLSTM needs the complete sequence, which isn't available during generation. All other options have the complete input available before prediction.

ApplyBiLSTM ConstraintsIntermediate

For an LSTM with hidden size n=256 and input size m=100, approximately how many parameters does a single layer have?

~91,000
~182,000
~274,000
~366,000

✅ D — LSTM parameters = 4[n(n+m) + n] = 4[256×356 + 256] = 4[91,136 + 256] = 4 × 91,392 = 365,568 ≈ 366,000. Option C (~274,000) would be the GRU count: 3 × 91,392.

ApplyParameter CountIntermediate

Why should the forget gate bias in an LSTM be initialized to a positive value (e.g., 1.0)?

To make the sigmoid output start near 0, encouraging forgetting
To make the sigmoid output start near 1, encouraging remembering at the beginning of training
To prevent the cell state from becoming negative
To match the output gate initialization

✅ B — σ(1.0) ≈ 0.73, so the LSTM starts by remembering ~73% of information. This prevents premature information loss before the model has learned what to forget. Jozefowicz et al. (2015) showed this initialization is crucial for good LSTM performance.

UnderstandInitializationIntermediate

In a stacked 3-layer LSTM, what is the input to the second LSTM layer at time step t?

The original input x⟨t⟩
The hidden state of the first layer at time step t−1
The hidden state of the first layer at time step t
The cell state of the first layer at time step t

✅ C — In stacked LSTMs, the hidden state output of layer l at time t serves as the input to layer l+1 at the same time step t. The cell state stays within each layer and is not passed vertically. The first layer takes x⟨t⟩, the second layer takes a₁⟨t⟩, and the third takes a₂⟨t⟩.

UnderstandStacked ArchitectureIntermediate

The GRU's reset gate r⟨t⟩ is applied to h⟨t−1⟩ before computing the candidate h̃⟨t⟩. What is the effect when r⟨t⟩ ≈ 0?

The candidate depends only on the current input x⟨t⟩, ignoring history
The candidate is identical to the previous hidden state
The update gate is forced to 0
The GRU output becomes zero

✅ A — When r ≈ 0, the term r⊙h⟨t−1⟩ ≈ 0, so h̃ = tanh(W_h · [0, x⟨t⟩] + b_h). The candidate is computed as if there's no history — the model "resets" and starts fresh. This is useful at sentence boundaries or topic changes.

AnalyzeGRU Reset GateAdvanced

Q10

In the HDFC Bank fraud detection case study, why did LSTM outperform 1D-CNN on the transaction sequence task?

LSTMs have more parameters and are always more powerful
CNNs cannot process sequential data at all
LSTMs preserve temporal ordering — the order of transactions matters for fraud patterns, which CNNs with fixed receptive fields may miss
LSTMs are faster at inference than CNNs

Standard dropout applies a different random mask at each time step. Recurrent dropout applies the same mask across all time steps. Why does this distinction matter for gradient flow?

Expected Length

4-5 sentences

Section C — Long Answer Questions (3)

C1. Draw the LSTM Cell and Label All Gates (15 marks)

Instructions

Draw a complete LSTM cell diagram showing:

The cell state "conveyor belt" (c⟨t−1⟩ → c⟨t⟩) — show the flow from left to right
The forget gate (f) with its σ activation — show it connecting to the cell state via element-wise multiplication
The input gate (i) with its σ activation
The cell candidate (c̃) with its tanh activation — show how i and c̃ combine via element-wise multiplication
The additive junction where f⊙c⟨t−1⟩ and i⊙c̃ combine
The output gate (o) with its σ activation
The hidden state output a⟨t⟩ = o ⊙ tanh(c⟨t⟩)
All inputs: a⟨t−1⟩, x⟨t⟩, and the concatenation [a⟨t−1⟩, x⟨t⟩]

Write the complete equation next to each gate. Explain why the additive cell state update solves vanishing gradients (5 marks).

C2. Compare LSTM and GRU Architectures (15 marks)

Instructions

Write a detailed comparison covering:

Architecture: Draw both cells side by side. Map GRU gates to LSTM gates and explain the correspondence (5 marks)
Mathematics: Write all equations for both architectures. Show how the GRU's update equation h⟨t⟩ = z⊙h⟨t−1⟩ + (1−z)⊙h̃ is analogous to but simpler than the LSTM's cell state update (4 marks)
Parameter analysis: For n=512, m=300, compute the exact parameter count for both architectures. What's the percentage reduction? (3 marks)
Practical guidance: Provide 3 scenarios where LSTM is preferred and 3 where GRU is preferred, with justification (3 marks)

C3. Design a Fraud Detection System Using LSTMs (15 marks)

Instructions

You are the lead ML engineer at a UPI payment company (like PhonePe or Google Pay) processing 800 crore UPI transactions per month. Design a complete LSTM-based fraud detection system:

Data representation: What features would you extract from each transaction? How would you form sequences? Justify your sequence length choice (4 marks)
Architecture: Propose a specific model architecture (number of layers, hidden sizes, bidirectional or not, output structure). Explain each design choice (4 marks)
Training strategy: Address class imbalance (fraud is ~0.01% of transactions), choice of loss function, and validation strategy (3 marks)
Deployment: Address inference latency requirements (UPI mandate: 30-second transaction timeout), model serving architecture, and how to handle cold-start (new users with no history) (4 marks)

Section D — Programming Assignments (2)

D1. Nifty 50 Price Prediction — LSTM vs GRU Comparison

Task

Build two models — one using LSTM layers and one using GRU layers — to predict the next-day closing price of the Nifty 50 index. Compare them on:

Test MAE and MAPE
Training time per epoch
Total trainable parameters
Loss convergence curves (plot both on the same graph)

Specifications

Use yfinance to download Nifty 50 data (^NSEI) from 2015-2024
Lookback window: 60 days
Both models: 2 stacked layers (128 → 64 units) with Dropout(0.2)
Train for 100 epochs with EarlyStopping
Normalize prices using MinMaxScaler
Use 80-20 chronological split (NO shuffling!)

Deliverables

A Jupyter notebook with both models, training curves overlay, test predictions overlay on actual prices, and a 200-word analysis of which model performed better and why.

D2. Bidirectional LSTM for Hindi-English NER

Task

Build a BiLSTM-based Named Entity Recognition system for code-mixed Hindi-English text common in Indian social media. Use the provided dataset format or generate your own.

Example Data

Sentence: "Modi ji ne Varanasi mein rally ki"
Tags:      B-PER O O  B-LOC   O    O    O

Sentence: "Flipkart ka Big Billion Days sale start"
Tags:      B-ORG  O  O   O       O    O    O

Requirements

Create at least 50 training sentences with Indian entities
Model: Embedding(64) → BiLSTM(128) → Dropout(0.3) → BiLSTM(64) → TimeDistributed(Dense)
Entity types: PER, ORG, LOC, EVENT, PRODUCT (with B- and I- prefixes)
Evaluate using entity-level F1 score (use seqeval library)
Show 10 example predictions on unseen sentences

Section E — Mini-Project

🚀 Project: Indian Stock Market Multi-Feature LSTM Predictor

Overview

Build a comprehensive stock prediction system for Indian markets using multiple features and LSTM variants. This goes beyond simple price prediction to incorporate volume, technical indicators, and market sentiment.

Phase 1: Data Collection (Week 1)

Download daily OHLCV data for 5 stocks: Reliance, TCS, HDFC Bank, Infosys, and ICICI Bank from NSE using yfinance
Compute technical indicators: 20-day SMA, 50-day EMA, RSI (14-day), MACD, Bollinger Bands
Add market-wide features: Nifty 50 daily return, India VIX

Phase 2: Model Development (Week 2)

Build 3 model variants: (a) Simple LSTM, (b) Stacked LSTM, (c) Bidirectional LSTM (train on completed windows)
Input: 30-day window of 10+ features
Output: Next-day direction (up/down) — classification task
Handle class balance and use proper time-series cross-validation

Phase 3: Analysis (Week 3)

Compare all 3 architectures on accuracy, precision, recall, and F1
Analyze: Which features contribute most? (Use ablation study)
Visualize LSTM gate activations for interesting sequences (e.g., market crash days, budget day)
Write a 500-word report with investment-context analysis

Grading Rubric

Component	Marks
Data pipeline + feature engineering	20
Model implementation (3 variants)	30
Evaluation and comparison	20
Visualization and interpretability	15
Report and code quality	15
Total	100

Section 12

Chapter Summary

Key Takeaways from Chapter 14

The Vanishing Gradient Problem — Vanilla RNNs fail on sequences longer than ~10-20 steps because gradients are products of many weight matrices: if eigenvalues < 1, gradients vanish; if > 1, they explode.
LSTM Architecture — Introduces a separate cell state (c⟨t⟩) that flows through time with additive updates. Three gates control information flow:
- Forget gate (f): What to erase from memory — σ(W_f · [a⟨t−1⟩, x⟨t⟩] + b_f)
- Input gate (i): What new info to write — σ(W_i · [a⟨t−1⟩, x⟨t⟩] + b_i)
- Output gate (o): What to reveal — σ(W_o · [a⟨t−1⟩, x⟨t⟩] + b_o)
Cell update: c⟨t⟩ = f⊙c⟨t−1⟩ + i⊙c̃ (additive — gradient flows easily)
GRU Architecture — Simplifies LSTM by merging cell and hidden states, and using 2 gates instead of 3:
- Update gate (z): Controls forgetting AND updating simultaneously
- Reset gate (r): Controls how much history to use for the candidate
25% fewer parameters, similar performance on many tasks
LSTM vs GRU — LSTM is better for very long sequences (>500 steps) and interpretability; GRU is better for small datasets, mobile deployment, and faster experimentation. No universal winner — try both.
Bidirectional RNNs — Run two RNNs (forward + backward) on the same sequence. Essential for NER, sentiment analysis, and tasks where future context helps. Cannot be used for real-time/streaming predictions.
Stacked/Deep RNNs — 2-3 layers is the sweet spot. Add dropout between layers (not within recurrent connections). Use residual connections for 4+ layers.
Practical Tips:
- Initialize forget gate bias to 1.0 (remember by default)
- Use recurrent dropout (same mask across time), not standard dropout
- Never shuffle time-series data for train/test split
- LSTM parameters = 4n(n+m) + 4n; GRU = 3n(n+m) + 3n
Industry Impact — HDFC Bank's LSTM-based fraud detection reduced false positives by 40% and prevented ₹240 crore in additional annual fraud. LSTMs power translation, NER, speech, and financial systems across India.

Section 13

References

Foundational Papers

Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735–1780. — The original LSTM paper introducing the cell state and gating mechanism.
Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). "Learning to Forget: Continual Prediction with LSTM." Neural Computation, 12(10), 2451–2471. — Added the forget gate (not in the original 1997 paper!).
Cho, K. et al. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." EMNLP 2014. — The paper introducing GRU.
Chung, J. et al. (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling." NIPS 2014 Workshop. — Comprehensive LSTM vs GRU comparison.
Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015). "An Empirical Exploration of Recurrent Network Architectures." ICML 2015. — Showed forget gate bias initialization to 1.0 is critical.

Architecture Variants

Schuster, M. & Paliwal, K. K. (1997). "Bidirectional Recurrent Neural Networks." IEEE Transactions on Signal Processing, 45(11). — The original bidirectional RNN paper.
Graves, A. (2013). "Generating Sequences With Recurrent Neural Networks." arXiv:1308.0850. — Handwriting generation with stacked LSTMs.
Wu, Y. et al. (2016). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation." arXiv:1609.08144. — GNMT with 8-layer stacked LSTMs and residual connections.
Gal, Y. & Ghahramani, Z. (2016). "A Theoretically Grounded Application of Dropout to Recurrent Neural Networks." NIPS 2016. — Variational/recurrent dropout for LSTMs.

Textbooks

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 10: Sequence Modeling — Detailed treatment of LSTM and GRU architectures.
Chollet, F. (2021). Deep Learning with Python, 2nd Edition. Manning. Chapter 10 — Practical LSTM/GRU implementations in Keras.
Jurafsky, D. & Martin, J. H. (2023). Speech and Language Processing, 3rd Edition (Draft). Chapter 9 — LSTMs in NLP context.

Indian Industry Context

HDFC Bank Annual Report (2023-24) — Sections on digital banking technology, AI/ML fraud detection deployment statistics.
RBI Circular on Digital Payment Security Controls (2022) — Mandate for AI/ML-based fraud detection in Indian banking, driving LSTM adoption.
NASSCOM AI in BFSI Report (2023) — Survey of AI/ML adoption in Indian banking and financial services, including sequence modeling for fraud detection and credit scoring.

Chapter 14: LSTMs and GRUs — Solving Long-Term Memory

Bloom's Taxonomy Map for This Chapter

Learning Objectives

Opening Hook — The Sentence That Broke the RNN

🗣️ When Memory Fails: A Hindi Sentence Challenge

Core Concepts

14.1 The Vanishing Gradient Problem in RNNs — Why Memory Fades

Why the Product of Matrices Destroys Gradients

14.2 LSTM — Long Short-Term Memory (Full Derivation)

The Key Idea: Two Separate State Vectors

Step-by-Step Gate Derivation

Gate 1: Forget Gate (f⟨t⟩) — "What to erase from memory"

Gate 2: Input/Update Gate (i⟨t⟩) — "What new information to store"

Cell Candidate (c̃⟨t⟩) — "What new information to potentially store"

Cell State Update (c⟨t⟩) — "The actual memory update"

🔑 Why the Additive Update Solves Vanishing Gradients

Gate 3: Output Gate (o⟨t⟩) — "What to reveal from memory"

Hidden State (a⟨t⟩) — "The visible output"

Complete LSTM Equations — Summary

LSTM Parameter Count

14.3 GRU — Gated Recurrent Unit (Simplified Gating)

GRU Equations — Step by Step

Update Gate (z⟨t⟩) — "How much of the old state to keep"

Reset Gate (r⟨t⟩) — "How much of the old state to use for the candidate"

Candidate Hidden State (h̃⟨t⟩)

Hidden State Update (h⟨t⟩)

Complete GRU Equations — Summary

GRU Parameter Count

GRU ↔ LSTM Correspondence

14.4 LSTM vs GRU — When to Use Each

14.5 Bidirectional RNNs — Reading Forward AND Backward

Architecture: Two RNNs, One Sequence

14.6 Deep (Stacked) RNNs — Adding Depth to Sequence Models

From-Scratch Code — LSTM Cell in NumPy

Industry Code — TensorFlow / Keras

5A. NSE Nifty 50 Stock Price Prediction with LSTM

📈 Real-World Context

5B. Bidirectional LSTM for Named Entity Recognition (Indian News)

Visual Diagrams

6A. LSTM Cell — Complete Architecture

6B. GRU Cell — Simplified Architecture

6C. LSTM vs GRU — Side by Side

6D. Unrolled Bidirectional LSTM

Worked Example — Tracing Through an LSTM Cell by Hand

Setup

Given Values

Step 1: Forget Gate

Step 2: Input Gate

Step 3: Cell Candidate

Step 4: Cell State Update (THE KEY STEP)

Step 5: Output Gate

Step 6: Hidden State

What Did the LSTM Cell Do?

Case Study — HDFC Bank: LSTM-Powered Fraud Detection

🏦 HDFC Bank — Detecting Fraud in Transaction Sequences with LSTMs

The Problem

The Insight: Fraud is a Sequence Problem

The LSTM Architecture

Results (After 6-Month Production Deployment)

Why LSTM, Not GRU or Transformer?

Lessons Learned

Common Mistakes & Misconceptions

Comparison Table — RNN Architectures

When to Choose What — Decision Flowchart

Exercises

Section A — Multiple Choice Questions (10)

Section B — Short Answer Questions (5)

B1. Explain the "conveyor belt" analogy for the LSTM cell state

B2. Why does the GRU use (1 − z) for the candidate weight instead of a separate input gate?

B3. Why is return_sequences=True necessary in stacked LSTMs but not in the final LSTM layer (for classification)?

B4. In the HDFC Bank case study, why was the false positive reduction more valuable than the fraud detection improvement?

B5. What is "recurrent dropout" and why is it different from standard dropout in LSTMs?

Section C — Long Answer Questions (3)

C1. Draw the LSTM Cell and Label All Gates (15 marks)

C2. Compare LSTM and GRU Architectures (15 marks)

C3. Design a Fraud Detection System Using LSTMs (15 marks)

Section D — Programming Assignments (2)

D1. Nifty 50 Price Prediction — LSTM vs GRU Comparison

D2. Bidirectional LSTM for Hindi-English NER

Section E — Mini-Project

B3. Why is `return_sequences=True` necessary in stacked LSTMs but not in the final LSTM layer (for classification)?