Learning Objectives
Understand why feedforward networks fail on sequential data and how time dependencies arise.
Derive the RNN recurrence relation h_t = tanh(W_hhยทh_{t-1} + W_xhยทx_t + b).
Derive BPTT from first principles and understand its computational cost.
Prove mathematically why gradients vanish/explode via repeated matrix multiplication.
Master all four gates: forget, input, candidate, output โ with complete equations.
Understand the update and reset gates and when GRU is preferred.
Learn to process sequences in both directions and stack multiple layers.
Build encoder-decoder architectures for tasks like machine translation.
Get a preview of how attention solves the information bottleneck in seq2seq.
Implement NLP, time series forecasting, and generative models using RNNs/LSTMs.
Introduction
Consider reading this sentence. You do not understand each word in isolation โ the meaning of "bank" changes depending on whether the preceding words were "river" or "investment." Your brain processes text sequentially, carrying a running summary of what came before. Feedforward neural networks (Chapter 12) cannot do this: each input is processed independently, with no memory of the past.
Recurrent Neural Networks (RNNs) solve this by introducing loops in the network, allowing information to persist from one time step to the next. They were the first neural architecture capable of handling variable-length sequences โ sentences, audio waveforms, stock prices, DNA sequences โ by maintaining a hidden state that encodes a compressed history of everything seen so far.
However, vanilla RNNs have a critical flaw: they struggle to learn long-range dependencies. The meaning of a word at position 1 might be crucial for understanding word 50, but the gradient signal carrying this information vanishes exponentially during training. Long Short-Term Memory (LSTM) networks, invented by Hochreiter & Schmidhuber in 1997, solve this with a gating mechanism that controls what information to remember, what to forget, and what to output.
In this chapter, we will build RNNs and LSTMs from first principles โ starting with the math, proving why gradients vanish, deriving every LSTM gate equation, and then implementing everything in both raw NumPy and TensorFlow. Along the way, we will forecast Nifty50 stock prices, generate Hindi text, and understand the systems behind Google Translate and voice assistants.
RNNs are foundational. Even though Transformers (Chapter 22) have replaced them in many NLP tasks, understanding RNNs is essential because: (a) LSTMs still dominate time-series forecasting, (b) the concepts of hidden states and gating appear throughout modern deep learning, and (c) many interview questions still test RNN knowledge.
Historical Background
Timeline of Sequence Modeling
| Year | Milestone | Significance |
|---|---|---|
| 1982 | Hopfield Networks | John Hopfield introduces associative memory networks with recurrence |
| 1986 | Simple RNN (Jordan) | Michael Jordan proposes recurrent connections for sequence processing |
| 1990 | Elman Network | Jeffrey Elman introduces the "simple recurrent network" with hidden-to-hidden connections |
| 1990 | BPTT Formalized | Werbos formalizes backpropagation through time |
| 1991 | Vanishing Gradient | Hochreiter identifies the vanishing gradient problem in his diploma thesis |
| 1997 | LSTM Invented | Hochreiter & Schmidhuber publish LSTM โ the breakthrough |
| 2000 | Forget Gate Added | Gers, Schmidhuber & Cummins add the forget gate to LSTM |
| 2005 | Bidirectional LSTM | Graves & Schmidhuber apply BiLSTMs to phoneme classification |
| 2014 | GRU Proposed | Cho et al. introduce the Gated Recurrent Unit โ a simpler alternative |
| 2014 | Seq2Seq | Sutskever, Vinyals & Le introduce encoder-decoder for machine translation |
| 2015 | Attention Mechanism | Bahdanau et al. add attention to seq2seq, removing the bottleneck |
| 2017 | Transformer | Vaswani et al. replace recurrence entirely with self-attention |
Indian railways (IRCTC) began experimenting with LSTM-based models for ticket demand forecasting around 2018, aiming to optimize dynamic pricing on Rajdhani and Shatabdi routes. Indian fintech firms like Zerodha and Groww also use LSTM models for real-time stock signal generation on BSE/NSE data.
Conceptual Explanation
4.1 Why Sequences Need Special Networks
Consider three challenges that feedforward networks cannot handle:
- Variable length: Sentences have different lengths. A feedforward net needs a fixed-size input.
- Order matters: "Dog bites man" โ "Man bites dog" โ but a bag-of-words feedforward net treats them identically.
- Long-range dependencies: "The cat, which sat on the mat and played with the yarn, was happy" โ the verb "was" must agree with "cat" many words earlier.
4.2 The Core Idea: Hidden State as Memory
An RNN processes one element of the sequence at a time. At each time step t, it receives input x_t and combines it with the hidden state from the previous time step h_{t-1} to produce a new hidden state h_t. Think of h_t as a "compressed summary" of everything the network has seen from time step 1 through time step t.
4.3 Vanilla RNN
The simplest RNN uses a single recurrence equation:
y_t = W_hy ยท h_t + b_y
Where: W_hh = hidden-to-hidden weights, W_xh = input-to-hidden weights, W_hy = hidden-to-output weights. The same weights are shared across all time steps โ this is called weight sharing and is what makes RNNs parameter-efficient.
4.4 The Vanishing Gradient Problem
During backpropagation through time (BPTT), gradients must flow backward through many time steps. Each step multiplies by the weight matrix W_hh and the derivative of tanh. Since |tanh'(x)| โค 1, after multiplying many values less than 1, the gradient shrinks exponentially โ vanishing gradient. Conversely, if eigenvalues of W_hh > 1, gradients grow exponentially โ exploding gradient.
4.5 LSTM: The Solution
LSTM introduces a separate cell state C_t that runs like a conveyor belt through time, with gates that control information flow:
- Forget gate (f_t): Decides what to erase from cell state
- Input gate (i_t): Decides what new info to write
- Candidate gate (Cฬ_t): Creates candidate values to add
- Output gate (o_t): Decides what to output from cell state
4.6 GRU: Simplified LSTM
The Gated Recurrent Unit merges the forget and input gates into a single update gate, and uses a reset gate to control how much past information to forget. It has fewer parameters and trains faster, with comparable performance on many tasks.
4.7 Bidirectional RNNs
In many tasks (like named entity recognition), the meaning of a word depends on both preceding and following context. A Bidirectional RNN runs two separate RNNs โ one forward, one backward โ and concatenates their hidden states.
4.8 Sequence-to-Sequence (Seq2Seq)
For tasks where input and output sequences have different lengths (e.g., translation from English to Hindi), we use an encoder-decoder architecture: the encoder RNN compresses the input into a fixed-size context vector, and the decoder RNN generates the output sequence from this vector.
While Transformers have largely replaced RNNs in NLP (2020+), LSTMs remain the go-to choice for many time-series applications: financial forecasting, IoT sensor prediction, weather modeling, and demand estimation โ due to lower data requirements and faster inference on edge devices.
Mathematical Foundation
5.1 Vanilla RNN Equations
Given an input sequence (x_1, x_2, ..., x_T) where x_t โ โ^d, hidden state h_t โ โ^n:
W_hh โ โ^{nรn}, W_xh โ โ^{nรd}, b_h โ โ^n
y_t = softmax(W_hy ยท h_t + b_y)
W_hy โ โ^{kรn}, b_y โ โ^k (k = output classes)
5.2 LSTM Full Equations
Let [h_{t-1}, x_t] denote the concatenation of previous hidden state and current input. The LSTM computes:
Input Gate: i_t = ฯ(W_i ยท [h_{t-1}, x_t] + b_i)
Candidate: Cฬ_t = tanh(W_C ยท [h_{t-1}, x_t] + b_C)
Cell State: C_t = f_t โ C_{t-1} + i_t โ Cฬ_t
Output Gate: o_t = ฯ(W_o ยท [h_{t-1}, x_t] + b_o)
Hidden State: h_t = o_t โ tanh(C_t)
Where ฯ is the sigmoid function and โ is element-wise (Hadamard) product.
5.3 GRU Equations
Reset Gate: r_t = ฯ(W_r ยท [h_{t-1}, x_t] + b_r)
Candidate: hฬ_t = tanh(W_h ยท [r_t โ h_{t-1}, x_t] + b_h)
Hidden State: h_t = (1 - z_t) โ h_{t-1} + z_t โ hฬ_t
5.4 Parameter Count Comparison
| Architecture | Parameters (hidden=n, input=d) | Example (n=128, d=64) |
|---|---|---|
| Vanilla RNN | nยฒ + nยทd + n (+ output) | ~24,704 |
| LSTM | 4 ร (nยฒ + nยทd + n) | ~98,816 |
| GRU | 3 ร (nยฒ + nยทd + n) | ~74,112 |
5.5 Bidirectional RNN
Backward: h_tโ = RNN(x_t, h_{t+1}โ)
Combined: h_t = [h_tโ ; h_tโ] โ โ^{2n}
GATE/NET exams frequently ask you to calculate the parameter count of an LSTM layer. Remember: LSTM has 4ร the parameters of a vanilla RNN because it has 4 gate weight matrices (forget, input, candidate, output), each of the same size as the vanilla RNN's single weight matrix.
Formula Derivations
6.1 Deriving Backpropagation Through Time (BPTT)
We derive BPTT from first principles. The total loss across all T time steps is:
The hidden state at time t is:
To compute โL/โW_hh, we apply the chain rule. Since W_hh affects all future losses through the hidden state chain:
โL_t/โW_hh = ฮฃ_{k=1}^{t} (โL_t/โh_t) ยท (โh_t/โh_k) ยท (โh_k/โW_hh)
The key term is the Jacobian product:
where โh_i/โh_{i-1} = diag(1 - h_iยฒ) ยท W_hh
Here, diag(1 - h_iยฒ) is the diagonal matrix of tanh derivatives. This product of matrices is the source of the vanishing/exploding gradient.
6.2 Proving the Vanishing Gradient
Theorem: For a vanilla RNN with hidden state dimension n, the gradient โh_t/โh_k decays exponentially as (t-k) increases.
Proof:
โค ฮ _{i=k+1}^{t} โdiag(1 - h_iยฒ)โ ยท โW_hhโ
Since |tanh'(x)| = |1 - tanhยฒ(x)| โค 1, we have โdiag(1 - h_iยฒ)โ โค 1
Therefore: โโh_t/โh_kโ โค โW_hhโ^{t-k}
If โW_hhโ < 1 โ โโh_t/โh_kโ โ 0 (vanishing)
If โW_hhโ > 1 โ โโh_t/โh_kโ โ โ (exploding) โก
6.3 Why LSTM Solves Vanishing Gradients
In LSTM, the cell state gradient flows through:
โC_t/โC_k = ฮ _{i=k+1}^{t} f_i
Crucially, the forget gate f_t โ (0,1) is learned. When the network learns to set f_t โ 1, gradients flow perfectly with no decay. This is the "constant error carousel" โ the cell state acts as a highway for gradient flow. Unlike vanilla RNN where gradients must pass through tanh and a fixed weight matrix, LSTM gradients pass through learned gate values that can be close to 1.
6.4 Deriving GRU from LSTM
GRU simplifies LSTM by:
- Merging forget gate and input gate into one update gate: z_t (where input = z_t, forget = 1-z_t)
- Removing the separate cell state โ the hidden state serves both roles
- Adding a reset gate to control past hidden state influence on candidates
GRU: h_t = (1-z_t) โ h_{t-1} + z_t โ hฬ_t (coupled via z_t)
The "constant error carousel" is the key insight of LSTM. It's analogous to ResNet's skip connections (Chapter 18). Both solve vanishing gradients by providing a shortcut path for gradient flow. In LSTM, this path is the cell state; in ResNet, it's the identity mapping. This deep connection shows up frequently in research interviews.
Worked Numerical Examples
Example 1: RNN Forward Pass (3 Time Steps)
Input dimension d=2, hidden dimension n=2. Inputs: xโ=[1,0], xโ=[0,1], xโ=[1,1]. Initial hโ=[0,0].
Weights (simplified):
Time Step 1: xโ = [1, 0], hโ = [0, 0]
= [0, 0] + [0.5ยท1 + 0.3ยท0, 0.1ยท1 + 0.4ยท0] + [0, 0]
= [0.5, 0.1]
hโ = tanh([0.5, 0.1]) = [0.4621, 0.0997]
Time Step 2: xโ = [0, 1], hโ = [0.4621, 0.0997]
W_hh ยท hโ = [0.2ยท0.4621 + 0.1ยท0.0997, 0.3ยท0.4621 + 0.2ยท0.0997]
= [0.1024, 0.1586]
W_xh ยท xโ = [0.5ยท0 + 0.3ยท1, 0.1ยท0 + 0.4ยท1] = [0.3, 0.4]
zโ = [0.4024, 0.5586]
hโ = tanh([0.4024, 0.5586]) = [0.3828, 0.5068]
Time Step 3: xโ = [1, 1], hโ = [0.3828, 0.5068]
= [0.1273, 0.2162]
W_xh ยท xโ = [0.5 + 0.3, 0.1 + 0.4] = [0.8, 0.5]
zโ = [0.9273, 0.7162]
hโ = tanh([0.9273, 0.7162]) = [0.7286, 0.6143]
Observation: hโ = [0.7286, 0.6143] encodes information from all three inputs!
Example 2: LSTM Gate Computation (1 Time Step)
Hidden dim n=2, input dim d=2. hโ=[0,0], Cโ=[0,0], xโ=[1,0.5]
Concatenated [hโ, xโ] = [0, 0, 1, 0.5] (dim 4). We use simplified weight matrices W โ โ^{2ร4}:
W_i = [[0.3, 0.1, 0.2, 0.2], [0.1, 0.3, 0.3, 0.1]] b_i = [0, 0]
W_C = [[0.2, 0.3, 0.4, 0.1], [0.3, 0.2, 0.1, 0.4]] b_C = [0, 0]
W_o = [[0.1, 0.1, 0.5, 0.2], [0.2, 0.2, 0.2, 0.5]] b_o = [0, 0]
Step 1: Forget Gate
= ฯ([0.3+0.05+0.5, 0.1+0.15+0.5])
= ฯ([0.85, 0.75]) = [0.7003, 0.6792]
Step 2: Input Gate
= ฯ([0.2+0.1, 0.3+0.05]) = ฯ([0.3, 0.35]) = [0.5744, 0.5866]
Step 3: Candidate Cell
= tanh([0.4+0.05, 0.1+0.2]) = tanh([0.45, 0.3]) = [0.4219, 0.2913]
Step 4: Cell State Update
= [0.7003, 0.6792] โ [0, 0] + [0.5744, 0.5866] โ [0.4219, 0.2913]
= [0, 0] + [0.2424, 0.1709] = [0.2424, 0.1709]
Step 5: Output Gate & Hidden State
= ฯ([0.5+0.1, 0.2+0.25]) = ฯ([0.6, 0.45]) = [0.6457, 0.6106]
hโ = oโ โ tanh(Cโ) = [0.6457, 0.6106] โ tanh([0.2424, 0.1709])
= [0.6457, 0.6106] โ [0.2379, 0.1693] = [0.1536, 0.1034]
Result: hโ = [0.1536, 0.1034], Cโ = [0.2424, 0.1709]. The forget gate values (~0.7) mean we'd retain about 70% of previous cell state if it were non-zero.
Visual Diagrams
Diagram 1: Vanilla RNN Unrolled
Diagram 2: LSTM Cell Architecture
Diagram 3: GRU Cell
Diagram 4: Seq2Seq Encoder-Decoder
Flowcharts
Flowchart 1: Choosing RNN Architecture
Flowchart 2: BPTT Training Algorithm
Python Implementation (From Scratch)
10.1 Vanilla RNN in NumPy
import numpy as np
class VanillaRNN:
"""
Vanilla RNN implementation from scratch using NumPy.
h_t = tanh(W_hh @ h_{t-1} + W_xh @ x_t + b_h)
y_t = W_hy @ h_t + b_y
"""
def __init__(self, input_dim, hidden_dim, output_dim):
self.hidden_dim = hidden_dim
# Xavier initialization
scale_xh = np.sqrt(2.0 / (input_dim + hidden_dim))
scale_hh = np.sqrt(2.0 / (hidden_dim + hidden_dim))
scale_hy = np.sqrt(2.0 / (hidden_dim + output_dim))
self.W_xh = np.random.randn(hidden_dim, input_dim) * scale_xh
self.W_hh = np.random.randn(hidden_dim, hidden_dim) * scale_hh
self.b_h = np.zeros((hidden_dim, 1))
self.W_hy = np.random.randn(output_dim, hidden_dim) * scale_hy
self.b_y = np.zeros((output_dim, 1))
def forward(self, inputs, h_prev=None):
"""
Forward pass through the entire sequence.
inputs: list of column vectors [x_1, x_2, ..., x_T]
Returns: outputs, hidden_states
"""
if h_prev is None:
h_prev = np.zeros((self.hidden_dim, 1))
self.inputs = inputs
self.hidden_states = {0: h_prev}
outputs = []
for t in range(1, len(inputs) + 1):
x_t = inputs[t - 1]
# Core RNN equation
z_t = self.W_hh @ self.hidden_states[t-1] + self.W_xh @ x_t + self.b_h
h_t = np.tanh(z_t)
y_t = self.W_hy @ h_t + self.b_y
self.hidden_states[t] = h_t
outputs.append(y_t)
return outputs, self.hidden_states
def backward(self, d_outputs, learning_rate=0.001):
"""
Backpropagation Through Time (BPTT).
d_outputs: list of gradients dL/dy_t for each time step
"""
T = len(d_outputs)
dW_xh = np.zeros_like(self.W_xh)
dW_hh = np.zeros_like(self.W_hh)
db_h = np.zeros_like(self.b_h)
dW_hy = np.zeros_like(self.W_hy)
db_y = np.zeros_like(self.b_y)
dh_next = np.zeros((self.hidden_dim, 1))
for t in reversed(range(1, T + 1)):
dy = d_outputs[t - 1]
# Gradient from output layer
dW_hy += dy @ self.hidden_states[t].T
db_y += dy
# Gradient into hidden state
dh = self.W_hy.T @ dy + dh_next
# Backprop through tanh: dtanh/dz = 1 - tanh^2
dz = dh * (1 - self.hidden_states[t] ** 2)
# Accumulate gradients
dW_xh += dz @ self.inputs[t - 1].T
dW_hh += dz @ self.hidden_states[t - 1].T
db_h += dz
# Gradient flowing to previous time step
dh_next = self.W_hh.T @ dz
# Gradient clipping to prevent explosion
for grad in [dW_xh, dW_hh, db_h, dW_hy, db_y]:
np.clip(grad, -5, 5, out=grad)
# Update weights
self.W_xh -= learning_rate * dW_xh
self.W_hh -= learning_rate * dW_hh
self.b_h -= learning_rate * db_h
self.W_hy -= learning_rate * dW_hy
self.b_y -= learning_rate * db_y
# ========== Demo: Character-level language model ==========
# Simple example with tiny vocab
text = "hello world hello"
chars = sorted(list(set(text)))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for i, c in enumerate(chars)}
vocab_size = len(chars)
# Create training pairs
inputs_idx = [char_to_idx[c] for c in text[:-1]]
targets_idx = [char_to_idx[c] for c in text[1:]]
rnn = VanillaRNN(input_dim=vocab_size, hidden_dim=16, output_dim=vocab_size)
# Training loop
for epoch in range(200):
# One-hot encode
xs = [np.eye(vocab_size)[:, [i]] for i in inputs_idx]
ys_true = targets_idx
# Forward pass
outputs, _ = rnn.forward(xs)
# Compute softmax + cross-entropy loss
loss = 0
d_outputs = []
for t in range(len(outputs)):
# Softmax
exp_y = np.exp(outputs[t] - np.max(outputs[t]))
probs = exp_y / np.sum(exp_y)
loss -= np.log(probs[ys_true[t], 0] + 1e-8)
# Gradient of cross-entropy + softmax
dy = probs.copy()
dy[ys_true[t]] -= 1
d_outputs.append(dy)
# Backward pass
rnn.backward(d_outputs, learning_rate=0.01)
if epoch % 50 == 0:
print(f"Epoch {epoch}, Loss: {loss:.4f}")
print("Training complete!")
10.2 LSTM in NumPy
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def sigmoid_derivative(s):
return s * (1 - s)
def tanh_derivative(t):
return 1 - t ** 2
class LSTMCell:
"""
Single LSTM Cell implementation from scratch.
Implements all 4 gates: forget, input, candidate, output.
"""
def __init__(self, input_dim, hidden_dim):
self.input_dim = input_dim
self.hidden_dim = hidden_dim
concat_dim = input_dim + hidden_dim
scale = np.sqrt(2.0 / concat_dim)
# Forget gate parameters
self.W_f = np.random.randn(hidden_dim, concat_dim) * scale
self.b_f = np.ones((hidden_dim, 1)) # bias=1 for forget gate (important!)
# Input gate parameters
self.W_i = np.random.randn(hidden_dim, concat_dim) * scale
self.b_i = np.zeros((hidden_dim, 1))
# Candidate parameters
self.W_c = np.random.randn(hidden_dim, concat_dim) * scale
self.b_c = np.zeros((hidden_dim, 1))
# Output gate parameters
self.W_o = np.random.randn(hidden_dim, concat_dim) * scale
self.b_o = np.zeros((hidden_dim, 1))
def forward(self, x_t, h_prev, c_prev):
"""Single time step forward pass."""
# Concatenate [h_{t-1}, x_t]
concat = np.vstack([h_prev, x_t])
# Forget gate: what to erase from cell state
f_t = sigmoid(self.W_f @ concat + self.b_f)
# Input gate: what new info to write
i_t = sigmoid(self.W_i @ concat + self.b_i)
# Candidate cell state
c_tilde = np.tanh(self.W_c @ concat + self.b_c)
# New cell state
c_t = f_t * c_prev + i_t * c_tilde
# Output gate: what to output
o_t = sigmoid(self.W_o @ concat + self.b_o)
# New hidden state
h_t = o_t * np.tanh(c_t)
# Cache for backward pass
cache = (concat, f_t, i_t, c_tilde, c_t, o_t, h_prev, c_prev, x_t)
return h_t, c_t, cache
def backward(self, dh_t, dc_t, cache):
"""Single time step backward pass."""
concat, f_t, i_t, c_tilde, c_t, o_t, h_prev, c_prev, x_t = cache
# Gradient through output gate
tanh_c_t = np.tanh(c_t)
do_t = dh_t * tanh_c_t
dc_t += dh_t * o_t * tanh_derivative(tanh_c_t)
# Gradient through cell state update
df_t = dc_t * c_prev
di_t = dc_t * c_tilde
dc_tilde = dc_t * i_t
dc_prev = dc_t * f_t
# Gradient through activations
df_raw = df_t * sigmoid_derivative(f_t)
di_raw = di_t * sigmoid_derivative(i_t)
dc_raw = dc_tilde * tanh_derivative(c_tilde)
do_raw = do_t * sigmoid_derivative(o_t)
# Weight gradients
dW_f = df_raw @ concat.T
dW_i = di_raw @ concat.T
dW_c = dc_raw @ concat.T
dW_o = do_raw @ concat.T
db_f = df_raw
db_i = di_raw
db_c = dc_raw
db_o = do_raw
# Gradient to concat = [h_prev, x_t]
d_concat = (self.W_f.T @ df_raw + self.W_i.T @ di_raw +
self.W_c.T @ dc_raw + self.W_o.T @ do_raw)
dh_prev = d_concat[:self.hidden_dim]
dx_t = d_concat[self.hidden_dim:]
grads = {
'dW_f': dW_f, 'dW_i': dW_i, 'dW_c': dW_c, 'dW_o': dW_o,
'db_f': db_f, 'db_i': db_i, 'db_c': db_c, 'db_o': db_o
}
return dh_prev, dc_prev, dx_t, grads
class LSTM:
"""Full LSTM for sequence processing."""
def __init__(self, input_dim, hidden_dim, output_dim):
self.cell = LSTMCell(input_dim, hidden_dim)
self.hidden_dim = hidden_dim
scale = np.sqrt(2.0 / (hidden_dim + output_dim))
self.W_y = np.random.randn(output_dim, hidden_dim) * scale
self.b_y = np.zeros((output_dim, 1))
def forward(self, inputs):
"""Process entire sequence."""
T = len(inputs)
h = np.zeros((self.hidden_dim, 1))
c = np.zeros((self.hidden_dim, 1))
self.caches = []
self.h_states = [h]
outputs = []
for t in range(T):
h, c, cache = self.cell.forward(inputs[t], h, c)
self.caches.append(cache)
self.h_states.append(h)
y = self.W_y @ h + self.b_y
outputs.append(y)
return outputs
def predict_sequence(self, seed_input, length, temperature=1.0):
"""Generate a sequence given a seed."""
h = np.zeros((self.hidden_dim, 1))
c = np.zeros((self.hidden_dim, 1))
x = seed_input
generated = []
for _ in range(length):
h, c, _ = self.cell.forward(x, h, c)
y = self.W_y @ h + self.b_y
# Temperature-scaled softmax
y = y / temperature
exp_y = np.exp(y - np.max(y))
probs = exp_y / np.sum(exp_y)
idx = np.random.choice(len(probs.flatten()), p=probs.flatten())
generated.append(idx)
# Next input is one-hot of predicted char
x = np.zeros_like(seed_input)
x[idx] = 1
return generated
# ========== Demo: LSTM on simple sequence ==========
print("=== LSTM Forward Pass Demo ===")
lstm_cell = LSTMCell(input_dim=3, hidden_dim=4)
h = np.zeros((4, 1))
c = np.zeros((4, 1))
# Process 3 time steps
for t in range(3):
x = np.random.randn(3, 1)
h, c, cache = lstm_cell.forward(x, h, c)
print(f"t={t+1}: h={h.flatten()[:3].round(4)}... c={c.flatten()[:3].round(4)}...")
print("\nLSTM cell maintains separate h and c states!")
10.3 GRU in NumPy
class GRUCell:
"""GRU Cell: simplified LSTM with update + reset gates."""
def __init__(self, input_dim, hidden_dim):
self.hidden_dim = hidden_dim
concat_dim = input_dim + hidden_dim
scale = np.sqrt(2.0 / concat_dim)
# Update gate (merges forget + input)
self.W_z = np.random.randn(hidden_dim, concat_dim) * scale
self.b_z = np.zeros((hidden_dim, 1))
# Reset gate
self.W_r = np.random.randn(hidden_dim, concat_dim) * scale
self.b_r = np.zeros((hidden_dim, 1))
# Candidate hidden state
self.W_h = np.random.randn(hidden_dim, concat_dim) * scale
self.b_h = np.zeros((hidden_dim, 1))
def forward(self, x_t, h_prev):
concat = np.vstack([h_prev, x_t])
# Update gate: how much to keep from old state
z_t = sigmoid(self.W_z @ concat + self.b_z)
# Reset gate: how much past to use for candidate
r_t = sigmoid(self.W_r @ concat + self.b_r)
# Candidate with reset applied
concat_reset = np.vstack([r_t * h_prev, x_t])
h_tilde = np.tanh(self.W_h @ concat_reset + self.b_h)
# Final hidden state: interpolation
h_t = (1 - z_t) * h_prev + z_t * h_tilde
return h_t
# Demo
print("\n=== GRU Forward Pass Demo ===")
gru = GRUCell(input_dim=3, hidden_dim=4)
h = np.zeros((4, 1))
for t in range(3):
x = np.random.randn(3, 1)
h = gru.forward(x, h)
print(f"t={t+1}: h={h.flatten().round(4)}")
Modify the LSTM class to implement Peephole connections โ where the gates also look at the cell state C_{t-1} directly. Add C_{t-1} to the forget and input gate computations, and C_t to the output gate computation. Compare training convergence with the standard LSTM.
TensorFlow Implementation
11.1 Text Generation with LSTM
import tensorflow as tf
import numpy as np
# ========== Text Generation with LSTM ==========
# Sample text (use a larger corpus in practice)
text = """India is a land of diversity. From the Himalayas in the north
to the beaches of Kerala in the south, every region has its own culture.
The country is home to over a billion people speaking hundreds of languages."""
# Character-level tokenization
chars = sorted(list(set(text)))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for i, c in enumerate(chars)}
vocab_size = len(chars)
print(f"Vocabulary size: {vocab_size} unique characters")
# Create training sequences
seq_length = 40
X_data, y_data = [], []
for i in range(len(text) - seq_length):
X_data.append([char_to_idx[c] for c in text[i:i+seq_length]])
y_data.append(char_to_idx[text[i+seq_length]])
X = tf.keras.utils.to_categorical(X_data, num_classes=vocab_size)
y = tf.keras.utils.to_categorical(y_data, num_classes=vocab_size)
print(f"Training samples: {len(X_data)}")
# Build LSTM model
model = tf.keras.Sequential([
tf.keras.layers.LSTM(128, input_shape=(seq_length, vocab_size),
return_sequences=True),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.LSTM(128),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(vocab_size, activation='softmax')
])
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss='categorical_crossentropy',
metrics=['accuracy']
)
model.summary()
# Train
history = model.fit(X, y, epochs=50, batch_size=32, verbose=1)
# Text generation function
def generate_text(model, seed_text, length=200, temperature=0.8):
"""Generate text character by character."""
generated = seed_text
for _ in range(length):
# Encode the last seq_length characters
x_pred = [char_to_idx.get(c, 0) for c in generated[-seq_length:]]
x_pred = tf.keras.utils.to_categorical([x_pred], num_classes=vocab_size)
# Predict next character
probs = model.predict(x_pred, verbose=0)[0]
# Temperature sampling
probs = np.log(probs + 1e-8) / temperature
exp_probs = np.exp(probs)
probs = exp_probs / np.sum(exp_probs)
next_idx = np.random.choice(len(probs), p=probs)
generated += idx_to_char[next_idx]
return generated
# Generate sample text
seed = text[:seq_length]
print("\n=== Generated Text ===")
print(generate_text(model, seed, length=200))
11.2 Stock Price Prediction with LSTM
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
# ========== Nifty50 Stock Price Prediction ==========
# In production, load from NSE API or CSV
# Here we simulate realistic Nifty50 data
np.random.seed(42)
dates = pd.date_range('2020-01-01', periods=1000, freq='B') # Business days
# Simulate with trend + seasonality + noise
trend = np.linspace(11000, 22000, 1000)
seasonal = 500 * np.sin(np.linspace(0, 8*np.pi, 1000))
noise = np.random.randn(1000) * 200
nifty_data = trend + seasonal + noise
df = pd.DataFrame({'Date': dates, 'Close': nifty_data})
print(f"Dataset: {len(df)} trading days")
print(f"Price range: {df['Close'].min():.0f} - {df['Close'].max():.0f}")
# Normalize
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(df['Close'].values.reshape(-1, 1))
# Create sequences
def create_sequences(data, lookback=60):
X, y = [], []
for i in range(lookback, len(data)):
X.append(data[i-lookback:i, 0])
y.append(data[i, 0])
return np.array(X), np.array(y)
lookback = 60
X, y = create_sequences(scaled_data, lookback)
X = X.reshape(X.shape[0], X.shape[1], 1) # Add feature dimension
# Train/test split (80/20)
split = int(0.8 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]
print(f"Train: {len(X_train)}, Test: {len(X_test)}")
# Build Stacked LSTM model
model = tf.keras.Sequential([
tf.keras.layers.LSTM(64, return_sequences=True,
input_shape=(lookback, 1)),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.LSTM(64, return_sequences=True),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.LSTM(32),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss='mse',
metrics=['mae']
)
model.summary()
# Train with early stopping
early_stop = tf.keras.callbacks.EarlyStopping(
monitor='val_loss', patience=10, restore_best_weights=True
)
history = model.fit(
X_train, y_train,
epochs=100,
batch_size=32,
validation_split=0.1,
callbacks=[early_stop],
verbose=1
)
# Predict
y_pred = model.predict(X_test)
# Inverse transform
y_test_actual = scaler.inverse_transform(y_test.reshape(-1, 1))
y_pred_actual = scaler.inverse_transform(y_pred)
# Metrics
mae = mean_absolute_error(y_test_actual, y_pred_actual)
rmse = np.sqrt(mean_squared_error(y_test_actual, y_pred_actual))
mape = np.mean(np.abs((y_test_actual - y_pred_actual) / y_test_actual)) * 100
print(f"\n=== Results ===")
print(f"MAE: โน{mae:.2f}")
print(f"RMSE: โน{rmse:.2f}")
print(f"MAPE: {mape:.2f}%")
11.3 Bidirectional LSTM for Sentiment Analysis
# Bidirectional LSTM for text classification
model_bilstm = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim=10000, output_dim=128,
input_length=200),
tf.keras.layers.Bidirectional(
tf.keras.layers.LSTM(64, return_sequences=True)
),
tf.keras.layers.Bidirectional(
tf.keras.layers.LSTM(32)
),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model_bilstm.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
model_bilstm.summary()
# Train on IMDB or your own Hindi/English sentiment dataset
11.4 GRU Comparison
# GRU โ fewer parameters, often comparable performance
model_gru = tf.keras.Sequential([
tf.keras.layers.GRU(64, return_sequences=True,
input_shape=(lookback, 1)),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.GRU(32),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(1)
])
model_gru.compile(optimizer='adam', loss='mse', metrics=['mae'])
model_gru.summary()
# Compare parameter counts
print(f"\nLSTM params: {model.count_params():,}")
print(f"GRU params: {model_gru.count_params():,}")
print(f"GRU saves {(1 - model_gru.count_params()/model.count_params())*100:.1f}% parameters")
Scikit-Learn Integration
While scikit-learn doesn't natively support RNNs, we can wrap TensorFlow/Keras models in a scikit-learn compatible interface for use in pipelines, cross-validation, and hyperparameter tuning.
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
import numpy as np
class LSTMRegressor(BaseEstimator, RegressorMixin):
"""Scikit-learn compatible LSTM wrapper for time series."""
def __init__(self, lookback=60, units=64, epochs=50,
batch_size=32, learning_rate=0.001):
self.lookback = lookback
self.units = units
self.epochs = epochs
self.batch_size = batch_size
self.learning_rate = learning_rate
def _build_model(self, input_shape):
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.LSTM(self.units, input_shape=input_shape),
tf.keras.layers.Dense(1)
])
model.compile(
optimizer=tf.keras.optimizers.Adam(lr=self.learning_rate),
loss='mse'
)
return model
def fit(self, X, y):
self.model_ = self._build_model((X.shape[1], X.shape[2]))
self.model_.fit(X, y, epochs=self.epochs,
batch_size=self.batch_size, verbose=0)
return self
def predict(self, X):
return self.model_.predict(X, verbose=0).flatten()
def score(self, X, y):
y_pred = self.predict(X)
return -mean_squared_error(y, y_pred) # Negative MSE for sklearn
# Time Series Cross-Validation
tscv = TimeSeriesSplit(n_splits=5)
lstm_reg = LSTMRegressor(lookback=60, units=32, epochs=20)
scores = []
for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
X_tr, X_val = X[train_idx], X[val_idx]
y_tr, y_val = y[train_idx], y[val_idx]
lstm_reg.fit(X_tr, y_tr)
score = lstm_reg.score(X_val, y_val)
scores.append(-score) # Convert back to positive MSE
print(f"Fold {fold+1}: MSE = {-score:.6f}")
print(f"\nMean MSE: {np.mean(scores):.6f} ยฑ {np.std(scores):.6f}")
Indian Case Studies
Problem
Indian Railways handles 23+ million passengers daily across 12,000+ trains. Predicting ticket demand is crucial for dynamic pricing (Flexi-fare on Rajdhani/Shatabdi), overbooking management, and resource planning.
Solution Architecture
- Input features: Historical booking data (90 days), day of week, festivals (Diwali/Holi/Eid), season, route popularity, wait-list trends
- Model: Stacked LSTM (3 layers, 128/64/32 units) with attention
- Output: Predicted demand for next 7/15/30 days per route
Results
Reduced overbooking complaints by ~18%. Improved revenue on flexi-fare routes by โน800+ crore annually. MAPE of 8.3% on high-demand routes.
Key Insight
Festival-aware features were critical โ demand spikes 10x during Chhath Puja on Bihar routes. The LSTM learned these recurring seasonal patterns without explicit programming.
Problem
Quantitative trading firms on NSE need short-term (1-5 day) price movement predictions for algorithmic trading strategies.
Approach
- Data: 10+ years of Nifty50 OHLCV data, plus FII/DII flows, India VIX, US market correlation
- Feature engineering: RSI, MACD, Bollinger Bands, moving averages (20/50/200 day), volume profile
- Model: Bidirectional LSTM with 60-day lookback window
- Training: Walk-forward validation (no data leakage)
Results
Directional accuracy: 58-62% (significantly above random 50%). Sharpe ratio: 1.8 vs. buy-and-hold 1.2. Best performance during trending markets; struggled in sideways/choppy markets.
Caution
Stock prediction is inherently uncertain. LSTMs capture patterns but cannot predict black swan events (COVID crash, demonetization). Always combine with risk management.
NPCI processes 10+ billion UPI transactions monthly. LSTM-based sequence models analyze user transaction patterns (timing, amounts, merchants) to flag fraudulent transactions in real-time. The model treats each user's transaction history as a time series and detects deviations from learned behavioral patterns. False positive rate reduced from 3.2% to 1.1%.
ISRO's MOSDAC (Meteorological and Oceanographic Satellite Data Archive) uses LSTM networks to predict cyclone trajectories in the Indian Ocean. By processing sequential satellite imagery features (cloud patterns, sea surface temperatures, wind shear), LSTMs predict cyclone paths 48-72 hours ahead with 15-20% improvement over statistical models.
Flipkart's demand prediction engine uses LSTM-based models to forecast product demand across 27,000+ pin codes. The model handles festival-driven demand spikes (Big Billion Days), regional variations, and new product cold-start โ helping optimize warehouse inventory and reduce delivery times from days to hours.
Global Case Studies
The Problem
Before 2016, Google Translate used phrase-based statistical machine translation (SMT) โ clunky, inaccurate, and poor at capturing context.
The LSTM Solution (2016-2017)
Google's Neural Machine Translation (GNMT) system used an 8-layer encoder + 8-layer decoder LSTM architecture with attention. Key innovations:
- Residual connections between LSTM layers to enable training 8 layers deep
- Attention mechanism connecting decoder to all encoder states
- Wordpiece tokenization for handling rare words
- Quantization for serving at scale (100B+ translations/day)
Impact
BLEU score improved by 60% over SMT. Human evaluation showed GNMT bridging ~60% of the gap between SMT and human translation. This was the state-of-the-art until Transformers (2017).
Voice Recognition Pipeline
Both Siri and Alexa used deep bidirectional LSTMs as core components of their Automatic Speech Recognition (ASR) systems:
- Acoustic model: BiLSTM processing mel-spectrogram features frame-by-frame
- Language model: LSTM predicting next word probabilities
- End-to-end: Listen-Attend-Spell (LAS) architecture using encoder LSTM + decoder LSTM with attention
Alexa processes 100M+ voice requests daily. The LSTM-based system reduced word error rate (WER) from 8.5% to 5.1% between 2015-2018.
Spotify uses LSTM-based session models to predict the next song a user will enjoy based on their listening sequence. The model processes the sequence of recently played tracks (encoded as embeddings) and predicts engagement probability for candidate songs. This powers the "autoplay" feature and contributes to 30%+ of total streams.
DeepMind used LSTM networks to predict Acute Kidney Injury (AKI) up to 48 hours before it happens by analyzing sequential electronic health records (lab results, vital signs, medications). Published in Nature (2019), the system correctly predicted 55.8% of AKI events, with a 2:1 true-to-false positive ratio โ potentially saving thousands of lives.
Startup Applications
Indian startup Yellow.ai uses LSTM-based intent classification and entity extraction for building multilingual chatbots. Their platform serves 1000+ enterprises across 135+ languages, using BiLSTMs to understand customer queries in Hindi, Tamil, Bengali, and other Indian languages with 90%+ accuracy.
Startups like QuantConnect and Alpaca provide LSTM-based trading signal generators. Features include multi-timeframe OHLCV data, order book imbalance sequences, and news sentiment sequences. GRU models are preferred for high-frequency trading due to faster inference (~20% fewer parameters than LSTM).
Niramai (Bangalore) combines LSTM sequence models with thermal imaging for breast cancer screening. The temporal analysis of thermal patterns across sequential scans helps detect anomalies earlier than single-snapshot analysis. FDA and CE certified.
AIVA (Luxembourg) uses deep LSTM networks trained on 30,000+ classical music scores to compose original symphonies. Their model processes note sequences (pitch, duration, velocity) and generates coherent musical compositions used in films, ads, and games.
Government Applications
The Central Water Commission uses LSTM models fed with sequential river gauge data (water levels, rainfall, upstream discharge) to predict flood levels 24-72 hours ahead for major rivers like Ganga, Brahmaputra, and Godavari. The LSTM outperforms traditional hydrological models by 25% in RMSE during extreme events.
India's Computer Emergency Response Team uses LSTM-based intrusion detection systems that process network traffic sequences to identify anomalous patterns. The model learns normal traffic flow patterns and flags deviations โ detecting DDoS attacks, data exfiltration, and lateral movement within government networks.
The Department of Telecommunications uses GRU models for radio spectrum usage prediction, helping optimize frequency allocation across telecom operators. The model predicts spectrum demand patterns 30 days ahead with 92% accuracy.
ICMR used LSTM models during COVID-19 to predict case trajectories for Indian states, incorporating mobility data, vaccination rates, and past wave patterns as sequential features. These predictions informed lockdown decisions and resource allocation.
Industry Applications
| Industry | Application | RNN Variant | Key Feature |
|---|---|---|---|
| Finance | Fraud detection in transaction sequences | LSTM | Behavioral anomaly detection |
| Healthcare | ICU patient deterioration prediction | BiLSTM | Vital signs time series |
| Manufacturing | Predictive maintenance (vibration data) | GRU | Sensor sequence anomalies |
| Energy | Solar/wind power output forecasting | LSTM | Weather sequence data |
| Telecom | Network traffic prediction | Stacked LSTM | Load balancing optimization |
| Agriculture | Crop yield prediction from weather sequences | LSTM | Multi-season patterns |
| Retail | Customer purchase sequence modeling | GRU | Next-purchase prediction |
| Automotive | Driver behavior prediction | BiLSTM | Sensor fusion sequences |
| Gaming | Player churn prediction | GRU | Session activity patterns |
| Legal | Contract clause sequence analysis | BiLSTM | Document understanding |
RNN/LSTM expertise opens doors to: NLP Engineer (โน12-30 LPA), Quantitative Analyst (โน20-50 LPA), Time Series Specialist (โน15-35 LPA), Speech Recognition Engineer (โน18-40 LPA at Google/Amazon), Autonomous Driving Engineer (sensor sequence processing). Strong LSTM skills + domain expertise (finance/healthcare) is particularly valuable.
Mini Projects
Objective
Build a character-level LSTM that generates Hindi text trained on Hindi Wikipedia or news articles.
import tensorflow as tf
import numpy as np
# ========== Hindi Text Generator ==========
# Sample Hindi text (use larger corpus in production)
hindi_text = """เคญเคพเคฐเคค เคเค เคตเคฟเคถเคพเคฒ เคฆเฅเคถ เคนเฅเฅค เคฏเคนเคพเค เค
เคจเฅเค เคญเคพเคทเคพเคเค เคฌเฅเคฒเฅ เคเคพเคคเฅ เคนเฅเคเฅค
เคนเคฟเคเคฆเฅ เคญเคพเคฐเคค เคเฅ เคฐเคพเคเคญเคพเคทเคพ เคนเฅเฅค เคญเคพเคฐเคค เคเฅ เคธเคเคธเฅเคเฅเคคเคฟ เคฌเคนเฅเคค เคชเฅเคฐเคพเคเฅเคจ เคนเฅเฅค
เคฏเคนเคพเค เคเฅ เคฒเฅเค เคฎเฅเคนเคจเคคเฅ เคเคฐ เคฆเคฏเคพเคฒเฅ เคนเฅเคเฅค เคญเคพเคฐเคค เคฎเฅเค เค
เคจเฅเค เคคเฅเคฏเฅเคนเคพเคฐ เคฎเคจเคพเค เคเคพเคคเฅ เคนเฅเคเฅค
เคฆเฅเคชเคพเคตเคฒเฅ, เคนเฅเคฒเฅ, เคเคฆ, เคเฅเคฐเคฟเคธเคฎเคธ เคธเคญเฅ เคงเคฐเฅเคฎเฅเค เคเฅ เคคเฅเคฏเฅเคนเคพเคฐ เคฎเคจเคพเค เคเคพเคคเฅ เคนเฅเคเฅค
เคญเคพเคฐเคค เคเฅ เค
เคฐเฅเคฅเคตเฅเคฏเคตเคธเฅเคฅเคพ เคคเฅเคเฅ เคธเฅ เคฌเคขเคผ เคฐเคนเฅ เคนเฅเฅค เคชเฅเคฐเฅเคฆเฅเคฏเฅเคเคฟเคเฅ เคเฅเคทเฅเคคเฅเคฐ เคฎเฅเค เคญเคพเคฐเคค เค
เคเฅเคฐเคฃเฅ เคนเฅเฅค"""
# Character-level tokenization for Hindi
chars = sorted(list(set(hindi_text)))
char_to_idx = {c: i for i, c in enumerate(chars)}
idx_to_char = {i: c for i, c in enumerate(chars)}
vocab_size = len(chars)
print(f"Hindi vocab size: {vocab_size} characters")
print(f"Sample chars: {chars[:20]}")
# Prepare training data
seq_length = 30 # Shorter for Hindi due to character density
X_data, y_data = [], []
for i in range(len(hindi_text) - seq_length):
seq_in = hindi_text[i:i + seq_length]
seq_out = hindi_text[i + seq_length]
X_data.append([char_to_idx[c] for c in seq_in])
y_data.append(char_to_idx[seq_out])
X = np.array(X_data)
y = tf.keras.utils.to_categorical(y_data, num_classes=vocab_size)
# Reshape for LSTM: (samples, timesteps, features)
X = X.reshape(X.shape[0], X.shape[1], 1) / float(vocab_size)
# Build model
model = tf.keras.Sequential([
tf.keras.layers.LSTM(256, input_shape=(seq_length, 1),
return_sequences=True),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.LSTM(256),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(vocab_size, activation='softmax')
])
model.compile(loss='categorical_crossentropy',
optimizer='adam', metrics=['accuracy'])
# Train
model.fit(X, y, epochs=100, batch_size=64, verbose=1)
# Generate Hindi text
def generate_hindi(model, seed_text, length=200, temperature=0.7):
generated = seed_text
pattern = [char_to_idx[c] for c in seed_text[-seq_length:]]
for _ in range(length):
x = np.array(pattern).reshape(1, seq_length, 1) / float(vocab_size)
probs = model.predict(x, verbose=0)[0]
# Temperature sampling
probs = np.log(probs + 1e-8) / temperature
exp_probs = np.exp(probs)
probs = exp_probs / np.sum(exp_probs)
next_idx = np.random.choice(vocab_size, p=probs)
generated += idx_to_char[next_idx]
pattern = pattern[1:] + [next_idx]
return generated
# Generate
seed = hindi_text[:seq_length]
print("\n=== Generated Hindi Text ===")
print(generate_hindi(model, seed, length=300))
Evaluation Criteria
- Does the generated text form valid Hindi words? (character coherence)
- Are Devanagari matras (vowel signs) placed correctly?
- Does the text maintain grammatical structure?
- Experiment with temperatures: 0.3 (conservative), 0.7 (balanced), 1.2 (creative)
Objective
Build an end-to-end stock prediction system for NSE stocks with walk-forward validation and a simple prediction dashboard.
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
import json
class StockPredictor:
"""End-to-end LSTM stock prediction system."""
def __init__(self, lookback=60, units=64, epochs=50):
self.lookback = lookback
self.units = units
self.epochs = epochs
self.scaler = MinMaxScaler()
self.model = None
def prepare_data(self, prices):
"""Scale and create sequences."""
scaled = self.scaler.fit_transform(prices.reshape(-1, 1))
X, y = [], []
for i in range(self.lookback, len(scaled)):
X.append(scaled[i-self.lookback:i, 0])
y.append(scaled[i, 0])
X = np.array(X).reshape(-1, self.lookback, 1)
y = np.array(y)
return X, y
def build_model(self):
"""Build stacked LSTM."""
self.model = tf.keras.Sequential([
tf.keras.layers.LSTM(self.units, return_sequences=True,
input_shape=(self.lookback, 1)),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.LSTM(self.units // 2),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(1)
])
self.model.compile(optimizer='adam', loss='mse')
def walk_forward_validate(self, prices, n_splits=5):
"""Walk-forward validation โ proper time series CV."""
X, y = self.prepare_data(prices)
fold_size = len(X) // (n_splits + 1)
results = []
for fold in range(n_splits):
train_end = fold_size * (fold + 2)
test_end = min(train_end + fold_size, len(X))
X_train = X[:train_end]
y_train = y[:train_end]
X_test = X[train_end:test_end]
y_test = y[train_end:test_end]
self.build_model()
self.model.fit(X_train, y_train, epochs=self.epochs,
batch_size=32, verbose=0)
y_pred = self.model.predict(X_test, verbose=0).flatten()
# Directional accuracy
actual_dir = np.sign(np.diff(y_test))
pred_dir = np.sign(np.diff(y_pred))
dir_acc = np.mean(actual_dir == pred_dir)
mse = np.mean((y_test - y_pred) ** 2)
results.append({'fold': fold+1, 'mse': mse,
'dir_accuracy': dir_acc})
print(f"Fold {fold+1}: MSE={mse:.6f}, "
f"Direction Accuracy={dir_acc:.2%}")
return results
def predict_next(self, prices, n_days=5):
"""Predict next n days."""
X, y = self.prepare_data(prices)
self.build_model()
self.model.fit(X, y, epochs=self.epochs, batch_size=32, verbose=0)
# Recursive prediction
last_seq = X[-1:].copy()
predictions = []
for _ in range(n_days):
pred = self.model.predict(last_seq, verbose=0)[0, 0]
predictions.append(pred)
# Shift window
last_seq = np.roll(last_seq, -1, axis=1)
last_seq[0, -1, 0] = pred
# Inverse scale
pred_prices = self.scaler.inverse_transform(
np.array(predictions).reshape(-1, 1)
).flatten()
return pred_prices
# ========== Usage ==========
# Simulate Nifty50 data
np.random.seed(42)
prices = np.cumsum(np.random.randn(500)) + 18000
prices = np.abs(prices) # Ensure positive
predictor = StockPredictor(lookback=30, units=32, epochs=30)
# Walk-forward validation
print("=== Walk-Forward Validation ===")
results = predictor.walk_forward_validate(prices, n_splits=3)
# Predict next 5 days
print("\n=== 5-Day Forecast ===")
next_prices = predictor.predict_next(prices, n_days=5)
for i, p in enumerate(next_prices):
print(f"Day {i+1}: โน{p:,.2f}")
Objective
Build a seq2seq model to transliterate English names to Hindi (Devanagari script). E.g., "Rahul" โ "เคฐเคพเคนเฅเคฒ".
import tensorflow as tf
import numpy as np
# Sample transliteration pairs
pairs = [
("rahul", "เคฐเคพเคนเฅเคฒ"), ("priya", "เคชเฅเคฐเคฟเคฏเคพ"), ("amit", "เค
เคฎเคฟเคค"),
("neha", "เคจเฅเคนเคพ"), ("vijay", "เคตเคฟเคเคฏ"), ("sunita", "เคธเฅเคจเฅเคคเคพ"),
("deepak", "เคฆเฅเคชเค"), ("anita", "เค
เคจเคฟเคคเคพ"), ("suresh", "เคธเฅเคฐเฅเคถ"),
("kavita", "เคเคตเคฟเคคเคพ"), ("rajesh", "เคฐเคพเคเฅเคถ"), ("pooja", "เคชเฅเคเคพ"),
]
# Build character vocabularies
eng_chars = sorted(set(''.join([p[0] for p in pairs]))) + ['', '', '']
hin_chars = sorted(set(''.join([p[1] for p in pairs]))) + ['', '', '']
eng_to_idx = {c: i for i, c in enumerate(eng_chars)}
hin_to_idx = {c: i for i, c in enumerate(hin_chars)}
idx_to_hin = {i: c for c, i in hin_to_idx.items()}
# Encode sequences
max_eng = max(len(p[0]) for p in pairs) + 2
max_hin = max(len(p[1]) for p in pairs) + 2
encoder_input = np.zeros((len(pairs), max_eng, len(eng_chars)))
decoder_input = np.zeros((len(pairs), max_hin, len(hin_chars)))
decoder_target = np.zeros((len(pairs), max_hin, len(hin_chars)))
for i, (eng, hin) in enumerate(pairs):
for t, c in enumerate(eng):
encoder_input[i, t, eng_to_idx[c]] = 1
hin_seq = '' + hin + ''
for t in range(len(hin_seq)):
if t < len(hin_seq):
ch = hin_seq[t] if hin_seq[t] in hin_to_idx else ''
decoder_input[i, t, hin_to_idx.get(ch, 0)] = 1
if t > 0:
ch = hin_seq[t] if hin_seq[t] in hin_to_idx else ''
decoder_target[i, t-1, hin_to_idx.get(ch, 0)] = 1
# Encoder
encoder_inputs = tf.keras.Input(shape=(max_eng, len(eng_chars)))
encoder_lstm = tf.keras.layers.LSTM(64, return_state=True)
_, state_h, state_c = encoder_lstm(encoder_inputs)
# Decoder
decoder_inputs = tf.keras.Input(shape=(max_hin, len(hin_chars)))
decoder_lstm = tf.keras.layers.LSTM(64, return_sequences=True, return_state=True)
decoder_out, _, _ = decoder_lstm(decoder_inputs, initial_state=[state_h, state_c])
decoder_dense = tf.keras.layers.Dense(len(hin_chars), activation='softmax')
decoder_outputs = decoder_dense(decoder_out)
# Model
model = tf.keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit([encoder_input, decoder_input], decoder_target,
epochs=200, batch_size=4, verbose=1)
print("Seq2Seq transliterator trained!")
print("This demonstrates encoder-decoder architecture for")
print("mapping English character sequences to Hindi characters.")
End-of-Chapter Exercises (25 Questions)
Multiple Choice Questions (12 MCQs)
Interview Questions (12 Questions)
Expected Answer: During BPTT, gradients are multiplied by the Jacobian โh_i/โh_{i-1} = diag(tanh'(z_i)) ยท W_hh at each step. Since |tanh'| โค 1, gradients decay exponentially over long sequences. LSTM introduces a cell state C_t that is updated additively (C_t = f_t โ C_{t-1} + i_t โ Cฬ_t). When f_t โ 1, gradients flow through C without decay โ the "constant error carousel."
Expected Answer: GRU when: (a) training data is limited (fewer params = less overfitting), (b) inference speed matters (25% fewer ops), (c) sequences are moderate length. LSTM when: (a) very long sequences need precise memory control, (b) computational budget allows it, (c) task requires independent control of forgetting and input. Empirically, performance is often comparable โ try both and validate.
Expected Answer: Teacher forcing feeds ground-truth tokens as decoder input during training (instead of model predictions). Problem: "exposure bias" โ during inference, the model uses its own (possibly wrong) predictions, but it never saw such errors during training. Solutions: scheduled sampling (gradually shifting from teacher forcing to model predictions), or reinforcement learning-based training.
Expected Answer: (1) Padding: pad shorter sequences with zeros to max length in batch, use masking to ignore padded positions. (2) Bucketing: group sequences of similar length into the same batch to minimize padding. (3) Pack sequences (PyTorch pack_padded_sequence): skip computation on padded timesteps. (4) Dynamic batching: adjust batch size based on sequence length.
Expected Answer: Many-to-one: sentiment analysis (sequence โ single label). Many-to-many (same length): POS tagging, NER (label per token). Many-to-many (different length): machine translation (seq2seq). One-to-many: image captioning (single image โ sequence of words). One-to-one: essentially a feedforward network (not useful as RNN).
Expected Answer: With b_f=1, the sigmoid output starts near 1, meaning the LSTM initially remembers everything. This prevents premature information loss before the model has learned what to forget. With b_f=0, the forget gate starts at ฯ(0)=0.5, immediately discarding 50% of cell state โ harmful for long-range dependencies. This was recommended by Gers et al. (2000) and Jozefowicz et al. (2015).
Expected Answer: (1) Dropout between LSTM layers (not within recurrence). (2) Recurrent dropout: same dropout mask across time steps (Gal & Ghahramani, 2016). (3) L2 regularization on weights. (4) Early stopping with validation loss. (5) Reduce model complexity (fewer units/layers). (6) Data augmentation for sequences (noise injection, time warping).
Expected Answer: Red flags: (1) Data leakage โ using future information in features. (2) Wrong split โ random instead of temporal. (3) Accuracy metric is meaningless for regression โ should use MAE, RMSE, MAPE. (4) Directional accuracy might be a better metric. (5) Need walk-forward validation, not single train/test split. (6) Overfitting to training period. (7) Transaction costs not considered. 95% in stock prediction is almost certainly a bug.
Expected Answer: Transformers: better for long sequences (parallel processing), state-of-the-art for NLP, need more data and compute. LSTM still preferred for: (1) small datasets, (2) online/streaming applications (process one step at a time), (3) edge devices (fewer parameters), (4) time series with strong autoregressive patterns, (5) tasks where sequential inductive bias helps. Transformers are O(nยฒ) in sequence length; LSTM is O(n).
Expected Answer: Instead of compressing the entire input into a single context vector, attention allows the decoder to "look back" at all encoder hidden states at each generation step. It computes alignment scores between decoder state s_t and each encoder state h_i, converts them to weights via softmax, and creates a weighted sum (context vector). This solves the information bottleneck: the decoder can access any part of the input directly.
Expected Answer: Gradient clipping rescales the gradient if its norm exceeds a threshold: g โ g ร (threshold/โgโ). Essential because RNNs suffer from exploding gradients (spectral radius of W_hh > 1 causes exponential gradient growth). Without clipping, a single step with exploding gradients can ruin all learned weights. Typical threshold: 1.0-5.0. Two variants: norm clipping (scale entire gradient vector) and value clipping (clip each element independently).
Expected Answer: Architecture: (1) Feature extraction: encode each transaction as a vector (amount, merchant category, time delta, location, device). (2) User-level LSTM: maintain per-user hidden state updated with each transaction. (3) Anomaly scoring: LSTM output โ dense โ sigmoid for fraud probability. (4) Online learning: update model with confirmed labels. Key challenges: class imbalance (99.9% legitimate), latency requirements (<100ms), cold start for new users. Use GRU for faster inference. Deployment: model serving with TF Serving or ONNX Runtime.
Research Problems
Question: For languages with limited training data (Konkani, Dogri, Bodo โ scheduled languages with < 1M text corpus), do LSTM-based models outperform Transformers for tasks like NER, POS tagging, and text classification?
Hypothesis: LSTMs' stronger inductive bias (sequential processing) may compensate for data scarcity where Transformers' flexibility leads to overfitting.
Methodology: Compare BiLSTM-CRF vs. small Transformer models across 5+ Indian languages at various data sizes (1K, 10K, 100K, 1M sentences). Use cross-lingual transfer from Hindi as baseline.
Expected Contribution: Guidelines for choosing architectures based on data availability in multilingual Indian NLP applications.
Question: How can LSTM models adapt to distributional shift in financial time series (e.g., regime changes in Nifty50 due to policy changes, pandemics) without catastrophic forgetting?
Approach: Investigate elastic weight consolidation (EWC), progressive neural networks, and online LSTM updating strategies. Test on Indian market data across regime changes: demonetization (Nov 2016), GST implementation (Jul 2017), COVID crash (Mar 2020), and rate hike cycles.
Question: Can we design pruned/quantized LSTM models that run on Indian IoT devices (Raspberry Pi, ESP32) for real-time agricultural sensor prediction while maintaining > 95% of full-precision performance?
Techniques to Explore: Knowledge distillation from large LSTM to small GRU, structured pruning of LSTM gates, INT8 quantization, and architecture search for optimal hidden dimension on constrained hardware. Target: < 1MB model size, < 10ms inference latency.
Question: Can we combine the sequential inductive bias of LSTMs with the parallel attention of Transformers to get the best of both worlds for sequence modeling?
Ideas: (1) LSTM encoder + Transformer decoder, (2) Transformer with LSTM positional encoding replacing sinusoidal, (3) Gated Transformer blocks using LSTM-style forget/update mechanisms. Benchmark on time series (ETTh, Weather), machine translation (FLORES for Indian languages), and speech recognition (CommonVoice Hindi).
Key Takeaways
References & Further Reading
- Hochreiter, S., & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735-1780. โ The original LSTM paper.
- Gers, F.A., Schmidhuber, J., & Cummins, F. (2000). "Learning to Forget: Continual Prediction with LSTM." Neural Computation, 12(10), 2451-2471. โ Introduces the forget gate.
- Cho, K., et al. (2014). "Learning Phrase Representations using RNN Encoder-Decoder." EMNLP. โ Introduces GRU.
- Sutskever, I., Vinyals, O., & Le, Q.V. (2014). "Sequence to Sequence Learning with Neural Networks." NeurIPS. โ Foundational seq2seq paper.
- Bahdanau, D., Cho, K., & Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR. โ Introduces attention mechanism.
- Graves, A. (2013). "Generating Sequences With Recurrent Neural Networks." arXiv:1308.0850. โ Text and handwriting generation.
- Pascanu, R., Mikolov, T., & Bengio, Y. (2013). "On the Difficulty of Training Recurrent Neural Networks." ICML. โ Vanishing/exploding gradient analysis.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 10: Sequence Modeling. MIT Press.
- Jurafsky, D. & Martin, J.H. (2023). Speech and Language Processing, 3rd ed. โ Chapters on RNNs and seq2seq.
- Chollet, F. (2021). Deep Learning with Python, 2nd ed. Chapter 10: Timeseries. Manning.
- Gรฉron, A. (2022). Hands-On Machine Learning, 3rd ed. Chapter 15: Processing Sequences. O'Reilly.
- Olah, C. (2015). "Understanding LSTM Networks." โ colah.github.io. โ Best visual explanation of LSTMs.
- Karpathy, A. (2015). "The Unreasonable Effectiveness of Recurrent Neural Networks." โ karpathy.github.io.
- TensorFlow RNN Tutorial โ tensorflow.org/guide/keras/rnn
- PyTorch Seq2Seq Tutorial โ pytorch.org/tutorials
- CS231n Lecture 10: Recurrent Neural Networks โ Stanford (YouTube)
- NPCI Annual Reports (2020-2024) โ UPI transaction statistics and fraud prevention.
- NSE India Historical Data โ nseindia.com โ Nifty50 OHLCV data for stock prediction projects.
- IRCTC Open Data โ Passenger traffic and booking patterns.
- ISRO MOSDAC โ mosdac.gov.in โ Meteorological data for weather prediction.
- IIT Bombay Hindi-English Parallel Corpus โ For seq2seq translation projects.
Recurrent Neural Networks & LSTMs
You've mastered sequence modeling from vanilla RNNs to LSTMs. Next up: Chapter 20 explores Generative Adversarial Networks (GANs) โ teaching networks to create.