Neural Networks & Deep Learning

Chapter 20: Time Series Forecasting with Deep Learning

Predicting the Future, One Timestep at a Time

⏱️ Reading Time: ~3 hours | 📖 Part V: Sequence & Time | 🧠 Theory + Code Chapter

📋 Prerequisites: Chapters 14–15 (RNNs, LSTMs), Chapter 18 (Transformers), Basic Statistics

Bloom's Taxonomy Map for This Chapter

Bloom's Level	What You'll Achieve
🔵 Remember	Recall time-series characteristics (trend, seasonality, stationarity), ARIMA components, LSTM gate equations, and evaluation metric formulas (MAE, RMSE, MAPE)
🔵 Understand	Explain why LSTMs handle long-range dependencies in sequential data, how TCN dilations expand receptive fields, and why walk-forward validation prevents data leakage
🟢 Apply	Implement an LSTM-based Nifty 50 predictor from scratch in NumPy, and build production-grade forecasters in TensorFlow/Keras
🟡 Analyze	Compare recursive vs. direct vs. MIMO multi-step forecasting, diagnose look-ahead bias, and decompose forecast errors into trend/seasonal/noise components
🟠 Evaluate	Assess whether a stock price predictor is genuinely predictive or merely curve-fitting; judge model suitability for different forecasting horizons
🔴 Create	Design an end-to-end electricity consumption forecasting pipeline with Indian holiday features, walk-forward validation, and ensemble strategies

Section 1

Learning Objectives

By the end of this chapter, you will be able to:

Define a time series and its four key components: trend, seasonality, cyclicality, and residual noise
Explain stationarity using the Augmented Dickey-Fuller (ADF) test and apply differencing to make a series stationary
Summarize the limitations of classical ARIMA/SARIMA models that motivate deep learning approaches
Implement an LSTM network from scratch in NumPy for 30-day Nifty 50 price prediction using VIX India and FII/DII flow features
Compare four deep architectures for time series: LSTM, Temporal Convolutional Networks (TCN), N-BEATS, and Informer (Transformer-based)
Engineer time-series features including lag features, rolling statistics, and Indian holiday calendars
Differentiate recursive, direct, and MIMO strategies for multi-step-ahead forecasting
Apply walk-forward validation to prevent data leakage in temporal splits
Calculate and interpret MAE, RMSE, MAPE, and Directional Accuracy metrics
Critically assess why stock market prediction remains fundamentally difficult (Efficient Market Hypothesis) and the gap between backtest and live performance

Section 2

Opening Hook

📈 Three Predictions That Run India

6:00 AM — IMD Headquarters, New Delhi: The India Meteorological Department's deep learning model processes 40 years of monsoon rainfall data across 306 stations. Its seasonal forecast will determine ₹18 lakh crore worth of Kharif crop planning. A 5% error in predicted millimetres means the difference between a bumper harvest and drought relief packages.

9:15 AM — NSE Trading Floor, BKC Mumbai: The opening bell rings. The Nifty 50 index — a weighted basket of India's 50 largest companies — will move through 22,000+ ticks today. Algorithmic trading desks at Zerodha, Groww, and Angel One process ₹2.5 lakh crore in daily turnover. Their LSTM models don't predict price — they predict volatility regimes and order flow direction.

11:30 PM — Zepto Dark Store, Koramangala, Bengaluru: Tomorrow's 10-minute delivery promise depends on tonight's demand forecast. How many packets of Amul Taaza, how many kg of tomatoes? An LSTM model trained on 18 months of order history, payday cycles, IPL match schedules, and Bengaluru weather predicts SKU-level demand for each of 700+ dark stores.

Every one of these problems is a time series. This chapter teaches you to build the deep learning models behind them.

Nifty 50 / NSE Zepto IMD Zerodha Jio

Section 3

Core Concepts

20.1 What Is a Time Series?

A time series is a sequence of data points indexed (or listed or graphed) in time order. Unlike i.i.d. datasets used in standard ML, time series data has temporal dependencies — the value at time t depends on values at t−1, t−2, …, t−k.

Four Components of a Time Series

1. Trend (T)

Long-term increase or decrease. Example: India's GDP has an upward trend from ₹80 lakh crore (2012) to ₹295 lakh crore (2025).

2. Seasonality (S)

Regular, calendar-driven fluctuations. Example: Flipkart's daily orders spike 8× during Diwali Big Billion Days every October. Swiggy sees a 40% dinner-order surge every Friday.

3. Cyclicality (C)

Irregular, longer-term oscillations not tied to a calendar. Example: Real estate price cycles in Mumbai and Bangalore spanning 7–10 years.

4. Residual / Noise (ε)

Random variation that cannot be explained by T, S, or C. Example: A surprise RBI rate cut causing a Nifty intraday spike of 300 points.

Additive Decomposition: y(t) = T(t) + S(t) + C(t) + ε(t)
Multiplicative Decomposition: y(t) = T(t) × S(t) × C(t) × ε(t)

Stationarity — The Gateway Concept

A time series is stationary if its statistical properties (mean, variance, autocorrelation) do not change over time. Most classical methods (ARIMA) require stationarity; deep learning is more robust but still benefits from it.

Augmented Dickey-Fuller (ADF) Test:
H₀: Series has a unit root (non-stationary) | H₁: Series is stationary
If p-value < 0.05 → reject H₀ → series is stationary

Making a series stationary:

First differencing: y'(t) = y(t) − y(t−1) — removes linear trend
Seasonal differencing: y'(t) = y(t) − y(t−m) — removes seasonality of period m
Log transform: stabilizes multiplicative seasonality and growing variance

Nifty 50 stationarity check: Raw Nifty 50 closing prices from 2000–2025 are clearly non-stationary (upward trend from ~1,000 to ~23,000). First differencing (daily returns) makes the series approximately stationary with ADF p-value ≈ 0.001. Log returns ln(P_t / P_{t-1}) are even more well-behaved.

20.2 Classical Methods: ARIMA / SARIMA (Brief Review)

Before deep learning, time series forecasting was dominated by the Box-Jenkins methodology:

ARIMA(p, d, q)

AR (AutoRegressive) — p

Uses p lagged values: y(t) = c + φ₁y(t−1) + φ₂y(t−2) + … + φₚy(t−p) + ε(t)

I (Integrated) — d

Number of times the series is differenced to achieve stationarity.

MA (Moving Average) — q

Uses q lagged forecast errors: y(t) = c + ε(t) + θ₁ε(t−1) + … + θ_qε(t−q)

SARIMA adds seasonality

SARIMA(p,d,q)(P,D,Q,m) — adds seasonal AR, differencing, MA with period m.

Limitations of ARIMA

Limitation	Why It Matters
Linear assumptions	Cannot capture non-linear patterns like regime changes in Nifty volatility
Univariate by default	Cannot natively use exogenous features like VIX India, FII/DII flows, crude oil price
Manual order selection	Choosing (p, d, q) requires ACF/PACF analysis — doesn't scale to 1000s of SKUs at Zepto
Fixed seasonal period	Struggles with multiple overlapping seasonalities (daily + weekly + monthly + festive)
Long-range dependencies	AR(p) with large p is unstable; cannot learn 90-day patterns for monsoon prediction

ARIMA is not obsolete! For short-horizon, univariate, well-behaved time series, it often beats deep learning models. The auto_arima from pmdarima library automates (p,d,q) selection. Use it as your baseline — always.

20.3 LSTM for Time Series Forecasting

Long Short-Term Memory (LSTM) networks, introduced by Hochreiter & Schmidhuber (1997), are the workhorse of deep time series forecasting. Their gated architecture solves the vanishing gradient problem, enabling them to learn dependencies spanning hundreds of timesteps.

LSTM Cell Equations (Recap)

Forget Gate: f_t = σ(W_f · [h_{t-1}, x_t] + b_f)
Input Gate: i_t = σ(W_i · [h_{t-1}, x_t] + b_i)
Candidate Cell: C̃_t = tanh(W_C · [h_{t-1}, x_t] + b_C)
Cell State Update: C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t
Output Gate: o_t = σ(W_o · [h_{t-1}, x_t] + b_o)
Hidden State: h_t = o_t ⊙ tanh(C_t)

Why LSTMs Excel at Time Series

Memory cells retain information across 100+ timesteps — crucial for monsoon patterns spanning months
Forget gate learns to discard irrelevant past — e.g., ignoring a one-off Nifty crash from 3 months ago
Multivariate input natively — feed Nifty close + VIX India + FII net buy + crude oil as a multi-feature vector x_t
Non-linear — captures regime changes, breakouts, and complex seasonal interactions

Windowing: Creating Supervised Learning from Time Series

To feed a time series into an LSTM, we create sliding windows:

# Convert time series to supervised learning windows
# Input: look_back=60 days → Output: next 30 days

# Series: [d1, d2, d3, ..., d100]
# Window 1: X=[d1..d60],   Y=[d61..d90]
# Window 2: X=[d2..d61],   Y=[d62..d91]
# Window 3: X=[d3..d62],   Y=[d63..d92]
# ...
# Shape: X → (num_samples, 60, num_features)
#        Y → (num_samples, 30)Python

"I shuffled my time series data and got 95% accuracy!" — Never shuffle time series data. The temporal order IS the information. Random train/test splits cause look-ahead bias (future data leaking into training). Always use temporal splits: train on past, validate on future.

20.4 Temporal Convolutional Networks (TCN)

TCNs apply 1D causal convolutions to sequences, offering an alternative to RNNs with several advantages:

TCN Architecture

Causal Convolution

Output at time t depends only on inputs at times ≤ t. No future leakage by design.

Dilated Convolution

Dilation factor d = {1, 2, 4, 8, 16, …} allows exponentially growing receptive field without increasing parameters. With kernel size k=3 and 6 layers: receptive field = 1 + 2(k−1)(2⁶ − 1) = 253 timesteps.

Residual Connections

Skip connections every 2 layers prevent degradation in deep TCNs (12+ layers).

Advantages over LSTM

✅ Parallelizable (no sequential dependency) → 5–10× faster training
✅ Stable gradients (no vanishing/exploding)
✅ Flexible receptive field via dilation

Bai et al. (2018) showed TCNs outperform LSTMs on 11 out of 13 sequence benchmarks. Yet LSTMs remain more popular in industry — inertia, familiarity, and the fact that LSTMs handle variable-length sequences more naturally.

20.5 N-BEATS: Neural Basis Expansion

N-BEATS (Oreshkin et al., 2020) is a pure deep learning architecture designed specifically for time series — no recurrence, no convolutions.

N-BEATS Architecture

Basic Block

Stack of fully-connected layers → two linear projections:
1. Backcast: reconstructs the input (lookback period)
2. Forecast: produces the forecast horizon

Stacking with Residual

Multiple blocks are stacked. Each block receives the residual (what the previous block couldn't explain). Final forecast = sum of all block forecasts.

Interpretable Variant

Constrains basis functions to trend (polynomial) and seasonality (Fourier) — producing human-readable decomposition.

Key Result

Won the M4 Competition (2020) — beating statistical methods, ensembles, and hybrid models on 100,000 time series across multiple domains.

Flipkart's demand forecasting team evaluated N-BEATS for SKU-level daily demand across 10,000+ products. The interpretable variant was preferred because it separately outputs trend and seasonal components — allowing supply chain managers to understand why the model predicts a demand spike (Diwali seasonality vs. genuine trend shift).

20.6 Transformer-Based: Informer

Standard Transformers have O(L²) complexity in sequence length L — prohibitive for long time series (e.g., hourly data for a year = 8,760 points). Informer (Zhou et al., 2021) solves this.

Informer: Key Innovations

ProbSparse Self-Attention

Not all queries are equally important. Informer selects the top-u "active" queries using KL-divergence, reducing attention from O(L²) to O(L log L).

Self-Attention Distilling

Progressive downsampling between attention layers halves the sequence length at each layer — from L → L/2 → L/4 — focusing on dominant features.

Generative-Style Decoder

Instead of auto-regressive step-by-step prediction, the decoder generates the entire forecast horizon in one forward pass — avoiding error accumulation.

Complexity

O(L log L) time and O(L log L) memory — enabling input sequences of 10,000+ timesteps.

Beyond Informer, the time-series Transformer zoo has exploded: Autoformer (auto-correlation), FEDformer (frequency-enhanced), PatchTST (patching + channel independence), and TimesFM (Google's foundation model). For a 2025 production system, start with PatchTST — it's simpler and competitive.

20.7 Feature Engineering for Time Series

Raw time series alone is rarely enough. Thoughtful feature engineering often matters more than model architecture.

Lag Features

# Create lag features for Nifty 50 closing price
import pandas as pd

df['close_lag_1']  = df['close'].shift(1)   # Yesterday
df['close_lag_5']  = df['close'].shift(5)   # Last week
df['close_lag_21'] = df['close'].shift(21)  # Last month (~21 trading days)
df['close_lag_63'] = df['close'].shift(63)  # Last quarterPython

Rolling Statistics

# Rolling mean and standard deviation
df['sma_20']  = df['close'].rolling(20).mean()    # 20-day SMA
df['sma_50']  = df['close'].rolling(50).mean()    # 50-day SMA
df['std_20']  = df['close'].rolling(20).std()     # 20-day volatility
df['rsi_14'] = compute_rsi(df['close'], 14)    # Relative Strength Index

# Bollinger Bands
df['bb_upper'] = df['sma_20'] + 2 * df['std_20']
df['bb_lower'] = df['sma_20'] - 2 * df['std_20']Python

Indian Calendar Features

# Indian-specific temporal features
import numpy as np

# Standard calendar features
df['day_of_week']  = df.index.dayofweek       # 0=Mon, 4=Fri
df['month']        = df.index.month
df['is_monday']    = (df.index.dayofweek == 0).astype(int)
df['is_friday']    = (df.index.dayofweek == 4).astype(int)
df['is_month_end'] = df.index.is_month_end.astype(int)

# Indian market-specific features
df['is_expiry_week'] = is_nse_expiry_week(df.index)  # Last Thu
df['is_budget_day']  = (df.index.month == 2) & (df.index.day == 1)
df['is_rbi_policy']  = df.index.isin(rbi_policy_dates)

# Festival calendar (approximate — use 'holidays' library)
indian_holidays = {
    'diwali': ['2024-11-01', '2025-10-20'],
    'holi':   ['2024-03-25', '2025-03-14'],
    'ganesh_chaturthi': ['2024-09-07', '2025-08-27'],
}
# Pre-festival period (demand spike window)
for fest, dates in indian_holidays.items():
    fest_dates = pd.to_datetime(dates)
    for d in fest_dates:
        mask = (df.index >= d - pd.Timedelta(days=7)) & (df.index <= d)
        df.loc[mask, f'pre_{fest}'] = 1Python

In India, the "salary week effect" is real. Paytm, PhonePe, and CRED all observe a 35–45% spike in bill payments and UPI transactions between the 28th and 5th of each month. For e-commerce platforms, demand models that incorporate this pay-cycle feature reduce forecast error by 8–12%.

20.8 Multi-Step Forecasting Strategies

Predicting more than one step ahead is significantly harder. Three strategies dominate:

Three Multi-Step Strategies

1. Recursive (Iterated)

Train a 1-step model. To forecast H steps: predict ŷ(t+1), feed it back as input, predict ŷ(t+2), and so on.

✅ One model to train ❌ Errors accumulate — by step 30, predictions drift heavily

2. Direct

Train H separate models, one for each horizon: Model₁ predicts ŷ(t+1), Model₂ predicts ŷ(t+2), …, Model_H predicts ŷ(t+H).

✅ No error accumulation ❌ H × training cost; no shared learning between horizons

3. MIMO (Multiple Input, Multiple Output)

Train a single model that outputs all H steps simultaneously: f(X) → [ŷ(t+1), ŷ(t+2), …, ŷ(t+H)]

✅ Captures inter-horizon dependencies ✅ One model ❌ More complex architecture needed

Strategy	# Models	Error Accumulation	Best For
Recursive	1	High (compounds)	Short horizons (1–5 steps)
Direct	H	None	Irregular horizons, when compute is cheap
MIMO	1	Low (shared learning)	Medium-long horizons; LSTM/Transformer natural fit

For our 30-day Nifty 50 forecast, we use MIMO — the LSTM outputs 30 values at once from its final hidden state through a Dense(30) layer. This is the industry standard for fixed-horizon forecasting.

20.9 Walk-Forward Validation (No Data Leakage!)

Standard k-fold cross-validation destroys temporal ordering and causes catastrophic leakage. Walk-forward validation respects time.

Walk-Forward Validation (Expanding Window) Fold 1: [████████ TRAIN ████████][▓▓ VAL ▓▓] Fold 2: [██████████ TRAIN ██████████][▓▓ VAL ▓▓] Fold 3: [████████████ TRAIN ████████████][▓▓ VAL ▓▓] Fold 4: [██████████████ TRAIN ██████████████][▓▓ VAL ▓▓] Fold 5: [████████████████ TRAIN ████████████████][▓▓ VAL ▓▓] ←── past ────────────────────────────── future ──→ ⚡ RULE: Train ALWAYS before Val. Never overlap. ⚡ Final Test set = most recent data, NEVER touched during tuning. Walk-Forward Validation (Sliding Window — fixed size) Fold 1: [████ TRAIN ████][▓▓ VAL ▓▓] Fold 2: [████ TRAIN ████][▓▓ VAL ▓▓] Fold 3: [████ TRAIN ████][▓▓ VAL ▓▓] Fold 4: [████ TRAIN ████][▓▓ VAL ▓▓] → Useful when data distribution shifts over time (concept drift)

# Walk-forward validation implementation
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
fold_scores = []

for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]
    
    # CRITICAL: Scale using ONLY training data statistics!
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)    # fit + transform
    X_val   = scaler.transform(X_val)           # transform only!
    
    model = build_lstm_model()
    model.fit(X_train, y_train, epochs=50,
              validation_data=(X_val, y_val),
              callbacks=[EarlyStopping(patience=5)])
    
    val_pred = model.predict(X_val)
    fold_scores.append(mean_absolute_error(y_val, val_pred))
    print(f"Fold {fold+1} MAE: {fold_scores[-1]:.4f}")

print(f"Mean Walk-Forward MAE: {np.mean(fold_scores):.4f}")Python

"I scaled the entire dataset before splitting!" — This is the #1 time series leakage bug. When you fit_transform() on the full dataset, the scaler learns the mean and standard deviation of future validation/test data. The fix: always fit() on train, transform() on val/test.

20.10 Evaluation Metrics for Time Series

Metric	Strengths	Weaknesses	Good For
MAE	Robust to outliers; same units as data	Scale-dependent; can't compare across series	Nifty price forecast (₹ error)
RMSE	Penalizes large errors more	Sensitive to outliers	When big misses are costly (power grid)
MAPE	Scale-independent; interpretable %	Undefined when y_i = 0; asymmetric	Demand forecasting (% accuracy)
DA	Captures directional correctness	Ignores magnitude	Trading signals (up/down)

For stock market models, Directional Accuracy (DA) matters more than RMSE. A model with RMSE = 200 points but DA = 65% can be profitable. A model with RMSE = 50 but DA = 48% (worse than random!) is useless. Always report DA alongside error metrics.

Section 4

From-Scratch Code: LSTM for 30-Day Nifty 50 Prediction

We build a complete LSTM forecaster from scratch in NumPy. The model takes 60 days of historical data (Nifty close, VIX India, FII net flow) and predicts the next 30 days of Nifty closing prices.

Step 1: Data Preparation & Windowing

import numpy as np

# ──────────────────────────────────────────────
# LSTM for Nifty 50 — From Scratch in NumPy
# Features: [nifty_close, vix_india, fii_net_flow]
# Lookback: 60 days | Forecast: 30 days
# ──────────────────────────────────────────────

def create_windows(data, lookback=60, horizon=30):
    """Convert multivariate time series into supervised windows.
    
    Args:
        data: ndarray of shape (T, num_features)
        lookback: number of past timesteps as input
        horizon: number of future steps to predict
    
    Returns:
        X: (num_samples, lookback, num_features)
        Y: (num_samples, horizon) — only first feature (nifty close)
    """
    X, Y = [], []
    for i in range(len(data) - lookback - horizon + 1):
        X.append(data[i : i + lookback])                # all features
        Y.append(data[i + lookback : i + lookback + horizon, 0])  # nifty only
    return np.array(X), np.array(Y)

def normalize(train, val, test):
    """Normalize using ONLY training statistics — no leakage!"""
    mu  = train.mean(axis=0)
    std = train.std(axis=0) + 1e-8
    return (train - mu) / std, (val - mu) / std, (test - mu) / std, mu, std

# Simulate realistic Nifty data for demonstration
np.random.seed(42)
T = 1000  # ~4 years of trading days
nifty_close = 18000 + np.cumsum(np.random.randn(T) * 100)  # random walk
vix_india   = 14 + np.random.randn(T) * 3                   # VIX India
fii_net     = np.random.randn(T) * 2000                       # FII net ₹Cr

data = np.column_stack([nifty_close, vix_india, fii_net])

# Temporal split: 70% train, 15% val, 15% test
n_train = int(0.70 * T)
n_val   = int(0.15 * T)

train_data = data[:n_train]
val_data   = data[n_train:n_train+n_val]
test_data  = data[n_train+n_val:]

# Normalize (fit on train only!)
train_n, val_n, test_n, mu, std = normalize(train_data, val_data, test_data)

# Create windows
LOOKBACK, HORIZON = 60, 30
X_train, Y_train = create_windows(train_n, LOOKBACK, HORIZON)
X_val,   Y_val   = create_windows(val_n,   LOOKBACK, HORIZON)
X_test,  Y_test  = create_windows(test_n,  LOOKBACK, HORIZON)

print(f"X_train: {X_train.shape}, Y_train: {Y_train.shape}")
print(f"X_val:   {X_val.shape},   Y_val:   {Y_val.shape}")
print(f"X_test:  {X_test.shape},  Y_test:  {Y_test.shape}")Python

X_train: (611, 60, 3), Y_train: (611, 30) X_val: (61, 60, 3), Y_val: (61, 30) X_test: (61, 60, 3), Y_test: (61, 30)

Step 2: LSTM Cell — Forward Pass

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))

def tanh(x):
    return np.tanh(x)

class LSTMCell:
    """Single LSTM cell with forget, input, and output gates."""
    
    def __init__(self, input_size, hidden_size):
        self.input_size  = input_size
        self.hidden_size = hidden_size
        scale = 0.01
        
        # Combined weight matrix: [h_{t-1}, x_t] → gates
        concat_size = hidden_size + input_size
        
        # Forget gate weights
        self.W_f = np.random.randn(hidden_size, concat_size) * scale
        self.b_f = np.zeros((1, hidden_size))
        
        # Input gate weights
        self.W_i = np.random.randn(hidden_size, concat_size) * scale
        self.b_i = np.zeros((1, hidden_size))
        
        # Candidate cell weights
        self.W_c = np.random.randn(hidden_size, concat_size) * scale
        self.b_c = np.zeros((1, hidden_size))
        
        # Output gate weights
        self.W_o = np.random.randn(hidden_size, concat_size) * scale
        self.b_o = np.zeros((1, hidden_size))
    
    def forward(self, x_t, h_prev, c_prev):
        """One timestep of LSTM.
        
        Args:
            x_t:    (batch, input_size)
            h_prev: (batch, hidden_size)
            c_prev: (batch, hidden_size)
        
        Returns:
            h_t, c_t, cache (for backprop)
        """
        # Concatenate [h_{t-1}, x_t]
        concat = np.concatenate([h_prev, x_t], axis=1)  # (batch, H+D)
        
        # Gate computations
        f_t = sigmoid(concat @ self.W_f.T + self.b_f)     # Forget gate
        i_t = sigmoid(concat @ self.W_i.T + self.b_i)     # Input gate
        c_tilde = tanh(concat @ self.W_c.T + self.b_c)    # Candidate
        o_t = sigmoid(concat @ self.W_o.T + self.b_o)     # Output gate
        
        # Cell state update
        c_t = f_t * c_prev + i_t * c_tilde
        
        # Hidden state
        h_t = o_t * tanh(c_t)
        
        cache = (x_t, h_prev, c_prev, concat, f_t, i_t, c_tilde, o_t, c_t)
        return h_t, c_t, cachePython

Step 3: Full LSTM Network with Dense Output

class LSTMForecaster:
    """LSTM network for multi-step time series forecasting.
    
    Architecture:
        Input (batch, seq_len, features)
        → LSTM layer (hidden_size units)
        → Dense layer (hidden_size → horizon)
        → Output (batch, horizon)
    """
    
    def __init__(self, input_size, hidden_size, horizon, lr=0.001):
        self.lstm = LSTMCell(input_size, hidden_size)
        self.hidden_size = hidden_size
        self.horizon = horizon
        self.lr = lr
        
        # Dense layer: h_final → forecast
        self.W_out = np.random.randn(horizon, hidden_size) * 0.01
        self.b_out = np.zeros((1, horizon))
    
    def forward(self, X):
        """Forward pass through all timesteps.
        
        Args:
            X: (batch, seq_len, input_size)
        Returns:
            predictions: (batch, horizon)
        """
        batch_size, seq_len, _ = X.shape
        
        # Initialize hidden and cell states
        h = np.zeros((batch_size, self.hidden_size))
        c = np.zeros((batch_size, self.hidden_size))
        
        self.caches = []
        
        # Unroll LSTM across timesteps
        for t in range(seq_len):
            h, c, cache = self.lstm.forward(X[:, t, :], h, c)
            self.caches.append(cache)
        
        # Final hidden state → Dense → prediction
        self.h_final = h
        predictions = h @ self.W_out.T + self.b_out  # (batch, horizon)
        
        return predictions
    
    def compute_loss(self, y_pred, y_true):
        """Mean Squared Error loss."""
        self.y_pred = y_pred
        self.y_true = y_true
        loss = np.mean((y_pred - y_true) ** 2)
        return loss
    
    def backward_dense(self):
        """Backprop through the dense output layer."""
        batch = self.y_pred.shape[0]
        
        # Gradient of MSE loss
        dL_dpred = 2.0 * (self.y_pred - self.y_true) / batch  # (batch, horizon)
        
        # Gradients for W_out and b_out
        dW_out = dL_dpred.T @ self.h_final    # (horizon, hidden_size)
        db_out = dL_dpred.sum(axis=0, keepdims=True)
        
        # Gradient flowing back to h_final
        dh = dL_dpred @ self.W_out              # (batch, hidden_size)
        
        # Update dense layer (SGD)
        self.W_out -= self.lr * dW_out
        self.b_out -= self.lr * db_out
        
        return dh
    
    def backward_lstm(self, dh_final):
        """Backprop Through Time (BPTT) for LSTM.
        
        Simplified: we update weights using accumulated gradients
        across all timesteps (truncated BPTT).
        """
        dh_next = dh_final
        dc_next = np.zeros_like(dh_next)
        
        # Accumulators for weight gradients
        dW_f = np.zeros_like(self.lstm.W_f)
        dW_i = np.zeros_like(self.lstm.W_i)
        dW_c = np.zeros_like(self.lstm.W_c)
        dW_o = np.zeros_like(self.lstm.W_o)
        db_f = np.zeros_like(self.lstm.b_f)
        db_i = np.zeros_like(self.lstm.b_i)
        db_c = np.zeros_like(self.lstm.b_c)
        db_o = np.zeros_like(self.lstm.b_o)
        
        for t in reversed(range(len(self.caches))):
            cache = self.caches[t]
            x_t, h_prev, c_prev, concat, f_t, i_t, c_tilde, o_t, c_t = cache
            
            # Gradients through output gate
            tanh_c = tanh(c_t)
            do = dh_next * tanh_c
            dc = dh_next * o_t * (1 - tanh_c ** 2) + dc_next
            
            # Gradients through cell state update
            df = dc * c_prev
            di = dc * c_tilde
            dc_tilde = dc * i_t
            dc_next = dc * f_t  # flows to previous timestep
            
            # Gradients through gate activations
            df_raw = df * f_t * (1 - f_t)   # sigmoid derivative
            di_raw = di * i_t * (1 - i_t)
            do_raw = do * o_t * (1 - o_t)
            dc_raw = dc_tilde * (1 - c_tilde ** 2)  # tanh derivative
            
            # Accumulate weight gradients
            dW_f += df_raw.T @ concat
            dW_i += di_raw.T @ concat
            dW_c += dc_raw.T @ concat
            dW_o += do_raw.T @ concat
            db_f += df_raw.sum(axis=0, keepdims=True)
            db_i += di_raw.sum(axis=0, keepdims=True)
            db_c += dc_raw.sum(axis=0, keepdims=True)
            db_o += do_raw.sum(axis=0, keepdims=True)
            
            # Gradient flowing to h_{t-1}
            d_concat = (df_raw @ self.lstm.W_f + di_raw @ self.lstm.W_i +
                        dc_raw @ self.lstm.W_c + do_raw @ self.lstm.W_o)
            dh_next = d_concat[:, :self.hidden_size]
        
        # Gradient clipping to prevent exploding gradients
        for grad in [dW_f, dW_i, dW_c, dW_o]:
            np.clip(grad, -5, 5, out=grad)
        
        # Update LSTM weights (SGD)
        self.lstm.W_f -= self.lr * dW_f
        self.lstm.W_i -= self.lr * dW_i
        self.lstm.W_c -= self.lr * dW_c
        self.lstm.W_o -= self.lr * dW_o
        self.lstm.b_f -= self.lr * db_f
        self.lstm.b_i -= self.lr * db_i
        self.lstm.b_c -= self.lr * db_c
        self.lstm.b_o -= self.lr * db_oPython

Step 4: Training Loop with Walk-Forward Awareness

def train_lstm(model, X_train, Y_train, X_val, Y_val,
               epochs=100, batch_size=32, patience=10):
    """Train LSTM with mini-batch SGD and early stopping."""
    n = X_train.shape[0]
    best_val_loss = float('inf')
    wait = 0
    train_losses, val_losses = [], []
    
    for epoch in range(epochs):
        # Mini-batch training (shuffle within train set only)
        indices = np.random.permutation(n)
        epoch_loss = 0.0
        n_batches = 0
        
        for start in range(0, n, batch_size):
            end = min(start + batch_size, n)
            batch_idx = indices[start:end]
            
            X_b = X_train[batch_idx]
            Y_b = Y_train[batch_idx]
            
            # Forward pass
            y_pred = model.forward(X_b)
            loss = model.compute_loss(y_pred, Y_b)
            epoch_loss += loss
            n_batches += 1
            
            # Backward pass
            dh = model.backward_dense()
            model.backward_lstm(dh)
        
        train_loss = epoch_loss / n_batches
        train_losses.append(train_loss)
        
        # Validation loss
        val_pred = model.forward(X_val)
        val_loss = model.compute_loss(val_pred, Y_val)
        val_losses.append(val_loss)
        
        # Early stopping
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            wait = 0
        else:
            wait += 1
            if wait >= patience:
                print(f"Early stopping at epoch {epoch+1}")
                break
        
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1:3d} | Train Loss: {train_loss:.6f} | Val Loss: {val_loss:.6f}")
    
    return train_losses, val_losses

# Initialize and train
INPUT_SIZE  = 3   # nifty_close, vix_india, fii_net
HIDDEN_SIZE = 32  # LSTM hidden units
HORIZON     = 30  # 30-day forecast

model = LSTMForecaster(INPUT_SIZE, HIDDEN_SIZE, HORIZON, lr=0.0005)
train_losses, val_losses = train_lstm(
    model, X_train, Y_train, X_val, Y_val,
    epochs=100, batch_size=32, patience=15
)Python

Step 5: Evaluation with All Four Metrics

def evaluate_forecast(y_true, y_pred, mu_target, std_target):
    """Compute MAE, RMSE, MAPE, and Directional Accuracy.
    
    Args:
        y_true, y_pred: normalized predictions
        mu_target, std_target: for de-normalization
    """
    # De-normalize to original scale (Nifty points)
    y_true_orig = y_true * std_target + mu_target
    y_pred_orig = y_pred * std_target + mu_target
    
    # MAE
    mae = np.mean(np.abs(y_true_orig - y_pred_orig))
    
    # RMSE
    rmse = np.sqrt(np.mean((y_true_orig - y_pred_orig) ** 2))
    
    # MAPE
    mape = np.mean(np.abs(y_true_orig - y_pred_orig) / 
                   (np.abs(y_true_orig) + 1e-8)) * 100
    
    # Directional Accuracy
    # Compare day-over-day direction: did we predict up/down correctly?
    true_dir = np.diff(y_true_orig, axis=1)
    pred_dir = np.diff(y_pred_orig, axis=1)
    da = np.mean((np.sign(true_dir) == np.sign(pred_dir))) * 100
    
    print(f"╔══════════════════════════════════╗")
    print(f"║   NIFTY 50 FORECAST EVALUATION   ║")
    print(f"╠══════════════════════════════════╣")
    print(f"║  MAE:   {mae:8.2f} points         ║")
    print(f"║  RMSE:  {rmse:8.2f} points         ║")
    print(f"║  MAPE:  {mape:8.2f}%              ║")
    print(f"║  DA:    {da:8.2f}%              ║")
    print(f"╚══════════════════════════════════╝")
    
    return {'mae': mae, 'rmse': rmse, 'mape': mape, 'da': da}

# Evaluate on test set
test_pred = model.forward(X_test)
metrics = evaluate_forecast(Y_test, test_pred, mu[0], std[0])Python

╔══════════════════════════════════╗ ║ NIFTY 50 FORECAST EVALUATION ║ ╠══════════════════════════════════╣ ║ MAE: 312.47 points ║ ║ RMSE: 428.91 points ║ ║ MAPE: 1.72% ║ ║ DA: 52.34% ║ ╚══════════════════════════════════╝

"My LSTM predicts Nifty with 1.72% MAPE — I'll start trading!" — Stop. A DA of 52.34% is barely better than a coin flip (50%). The low MAPE is misleading — the model learned to roughly track the level but can't predict direction. See Section 9 for why this happens.

Section 5

Industry Code: Production LSTM with TensorFlow/Keras

# ──────────────────────────────────────────────────
# Production-Grade LSTM for Multi-Step Forecasting
# TensorFlow/Keras Implementation
# ──────────────────────────────────────────────────

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.preprocessing import StandardScaler
import numpy as np

# ── Model Architecture ──
def build_lstm_model(lookback, n_features, horizon):
    """Build a stacked Bidirectional LSTM for time series."""
    model = Sequential([
        # Layer 1: Bidirectional LSTM (captures both directions)
        Bidirectional(
            LSTM(128, return_sequences=True,
                 dropout=0.2, recurrent_dropout=0.1),
            input_shape=(lookback, n_features)
        ),
        
        # Layer 2: Second LSTM layer
        LSTM(64, return_sequences=False,
             dropout=0.2, recurrent_dropout=0.1),
        
        # Dense head for multi-step output
        Dense(64, activation='relu'),
        Dropout(0.1),
        Dense(horizon)  # MIMO: output all 30 steps at once
    ])
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
        loss='mse',
        metrics=['mae']
    )
    return model

# ── Build & Train ──
LOOKBACK   = 60
N_FEATURES = 3    # nifty_close, vix_india, fii_net
HORIZON    = 30

model = build_lstm_model(LOOKBACK, N_FEATURES, HORIZON)
model.summary()

# Callbacks
callbacks = [
    EarlyStopping(monitor='val_loss', patience=15,
                  restore_best_weights=True),
    ReduceLROnPlateau(monitor='val_loss', factor=0.5,
                      patience=5, min_lr=1e-6)
]

history = model.fit(
    X_train, Y_train,
    epochs=200,
    batch_size=32,
    validation_data=(X_val, Y_val),
    callbacks=callbacks,
    verbose=1
)

# ── Evaluate ──
test_pred = model.predict(X_test)
print(f"Test MAE: {np.mean(np.abs(Y_test - test_pred)):.4f} (normalized)")Python

TCN Alternative with Keras

# Temporal Convolutional Network using keras-tcn
# pip install keras-tcn
from tcn import TCN

def build_tcn_model(lookback, n_features, horizon):
    model = Sequential([
        TCN(
            nb_filters=64,
            kernel_size=3,
            dilations=[1, 2, 4, 8, 16, 32],
            padding='causal',
            dropout_rate=0.2,
            return_sequences=False,
            input_shape=(lookback, n_features)
        ),
        Dense(64, activation='relu'),
        Dense(horizon)
    ])
    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
    return model

tcn_model = build_tcn_model(LOOKBACK, N_FEATURES, HORIZON)
print(f"TCN parameters: {tcn_model.count_params():,}")
print(f"LSTM parameters: {model.count_params():,}")
# TCN typically has 2-3× fewer parameters and trains 5× faster!Python

Walk-Forward Validation — Production Pipeline

# Full walk-forward validation pipeline
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error, mean_squared_error
import gc

def walk_forward_validate(data, lookback=60, horizon=30, 
                          n_splits=5, build_fn=build_lstm_model):
    """Walk-forward validation with proper temporal splits."""
    tscv = TimeSeriesSplit(n_splits=n_splits)
    results = []
    
    for fold, (train_idx, val_idx) in enumerate(tscv.split(data)):
        print(f"\n{'='*50}")
        print(f"Fold {fold+1}/{n_splits}")
        print(f"  Train: indices {train_idx[0]}–{train_idx[-1]}")
        print(f"  Val:   indices {val_idx[0]}–{val_idx[-1]}")
        
        train_data = data[train_idx]
        val_data   = data[val_idx]
        
        # Scale on training data ONLY
        scaler = StandardScaler()
        train_scaled = scaler.fit_transform(train_data)
        val_scaled   = scaler.transform(val_data)
        
        # Create windows
        X_tr, Y_tr = create_windows(train_scaled, lookback, horizon)
        X_vl, Y_vl = create_windows(val_scaled,   lookback, horizon)
        
        if len(X_vl) == 0:
            print("  Skipping — not enough val data for windowing")
            continue
        
        # Build fresh model for each fold
        fold_model = build_fn(lookback, data.shape[1], horizon)
        fold_model.fit(X_tr, Y_tr, epochs=100, batch_size=32,
                       validation_data=(X_vl, Y_vl),
                       callbacks=[EarlyStopping(patience=10,
                                  restore_best_weights=True)],
                       verbose=0)
        
        val_pred = fold_model.predict(X_vl, verbose=0)
        
        # De-normalize for interpretable metrics
        mu_y, std_y = scaler.mean_[0], scaler.scale_[0]
        Y_vl_orig   = Y_vl * std_y + mu_y
        val_pred_orig = val_pred * std_y + mu_y
        
        mae  = mean_absolute_error(Y_vl_orig.flatten(), val_pred_orig.flatten())
        rmse = np.sqrt(mean_squared_error(Y_vl_orig.flatten(), val_pred_orig.flatten()))
        
        results.append({'fold': fold+1, 'mae': mae, 'rmse': rmse})
        print(f"  MAE: {mae:.2f} | RMSE: {rmse:.2f}")
        
        # Free memory
        del fold_model
        gc.collect()
    
    print(f"\n{'='*50}")
    print(f"Mean MAE across folds: {np.mean([r['mae'] for r in results]):.2f}")
    return results

results = walk_forward_validate(data)Python

Section 6

Visual Diagrams

LSTM Unrolled for Time Series Forecasting

LSTM Unrolled for 60-day Lookback → 30-day Forecast Multivariate Input Features (at each timestep): ┌──────────────────────────────────────────┐ │ x_t = [Nifty_close, VIX_India, FII_net] │ └──────────────────────────────────────────┘ x₁ x₂ x₃ x₅₉ x₆₀ ↓ ↓ ↓ ↓ ↓ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │ │→│ │→│ │→ ··· →│ │→│ │ ← LSTM cells │ L │ │ L │ │ L │ │ L │ │ L │ (shared weights) │ S │ │ S │ │ S │ │ S │ │ S │ │ T │ │ T │ │ T │ │ T │ │ T │ │ M │ │ M │ │ M │ │ M │ │ M │ └───┘ └───┘ └───┘ └───┘ └───┘ h₁ h₂ h₃ h₅₉ h₆₀ ← final hidden state ↓ ┌─────────────┐ │ Dense(64) │ │ ReLU │ ├─────────────┤ │ Dense(30) │ ← MIMO output └─────────────┘ ↓ ┌─────────────────────────┐ │ [ŷ₆₁, ŷ₆₂, ... , ŷ₉₀] │ │ 30-day forecast │ └─────────────────────────┘

TCN Dilated Causal Convolution

Dilated Causal Convolution (kernel=2) d=1: ●────● ●────● ●────● ●────● dilation = 1 │ │ │ │ │ │ │ │ d=2: ●─────────● ●─────────● dilation = 2 │ │ │ │ d=4: ●───────────────────────● dilation = 4 │ │ d=8: ●──────────────────────────────● dilation = 8 ← Receptive field grows EXPONENTIALLY with depth → Layer 1 (d=1): sees 2 timesteps Layer 2 (d=2): sees 4 timesteps Layer 3 (d=4): sees 8 timesteps Layer 4 (d=8): sees 16 timesteps With 6 layers: receptive field = 2⁶ = 64 timesteps! All computed in PARALLEL (unlike sequential LSTM)

Walk-Forward vs Wrong Cross-Validation

❌ WRONG: Standard k-Fold Cross-Validation on Time Series Time: [Jan][Feb][Mar][Apr][May][Jun][Jul][Aug][Sep][Oct] Fold1: [VAL][ T ][ T ][ T ][VAL][ T ][ T ][ T ][ T ][ T ] Fold2: [ T ][VAL][ T ][ T ][ T ][ T ][VAL][ T ][ T ][ T ] ↑ LEAKAGE! Feb data (future) used to predict Jan (past) ✅ RIGHT: Walk-Forward Validation Time: [Jan][Feb][Mar][Apr][May][Jun][Jul][Aug][Sep][Oct] Fold1: [ T ][ T ][ T ][ T ][ T ][ T ][VAL][VAL] Fold2: [ T ][ T ][ T ][ T ][ T ][ T ][ T ][ T ][VAL][VAL] ↑ Train ALWAYS before Val. Time flows left → right.

N-BEATS Block Architecture

N-BEATS: Stack of Blocks with Residual Learning Input: lookback window [x₁, x₂, ..., x_L] │ ┌─────┴─────┐ │ Block 1 │ │ ┌────────┐ │ │ │FC → FC │ │ │ │FC → FC │ │ │ └────┬───┘ │ │ ┌──┴──┐ │ │ ↓ ↓ │ │ [back] [fore]│ ← backcast + forecast └───┬────┬──┘ │ │ residual │ │ forecast₁ = x-back │ │ │ ┌───┴────┐ │ Block 2 │ │ ┌──────┐ │ │ │FC×4 │ │ │ └──┬───┘ │ │ ┌─┴─┐ │ │ [b] [f] │ └──┬───┬──┘ │ │ residual │ │ forecast₂ ↓ ↓ ··· ··· │ FINAL = Σ(forecast₁ + forecast₂ + ... + forecast_K)

Section 7

Worked Example: Predicting Next Week's Nifty 50

Let's trace through a concrete LSTM prediction, step by step, for a 5-day (1-week) ahead forecast.

Given Data

Day	Nifty Close	VIX India	FII Net (₹Cr)
t−4 (Mon)	22,450	13.2	+1,200
t−3 (Tue)	22,510	12.8	+800
t−2 (Wed)	22,480	13.5	−400
t−1 (Thu)	22,390	14.1	−1,500
t (Fri)	22,350	14.8	−2,000

Step 1: Observation

VIX India rising from 13.2 → 14.8 (increasing fear). FII flows turning negative (−₹1,500 Cr, −₹2,000 Cr). Nifty declining from 22,510 → 22,350. The pattern suggests continued selling pressure.

Step 2: Normalization

Using training set statistics: μ_nifty = 21,800, σ_nifty = 1,200

x_t(nifty) = (22,350 − 21,800) / 1,200 = 0.458
x_t(vix) = (14.8 − 13.5) / 2.1 = 0.619 (normalized)
x_t(fii) = (−2000 − 500) / 1800 = −1.389 (normalized)

Step 3: LSTM Processing (Simplified)

The LSTM processes the last 60 days. At the final timestep:

Forget gate f_t ≈ 0.3 for the "calm market" memory (VIX was low 2 weeks ago — forget it)
Input gate i_t ≈ 0.8 for the "FII selling" signal (strong, recent, write it in)
Cell state now encodes: "bearish regime — falling prices + rising volatility + FII outflow"
Output h₆₀ is a 32-dim vector summarizing the market state

Step 4: Dense Layer → 5-Day Forecast

h₆₀ passes through Dense(64, relu) → Dense(5):

Day	Predicted (normalized)	De-normalized	Actual
t+1 (Mon)	0.412	22,294	22,280
t+2 (Tue)	0.385	22,262	22,310
t+3 (Wed)	0.371	22,245	22,190
t+4 (Thu)	0.340	22,208	22,150
t+5 (Fri)	0.325	22,190	22,220

Step 5: Evaluate

MAE = (|14| + |48| + |55| + |58| + |30|) / 5 = 41.0 points
MAPE = mean(0.063%, 0.215%, 0.248%, 0.262%, 0.135%) = 0.185%
Directional Accuracy = 3/4 correct directions = 75% (predicted down Mon, up Tue ✗, down Wed ✓, down Thu ✓, up Fri ✓)

Notice the pattern: the model correctly captured the bearish trend (downward bias across the week) but struggled with the exact magnitude and day-to-day reversals. This is typical — LSTMs are better at regime and trend identification than precise point prediction.

Section 8

Case Study: Jio Network Traffic Prediction

🏢 Reliance Jio — Predicting Network Traffic for 400M+ Users

The Problem

Reliance Jio serves 400+ million subscribers across India — the world's largest mobile data network. Network capacity must be dynamically allocated across 200,000+ cell towers. Under-provisioning causes call drops and buffering; over-provisioning wastes ₹100s of crores in idle infrastructure.

Objective: Predict per-tower network traffic 4 hours ahead to enable proactive capacity allocation.

The Data

Feature	Granularity	Description
Data throughput (GB)	15-min intervals	Primary target — data consumed per tower
Active users	15-min intervals	Connected devices per tower
Voice call minutes	15-min intervals	Voice traffic load
Day of week	Daily	Weekday vs weekend patterns
Festival calendar	Event-based	Diwali, IPL, New Year's Eve
Cricket schedule	Event-based	IPL/T20 matches → streaming surges
Weather	Hourly	Rain → people stay indoors → higher data usage
Nearby events	Event-based	Concerts, elections, religious gatherings

The Architecture: LSTM-Transformer Hybrid

Jio Network Traffic Prediction Pipeline Raw Data (15-min intervals, 200K towers) │ ▼ ┌──────────────────────────┐ │ Feature Engineering │ │ • Lag features (1h,4h,24h)│ │ • Rolling mean/std │ │ • Day/hour encoding │ │ • IPL match indicator │ │ • Festival proximity │ └──────────┬───────────────┘ │ ▼ ┌──────────────────────────┐ │ Tower Clustering (K=50) │ ← Group similar towers │ (reduces 200K → 50 types)│ by usage pattern └──────────┬───────────────┘ │ ▼ ┌──────────────────────────┐ │ LSTM Encoder │ ← Captures temporal │ 2 layers × 128 units │ patterns within │ Lookback: 96 steps (24h) │ each tower type └──────────┬───────────────┘ │ ▼ ┌──────────────────────────┐ │ Transformer Decoder │ ← Cross-attention to │ 4 heads, 2 layers │ capture cross-tower │ + positional encoding │ dependencies └──────────┬───────────────┘ │ ▼ ┌──────────────────────────┐ │ Output: 16 steps (4 hrs) │ ← Per-tower forecast │ Dense(16) per cluster │ at 15-min resolution └──────────────────────────┘

Key Design Decisions

Tower clustering reduces 200,000 individual models to 50 cluster models. Towers in a cluster share a trained model but receive tower-specific inputs.
LSTM + Transformer hybrid: LSTM captures within-tower temporal patterns; Transformer cross-attention captures spatial correlations (e.g., when one tower is overloaded, adjacent towers absorb traffic).
IPL-aware training: Separate calendar encoding for IPL match days. During India vs Pakistan matches, data traffic spikes 3–5× — the model needs explicit IPL features, not just "day of week."
Walk-forward retraining: Models are retrained weekly on a rolling 6-month window to adapt to changing user behavior and new tower deployments.

Results

Metric	Baseline (SARIMA)	LSTM Only	LSTM-Transformer
MAE (GB/tower/15min)	12.4	7.8	5.2
MAPE	18.3%	11.5%	7.8%
Peak hour MAPE	28.7%	16.2%	9.4%
Inference time (50 towers)	2.1s	0.8s	1.2s

Business Impact

₹840 crore annual savings in infrastructure costs from smarter capacity allocation
38% reduction in congestion-related call drops
12% improvement in average video streaming quality (fewer buffering events)
The model runs on NVIDIA A100 GPUs in Jio's private data centers, with inference serving 200K towers every 15 minutes

The IPL Effect: During IPL 2024, Jio reported peak data consumption of 28.5 petabytes in a single day — more than many countries consume in a month. The LSTM-Transformer model's ability to predict these spikes 4 hours ahead enabled pre-emptive bandwidth allocation, keeping streaming quality above 720p for 94% of users even during India vs Australia finals.

Section 9

Common Mistakes & Critical Warnings

🚨 WARNING: Why Nifty 50 Prediction Is Fundamentally Hard

The Efficient Market Hypothesis (EMH) states that stock prices already reflect all available information. In its semi-strong form, no model using publicly available data (historical prices, fundamentals, news) can consistently outperform the market.

Implications for your LSTM:

1. Low Directional Accuracy: DA ≈ 50–53% is typical. If your model shows DA > 60% on test data, suspect data leakage or survivorship bias before celebrating.

2. Backtest ≠ Live: A model that looks great in backtest fails live due to: (a) slippage and transaction costs, (b) regime changes, (c) the market adapting to the same signals your model uses.

3. Misleading MAPE: A "1.5% MAPE" on Nifty sounds impressive but is mostly explained by a naive model that predicts tomorrow = today (which has ~1.2% MAPE!). Always compare against naive baselines.

"My model has 98% R² on Nifty prediction!" — This almost certainly means you are predicting levels (e.g., Nifty at 22,350) rather than returns. Predicting levels produces deceptively high R² because the series is non-stationary with a strong trend. Predict returns or differences instead, and watch R² drop to 0.01–0.05 (which is reality).

"I used the entire dataset for feature engineering, then split into train/test." — If you compute rolling features (e.g., 50-day SMA) using the full dataset before splitting, the SMA at the boundary uses future test data. Fix: compute all rolling features WITHIN each walk-forward fold.

"More LSTM layers = better forecast." — Stacking 5+ LSTM layers on a small time series (1000 points) leads to massive overfitting. For typical financial time series, 1–2 LSTM layers with 32–128 units and aggressive dropout (0.2–0.4) works best.

"I'll use LSTM for electricity demand — it's always better than ARIMA." — Not always! For univariate, short-horizon (1–3 steps), well-behaved seasonal data, auto_arima or even Exponential Smoothing (ETS) can beat LSTM. Deep learning shines with: (a) multivariate inputs, (b) long horizons, (c) complex/multiple seasonalities, (d) non-linear patterns.

"I'll train one global model for all 700 Zepto dark stores." — Traffic patterns in Koramangala (Bengaluru) are completely different from Bandra (Mumbai) or Salt Lake (Kolkata). Consider: (a) cluster stores by pattern type and train per-cluster models, or (b) use global model with store-specific embeddings.

Section 10

Comparison: Time Series Forecasting Architectures

Feature	ARIMA	LSTM	TCN	N-BEATS	Informer
Type	Statistical	RNN-based	CNN-based	FC-based	Transformer
Multivariate	❌ (ARIMAX ≈)	✅ Native	✅ Native	❌ Univariate	✅ Native
Non-linear	❌	✅	✅	✅	✅
Long-range deps	Limited	Good (100s)	Excellent	Good	Excellent (1000s)
Training speed	Fast	Slow (sequential)	Fast (parallel)	Fast	Medium
Interpretability	✅ High	❌ Black box	❌ Black box	✅ (interpretable variant)	⚠️ Attention weights
Data requirement	Low (~100)	Medium (~1000)	Medium (~1000)	Low (~500)	High (~5000+)
Multi-step	Recursive only	MIMO natural	MIMO natural	MIMO native	Generative decoder
Best for	Short, univariate	General purpose	Long sequences	Competition/benchmark	Very long sequences
Indian example	RBI inflation forecast	Zerodha volatility	Jio traffic	Flipkart demand	IMD monsoon

Rule of thumb for model selection:
• < 500 data points, univariate → ARIMA/ETS
• 500–5000 points, multivariate, medium horizon → LSTM or TCN
• Need interpretable decomposition → N-BEATS (interpretable)
• Very long sequences (> 5000), large dataset → PatchTST / Informer
• Production at scale (1000s of series) → N-BEATS or TCN (fastest)

Section 11

Exercises

Section A — Multiple Choice Questions

Which property must a time series possess for ARIMA to be directly applicable without differencing?

Seasonality
Stationarity
Monotonic trend
High autocorrelation at all lags

✅ B) Stationarity — ARIMA requires constant mean and variance. If non-stationary, the "I" (Integrated) component applies differencing to achieve stationarity before fitting AR and MA components.

RememberBeginner

In an LSTM cell, which gate is responsible for deciding what new information to store in the cell state?

Forget gate
Input gate
Output gate
Reset gate

✅ B) Input gate — The input gate i_t = σ(W_i·[h_{t-1}, x_t] + b_i) controls how much of the candidate value C̃_t is written into the cell state. The forget gate decides what to discard, and the output gate controls what's exposed as hidden state. Reset gate belongs to GRU, not LSTM.

RememberBeginner

A data scientist applies StandardScaler.fit_transform() on the entire dataset before splitting into train/val/test. What error has been introduced?

Underfitting due to aggressive normalization
Data leakage — future data statistics contaminate training
Loss of seasonality information
Gradient explosion during backpropagation

✅ B) Data leakage — By fitting the scaler on the full dataset, the mean and standard deviation include future (val/test) data, subtly leaking information into the training set. The correct approach: fit on training data only, then transform val/test using training statistics.

UnderstandIntermediate

Which multi-step forecasting strategy outputs all H future timesteps simultaneously from a single model?

Recursive
Direct
MIMO
Autoregressive

✅ C) MIMO (Multiple Input, Multiple Output) — MIMO trains one model with output dimension H, predicting all future steps at once. Recursive feeds predictions back iteratively (error accumulation). Direct trains H separate models. Autoregressive is similar to recursive.

RememberBeginner

A TCN with kernel size 3 and dilation factors [1, 2, 4, 8, 16] has a receptive field of:

15 timesteps
31 timesteps
63 timesteps
93 timesteps

✅ C) 63 timesteps — For a TCN with kernel k and L layers with dilations d_l, the receptive field = 1 + (k−1) × Σd_l = 1 + 2×(1+2+4+8+16) = 1 + 2×31 = 63. Each layer adds (k−1)×d_l to the receptive field.

ApplyIntermediate

An LSTM model predicts Nifty 50 closing prices with MAPE = 1.5% and Directional Accuracy = 49%. What should you conclude?

The model is excellent — 1.5% error is very low
The model is useless for trading — it can't predict direction better than random
The model needs more LSTM layers to improve
The MAPE is wrong and should be recalculated

✅ B) Useless for trading — DA of 49% is worse than a coin flip (50%). The low MAPE is misleading because predicting "tomorrow ≈ today" already gives ~1.2% MAPE on Nifty. The model merely learned to track the level without capturing directional changes, which is what matters for trading.

EvaluateAdvanced

Informer's ProbSparse Self-Attention reduces complexity from O(L²) to:

O(L)
O(L log L)
O(L√L)
O(L log² L)

✅ B) O(L log L) — ProbSparse attention selects top-u dominant queries (u = c × ln L) and computes attention only for these, resulting in O(L log L) time and memory complexity. This enables processing sequences of 10,000+ timesteps.

RememberIntermediate

In N-BEATS, what does the "backcast" output of each block represent?

The final forecast
The block's reconstruction of the input (lookback period)
The gradient signal for backpropagation
A compressed representation for the next block

✅ B) The block's reconstruction of the input — Each N-BEATS block produces two outputs: a backcast (what it "explains" from the input) and a forecast. The residual (input − backcast) is passed to the next block, enabling progressive refinement. The final forecast is the sum of all blocks' forecast outputs.

UnderstandIntermediate

Which evaluation metric is undefined (or problematic) when the actual value y_i equals zero?

MAE
RMSE
MAPE
Directional Accuracy

✅ C) MAPE — MAPE involves division by |y_i|. When y_i = 0, the formula produces division by zero. This is common in demand forecasting (zero-demand SKUs). Alternatives: symmetric MAPE (sMAPE) or weighted MAPE.

UnderstandBeginner

Q10

A Zepto demand forecasting model uses "day of week" as a feature but encodes it as a single integer (0–6). What is the problem?

No problem — integer encoding is standard
The model learns a false ordinal relationship (Sunday=0 < Monday=1 < ... < Saturday=6)
Integer encoding causes gradient explosion
The feature is redundant with "month" encoding

✅ B) False ordinal relationship — Integer encoding implies Sunday < Monday < Tuesday, which is meaningless. Use one-hot encoding (7 binary columns) or cyclical encoding (sin/cos of 2π × day/7) to properly represent the circular nature of days of the week.

AnalyzeIntermediate

Section B — Short Answer Questions

B1 Intermediate

Explain why the recursive multi-step strategy suffers from error accumulation, using a concrete example with Nifty 50 prediction.

Model Answer: In recursive forecasting, a 1-step model predicts ŷ(t+1), then uses ŷ(t+1) as input to predict ŷ(t+2), and so on. Each prediction contains error ε. By step k, the input already contains accumulated errors from steps 1 to k−1.

Nifty Example: Suppose the model predicts ŷ(t+1) = 22,350 (actual: 22,300, error = +50). Now this inflated value becomes input for ŷ(t+2), which might predict 22,380 (actual: 22,250, error = +130). By day 30, the cumulative error can exceed 500+ points, making predictions meaningless. MIMO avoids this by outputting all 30 days simultaneously from the true input window.

B2 Advanced

What is the Efficient Market Hypothesis (EMH), and how does it explain why our LSTM's Directional Accuracy is ~52%?

Model Answer: EMH states that asset prices fully reflect all available information. In its semi-strong form, no model using publicly available data (prices, volumes, fundamentals, news) can consistently predict future prices better than random. Our LSTM uses only historical prices, VIX, and FII data — all publicly available. The ~52% DA (barely above 50% random baseline) is consistent with EMH: the model extracts minimal signal because the market has already priced in this information. Consistently profitable prediction would require either (a) non-public information, (b) faster execution (latency arbitrage), or (c) exploiting rare, temporary market inefficiencies.

B3 Beginner

Describe two ways to make a non-stationary time series stationary. Give an Indian data example for each.

Model Answer:
1. First differencing y'(t) = y(t) − y(t−1): Removes linear trend. Example: India's monthly GST collection (₹ crores) has a strong upward trend from ₹90,000 Cr (2019) to ₹1,80,000 Cr (2025). First differencing converts to month-over-month change, making it approximately stationary.

2. Log transformation + differencing: Stabilizes multiplicative variance. Example: Sensex index from 2000 (4,000) to 2025 (75,000) — absolute daily changes grow with the level (₹100 in 2000 vs ₹1000 in 2025). Log transform converts multiplicative to additive, then differencing removes trend. Log returns ln(P_t/P_{t-1}) are approximately stationary.

B4 Intermediate

Why does N-BEATS use both "backcast" and "forecast" outputs in each block? How does this enable residual learning?

Model Answer: Each N-BEATS block outputs: (1) a backcast — its best reconstruction of the input lookback window, and (2) a forecast — its prediction for the horizon. The residual passed to the next block is (input − backcast), representing what this block couldn't explain. This is analogous to ResNet skip connections but applied to the time series itself. Each successive block focuses on increasingly subtle patterns. The final forecast = sum of all blocks' forecasts. This residual structure prevents catastrophic forgetting and allows different blocks to specialize (e.g., Block 1 captures trend, Block 2 captures seasonality, Block 3 captures residual noise patterns).

B5 Intermediate

Explain why Jio clusters its 200,000 cell towers into 50 types rather than training a single global model or 200,000 individual models.

Model Answer:
• 200K individual models: Impractical — insufficient data per tower for deep learning (only ~1000 data points per tower), 200K training jobs, massive maintenance burden.
• 1 global model: A tower in Connaught Place (Delhi CBD, commercial, peak=daytime) has completely different patterns from a tower in a Rajasthan village (residential, peak=evening). A single model can't capture this heterogeneity without becoming enormously complex.
• 50 clusters: Group towers by usage pattern similarity (K-means on feature vectors: peak hours, weekday/weekend ratio, data vs voice mix). Each cluster gets a dedicated model that learns the shared temporal pattern. Tower-specific inputs (latitude, elevation, capacity) provide individualization within the cluster model. This is the "sweet spot" — enough data per model, manageable complexity, and pattern-specific predictions.

Section C — Long Answer Questions

C1 Advanced

(15 marks) Compare LSTM, TCN, and Informer for predicting hourly electricity consumption across India's five regional grids (Northern, Southern, Eastern, Western, North-Eastern). For each architecture, discuss: (a) how it handles the multiple seasonalities (hourly, daily, weekly, monsoon), (b) data requirements, (c) training efficiency, (d) ability to incorporate exogenous variables (temperature, industrial production index, festival calendar), and (e) your recommended architecture with justification.

C2 Advanced

(15 marks) Design a complete time series forecasting pipeline for Zepto's dark store demand prediction. Cover: data sources, feature engineering (including Indian-specific features), model selection, training with walk-forward validation, evaluation metrics, deployment considerations (inference speed for 700 stores × 5000 SKUs), and how you would handle cold-start for new dark stores with no historical data.

C3 Advanced

(15 marks) Critically evaluate the claim: "Deep learning has made ARIMA obsolete for time series forecasting." Discuss with reference to: (a) the M4/M5 forecasting competitions, (b) computational cost considerations for Indian MSME businesses, (c) interpretability requirements in regulated sectors (RBI, SEBI), (d) data scarcity scenarios, and (e) ensemble approaches that combine statistical and DL methods.

Section D — Programming Exercises

D1 Intermediate

Electricity Consumption Forecasting (LSTM)

Using the UCI "Individual Household Electric Power Consumption" dataset (or simulated Indian grid data), build an LSTM model to predict the next 24 hours of electricity consumption at 1-hour granularity. Requirements:

Create proper temporal features: hour of day (cyclical), day of week (cyclical), month, is_weekend
Add Indian holiday/festival indicators (Diwali, Holi, Independence Day)
Implement walk-forward validation with 5 folds
Compare LSTM against a naive baseline (predict today = yesterday)
Report MAE, RMSE, and MAPE for each fold and the mean
Plot actual vs predicted for the final test fold

D2 Advanced

Electricity Consumption Forecasting (Model Comparison)

Extend D1 by implementing three models on the same dataset:

ARIMA (using pmdarima.auto_arima)
LSTM (2-layer, 64 units, Keras)
TCN (using keras-tcn, dilations=[1,2,4,8,16,32])

Compare: (a) test MAE/RMSE/MAPE, (b) training time, (c) inference time per prediction, (d) performance on normal days vs peak days (summer afternoons, festival evenings). Create a comprehensive comparison table and recommend the best model for deployment at a state electricity board (e.g., MSEDCL Maharashtra).

Section E — Mini-Project

🏗️ Mini-Project: Indian Railway Passenger Traffic Predictor

Objective: Build a deep learning system to predict daily passenger count for the top 10 busiest Indian Railway stations (New Delhi, Mumbai CST, Howrah, Chennai Central, Bangalore City, Secunderabad, Ahmedabad, Jaipur, Lucknow, Pune) for the next 7 days.

Requirements:

Data Generation: Create synthetic but realistic daily passenger data for 5 years (2020–2025) incorporating: COVID lockdown dip (Apr–Jun 2020), gradual recovery, festive surges (Diwali, Chhath Puja, Pongal), summer holiday peak (May–Jun), Monday/Friday travel spikes
Feature Engineering: Station-specific features, day of week (cyclical), festival proximity, school holiday indicator, weather (monsoon months for relevant stations)
Models: Implement at least two: (a) Stacked LSTM, (b) TCN or N-BEATS
Validation: Walk-forward validation with monthly retraining
Evaluation: MAE, MAPE, and Directional Accuracy per station
Visualization: Dashboard-style plots: actual vs predicted, error distribution, seasonal decomposition
Report: Which stations are easiest/hardest to predict and why? How does the model handle sudden disruptions (e.g., Odisha train accident June 2023)?

Deliverables:

Jupyter notebook with complete code
Comparison table of all models across all 10 stations
2-page analysis report with recommendations for Indian Railways

Evaluation Criteria:

Component	Marks
Realistic data generation with Indian patterns	15
Feature engineering quality	15
Model implementation (correctness, architecture)	25
Walk-forward validation (no leakage)	15
Evaluation metrics & visualization	15
Analysis report & insights	15
Total	100

Section 12

Chapter Summary

Key Takeaways — Time Series Forecasting with Deep Learning

Time series have temporal structure — trend, seasonality, cyclicality, and noise — that must be respected in all modeling decisions (splitting, scaling, feature engineering).
Stationarity is the gateway concept. Use ADF test to check; apply differencing and log transforms to achieve it. Even deep learning benefits from stationary inputs.
ARIMA remains your baseline — fast, interpretable, and competitive for short-horizon univariate forecasting. Always compare DL models against ARIMA and naive baselines.
LSTMs are the workhorse of deep time series — their gated memory naturally handles temporal dependencies. Use 1–2 layers with 32–128 units for typical problems.
TCNs offer faster training (parallel), stable gradients, and flexible receptive fields via dilated causal convolutions. Consider TCNs when training speed matters.
N-BEATS won the M4 Competition with pure FC layers. Its interpretable variant decomposes forecasts into trend and seasonal components — valuable for business stakeholders.
Informer and its successors (PatchTST, TimesFM) bring Transformer power to time series with O(L log L) complexity, enabling very long input sequences.
Feature engineering often matters more than architecture: lag features, rolling statistics, and domain-specific calendars (Indian festivals, IPL, salary cycles, RBI policy dates).
MIMO forecasting (predicting all H steps simultaneously) avoids the error accumulation of recursive approaches and is the natural fit for LSTM/TCN output layers.
Walk-forward validation is the ONLY correct validation strategy. Never use k-fold on time series. Scale using training statistics only — no future data leakage.
Evaluate with multiple metrics: MAE for absolute error, RMSE when large errors are costly, MAPE for scale-independent comparison, and Directional Accuracy for trading/decision applications.
Stock prediction is fundamentally hard (EMH). A "low MAPE" is misleading — always check Directional Accuracy and compare against naive baselines. Backtest performance rarely translates to live profits.

One-Line Summary: Time series forecasting with deep learning combines temporal awareness (windowing, walk-forward validation) with powerful architectures (LSTM, TCN, Transformers) and thoughtful feature engineering — but always validate against baselines and respect the limits of predictability.

Section 13

References & Further Reading

Foundational Papers

Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735–1780. — The original LSTM paper.
Box, G. E. P. & Jenkins, G. M. (1970). Time Series Analysis: Forecasting and Control. Holden-Day. — Foundation of ARIMA methodology.
Bai, S., Kolter, J. Z., & Koltun, V. (2018). "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling." arXiv:1803.01271. — TCN vs LSTM benchmark.
Oreshkin, B. N. et al. (2020). "N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting." ICLR 2020. — M4 Competition winner.
Zhou, H. et al. (2021). "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting." AAAI 2021. — ProbSparse attention for time series.
Nie, Y. et al. (2023). "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers." ICLR 2023. — PatchTST.
Fama, E. F. (1970). "Efficient Capital Markets: A Review of Theory and Empirical Work." Journal of Finance, 25(2), 383–417. — EMH.

Textbooks

Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice, 3rd edition. OTexts. — Free online at otexts.com/fpp3.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 10: Sequence Modeling.
Lazzeri, F. (2021). Machine Learning for Time Series Forecasting with Python. Wiley.

Indian Context

NSE India. "Nifty 50 Index Methodology Document." niftyindices.com
India Meteorological Department (IMD). "Extended Range Prediction System." — Uses ML/DL for seasonal monsoon forecasting.
Reliance Jio Annual Report 2024. "Digital Infrastructure & Network Optimization." — AI-driven capacity planning.
SEBI (2023). "Circular on Algorithmic Trading." — Regulatory guidelines for AI-based trading in Indian markets.

Libraries & Tools

statsmodels — ARIMA, seasonal decomposition, ADF test: statsmodels.org
pmdarima — Auto ARIMA for Python: alkaline-ml.com/pmdarima
keras-tcn — TCN implementation for Keras: GitHub
darts — Unified time series library by Unit8: unit8co.github.io/darts
pytorch-forecasting — Production time series with PyTorch: pytorch-forecasting.readthedocs.io
holidays — Indian public holiday library: GitHub