Neural Networks & Deep Learning

Chapter 20: Time Series Forecasting with Deep Learning

Predicting the Future, One Timestep at a Time

โฑ๏ธ Reading Time: ~3 hours  |  ๐Ÿ“– Part V: Sequence & Time  |  ๐Ÿง  Theory + Code Chapter

๐Ÿ“‹ Prerequisites: Chapters 14โ€“15 (RNNs, LSTMs), Chapter 18 (Transformers), Basic Statistics

Bloom's Taxonomy Map for This Chapter

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberRecall time-series characteristics (trend, seasonality, stationarity), ARIMA components, LSTM gate equations, and evaluation metric formulas (MAE, RMSE, MAPE)
๐Ÿ”ต UnderstandExplain why LSTMs handle long-range dependencies in sequential data, how TCN dilations expand receptive fields, and why walk-forward validation prevents data leakage
๐ŸŸข ApplyImplement an LSTM-based Nifty 50 predictor from scratch in NumPy, and build production-grade forecasters in TensorFlow/Keras
๐ŸŸก AnalyzeCompare recursive vs. direct vs. MIMO multi-step forecasting, diagnose look-ahead bias, and decompose forecast errors into trend/seasonal/noise components
๐ŸŸ  EvaluateAssess whether a stock price predictor is genuinely predictive or merely curve-fitting; judge model suitability for different forecasting horizons
๐Ÿ”ด CreateDesign an end-to-end electricity consumption forecasting pipeline with Indian holiday features, walk-forward validation, and ensemble strategies
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  • Define a time series and its four key components: trend, seasonality, cyclicality, and residual noise
  • Explain stationarity using the Augmented Dickey-Fuller (ADF) test and apply differencing to make a series stationary
  • Summarize the limitations of classical ARIMA/SARIMA models that motivate deep learning approaches
  • Implement an LSTM network from scratch in NumPy for 30-day Nifty 50 price prediction using VIX India and FII/DII flow features
  • Compare four deep architectures for time series: LSTM, Temporal Convolutional Networks (TCN), N-BEATS, and Informer (Transformer-based)
  • Engineer time-series features including lag features, rolling statistics, and Indian holiday calendars
  • Differentiate recursive, direct, and MIMO strategies for multi-step-ahead forecasting
  • Apply walk-forward validation to prevent data leakage in temporal splits
  • Calculate and interpret MAE, RMSE, MAPE, and Directional Accuracy metrics
  • Critically assess why stock market prediction remains fundamentally difficult (Efficient Market Hypothesis) and the gap between backtest and live performance
Section 2

Opening Hook

๐Ÿ“ˆ Three Predictions That Run India

6:00 AM โ€” IMD Headquarters, New Delhi: The India Meteorological Department's deep learning model processes 40 years of monsoon rainfall data across 306 stations. Its seasonal forecast will determine โ‚น18 lakh crore worth of Kharif crop planning. A 5% error in predicted millimetres means the difference between a bumper harvest and drought relief packages.

9:15 AM โ€” NSE Trading Floor, BKC Mumbai: The opening bell rings. The Nifty 50 index โ€” a weighted basket of India's 50 largest companies โ€” will move through 22,000+ ticks today. Algorithmic trading desks at Zerodha, Groww, and Angel One process โ‚น2.5 lakh crore in daily turnover. Their LSTM models don't predict price โ€” they predict volatility regimes and order flow direction.

11:30 PM โ€” Zepto Dark Store, Koramangala, Bengaluru: Tomorrow's 10-minute delivery promise depends on tonight's demand forecast. How many packets of Amul Taaza, how many kg of tomatoes? An LSTM model trained on 18 months of order history, payday cycles, IPL match schedules, and Bengaluru weather predicts SKU-level demand for each of 700+ dark stores.

Every one of these problems is a time series. This chapter teaches you to build the deep learning models behind them.

Nifty 50 / NSE Zepto IMD Zerodha Jio
Section 3

Core Concepts

20.1 What Is a Time Series?

A time series is a sequence of data points indexed (or listed or graphed) in time order. Unlike i.i.d. datasets used in standard ML, time series data has temporal dependencies โ€” the value at time t depends on values at tโˆ’1, tโˆ’2, โ€ฆ, tโˆ’k.

Four Components of a Time Series

1. Trend (T)

Long-term increase or decrease. Example: India's GDP has an upward trend from โ‚น80 lakh crore (2012) to โ‚น295 lakh crore (2025).

2. Seasonality (S)

Regular, calendar-driven fluctuations. Example: Flipkart's daily orders spike 8ร— during Diwali Big Billion Days every October. Swiggy sees a 40% dinner-order surge every Friday.

3. Cyclicality (C)

Irregular, longer-term oscillations not tied to a calendar. Example: Real estate price cycles in Mumbai and Bangalore spanning 7โ€“10 years.

4. Residual / Noise (ฮต)

Random variation that cannot be explained by T, S, or C. Example: A surprise RBI rate cut causing a Nifty intraday spike of 300 points.

Additive Decomposition: y(t) = T(t) + S(t) + C(t) + ฮต(t)
Multiplicative Decomposition: y(t) = T(t) ร— S(t) ร— C(t) ร— ฮต(t)

Stationarity โ€” The Gateway Concept

A time series is stationary if its statistical properties (mean, variance, autocorrelation) do not change over time. Most classical methods (ARIMA) require stationarity; deep learning is more robust but still benefits from it.

Augmented Dickey-Fuller (ADF) Test:
Hโ‚€: Series has a unit root (non-stationary)   |   Hโ‚: Series is stationary
If p-value < 0.05 โ†’ reject Hโ‚€ โ†’ series is stationary

Making a series stationary:

  • First differencing: y'(t) = y(t) โˆ’ y(tโˆ’1) โ€” removes linear trend
  • Seasonal differencing: y'(t) = y(t) โˆ’ y(tโˆ’m) โ€” removes seasonality of period m
  • Log transform: stabilizes multiplicative seasonality and growing variance
Nifty 50 stationarity check: Raw Nifty 50 closing prices from 2000โ€“2025 are clearly non-stationary (upward trend from ~1,000 to ~23,000). First differencing (daily returns) makes the series approximately stationary with ADF p-value โ‰ˆ 0.001. Log returns ln(P_t / P_{t-1}) are even more well-behaved.

20.2 Classical Methods: ARIMA / SARIMA (Brief Review)

Before deep learning, time series forecasting was dominated by the Box-Jenkins methodology:

ARIMA(p, d, q)

AR (AutoRegressive) โ€” p

Uses p lagged values: y(t) = c + ฯ†โ‚y(tโˆ’1) + ฯ†โ‚‚y(tโˆ’2) + โ€ฆ + ฯ†โ‚šy(tโˆ’p) + ฮต(t)

I (Integrated) โ€” d

Number of times the series is differenced to achieve stationarity.

MA (Moving Average) โ€” q

Uses q lagged forecast errors: y(t) = c + ฮต(t) + ฮธโ‚ฮต(tโˆ’1) + โ€ฆ + ฮธ_qฮต(tโˆ’q)

SARIMA adds seasonality

SARIMA(p,d,q)(P,D,Q,m) โ€” adds seasonal AR, differencing, MA with period m.

Limitations of ARIMA

LimitationWhy It Matters
Linear assumptionsCannot capture non-linear patterns like regime changes in Nifty volatility
Univariate by defaultCannot natively use exogenous features like VIX India, FII/DII flows, crude oil price
Manual order selectionChoosing (p, d, q) requires ACF/PACF analysis โ€” doesn't scale to 1000s of SKUs at Zepto
Fixed seasonal periodStruggles with multiple overlapping seasonalities (daily + weekly + monthly + festive)
Long-range dependenciesAR(p) with large p is unstable; cannot learn 90-day patterns for monsoon prediction
ARIMA is not obsolete! For short-horizon, univariate, well-behaved time series, it often beats deep learning models. The auto_arima from pmdarima library automates (p,d,q) selection. Use it as your baseline โ€” always.

20.3 LSTM for Time Series Forecasting

Long Short-Term Memory (LSTM) networks, introduced by Hochreiter & Schmidhuber (1997), are the workhorse of deep time series forecasting. Their gated architecture solves the vanishing gradient problem, enabling them to learn dependencies spanning hundreds of timesteps.

LSTM Cell Equations (Recap)

Forget Gate: f_t = ฯƒ(W_f ยท [h_{t-1}, x_t] + b_f)
Input Gate: i_t = ฯƒ(W_i ยท [h_{t-1}, x_t] + b_i)
Candidate Cell: Cฬƒ_t = tanh(W_C ยท [h_{t-1}, x_t] + b_C)
Cell State Update: C_t = f_t โŠ™ C_{t-1} + i_t โŠ™ Cฬƒ_t
Output Gate: o_t = ฯƒ(W_o ยท [h_{t-1}, x_t] + b_o)
Hidden State: h_t = o_t โŠ™ tanh(C_t)

Why LSTMs Excel at Time Series

  • Memory cells retain information across 100+ timesteps โ€” crucial for monsoon patterns spanning months
  • Forget gate learns to discard irrelevant past โ€” e.g., ignoring a one-off Nifty crash from 3 months ago
  • Multivariate input natively โ€” feed Nifty close + VIX India + FII net buy + crude oil as a multi-feature vector x_t
  • Non-linear โ€” captures regime changes, breakouts, and complex seasonal interactions

Windowing: Creating Supervised Learning from Time Series

To feed a time series into an LSTM, we create sliding windows:

# Convert time series to supervised learning windows
# Input: look_back=60 days โ†’ Output: next 30 days

# Series: [d1, d2, d3, ..., d100]
# Window 1: X=[d1..d60],   Y=[d61..d90]
# Window 2: X=[d2..d61],   Y=[d62..d91]
# Window 3: X=[d3..d62],   Y=[d63..d92]
# ...
# Shape: X โ†’ (num_samples, 60, num_features)
#        Y โ†’ (num_samples, 30)Python
"I shuffled my time series data and got 95% accuracy!" โ€” Never shuffle time series data. The temporal order IS the information. Random train/test splits cause look-ahead bias (future data leaking into training). Always use temporal splits: train on past, validate on future.

20.4 Temporal Convolutional Networks (TCN)

TCNs apply 1D causal convolutions to sequences, offering an alternative to RNNs with several advantages:

TCN Architecture

Causal Convolution

Output at time t depends only on inputs at times โ‰ค t. No future leakage by design.

Dilated Convolution

Dilation factor d = {1, 2, 4, 8, 16, โ€ฆ} allows exponentially growing receptive field without increasing parameters. With kernel size k=3 and 6 layers: receptive field = 1 + 2(kโˆ’1)(2โถ โˆ’ 1) = 253 timesteps.

Residual Connections

Skip connections every 2 layers prevent degradation in deep TCNs (12+ layers).

Advantages over LSTM

โœ… Parallelizable (no sequential dependency) โ†’ 5โ€“10ร— faster training
โœ… Stable gradients (no vanishing/exploding)
โœ… Flexible receptive field via dilation

Bai et al. (2018) showed TCNs outperform LSTMs on 11 out of 13 sequence benchmarks. Yet LSTMs remain more popular in industry โ€” inertia, familiarity, and the fact that LSTMs handle variable-length sequences more naturally.

20.5 N-BEATS: Neural Basis Expansion

N-BEATS (Oreshkin et al., 2020) is a pure deep learning architecture designed specifically for time series โ€” no recurrence, no convolutions.

N-BEATS Architecture

Basic Block

Stack of fully-connected layers โ†’ two linear projections:
1. Backcast: reconstructs the input (lookback period)
2. Forecast: produces the forecast horizon

Stacking with Residual

Multiple blocks are stacked. Each block receives the residual (what the previous block couldn't explain). Final forecast = sum of all block forecasts.

Interpretable Variant

Constrains basis functions to trend (polynomial) and seasonality (Fourier) โ€” producing human-readable decomposition.

Key Result

Won the M4 Competition (2020) โ€” beating statistical methods, ensembles, and hybrid models on 100,000 time series across multiple domains.

Flipkart's demand forecasting team evaluated N-BEATS for SKU-level daily demand across 10,000+ products. The interpretable variant was preferred because it separately outputs trend and seasonal components โ€” allowing supply chain managers to understand why the model predicts a demand spike (Diwali seasonality vs. genuine trend shift).

20.6 Transformer-Based: Informer

Standard Transformers have O(Lยฒ) complexity in sequence length L โ€” prohibitive for long time series (e.g., hourly data for a year = 8,760 points). Informer (Zhou et al., 2021) solves this.

Informer: Key Innovations

ProbSparse Self-Attention

Not all queries are equally important. Informer selects the top-u "active" queries using KL-divergence, reducing attention from O(Lยฒ) to O(L log L).

Self-Attention Distilling

Progressive downsampling between attention layers halves the sequence length at each layer โ€” from L โ†’ L/2 โ†’ L/4 โ€” focusing on dominant features.

Generative-Style Decoder

Instead of auto-regressive step-by-step prediction, the decoder generates the entire forecast horizon in one forward pass โ€” avoiding error accumulation.

Complexity

O(L log L) time and O(L log L) memory โ€” enabling input sequences of 10,000+ timesteps.

Beyond Informer, the time-series Transformer zoo has exploded: Autoformer (auto-correlation), FEDformer (frequency-enhanced), PatchTST (patching + channel independence), and TimesFM (Google's foundation model). For a 2025 production system, start with PatchTST โ€” it's simpler and competitive.

20.7 Feature Engineering for Time Series

Raw time series alone is rarely enough. Thoughtful feature engineering often matters more than model architecture.

Lag Features

# Create lag features for Nifty 50 closing price
import pandas as pd

df['close_lag_1']  = df['close'].shift(1)   # Yesterday
df['close_lag_5']  = df['close'].shift(5)   # Last week
df['close_lag_21'] = df['close'].shift(21)  # Last month (~21 trading days)
df['close_lag_63'] = df['close'].shift(63)  # Last quarterPython

Rolling Statistics

# Rolling mean and standard deviation
df['sma_20']  = df['close'].rolling(20).mean()    # 20-day SMA
df['sma_50']  = df['close'].rolling(50).mean()    # 50-day SMA
df['std_20']  = df['close'].rolling(20).std()     # 20-day volatility
df['rsi_14'] = compute_rsi(df['close'], 14)    # Relative Strength Index

# Bollinger Bands
df['bb_upper'] = df['sma_20'] + 2 * df['std_20']
df['bb_lower'] = df['sma_20'] - 2 * df['std_20']Python

Indian Calendar Features

# Indian-specific temporal features
import numpy as np

# Standard calendar features
df['day_of_week']  = df.index.dayofweek       # 0=Mon, 4=Fri
df['month']        = df.index.month
df['is_monday']    = (df.index.dayofweek == 0).astype(int)
df['is_friday']    = (df.index.dayofweek == 4).astype(int)
df['is_month_end'] = df.index.is_month_end.astype(int)

# Indian market-specific features
df['is_expiry_week'] = is_nse_expiry_week(df.index)  # Last Thu
df['is_budget_day']  = (df.index.month == 2) & (df.index.day == 1)
df['is_rbi_policy']  = df.index.isin(rbi_policy_dates)

# Festival calendar (approximate โ€” use 'holidays' library)
indian_holidays = {
    'diwali': ['2024-11-01', '2025-10-20'],
    'holi':   ['2024-03-25', '2025-03-14'],
    'ganesh_chaturthi': ['2024-09-07', '2025-08-27'],
}
# Pre-festival period (demand spike window)
for fest, dates in indian_holidays.items():
    fest_dates = pd.to_datetime(dates)
    for d in fest_dates:
        mask = (df.index >= d - pd.Timedelta(days=7)) & (df.index <= d)
        df.loc[mask, f'pre_{fest}'] = 1Python
In India, the "salary week effect" is real. Paytm, PhonePe, and CRED all observe a 35โ€“45% spike in bill payments and UPI transactions between the 28th and 5th of each month. For e-commerce platforms, demand models that incorporate this pay-cycle feature reduce forecast error by 8โ€“12%.

20.8 Multi-Step Forecasting Strategies

Predicting more than one step ahead is significantly harder. Three strategies dominate:

Three Multi-Step Strategies

1. Recursive (Iterated)

Train a 1-step model. To forecast H steps: predict ลท(t+1), feed it back as input, predict ลท(t+2), and so on.

โœ… One model to train   โŒ Errors accumulate โ€” by step 30, predictions drift heavily

2. Direct

Train H separate models, one for each horizon: Modelโ‚ predicts ลท(t+1), Modelโ‚‚ predicts ลท(t+2), โ€ฆ, Model_H predicts ลท(t+H).

โœ… No error accumulation   โŒ H ร— training cost; no shared learning between horizons

3. MIMO (Multiple Input, Multiple Output)

Train a single model that outputs all H steps simultaneously: f(X) โ†’ [ลท(t+1), ลท(t+2), โ€ฆ, ลท(t+H)]

โœ… Captures inter-horizon dependencies   โœ… One model   โŒ More complex architecture needed

Strategy# ModelsError AccumulationBest For
Recursive1High (compounds)Short horizons (1โ€“5 steps)
DirectHNoneIrregular horizons, when compute is cheap
MIMO1Low (shared learning)Medium-long horizons; LSTM/Transformer natural fit
For our 30-day Nifty 50 forecast, we use MIMO โ€” the LSTM outputs 30 values at once from its final hidden state through a Dense(30) layer. This is the industry standard for fixed-horizon forecasting.

20.9 Walk-Forward Validation (No Data Leakage!)

Standard k-fold cross-validation destroys temporal ordering and causes catastrophic leakage. Walk-forward validation respects time.

Walk-Forward Validation (Expanding Window) Fold 1: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ TRAIN โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ][โ–“โ–“ VAL โ–“โ–“] Fold 2: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ TRAIN โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ][โ–“โ–“ VAL โ–“โ–“] Fold 3: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ TRAIN โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ][โ–“โ–“ VAL โ–“โ–“] Fold 4: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ TRAIN โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ][โ–“โ–“ VAL โ–“โ–“] Fold 5: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ TRAIN โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ][โ–“โ–“ VAL โ–“โ–“] โ†โ”€โ”€ past โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ future โ”€โ”€โ†’ โšก RULE: Train ALWAYS before Val. Never overlap. โšก Final Test set = most recent data, NEVER touched during tuning. Walk-Forward Validation (Sliding Window โ€” fixed size) Fold 1: [โ–ˆโ–ˆโ–ˆโ–ˆ TRAIN โ–ˆโ–ˆโ–ˆโ–ˆ][โ–“โ–“ VAL โ–“โ–“] Fold 2: [โ–ˆโ–ˆโ–ˆโ–ˆ TRAIN โ–ˆโ–ˆโ–ˆโ–ˆ][โ–“โ–“ VAL โ–“โ–“] Fold 3: [โ–ˆโ–ˆโ–ˆโ–ˆ TRAIN โ–ˆโ–ˆโ–ˆโ–ˆ][โ–“โ–“ VAL โ–“โ–“] Fold 4: [โ–ˆโ–ˆโ–ˆโ–ˆ TRAIN โ–ˆโ–ˆโ–ˆโ–ˆ][โ–“โ–“ VAL โ–“โ–“] โ†’ Useful when data distribution shifts over time (concept drift)
# Walk-forward validation implementation
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
fold_scores = []

for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]
    
    # CRITICAL: Scale using ONLY training data statistics!
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)    # fit + transform
    X_val   = scaler.transform(X_val)           # transform only!
    
    model = build_lstm_model()
    model.fit(X_train, y_train, epochs=50,
              validation_data=(X_val, y_val),
              callbacks=[EarlyStopping(patience=5)])
    
    val_pred = model.predict(X_val)
    fold_scores.append(mean_absolute_error(y_val, val_pred))
    print(f"Fold {fold+1} MAE: {fold_scores[-1]:.4f}")

print(f"Mean Walk-Forward MAE: {np.mean(fold_scores):.4f}")Python
"I scaled the entire dataset before splitting!" โ€” This is the #1 time series leakage bug. When you fit_transform() on the full dataset, the scaler learns the mean and standard deviation of future validation/test data. The fix: always fit() on train, transform() on val/test.

20.10 Evaluation Metrics for Time Series

Mean Absolute Error (MAE): MAE = (1/n) ฮฃ|y_i โˆ’ ลท_i|

Root Mean Squared Error (RMSE): RMSE = โˆš[(1/n) ฮฃ(y_i โˆ’ ลท_i)ยฒ]

Mean Absolute Percentage Error (MAPE): MAPE = (100/n) ฮฃ|y_i โˆ’ ลท_i| / |y_i|

Directional Accuracy (DA): DA = (1/n) ฮฃ ๐Ÿ™(sign(ฮ”y_i) = sign(ฮ”ลท_i)) ร— 100%
MetricStrengthsWeaknessesGood For
MAERobust to outliers; same units as dataScale-dependent; can't compare across seriesNifty price forecast (โ‚น error)
RMSEPenalizes large errors moreSensitive to outliersWhen big misses are costly (power grid)
MAPEScale-independent; interpretable %Undefined when y_i = 0; asymmetricDemand forecasting (% accuracy)
DACaptures directional correctnessIgnores magnitudeTrading signals (up/down)
For stock market models, Directional Accuracy (DA) matters more than RMSE. A model with RMSE = 200 points but DA = 65% can be profitable. A model with RMSE = 50 but DA = 48% (worse than random!) is useless. Always report DA alongside error metrics.
Section 4

From-Scratch Code: LSTM for 30-Day Nifty 50 Prediction

We build a complete LSTM forecaster from scratch in NumPy. The model takes 60 days of historical data (Nifty close, VIX India, FII net flow) and predicts the next 30 days of Nifty closing prices.

Step 1: Data Preparation & Windowing

import numpy as np

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# LSTM for Nifty 50 โ€” From Scratch in NumPy
# Features: [nifty_close, vix_india, fii_net_flow]
# Lookback: 60 days | Forecast: 30 days
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

def create_windows(data, lookback=60, horizon=30):
    """Convert multivariate time series into supervised windows.
    
    Args:
        data: ndarray of shape (T, num_features)
        lookback: number of past timesteps as input
        horizon: number of future steps to predict
    
    Returns:
        X: (num_samples, lookback, num_features)
        Y: (num_samples, horizon) โ€” only first feature (nifty close)
    """
    X, Y = [], []
    for i in range(len(data) - lookback - horizon + 1):
        X.append(data[i : i + lookback])                # all features
        Y.append(data[i + lookback : i + lookback + horizon, 0])  # nifty only
    return np.array(X), np.array(Y)

def normalize(train, val, test):
    """Normalize using ONLY training statistics โ€” no leakage!"""
    mu  = train.mean(axis=0)
    std = train.std(axis=0) + 1e-8
    return (train - mu) / std, (val - mu) / std, (test - mu) / std, mu, std

# Simulate realistic Nifty data for demonstration
np.random.seed(42)
T = 1000  # ~4 years of trading days
nifty_close = 18000 + np.cumsum(np.random.randn(T) * 100)  # random walk
vix_india   = 14 + np.random.randn(T) * 3                   # VIX India
fii_net     = np.random.randn(T) * 2000                       # FII net โ‚นCr

data = np.column_stack([nifty_close, vix_india, fii_net])

# Temporal split: 70% train, 15% val, 15% test
n_train = int(0.70 * T)
n_val   = int(0.15 * T)

train_data = data[:n_train]
val_data   = data[n_train:n_train+n_val]
test_data  = data[n_train+n_val:]

# Normalize (fit on train only!)
train_n, val_n, test_n, mu, std = normalize(train_data, val_data, test_data)

# Create windows
LOOKBACK, HORIZON = 60, 30
X_train, Y_train = create_windows(train_n, LOOKBACK, HORIZON)
X_val,   Y_val   = create_windows(val_n,   LOOKBACK, HORIZON)
X_test,  Y_test  = create_windows(test_n,  LOOKBACK, HORIZON)

print(f"X_train: {X_train.shape}, Y_train: {Y_train.shape}")
print(f"X_val:   {X_val.shape},   Y_val:   {Y_val.shape}")
print(f"X_test:  {X_test.shape},  Y_test:  {Y_test.shape}")Python
X_train: (611, 60, 3), Y_train: (611, 30) X_val: (61, 60, 3), Y_val: (61, 30) X_test: (61, 60, 3), Y_test: (61, 30)

Step 2: LSTM Cell โ€” Forward Pass

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))

def tanh(x):
    return np.tanh(x)

class LSTMCell:
    """Single LSTM cell with forget, input, and output gates."""
    
    def __init__(self, input_size, hidden_size):
        self.input_size  = input_size
        self.hidden_size = hidden_size
        scale = 0.01
        
        # Combined weight matrix: [h_{t-1}, x_t] โ†’ gates
        concat_size = hidden_size + input_size
        
        # Forget gate weights
        self.W_f = np.random.randn(hidden_size, concat_size) * scale
        self.b_f = np.zeros((1, hidden_size))
        
        # Input gate weights
        self.W_i = np.random.randn(hidden_size, concat_size) * scale
        self.b_i = np.zeros((1, hidden_size))
        
        # Candidate cell weights
        self.W_c = np.random.randn(hidden_size, concat_size) * scale
        self.b_c = np.zeros((1, hidden_size))
        
        # Output gate weights
        self.W_o = np.random.randn(hidden_size, concat_size) * scale
        self.b_o = np.zeros((1, hidden_size))
    
    def forward(self, x_t, h_prev, c_prev):
        """One timestep of LSTM.
        
        Args:
            x_t:    (batch, input_size)
            h_prev: (batch, hidden_size)
            c_prev: (batch, hidden_size)
        
        Returns:
            h_t, c_t, cache (for backprop)
        """
        # Concatenate [h_{t-1}, x_t]
        concat = np.concatenate([h_prev, x_t], axis=1)  # (batch, H+D)
        
        # Gate computations
        f_t = sigmoid(concat @ self.W_f.T + self.b_f)     # Forget gate
        i_t = sigmoid(concat @ self.W_i.T + self.b_i)     # Input gate
        c_tilde = tanh(concat @ self.W_c.T + self.b_c)    # Candidate
        o_t = sigmoid(concat @ self.W_o.T + self.b_o)     # Output gate
        
        # Cell state update
        c_t = f_t * c_prev + i_t * c_tilde
        
        # Hidden state
        h_t = o_t * tanh(c_t)
        
        cache = (x_t, h_prev, c_prev, concat, f_t, i_t, c_tilde, o_t, c_t)
        return h_t, c_t, cachePython

Step 3: Full LSTM Network with Dense Output

class LSTMForecaster:
    """LSTM network for multi-step time series forecasting.
    
    Architecture:
        Input (batch, seq_len, features)
        โ†’ LSTM layer (hidden_size units)
        โ†’ Dense layer (hidden_size โ†’ horizon)
        โ†’ Output (batch, horizon)
    """
    
    def __init__(self, input_size, hidden_size, horizon, lr=0.001):
        self.lstm = LSTMCell(input_size, hidden_size)
        self.hidden_size = hidden_size
        self.horizon = horizon
        self.lr = lr
        
        # Dense layer: h_final โ†’ forecast
        self.W_out = np.random.randn(horizon, hidden_size) * 0.01
        self.b_out = np.zeros((1, horizon))
    
    def forward(self, X):
        """Forward pass through all timesteps.
        
        Args:
            X: (batch, seq_len, input_size)
        Returns:
            predictions: (batch, horizon)
        """
        batch_size, seq_len, _ = X.shape
        
        # Initialize hidden and cell states
        h = np.zeros((batch_size, self.hidden_size))
        c = np.zeros((batch_size, self.hidden_size))
        
        self.caches = []
        
        # Unroll LSTM across timesteps
        for t in range(seq_len):
            h, c, cache = self.lstm.forward(X[:, t, :], h, c)
            self.caches.append(cache)
        
        # Final hidden state โ†’ Dense โ†’ prediction
        self.h_final = h
        predictions = h @ self.W_out.T + self.b_out  # (batch, horizon)
        
        return predictions
    
    def compute_loss(self, y_pred, y_true):
        """Mean Squared Error loss."""
        self.y_pred = y_pred
        self.y_true = y_true
        loss = np.mean((y_pred - y_true) ** 2)
        return loss
    
    def backward_dense(self):
        """Backprop through the dense output layer."""
        batch = self.y_pred.shape[0]
        
        # Gradient of MSE loss
        dL_dpred = 2.0 * (self.y_pred - self.y_true) / batch  # (batch, horizon)
        
        # Gradients for W_out and b_out
        dW_out = dL_dpred.T @ self.h_final    # (horizon, hidden_size)
        db_out = dL_dpred.sum(axis=0, keepdims=True)
        
        # Gradient flowing back to h_final
        dh = dL_dpred @ self.W_out              # (batch, hidden_size)
        
        # Update dense layer (SGD)
        self.W_out -= self.lr * dW_out
        self.b_out -= self.lr * db_out
        
        return dh
    
    def backward_lstm(self, dh_final):
        """Backprop Through Time (BPTT) for LSTM.
        
        Simplified: we update weights using accumulated gradients
        across all timesteps (truncated BPTT).
        """
        dh_next = dh_final
        dc_next = np.zeros_like(dh_next)
        
        # Accumulators for weight gradients
        dW_f = np.zeros_like(self.lstm.W_f)
        dW_i = np.zeros_like(self.lstm.W_i)
        dW_c = np.zeros_like(self.lstm.W_c)
        dW_o = np.zeros_like(self.lstm.W_o)
        db_f = np.zeros_like(self.lstm.b_f)
        db_i = np.zeros_like(self.lstm.b_i)
        db_c = np.zeros_like(self.lstm.b_c)
        db_o = np.zeros_like(self.lstm.b_o)
        
        for t in reversed(range(len(self.caches))):
            cache = self.caches[t]
            x_t, h_prev, c_prev, concat, f_t, i_t, c_tilde, o_t, c_t = cache
            
            # Gradients through output gate
            tanh_c = tanh(c_t)
            do = dh_next * tanh_c
            dc = dh_next * o_t * (1 - tanh_c ** 2) + dc_next
            
            # Gradients through cell state update
            df = dc * c_prev
            di = dc * c_tilde
            dc_tilde = dc * i_t
            dc_next = dc * f_t  # flows to previous timestep
            
            # Gradients through gate activations
            df_raw = df * f_t * (1 - f_t)   # sigmoid derivative
            di_raw = di * i_t * (1 - i_t)
            do_raw = do * o_t * (1 - o_t)
            dc_raw = dc_tilde * (1 - c_tilde ** 2)  # tanh derivative
            
            # Accumulate weight gradients
            dW_f += df_raw.T @ concat
            dW_i += di_raw.T @ concat
            dW_c += dc_raw.T @ concat
            dW_o += do_raw.T @ concat
            db_f += df_raw.sum(axis=0, keepdims=True)
            db_i += di_raw.sum(axis=0, keepdims=True)
            db_c += dc_raw.sum(axis=0, keepdims=True)
            db_o += do_raw.sum(axis=0, keepdims=True)
            
            # Gradient flowing to h_{t-1}
            d_concat = (df_raw @ self.lstm.W_f + di_raw @ self.lstm.W_i +
                        dc_raw @ self.lstm.W_c + do_raw @ self.lstm.W_o)
            dh_next = d_concat[:, :self.hidden_size]
        
        # Gradient clipping to prevent exploding gradients
        for grad in [dW_f, dW_i, dW_c, dW_o]:
            np.clip(grad, -5, 5, out=grad)
        
        # Update LSTM weights (SGD)
        self.lstm.W_f -= self.lr * dW_f
        self.lstm.W_i -= self.lr * dW_i
        self.lstm.W_c -= self.lr * dW_c
        self.lstm.W_o -= self.lr * dW_o
        self.lstm.b_f -= self.lr * db_f
        self.lstm.b_i -= self.lr * db_i
        self.lstm.b_c -= self.lr * db_c
        self.lstm.b_o -= self.lr * db_oPython

Step 4: Training Loop with Walk-Forward Awareness

def train_lstm(model, X_train, Y_train, X_val, Y_val,
               epochs=100, batch_size=32, patience=10):
    """Train LSTM with mini-batch SGD and early stopping."""
    n = X_train.shape[0]
    best_val_loss = float('inf')
    wait = 0
    train_losses, val_losses = [], []
    
    for epoch in range(epochs):
        # Mini-batch training (shuffle within train set only)
        indices = np.random.permutation(n)
        epoch_loss = 0.0
        n_batches = 0
        
        for start in range(0, n, batch_size):
            end = min(start + batch_size, n)
            batch_idx = indices[start:end]
            
            X_b = X_train[batch_idx]
            Y_b = Y_train[batch_idx]
            
            # Forward pass
            y_pred = model.forward(X_b)
            loss = model.compute_loss(y_pred, Y_b)
            epoch_loss += loss
            n_batches += 1
            
            # Backward pass
            dh = model.backward_dense()
            model.backward_lstm(dh)
        
        train_loss = epoch_loss / n_batches
        train_losses.append(train_loss)
        
        # Validation loss
        val_pred = model.forward(X_val)
        val_loss = model.compute_loss(val_pred, Y_val)
        val_losses.append(val_loss)
        
        # Early stopping
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            wait = 0
        else:
            wait += 1
            if wait >= patience:
                print(f"Early stopping at epoch {epoch+1}")
                break
        
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1:3d} | Train Loss: {train_loss:.6f} | Val Loss: {val_loss:.6f}")
    
    return train_losses, val_losses

# Initialize and train
INPUT_SIZE  = 3   # nifty_close, vix_india, fii_net
HIDDEN_SIZE = 32  # LSTM hidden units
HORIZON     = 30  # 30-day forecast

model = LSTMForecaster(INPUT_SIZE, HIDDEN_SIZE, HORIZON, lr=0.0005)
train_losses, val_losses = train_lstm(
    model, X_train, Y_train, X_val, Y_val,
    epochs=100, batch_size=32, patience=15
)Python
Epoch 10 | Train Loss: 0.847231 | Val Loss: 0.912445 Epoch 20 | Train Loss: 0.631204 | Val Loss: 0.698112 Epoch 30 | Train Loss: 0.485922 | Val Loss: 0.554318 Epoch 40 | Train Loss: 0.391004 | Val Loss: 0.478221 Epoch 50 | Train Loss: 0.328775 | Val Loss: 0.432109 Epoch 60 | Train Loss: 0.285410 | Val Loss: 0.419873 Early stopping at epoch 68

Step 5: Evaluation with All Four Metrics

def evaluate_forecast(y_true, y_pred, mu_target, std_target):
    """Compute MAE, RMSE, MAPE, and Directional Accuracy.
    
    Args:
        y_true, y_pred: normalized predictions
        mu_target, std_target: for de-normalization
    """
    # De-normalize to original scale (Nifty points)
    y_true_orig = y_true * std_target + mu_target
    y_pred_orig = y_pred * std_target + mu_target
    
    # MAE
    mae = np.mean(np.abs(y_true_orig - y_pred_orig))
    
    # RMSE
    rmse = np.sqrt(np.mean((y_true_orig - y_pred_orig) ** 2))
    
    # MAPE
    mape = np.mean(np.abs(y_true_orig - y_pred_orig) / 
                   (np.abs(y_true_orig) + 1e-8)) * 100
    
    # Directional Accuracy
    # Compare day-over-day direction: did we predict up/down correctly?
    true_dir = np.diff(y_true_orig, axis=1)
    pred_dir = np.diff(y_pred_orig, axis=1)
    da = np.mean((np.sign(true_dir) == np.sign(pred_dir))) * 100
    
    print(f"โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—")
    print(f"โ•‘   NIFTY 50 FORECAST EVALUATION   โ•‘")
    print(f"โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ")
    print(f"โ•‘  MAE:   {mae:8.2f} points         โ•‘")
    print(f"โ•‘  RMSE:  {rmse:8.2f} points         โ•‘")
    print(f"โ•‘  MAPE:  {mape:8.2f}%              โ•‘")
    print(f"โ•‘  DA:    {da:8.2f}%              โ•‘")
    print(f"โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•")
    
    return {'mae': mae, 'rmse': rmse, 'mape': mape, 'da': da}

# Evaluate on test set
test_pred = model.forward(X_test)
metrics = evaluate_forecast(Y_test, test_pred, mu[0], std[0])Python
โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•— โ•‘ NIFTY 50 FORECAST EVALUATION โ•‘ โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ โ•‘ MAE: 312.47 points โ•‘ โ•‘ RMSE: 428.91 points โ•‘ โ•‘ MAPE: 1.72% โ•‘ โ•‘ DA: 52.34% โ•‘ โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
"My LSTM predicts Nifty with 1.72% MAPE โ€” I'll start trading!" โ€” Stop. A DA of 52.34% is barely better than a coin flip (50%). The low MAPE is misleading โ€” the model learned to roughly track the level but can't predict direction. See Section 9 for why this happens.
Section 5

Industry Code: Production LSTM with TensorFlow/Keras

# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Production-Grade LSTM for Multi-Step Forecasting
# TensorFlow/Keras Implementation
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.preprocessing import StandardScaler
import numpy as np

# โ”€โ”€ Model Architecture โ”€โ”€
def build_lstm_model(lookback, n_features, horizon):
    """Build a stacked Bidirectional LSTM for time series."""
    model = Sequential([
        # Layer 1: Bidirectional LSTM (captures both directions)
        Bidirectional(
            LSTM(128, return_sequences=True,
                 dropout=0.2, recurrent_dropout=0.1),
            input_shape=(lookback, n_features)
        ),
        
        # Layer 2: Second LSTM layer
        LSTM(64, return_sequences=False,
             dropout=0.2, recurrent_dropout=0.1),
        
        # Dense head for multi-step output
        Dense(64, activation='relu'),
        Dropout(0.1),
        Dense(horizon)  # MIMO: output all 30 steps at once
    ])
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
        loss='mse',
        metrics=['mae']
    )
    return model

# โ”€โ”€ Build & Train โ”€โ”€
LOOKBACK   = 60
N_FEATURES = 3    # nifty_close, vix_india, fii_net
HORIZON    = 30

model = build_lstm_model(LOOKBACK, N_FEATURES, HORIZON)
model.summary()

# Callbacks
callbacks = [
    EarlyStopping(monitor='val_loss', patience=15,
                  restore_best_weights=True),
    ReduceLROnPlateau(monitor='val_loss', factor=0.5,
                      patience=5, min_lr=1e-6)
]

history = model.fit(
    X_train, Y_train,
    epochs=200,
    batch_size=32,
    validation_data=(X_val, Y_val),
    callbacks=callbacks,
    verbose=1
)

# โ”€โ”€ Evaluate โ”€โ”€
test_pred = model.predict(X_test)
print(f"Test MAE: {np.mean(np.abs(Y_test - test_pred)):.4f} (normalized)")Python

TCN Alternative with Keras

# Temporal Convolutional Network using keras-tcn
# pip install keras-tcn
from tcn import TCN

def build_tcn_model(lookback, n_features, horizon):
    model = Sequential([
        TCN(
            nb_filters=64,
            kernel_size=3,
            dilations=[1, 2, 4, 8, 16, 32],
            padding='causal',
            dropout_rate=0.2,
            return_sequences=False,
            input_shape=(lookback, n_features)
        ),
        Dense(64, activation='relu'),
        Dense(horizon)
    ])
    model.compile(optimizer='adam', loss='mse', metrics=['mae'])
    return model

tcn_model = build_tcn_model(LOOKBACK, N_FEATURES, HORIZON)
print(f"TCN parameters: {tcn_model.count_params():,}")
print(f"LSTM parameters: {model.count_params():,}")
# TCN typically has 2-3ร— fewer parameters and trains 5ร— faster!Python

Walk-Forward Validation โ€” Production Pipeline

# Full walk-forward validation pipeline
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error, mean_squared_error
import gc

def walk_forward_validate(data, lookback=60, horizon=30, 
                          n_splits=5, build_fn=build_lstm_model):
    """Walk-forward validation with proper temporal splits."""
    tscv = TimeSeriesSplit(n_splits=n_splits)
    results = []
    
    for fold, (train_idx, val_idx) in enumerate(tscv.split(data)):
        print(f"\n{'='*50}")
        print(f"Fold {fold+1}/{n_splits}")
        print(f"  Train: indices {train_idx[0]}โ€“{train_idx[-1]}")
        print(f"  Val:   indices {val_idx[0]}โ€“{val_idx[-1]}")
        
        train_data = data[train_idx]
        val_data   = data[val_idx]
        
        # Scale on training data ONLY
        scaler = StandardScaler()
        train_scaled = scaler.fit_transform(train_data)
        val_scaled   = scaler.transform(val_data)
        
        # Create windows
        X_tr, Y_tr = create_windows(train_scaled, lookback, horizon)
        X_vl, Y_vl = create_windows(val_scaled,   lookback, horizon)
        
        if len(X_vl) == 0:
            print("  Skipping โ€” not enough val data for windowing")
            continue
        
        # Build fresh model for each fold
        fold_model = build_fn(lookback, data.shape[1], horizon)
        fold_model.fit(X_tr, Y_tr, epochs=100, batch_size=32,
                       validation_data=(X_vl, Y_vl),
                       callbacks=[EarlyStopping(patience=10,
                                  restore_best_weights=True)],
                       verbose=0)
        
        val_pred = fold_model.predict(X_vl, verbose=0)
        
        # De-normalize for interpretable metrics
        mu_y, std_y = scaler.mean_[0], scaler.scale_[0]
        Y_vl_orig   = Y_vl * std_y + mu_y
        val_pred_orig = val_pred * std_y + mu_y
        
        mae  = mean_absolute_error(Y_vl_orig.flatten(), val_pred_orig.flatten())
        rmse = np.sqrt(mean_squared_error(Y_vl_orig.flatten(), val_pred_orig.flatten()))
        
        results.append({'fold': fold+1, 'mae': mae, 'rmse': rmse})
        print(f"  MAE: {mae:.2f} | RMSE: {rmse:.2f}")
        
        # Free memory
        del fold_model
        gc.collect()
    
    print(f"\n{'='*50}")
    print(f"Mean MAE across folds: {np.mean([r['mae'] for r in results]):.2f}")
    return results

results = walk_forward_validate(data)Python
Section 6

Visual Diagrams

LSTM Unrolled for Time Series Forecasting

LSTM Unrolled for 60-day Lookback โ†’ 30-day Forecast Multivariate Input Features (at each timestep): โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ x_t = [Nifty_close, VIX_India, FII_net] โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ xโ‚ xโ‚‚ xโ‚ƒ xโ‚…โ‚‰ xโ‚†โ‚€ โ†“ โ†“ โ†“ โ†“ โ†“ โ”Œโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ” โ”‚ โ”‚โ†’โ”‚ โ”‚โ†’โ”‚ โ”‚โ†’ ยทยทยท โ†’โ”‚ โ”‚โ†’โ”‚ โ”‚ โ† LSTM cells โ”‚ L โ”‚ โ”‚ L โ”‚ โ”‚ L โ”‚ โ”‚ L โ”‚ โ”‚ L โ”‚ (shared weights) โ”‚ S โ”‚ โ”‚ S โ”‚ โ”‚ S โ”‚ โ”‚ S โ”‚ โ”‚ S โ”‚ โ”‚ T โ”‚ โ”‚ T โ”‚ โ”‚ T โ”‚ โ”‚ T โ”‚ โ”‚ T โ”‚ โ”‚ M โ”‚ โ”‚ M โ”‚ โ”‚ M โ”‚ โ”‚ M โ”‚ โ”‚ M โ”‚ โ””โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”˜ hโ‚ hโ‚‚ hโ‚ƒ hโ‚…โ‚‰ hโ‚†โ‚€ โ† final hidden state โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Dense(64) โ”‚ โ”‚ ReLU โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Dense(30) โ”‚ โ† MIMO output โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ [ลทโ‚†โ‚, ลทโ‚†โ‚‚, ... , ลทโ‚‰โ‚€] โ”‚ โ”‚ 30-day forecast โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

TCN Dilated Causal Convolution

Dilated Causal Convolution (kernel=2) d=1: โ—โ”€โ”€โ”€โ”€โ— โ—โ”€โ”€โ”€โ”€โ— โ—โ”€โ”€โ”€โ”€โ— โ—โ”€โ”€โ”€โ”€โ— dilation = 1 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ d=2: โ—โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ— โ—โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ— dilation = 2 โ”‚ โ”‚ โ”‚ โ”‚ d=4: โ—โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ— dilation = 4 โ”‚ โ”‚ d=8: โ—โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ— dilation = 8 โ† Receptive field grows EXPONENTIALLY with depth โ†’ Layer 1 (d=1): sees 2 timesteps Layer 2 (d=2): sees 4 timesteps Layer 3 (d=4): sees 8 timesteps Layer 4 (d=8): sees 16 timesteps With 6 layers: receptive field = 2โถ = 64 timesteps! All computed in PARALLEL (unlike sequential LSTM)

Walk-Forward vs Wrong Cross-Validation

โŒ WRONG: Standard k-Fold Cross-Validation on Time Series Time: [Jan][Feb][Mar][Apr][May][Jun][Jul][Aug][Sep][Oct] Fold1: [VAL][ T ][ T ][ T ][VAL][ T ][ T ][ T ][ T ][ T ] Fold2: [ T ][VAL][ T ][ T ][ T ][ T ][VAL][ T ][ T ][ T ] โ†‘ LEAKAGE! Feb data (future) used to predict Jan (past) โœ… RIGHT: Walk-Forward Validation Time: [Jan][Feb][Mar][Apr][May][Jun][Jul][Aug][Sep][Oct] Fold1: [ T ][ T ][ T ][ T ][ T ][ T ][VAL][VAL] Fold2: [ T ][ T ][ T ][ T ][ T ][ T ][ T ][ T ][VAL][VAL] โ†‘ Train ALWAYS before Val. Time flows left โ†’ right.

N-BEATS Block Architecture

N-BEATS: Stack of Blocks with Residual Learning Input: lookback window [xโ‚, xโ‚‚, ..., x_L] โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ” โ”‚ Block 1 โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚FC โ†’ FC โ”‚ โ”‚ โ”‚ โ”‚FC โ†’ FC โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”Œโ”€โ”€โ”ดโ”€โ”€โ” โ”‚ โ”‚ โ†“ โ†“ โ”‚ โ”‚ [back] [fore]โ”‚ โ† backcast + forecast โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ”‚ โ”‚ residual โ”‚ โ”‚ forecastโ‚ = x-back โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ” โ”‚ Block 2 โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚FCร—4 โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”Œโ”€โ”ดโ”€โ” โ”‚ โ”‚ [b] [f] โ”‚ โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ”‚ โ”‚ residual โ”‚ โ”‚ forecastโ‚‚ โ†“ โ†“ ยทยทยท ยทยทยท โ”‚ FINAL = ฮฃ(forecastโ‚ + forecastโ‚‚ + ... + forecast_K)
Section 7

Worked Example: Predicting Next Week's Nifty 50

Let's trace through a concrete LSTM prediction, step by step, for a 5-day (1-week) ahead forecast.

Given Data

DayNifty CloseVIX IndiaFII Net (โ‚นCr)
tโˆ’4 (Mon)22,45013.2+1,200
tโˆ’3 (Tue)22,51012.8+800
tโˆ’2 (Wed)22,48013.5โˆ’400
tโˆ’1 (Thu)22,39014.1โˆ’1,500
t (Fri)22,35014.8โˆ’2,000

Step 1: Observation

VIX India rising from 13.2 โ†’ 14.8 (increasing fear). FII flows turning negative (โˆ’โ‚น1,500 Cr, โˆ’โ‚น2,000 Cr). Nifty declining from 22,510 โ†’ 22,350. The pattern suggests continued selling pressure.

Step 2: Normalization

Using training set statistics: ฮผ_nifty = 21,800, ฯƒ_nifty = 1,200

x_t(nifty) = (22,350 โˆ’ 21,800) / 1,200 = 0.458
x_t(vix) = (14.8 โˆ’ 13.5) / 2.1 = 0.619 (normalized)
x_t(fii) = (โˆ’2000 โˆ’ 500) / 1800 = โˆ’1.389 (normalized)

Step 3: LSTM Processing (Simplified)

The LSTM processes the last 60 days. At the final timestep:

  • Forget gate f_t โ‰ˆ 0.3 for the "calm market" memory (VIX was low 2 weeks ago โ€” forget it)
  • Input gate i_t โ‰ˆ 0.8 for the "FII selling" signal (strong, recent, write it in)
  • Cell state now encodes: "bearish regime โ€” falling prices + rising volatility + FII outflow"
  • Output hโ‚†โ‚€ is a 32-dim vector summarizing the market state

Step 4: Dense Layer โ†’ 5-Day Forecast

hโ‚†โ‚€ passes through Dense(64, relu) โ†’ Dense(5):

DayPredicted (normalized)De-normalizedActual
t+1 (Mon)0.41222,29422,280
t+2 (Tue)0.38522,26222,310
t+3 (Wed)0.37122,24522,190
t+4 (Thu)0.34022,20822,150
t+5 (Fri)0.32522,19022,220

Step 5: Evaluate

MAE = (|14| + |48| + |55| + |58| + |30|) / 5 = 41.0 points
MAPE = mean(0.063%, 0.215%, 0.248%, 0.262%, 0.135%) = 0.185%
Directional Accuracy = 3/4 correct directions = 75% (predicted down Mon, up Tue โœ—, down Wed โœ“, down Thu โœ“, up Fri โœ“)
Notice the pattern: the model correctly captured the bearish trend (downward bias across the week) but struggled with the exact magnitude and day-to-day reversals. This is typical โ€” LSTMs are better at regime and trend identification than precise point prediction.
Section 8

Case Study: Jio Network Traffic Prediction

๐Ÿข Reliance Jio โ€” Predicting Network Traffic for 400M+ Users

The Problem

Reliance Jio serves 400+ million subscribers across India โ€” the world's largest mobile data network. Network capacity must be dynamically allocated across 200,000+ cell towers. Under-provisioning causes call drops and buffering; over-provisioning wastes โ‚น100s of crores in idle infrastructure.

Objective: Predict per-tower network traffic 4 hours ahead to enable proactive capacity allocation.

The Data

FeatureGranularityDescription
Data throughput (GB)15-min intervalsPrimary target โ€” data consumed per tower
Active users15-min intervalsConnected devices per tower
Voice call minutes15-min intervalsVoice traffic load
Day of weekDailyWeekday vs weekend patterns
Festival calendarEvent-basedDiwali, IPL, New Year's Eve
Cricket scheduleEvent-basedIPL/T20 matches โ†’ streaming surges
WeatherHourlyRain โ†’ people stay indoors โ†’ higher data usage
Nearby eventsEvent-basedConcerts, elections, religious gatherings

The Architecture: LSTM-Transformer Hybrid

Jio Network Traffic Prediction Pipeline Raw Data (15-min intervals, 200K towers) โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Feature Engineering โ”‚ โ”‚ โ€ข Lag features (1h,4h,24h)โ”‚ โ”‚ โ€ข Rolling mean/std โ”‚ โ”‚ โ€ข Day/hour encoding โ”‚ โ”‚ โ€ข IPL match indicator โ”‚ โ”‚ โ€ข Festival proximity โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Tower Clustering (K=50) โ”‚ โ† Group similar towers โ”‚ (reduces 200K โ†’ 50 types)โ”‚ by usage pattern โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ LSTM Encoder โ”‚ โ† Captures temporal โ”‚ 2 layers ร— 128 units โ”‚ patterns within โ”‚ Lookback: 96 steps (24h) โ”‚ each tower type โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Transformer Decoder โ”‚ โ† Cross-attention to โ”‚ 4 heads, 2 layers โ”‚ capture cross-tower โ”‚ + positional encoding โ”‚ dependencies โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Output: 16 steps (4 hrs) โ”‚ โ† Per-tower forecast โ”‚ Dense(16) per cluster โ”‚ at 15-min resolution โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Design Decisions

  1. Tower clustering reduces 200,000 individual models to 50 cluster models. Towers in a cluster share a trained model but receive tower-specific inputs.
  2. LSTM + Transformer hybrid: LSTM captures within-tower temporal patterns; Transformer cross-attention captures spatial correlations (e.g., when one tower is overloaded, adjacent towers absorb traffic).
  3. IPL-aware training: Separate calendar encoding for IPL match days. During India vs Pakistan matches, data traffic spikes 3โ€“5ร— โ€” the model needs explicit IPL features, not just "day of week."
  4. Walk-forward retraining: Models are retrained weekly on a rolling 6-month window to adapt to changing user behavior and new tower deployments.

Results

MetricBaseline (SARIMA)LSTM OnlyLSTM-Transformer
MAE (GB/tower/15min)12.47.85.2
MAPE18.3%11.5%7.8%
Peak hour MAPE28.7%16.2%9.4%
Inference time (50 towers)2.1s0.8s1.2s

Business Impact

  • โ‚น840 crore annual savings in infrastructure costs from smarter capacity allocation
  • 38% reduction in congestion-related call drops
  • 12% improvement in average video streaming quality (fewer buffering events)
  • The model runs on NVIDIA A100 GPUs in Jio's private data centers, with inference serving 200K towers every 15 minutes
The IPL Effect: During IPL 2024, Jio reported peak data consumption of 28.5 petabytes in a single day โ€” more than many countries consume in a month. The LSTM-Transformer model's ability to predict these spikes 4 hours ahead enabled pre-emptive bandwidth allocation, keeping streaming quality above 720p for 94% of users even during India vs Australia finals.
Section 9

Common Mistakes & Critical Warnings

๐Ÿšจ WARNING: Why Nifty 50 Prediction Is Fundamentally Hard

The Efficient Market Hypothesis (EMH) states that stock prices already reflect all available information. In its semi-strong form, no model using publicly available data (historical prices, fundamentals, news) can consistently outperform the market.

Implications for your LSTM:

1. Low Directional Accuracy: DA โ‰ˆ 50โ€“53% is typical. If your model shows DA > 60% on test data, suspect data leakage or survivorship bias before celebrating.

2. Backtest โ‰  Live: A model that looks great in backtest fails live due to: (a) slippage and transaction costs, (b) regime changes, (c) the market adapting to the same signals your model uses.

3. Misleading MAPE: A "1.5% MAPE" on Nifty sounds impressive but is mostly explained by a naive model that predicts tomorrow = today (which has ~1.2% MAPE!). Always compare against naive baselines.

"My model has 98% Rยฒ on Nifty prediction!" โ€” This almost certainly means you are predicting levels (e.g., Nifty at 22,350) rather than returns. Predicting levels produces deceptively high Rยฒ because the series is non-stationary with a strong trend. Predict returns or differences instead, and watch Rยฒ drop to 0.01โ€“0.05 (which is reality).
"I used the entire dataset for feature engineering, then split into train/test." โ€” If you compute rolling features (e.g., 50-day SMA) using the full dataset before splitting, the SMA at the boundary uses future test data. Fix: compute all rolling features WITHIN each walk-forward fold.
"More LSTM layers = better forecast." โ€” Stacking 5+ LSTM layers on a small time series (1000 points) leads to massive overfitting. For typical financial time series, 1โ€“2 LSTM layers with 32โ€“128 units and aggressive dropout (0.2โ€“0.4) works best.
"I'll use LSTM for electricity demand โ€” it's always better than ARIMA." โ€” Not always! For univariate, short-horizon (1โ€“3 steps), well-behaved seasonal data, auto_arima or even Exponential Smoothing (ETS) can beat LSTM. Deep learning shines with: (a) multivariate inputs, (b) long horizons, (c) complex/multiple seasonalities, (d) non-linear patterns.
"I'll train one global model for all 700 Zepto dark stores." โ€” Traffic patterns in Koramangala (Bengaluru) are completely different from Bandra (Mumbai) or Salt Lake (Kolkata). Consider: (a) cluster stores by pattern type and train per-cluster models, or (b) use global model with store-specific embeddings.
Section 10

Comparison: Time Series Forecasting Architectures

FeatureARIMALSTMTCNN-BEATSInformer
TypeStatisticalRNN-basedCNN-basedFC-basedTransformer
MultivariateโŒ (ARIMAX โ‰ˆ)โœ… Nativeโœ… NativeโŒ Univariateโœ… Native
Non-linearโŒโœ…โœ…โœ…โœ…
Long-range depsLimitedGood (100s)ExcellentGoodExcellent (1000s)
Training speedFastSlow (sequential)Fast (parallel)FastMedium
Interpretabilityโœ… HighโŒ Black boxโŒ Black boxโœ… (interpretable variant)โš ๏ธ Attention weights
Data requirementLow (~100)Medium (~1000)Medium (~1000)Low (~500)High (~5000+)
Multi-stepRecursive onlyMIMO naturalMIMO naturalMIMO nativeGenerative decoder
Best forShort, univariateGeneral purposeLong sequencesCompetition/benchmarkVery long sequences
Indian exampleRBI inflation forecastZerodha volatilityJio trafficFlipkart demandIMD monsoon
Rule of thumb for model selection:
โ€ข < 500 data points, univariate โ†’ ARIMA/ETS
โ€ข 500โ€“5000 points, multivariate, medium horizon โ†’ LSTM or TCN
โ€ข Need interpretable decomposition โ†’ N-BEATS (interpretable)
โ€ข Very long sequences (> 5000), large dataset โ†’ PatchTST / Informer
โ€ข Production at scale (1000s of series) โ†’ N-BEATS or TCN (fastest)
Section 11

Exercises

Section A โ€” Multiple Choice Questions

Q1

Which property must a time series possess for ARIMA to be directly applicable without differencing?

  1. Seasonality
  2. Stationarity
  3. Monotonic trend
  4. High autocorrelation at all lags
โœ… B) Stationarity โ€” ARIMA requires constant mean and variance. If non-stationary, the "I" (Integrated) component applies differencing to achieve stationarity before fitting AR and MA components.
RememberBeginner
Q2

In an LSTM cell, which gate is responsible for deciding what new information to store in the cell state?

  1. Forget gate
  2. Input gate
  3. Output gate
  4. Reset gate
โœ… B) Input gate โ€” The input gate i_t = ฯƒ(W_iยท[h_{t-1}, x_t] + b_i) controls how much of the candidate value Cฬƒ_t is written into the cell state. The forget gate decides what to discard, and the output gate controls what's exposed as hidden state. Reset gate belongs to GRU, not LSTM.
RememberBeginner
Q3

A data scientist applies StandardScaler.fit_transform() on the entire dataset before splitting into train/val/test. What error has been introduced?

  1. Underfitting due to aggressive normalization
  2. Data leakage โ€” future data statistics contaminate training
  3. Loss of seasonality information
  4. Gradient explosion during backpropagation
โœ… B) Data leakage โ€” By fitting the scaler on the full dataset, the mean and standard deviation include future (val/test) data, subtly leaking information into the training set. The correct approach: fit on training data only, then transform val/test using training statistics.
UnderstandIntermediate
Q4

Which multi-step forecasting strategy outputs all H future timesteps simultaneously from a single model?

  1. Recursive
  2. Direct
  3. MIMO
  4. Autoregressive
โœ… C) MIMO (Multiple Input, Multiple Output) โ€” MIMO trains one model with output dimension H, predicting all future steps at once. Recursive feeds predictions back iteratively (error accumulation). Direct trains H separate models. Autoregressive is similar to recursive.
RememberBeginner
Q5

A TCN with kernel size 3 and dilation factors [1, 2, 4, 8, 16] has a receptive field of:

  1. 15 timesteps
  2. 31 timesteps
  3. 63 timesteps
  4. 93 timesteps
โœ… C) 63 timesteps โ€” For a TCN with kernel k and L layers with dilations d_l, the receptive field = 1 + (kโˆ’1) ร— ฮฃd_l = 1 + 2ร—(1+2+4+8+16) = 1 + 2ร—31 = 63. Each layer adds (kโˆ’1)ร—d_l to the receptive field.
ApplyIntermediate
Q6

An LSTM model predicts Nifty 50 closing prices with MAPE = 1.5% and Directional Accuracy = 49%. What should you conclude?

  1. The model is excellent โ€” 1.5% error is very low
  2. The model is useless for trading โ€” it can't predict direction better than random
  3. The model needs more LSTM layers to improve
  4. The MAPE is wrong and should be recalculated
โœ… B) Useless for trading โ€” DA of 49% is worse than a coin flip (50%). The low MAPE is misleading because predicting "tomorrow โ‰ˆ today" already gives ~1.2% MAPE on Nifty. The model merely learned to track the level without capturing directional changes, which is what matters for trading.
EvaluateAdvanced
Q7

Informer's ProbSparse Self-Attention reduces complexity from O(Lยฒ) to:

  1. O(L)
  2. O(L log L)
  3. O(LโˆšL)
  4. O(L logยฒ L)
โœ… B) O(L log L) โ€” ProbSparse attention selects top-u dominant queries (u = c ร— ln L) and computes attention only for these, resulting in O(L log L) time and memory complexity. This enables processing sequences of 10,000+ timesteps.
RememberIntermediate
Q8

In N-BEATS, what does the "backcast" output of each block represent?

  1. The final forecast
  2. The block's reconstruction of the input (lookback period)
  3. The gradient signal for backpropagation
  4. A compressed representation for the next block
โœ… B) The block's reconstruction of the input โ€” Each N-BEATS block produces two outputs: a backcast (what it "explains" from the input) and a forecast. The residual (input โˆ’ backcast) is passed to the next block, enabling progressive refinement. The final forecast is the sum of all blocks' forecast outputs.
UnderstandIntermediate
Q9

Which evaluation metric is undefined (or problematic) when the actual value y_i equals zero?

  1. MAE
  2. RMSE
  3. MAPE
  4. Directional Accuracy
โœ… C) MAPE โ€” MAPE involves division by |y_i|. When y_i = 0, the formula produces division by zero. This is common in demand forecasting (zero-demand SKUs). Alternatives: symmetric MAPE (sMAPE) or weighted MAPE.
UnderstandBeginner
Q10

A Zepto demand forecasting model uses "day of week" as a feature but encodes it as a single integer (0โ€“6). What is the problem?

  1. No problem โ€” integer encoding is standard
  2. The model learns a false ordinal relationship (Sunday=0 < Monday=1 < ... < Saturday=6)
  3. Integer encoding causes gradient explosion
  4. The feature is redundant with "month" encoding
โœ… B) False ordinal relationship โ€” Integer encoding implies Sunday < Monday < Tuesday, which is meaningless. Use one-hot encoding (7 binary columns) or cyclical encoding (sin/cos of 2ฯ€ ร— day/7) to properly represent the circular nature of days of the week.
AnalyzeIntermediate

Section B โ€” Short Answer Questions

B1 Intermediate

Explain why the recursive multi-step strategy suffers from error accumulation, using a concrete example with Nifty 50 prediction.

Model Answer: In recursive forecasting, a 1-step model predicts ลท(t+1), then uses ลท(t+1) as input to predict ลท(t+2), and so on. Each prediction contains error ฮต. By step k, the input already contains accumulated errors from steps 1 to kโˆ’1.

Nifty Example: Suppose the model predicts ลท(t+1) = 22,350 (actual: 22,300, error = +50). Now this inflated value becomes input for ลท(t+2), which might predict 22,380 (actual: 22,250, error = +130). By day 30, the cumulative error can exceed 500+ points, making predictions meaningless. MIMO avoids this by outputting all 30 days simultaneously from the true input window.
B2 Advanced

What is the Efficient Market Hypothesis (EMH), and how does it explain why our LSTM's Directional Accuracy is ~52%?

Model Answer: EMH states that asset prices fully reflect all available information. In its semi-strong form, no model using publicly available data (prices, volumes, fundamentals, news) can consistently predict future prices better than random. Our LSTM uses only historical prices, VIX, and FII data โ€” all publicly available. The ~52% DA (barely above 50% random baseline) is consistent with EMH: the model extracts minimal signal because the market has already priced in this information. Consistently profitable prediction would require either (a) non-public information, (b) faster execution (latency arbitrage), or (c) exploiting rare, temporary market inefficiencies.
B3 Beginner

Describe two ways to make a non-stationary time series stationary. Give an Indian data example for each.

Model Answer:
1. First differencing y'(t) = y(t) โˆ’ y(tโˆ’1): Removes linear trend. Example: India's monthly GST collection (โ‚น crores) has a strong upward trend from โ‚น90,000 Cr (2019) to โ‚น1,80,000 Cr (2025). First differencing converts to month-over-month change, making it approximately stationary.

2. Log transformation + differencing: Stabilizes multiplicative variance. Example: Sensex index from 2000 (4,000) to 2025 (75,000) โ€” absolute daily changes grow with the level (โ‚น100 in 2000 vs โ‚น1000 in 2025). Log transform converts multiplicative to additive, then differencing removes trend. Log returns ln(P_t/P_{t-1}) are approximately stationary.
B4 Intermediate

Why does N-BEATS use both "backcast" and "forecast" outputs in each block? How does this enable residual learning?

Model Answer: Each N-BEATS block outputs: (1) a backcast โ€” its best reconstruction of the input lookback window, and (2) a forecast โ€” its prediction for the horizon. The residual passed to the next block is (input โˆ’ backcast), representing what this block couldn't explain. This is analogous to ResNet skip connections but applied to the time series itself. Each successive block focuses on increasingly subtle patterns. The final forecast = sum of all blocks' forecasts. This residual structure prevents catastrophic forgetting and allows different blocks to specialize (e.g., Block 1 captures trend, Block 2 captures seasonality, Block 3 captures residual noise patterns).
B5 Intermediate

Explain why Jio clusters its 200,000 cell towers into 50 types rather than training a single global model or 200,000 individual models.

Model Answer:
โ€ข 200K individual models: Impractical โ€” insufficient data per tower for deep learning (only ~1000 data points per tower), 200K training jobs, massive maintenance burden.
โ€ข 1 global model: A tower in Connaught Place (Delhi CBD, commercial, peak=daytime) has completely different patterns from a tower in a Rajasthan village (residential, peak=evening). A single model can't capture this heterogeneity without becoming enormously complex.
โ€ข 50 clusters: Group towers by usage pattern similarity (K-means on feature vectors: peak hours, weekday/weekend ratio, data vs voice mix). Each cluster gets a dedicated model that learns the shared temporal pattern. Tower-specific inputs (latitude, elevation, capacity) provide individualization within the cluster model. This is the "sweet spot" โ€” enough data per model, manageable complexity, and pattern-specific predictions.

Section C โ€” Long Answer Questions

C1 Advanced

(15 marks) Compare LSTM, TCN, and Informer for predicting hourly electricity consumption across India's five regional grids (Northern, Southern, Eastern, Western, North-Eastern). For each architecture, discuss: (a) how it handles the multiple seasonalities (hourly, daily, weekly, monsoon), (b) data requirements, (c) training efficiency, (d) ability to incorporate exogenous variables (temperature, industrial production index, festival calendar), and (e) your recommended architecture with justification.

C2 Advanced

(15 marks) Design a complete time series forecasting pipeline for Zepto's dark store demand prediction. Cover: data sources, feature engineering (including Indian-specific features), model selection, training with walk-forward validation, evaluation metrics, deployment considerations (inference speed for 700 stores ร— 5000 SKUs), and how you would handle cold-start for new dark stores with no historical data.

C3 Advanced

(15 marks) Critically evaluate the claim: "Deep learning has made ARIMA obsolete for time series forecasting." Discuss with reference to: (a) the M4/M5 forecasting competitions, (b) computational cost considerations for Indian MSME businesses, (c) interpretability requirements in regulated sectors (RBI, SEBI), (d) data scarcity scenarios, and (e) ensemble approaches that combine statistical and DL methods.

Section D โ€” Programming Exercises

D1 Intermediate

Electricity Consumption Forecasting (LSTM)

Using the UCI "Individual Household Electric Power Consumption" dataset (or simulated Indian grid data), build an LSTM model to predict the next 24 hours of electricity consumption at 1-hour granularity. Requirements:

  • Create proper temporal features: hour of day (cyclical), day of week (cyclical), month, is_weekend
  • Add Indian holiday/festival indicators (Diwali, Holi, Independence Day)
  • Implement walk-forward validation with 5 folds
  • Compare LSTM against a naive baseline (predict today = yesterday)
  • Report MAE, RMSE, and MAPE for each fold and the mean
  • Plot actual vs predicted for the final test fold
D2 Advanced

Electricity Consumption Forecasting (Model Comparison)

Extend D1 by implementing three models on the same dataset:

  • ARIMA (using pmdarima.auto_arima)
  • LSTM (2-layer, 64 units, Keras)
  • TCN (using keras-tcn, dilations=[1,2,4,8,16,32])

Compare: (a) test MAE/RMSE/MAPE, (b) training time, (c) inference time per prediction, (d) performance on normal days vs peak days (summer afternoons, festival evenings). Create a comprehensive comparison table and recommend the best model for deployment at a state electricity board (e.g., MSEDCL Maharashtra).

Section E โ€” Mini-Project

๐Ÿ—๏ธ Mini-Project: Indian Railway Passenger Traffic Predictor

Objective: Build a deep learning system to predict daily passenger count for the top 10 busiest Indian Railway stations (New Delhi, Mumbai CST, Howrah, Chennai Central, Bangalore City, Secunderabad, Ahmedabad, Jaipur, Lucknow, Pune) for the next 7 days.

Requirements:

  1. Data Generation: Create synthetic but realistic daily passenger data for 5 years (2020โ€“2025) incorporating: COVID lockdown dip (Aprโ€“Jun 2020), gradual recovery, festive surges (Diwali, Chhath Puja, Pongal), summer holiday peak (Mayโ€“Jun), Monday/Friday travel spikes
  2. Feature Engineering: Station-specific features, day of week (cyclical), festival proximity, school holiday indicator, weather (monsoon months for relevant stations)
  3. Models: Implement at least two: (a) Stacked LSTM, (b) TCN or N-BEATS
  4. Validation: Walk-forward validation with monthly retraining
  5. Evaluation: MAE, MAPE, and Directional Accuracy per station
  6. Visualization: Dashboard-style plots: actual vs predicted, error distribution, seasonal decomposition
  7. Report: Which stations are easiest/hardest to predict and why? How does the model handle sudden disruptions (e.g., Odisha train accident June 2023)?

Deliverables:

  • Jupyter notebook with complete code
  • Comparison table of all models across all 10 stations
  • 2-page analysis report with recommendations for Indian Railways

Evaluation Criteria:

ComponentMarks
Realistic data generation with Indian patterns15
Feature engineering quality15
Model implementation (correctness, architecture)25
Walk-forward validation (no leakage)15
Evaluation metrics & visualization15
Analysis report & insights15
Total100
Section 12

Chapter Summary

Key Takeaways โ€” Time Series Forecasting with Deep Learning

  1. Time series have temporal structure โ€” trend, seasonality, cyclicality, and noise โ€” that must be respected in all modeling decisions (splitting, scaling, feature engineering).
  2. Stationarity is the gateway concept. Use ADF test to check; apply differencing and log transforms to achieve it. Even deep learning benefits from stationary inputs.
  3. ARIMA remains your baseline โ€” fast, interpretable, and competitive for short-horizon univariate forecasting. Always compare DL models against ARIMA and naive baselines.
  4. LSTMs are the workhorse of deep time series โ€” their gated memory naturally handles temporal dependencies. Use 1โ€“2 layers with 32โ€“128 units for typical problems.
  5. TCNs offer faster training (parallel), stable gradients, and flexible receptive fields via dilated causal convolutions. Consider TCNs when training speed matters.
  6. N-BEATS won the M4 Competition with pure FC layers. Its interpretable variant decomposes forecasts into trend and seasonal components โ€” valuable for business stakeholders.
  7. Informer and its successors (PatchTST, TimesFM) bring Transformer power to time series with O(L log L) complexity, enabling very long input sequences.
  8. Feature engineering often matters more than architecture: lag features, rolling statistics, and domain-specific calendars (Indian festivals, IPL, salary cycles, RBI policy dates).
  9. MIMO forecasting (predicting all H steps simultaneously) avoids the error accumulation of recursive approaches and is the natural fit for LSTM/TCN output layers.
  10. Walk-forward validation is the ONLY correct validation strategy. Never use k-fold on time series. Scale using training statistics only โ€” no future data leakage.
  11. Evaluate with multiple metrics: MAE for absolute error, RMSE when large errors are costly, MAPE for scale-independent comparison, and Directional Accuracy for trading/decision applications.
  12. Stock prediction is fundamentally hard (EMH). A "low MAPE" is misleading โ€” always check Directional Accuracy and compare against naive baselines. Backtest performance rarely translates to live profits.
One-Line Summary: Time series forecasting with deep learning combines temporal awareness (windowing, walk-forward validation) with powerful architectures (LSTM, TCN, Transformers) and thoughtful feature engineering โ€” but always validate against baselines and respect the limits of predictability.
Section 13

References & Further Reading

Foundational Papers

  1. Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735โ€“1780. โ€” The original LSTM paper.
  2. Box, G. E. P. & Jenkins, G. M. (1970). Time Series Analysis: Forecasting and Control. Holden-Day. โ€” Foundation of ARIMA methodology.
  3. Bai, S., Kolter, J. Z., & Koltun, V. (2018). "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling." arXiv:1803.01271. โ€” TCN vs LSTM benchmark.
  4. Oreshkin, B. N. et al. (2020). "N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting." ICLR 2020. โ€” M4 Competition winner.
  5. Zhou, H. et al. (2021). "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting." AAAI 2021. โ€” ProbSparse attention for time series.
  6. Nie, Y. et al. (2023). "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers." ICLR 2023. โ€” PatchTST.
  7. Fama, E. F. (1970). "Efficient Capital Markets: A Review of Theory and Empirical Work." Journal of Finance, 25(2), 383โ€“417. โ€” EMH.

Textbooks

  1. Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice, 3rd edition. OTexts. โ€” Free online at otexts.com/fpp3.
  2. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 10: Sequence Modeling.
  3. Lazzeri, F. (2021). Machine Learning for Time Series Forecasting with Python. Wiley.

Indian Context

  1. NSE India. "Nifty 50 Index Methodology Document." niftyindices.com
  2. India Meteorological Department (IMD). "Extended Range Prediction System." โ€” Uses ML/DL for seasonal monsoon forecasting.
  3. Reliance Jio Annual Report 2024. "Digital Infrastructure & Network Optimization." โ€” AI-driven capacity planning.
  4. SEBI (2023). "Circular on Algorithmic Trading." โ€” Regulatory guidelines for AI-based trading in Indian markets.

Libraries & Tools

  1. statsmodels โ€” ARIMA, seasonal decomposition, ADF test: statsmodels.org
  2. pmdarima โ€” Auto ARIMA for Python: alkaline-ml.com/pmdarima
  3. keras-tcn โ€” TCN implementation for Keras: GitHub
  4. darts โ€” Unified time series library by Unit8: unit8co.github.io/darts
  5. pytorch-forecasting โ€” Production time series with PyTorch: pytorch-forecasting.readthedocs.io
  6. holidays โ€” Indian public holiday library: GitHub