Neural Networks & Deep Learning
Chapter 20: Time Series Forecasting with Deep Learning
Predicting the Future, One Timestep at a Time
โฑ๏ธ Reading Time: ~3 hours | ๐ Part V: Sequence & Time | ๐ง Theory + Code Chapter
๐ Prerequisites: Chapters 14โ15 (RNNs, LSTMs), Chapter 18 (Transformers), Basic Statistics
Bloom's Taxonomy Map for This Chapter
| Bloom's Level | What You'll Achieve |
|---|---|
| ๐ต Remember | Recall time-series characteristics (trend, seasonality, stationarity), ARIMA components, LSTM gate equations, and evaluation metric formulas (MAE, RMSE, MAPE) |
| ๐ต Understand | Explain why LSTMs handle long-range dependencies in sequential data, how TCN dilations expand receptive fields, and why walk-forward validation prevents data leakage |
| ๐ข Apply | Implement an LSTM-based Nifty 50 predictor from scratch in NumPy, and build production-grade forecasters in TensorFlow/Keras |
| ๐ก Analyze | Compare recursive vs. direct vs. MIMO multi-step forecasting, diagnose look-ahead bias, and decompose forecast errors into trend/seasonal/noise components |
| ๐ Evaluate | Assess whether a stock price predictor is genuinely predictive or merely curve-fitting; judge model suitability for different forecasting horizons |
| ๐ด Create | Design an end-to-end electricity consumption forecasting pipeline with Indian holiday features, walk-forward validation, and ensemble strategies |
Learning Objectives
By the end of this chapter, you will be able to:
- Define a time series and its four key components: trend, seasonality, cyclicality, and residual noise
- Explain stationarity using the Augmented Dickey-Fuller (ADF) test and apply differencing to make a series stationary
- Summarize the limitations of classical ARIMA/SARIMA models that motivate deep learning approaches
- Implement an LSTM network from scratch in NumPy for 30-day Nifty 50 price prediction using VIX India and FII/DII flow features
- Compare four deep architectures for time series: LSTM, Temporal Convolutional Networks (TCN), N-BEATS, and Informer (Transformer-based)
- Engineer time-series features including lag features, rolling statistics, and Indian holiday calendars
- Differentiate recursive, direct, and MIMO strategies for multi-step-ahead forecasting
- Apply walk-forward validation to prevent data leakage in temporal splits
- Calculate and interpret MAE, RMSE, MAPE, and Directional Accuracy metrics
- Critically assess why stock market prediction remains fundamentally difficult (Efficient Market Hypothesis) and the gap between backtest and live performance
Opening Hook
๐ Three Predictions That Run India
6:00 AM โ IMD Headquarters, New Delhi: The India Meteorological Department's deep learning model processes 40 years of monsoon rainfall data across 306 stations. Its seasonal forecast will determine โน18 lakh crore worth of Kharif crop planning. A 5% error in predicted millimetres means the difference between a bumper harvest and drought relief packages.
9:15 AM โ NSE Trading Floor, BKC Mumbai: The opening bell rings. The Nifty 50 index โ a weighted basket of India's 50 largest companies โ will move through 22,000+ ticks today. Algorithmic trading desks at Zerodha, Groww, and Angel One process โน2.5 lakh crore in daily turnover. Their LSTM models don't predict price โ they predict volatility regimes and order flow direction.
11:30 PM โ Zepto Dark Store, Koramangala, Bengaluru: Tomorrow's 10-minute delivery promise depends on tonight's demand forecast. How many packets of Amul Taaza, how many kg of tomatoes? An LSTM model trained on 18 months of order history, payday cycles, IPL match schedules, and Bengaluru weather predicts SKU-level demand for each of 700+ dark stores.
Every one of these problems is a time series. This chapter teaches you to build the deep learning models behind them.
Nifty 50 / NSE Zepto IMD Zerodha JioCore Concepts
20.1 What Is a Time Series?
A time series is a sequence of data points indexed (or listed or graphed) in time order. Unlike i.i.d. datasets used in standard ML, time series data has temporal dependencies โ the value at time t depends on values at tโ1, tโ2, โฆ, tโk.
Four Components of a Time Series
Long-term increase or decrease. Example: India's GDP has an upward trend from โน80 lakh crore (2012) to โน295 lakh crore (2025).
2. Seasonality (S)Regular, calendar-driven fluctuations. Example: Flipkart's daily orders spike 8ร during Diwali Big Billion Days every October. Swiggy sees a 40% dinner-order surge every Friday.
3. Cyclicality (C)Irregular, longer-term oscillations not tied to a calendar. Example: Real estate price cycles in Mumbai and Bangalore spanning 7โ10 years.
4. Residual / Noise (ฮต)Random variation that cannot be explained by T, S, or C. Example: A surprise RBI rate cut causing a Nifty intraday spike of 300 points.
Multiplicative Decomposition: y(t) = T(t) ร S(t) ร C(t) ร ฮต(t)
Stationarity โ The Gateway Concept
A time series is stationary if its statistical properties (mean, variance, autocorrelation) do not change over time. Most classical methods (ARIMA) require stationarity; deep learning is more robust but still benefits from it.
Hโ: Series has a unit root (non-stationary) | Hโ: Series is stationary
If p-value < 0.05 โ reject Hโ โ series is stationary
Making a series stationary:
- First differencing: y'(t) = y(t) โ y(tโ1) โ removes linear trend
- Seasonal differencing: y'(t) = y(t) โ y(tโm) โ removes seasonality of period m
- Log transform: stabilizes multiplicative seasonality and growing variance
ln(P_t / P_{t-1}) are even more well-behaved.
20.2 Classical Methods: ARIMA / SARIMA (Brief Review)
Before deep learning, time series forecasting was dominated by the Box-Jenkins methodology:
ARIMA(p, d, q)
Uses p lagged values: y(t) = c + ฯโy(tโ1) + ฯโy(tโ2) + โฆ + ฯโy(tโp) + ฮต(t)
I (Integrated) โ dNumber of times the series is differenced to achieve stationarity.
MA (Moving Average) โ qUses q lagged forecast errors: y(t) = c + ฮต(t) + ฮธโฮต(tโ1) + โฆ + ฮธ_qฮต(tโq)
SARIMA adds seasonalitySARIMA(p,d,q)(P,D,Q,m) โ adds seasonal AR, differencing, MA with period m.
Limitations of ARIMA
| Limitation | Why It Matters |
|---|---|
| Linear assumptions | Cannot capture non-linear patterns like regime changes in Nifty volatility |
| Univariate by default | Cannot natively use exogenous features like VIX India, FII/DII flows, crude oil price |
| Manual order selection | Choosing (p, d, q) requires ACF/PACF analysis โ doesn't scale to 1000s of SKUs at Zepto |
| Fixed seasonal period | Struggles with multiple overlapping seasonalities (daily + weekly + monthly + festive) |
| Long-range dependencies | AR(p) with large p is unstable; cannot learn 90-day patterns for monsoon prediction |
auto_arima from pmdarima library automates (p,d,q) selection. Use it as your baseline โ always.
20.3 LSTM for Time Series Forecasting
Long Short-Term Memory (LSTM) networks, introduced by Hochreiter & Schmidhuber (1997), are the workhorse of deep time series forecasting. Their gated architecture solves the vanishing gradient problem, enabling them to learn dependencies spanning hundreds of timesteps.
LSTM Cell Equations (Recap)
Input Gate: i_t = ฯ(W_i ยท [h_{t-1}, x_t] + b_i)
Candidate Cell: Cฬ_t = tanh(W_C ยท [h_{t-1}, x_t] + b_C)
Cell State Update: C_t = f_t โ C_{t-1} + i_t โ Cฬ_t
Output Gate: o_t = ฯ(W_o ยท [h_{t-1}, x_t] + b_o)
Hidden State: h_t = o_t โ tanh(C_t)
Why LSTMs Excel at Time Series
- Memory cells retain information across 100+ timesteps โ crucial for monsoon patterns spanning months
- Forget gate learns to discard irrelevant past โ e.g., ignoring a one-off Nifty crash from 3 months ago
- Multivariate input natively โ feed Nifty close + VIX India + FII net buy + crude oil as a multi-feature vector x_t
- Non-linear โ captures regime changes, breakouts, and complex seasonal interactions
Windowing: Creating Supervised Learning from Time Series
To feed a time series into an LSTM, we create sliding windows:
# Convert time series to supervised learning windows
# Input: look_back=60 days โ Output: next 30 days
# Series: [d1, d2, d3, ..., d100]
# Window 1: X=[d1..d60], Y=[d61..d90]
# Window 2: X=[d2..d61], Y=[d62..d91]
# Window 3: X=[d3..d62], Y=[d63..d92]
# ...
# Shape: X โ (num_samples, 60, num_features)
# Y โ (num_samples, 30)Python
20.4 Temporal Convolutional Networks (TCN)
TCNs apply 1D causal convolutions to sequences, offering an alternative to RNNs with several advantages:
TCN Architecture
Output at time t depends only on inputs at times โค t. No future leakage by design.
Dilated ConvolutionDilation factor d = {1, 2, 4, 8, 16, โฆ} allows exponentially growing receptive field without increasing parameters. With kernel size k=3 and 6 layers: receptive field = 1 + 2(kโ1)(2โถ โ 1) = 253 timesteps.
Residual ConnectionsSkip connections every 2 layers prevent degradation in deep TCNs (12+ layers).
Advantages over LSTMโ
Parallelizable (no sequential dependency) โ 5โ10ร faster training
โ
Stable gradients (no vanishing/exploding)
โ
Flexible receptive field via dilation
20.5 N-BEATS: Neural Basis Expansion
N-BEATS (Oreshkin et al., 2020) is a pure deep learning architecture designed specifically for time series โ no recurrence, no convolutions.
N-BEATS Architecture
Stack of fully-connected layers โ two linear projections:
1. Backcast: reconstructs the input (lookback period)
2. Forecast: produces the forecast horizon
Multiple blocks are stacked. Each block receives the residual (what the previous block couldn't explain). Final forecast = sum of all block forecasts.
Interpretable VariantConstrains basis functions to trend (polynomial) and seasonality (Fourier) โ producing human-readable decomposition.
Key ResultWon the M4 Competition (2020) โ beating statistical methods, ensembles, and hybrid models on 100,000 time series across multiple domains.
20.6 Transformer-Based: Informer
Standard Transformers have O(Lยฒ) complexity in sequence length L โ prohibitive for long time series (e.g., hourly data for a year = 8,760 points). Informer (Zhou et al., 2021) solves this.
Informer: Key Innovations
Not all queries are equally important. Informer selects the top-u "active" queries using KL-divergence, reducing attention from O(Lยฒ) to O(L log L).
Self-Attention DistillingProgressive downsampling between attention layers halves the sequence length at each layer โ from L โ L/2 โ L/4 โ focusing on dominant features.
Generative-Style DecoderInstead of auto-regressive step-by-step prediction, the decoder generates the entire forecast horizon in one forward pass โ avoiding error accumulation.
ComplexityO(L log L) time and O(L log L) memory โ enabling input sequences of 10,000+ timesteps.
20.7 Feature Engineering for Time Series
Raw time series alone is rarely enough. Thoughtful feature engineering often matters more than model architecture.
Lag Features
# Create lag features for Nifty 50 closing price
import pandas as pd
df['close_lag_1'] = df['close'].shift(1) # Yesterday
df['close_lag_5'] = df['close'].shift(5) # Last week
df['close_lag_21'] = df['close'].shift(21) # Last month (~21 trading days)
df['close_lag_63'] = df['close'].shift(63) # Last quarterPython
Rolling Statistics
# Rolling mean and standard deviation
df['sma_20'] = df['close'].rolling(20).mean() # 20-day SMA
df['sma_50'] = df['close'].rolling(50).mean() # 50-day SMA
df['std_20'] = df['close'].rolling(20).std() # 20-day volatility
df['rsi_14'] = compute_rsi(df['close'], 14) # Relative Strength Index
# Bollinger Bands
df['bb_upper'] = df['sma_20'] + 2 * df['std_20']
df['bb_lower'] = df['sma_20'] - 2 * df['std_20']Python
Indian Calendar Features
# Indian-specific temporal features
import numpy as np
# Standard calendar features
df['day_of_week'] = df.index.dayofweek # 0=Mon, 4=Fri
df['month'] = df.index.month
df['is_monday'] = (df.index.dayofweek == 0).astype(int)
df['is_friday'] = (df.index.dayofweek == 4).astype(int)
df['is_month_end'] = df.index.is_month_end.astype(int)
# Indian market-specific features
df['is_expiry_week'] = is_nse_expiry_week(df.index) # Last Thu
df['is_budget_day'] = (df.index.month == 2) & (df.index.day == 1)
df['is_rbi_policy'] = df.index.isin(rbi_policy_dates)
# Festival calendar (approximate โ use 'holidays' library)
indian_holidays = {
'diwali': ['2024-11-01', '2025-10-20'],
'holi': ['2024-03-25', '2025-03-14'],
'ganesh_chaturthi': ['2024-09-07', '2025-08-27'],
}
# Pre-festival period (demand spike window)
for fest, dates in indian_holidays.items():
fest_dates = pd.to_datetime(dates)
for d in fest_dates:
mask = (df.index >= d - pd.Timedelta(days=7)) & (df.index <= d)
df.loc[mask, f'pre_{fest}'] = 1Python
20.8 Multi-Step Forecasting Strategies
Predicting more than one step ahead is significantly harder. Three strategies dominate:
Three Multi-Step Strategies
Train a 1-step model. To forecast H steps: predict ลท(t+1), feed it back as input, predict ลท(t+2), and so on.
โ One model to train โ Errors accumulate โ by step 30, predictions drift heavily
2. DirectTrain H separate models, one for each horizon: Modelโ predicts ลท(t+1), Modelโ predicts ลท(t+2), โฆ, Model_H predicts ลท(t+H).
โ No error accumulation โ H ร training cost; no shared learning between horizons
3. MIMO (Multiple Input, Multiple Output)Train a single model that outputs all H steps simultaneously: f(X) โ [ลท(t+1), ลท(t+2), โฆ, ลท(t+H)]
โ Captures inter-horizon dependencies โ One model โ More complex architecture needed
| Strategy | # Models | Error Accumulation | Best For |
|---|---|---|---|
| Recursive | 1 | High (compounds) | Short horizons (1โ5 steps) |
| Direct | H | None | Irregular horizons, when compute is cheap |
| MIMO | 1 | Low (shared learning) | Medium-long horizons; LSTM/Transformer natural fit |
20.9 Walk-Forward Validation (No Data Leakage!)
Standard k-fold cross-validation destroys temporal ordering and causes catastrophic leakage. Walk-forward validation respects time.
# Walk-forward validation implementation
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
fold_scores = []
for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
# CRITICAL: Scale using ONLY training data statistics!
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # fit + transform
X_val = scaler.transform(X_val) # transform only!
model = build_lstm_model()
model.fit(X_train, y_train, epochs=50,
validation_data=(X_val, y_val),
callbacks=[EarlyStopping(patience=5)])
val_pred = model.predict(X_val)
fold_scores.append(mean_absolute_error(y_val, val_pred))
print(f"Fold {fold+1} MAE: {fold_scores[-1]:.4f}")
print(f"Mean Walk-Forward MAE: {np.mean(fold_scores):.4f}")Python
fit_transform() on the full dataset, the scaler learns the mean and standard deviation of future validation/test data. The fix: always fit() on train, transform() on val/test.
20.10 Evaluation Metrics for Time Series
Root Mean Squared Error (RMSE): RMSE = โ[(1/n) ฮฃ(y_i โ ลท_i)ยฒ]
Mean Absolute Percentage Error (MAPE): MAPE = (100/n) ฮฃ|y_i โ ลท_i| / |y_i|
Directional Accuracy (DA): DA = (1/n) ฮฃ ๐(sign(ฮy_i) = sign(ฮลท_i)) ร 100%
| Metric | Strengths | Weaknesses | Good For |
|---|---|---|---|
| MAE | Robust to outliers; same units as data | Scale-dependent; can't compare across series | Nifty price forecast (โน error) |
| RMSE | Penalizes large errors more | Sensitive to outliers | When big misses are costly (power grid) |
| MAPE | Scale-independent; interpretable % | Undefined when y_i = 0; asymmetric | Demand forecasting (% accuracy) |
| DA | Captures directional correctness | Ignores magnitude | Trading signals (up/down) |
From-Scratch Code: LSTM for 30-Day Nifty 50 Prediction
We build a complete LSTM forecaster from scratch in NumPy. The model takes 60 days of historical data (Nifty close, VIX India, FII net flow) and predicts the next 30 days of Nifty closing prices.
Step 1: Data Preparation & Windowing
import numpy as np
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# LSTM for Nifty 50 โ From Scratch in NumPy
# Features: [nifty_close, vix_india, fii_net_flow]
# Lookback: 60 days | Forecast: 30 days
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def create_windows(data, lookback=60, horizon=30):
"""Convert multivariate time series into supervised windows.
Args:
data: ndarray of shape (T, num_features)
lookback: number of past timesteps as input
horizon: number of future steps to predict
Returns:
X: (num_samples, lookback, num_features)
Y: (num_samples, horizon) โ only first feature (nifty close)
"""
X, Y = [], []
for i in range(len(data) - lookback - horizon + 1):
X.append(data[i : i + lookback]) # all features
Y.append(data[i + lookback : i + lookback + horizon, 0]) # nifty only
return np.array(X), np.array(Y)
def normalize(train, val, test):
"""Normalize using ONLY training statistics โ no leakage!"""
mu = train.mean(axis=0)
std = train.std(axis=0) + 1e-8
return (train - mu) / std, (val - mu) / std, (test - mu) / std, mu, std
# Simulate realistic Nifty data for demonstration
np.random.seed(42)
T = 1000 # ~4 years of trading days
nifty_close = 18000 + np.cumsum(np.random.randn(T) * 100) # random walk
vix_india = 14 + np.random.randn(T) * 3 # VIX India
fii_net = np.random.randn(T) * 2000 # FII net โนCr
data = np.column_stack([nifty_close, vix_india, fii_net])
# Temporal split: 70% train, 15% val, 15% test
n_train = int(0.70 * T)
n_val = int(0.15 * T)
train_data = data[:n_train]
val_data = data[n_train:n_train+n_val]
test_data = data[n_train+n_val:]
# Normalize (fit on train only!)
train_n, val_n, test_n, mu, std = normalize(train_data, val_data, test_data)
# Create windows
LOOKBACK, HORIZON = 60, 30
X_train, Y_train = create_windows(train_n, LOOKBACK, HORIZON)
X_val, Y_val = create_windows(val_n, LOOKBACK, HORIZON)
X_test, Y_test = create_windows(test_n, LOOKBACK, HORIZON)
print(f"X_train: {X_train.shape}, Y_train: {Y_train.shape}")
print(f"X_val: {X_val.shape}, Y_val: {Y_val.shape}")
print(f"X_test: {X_test.shape}, Y_test: {Y_test.shape}")Python
Step 2: LSTM Cell โ Forward Pass
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))
def tanh(x):
return np.tanh(x)
class LSTMCell:
"""Single LSTM cell with forget, input, and output gates."""
def __init__(self, input_size, hidden_size):
self.input_size = input_size
self.hidden_size = hidden_size
scale = 0.01
# Combined weight matrix: [h_{t-1}, x_t] โ gates
concat_size = hidden_size + input_size
# Forget gate weights
self.W_f = np.random.randn(hidden_size, concat_size) * scale
self.b_f = np.zeros((1, hidden_size))
# Input gate weights
self.W_i = np.random.randn(hidden_size, concat_size) * scale
self.b_i = np.zeros((1, hidden_size))
# Candidate cell weights
self.W_c = np.random.randn(hidden_size, concat_size) * scale
self.b_c = np.zeros((1, hidden_size))
# Output gate weights
self.W_o = np.random.randn(hidden_size, concat_size) * scale
self.b_o = np.zeros((1, hidden_size))
def forward(self, x_t, h_prev, c_prev):
"""One timestep of LSTM.
Args:
x_t: (batch, input_size)
h_prev: (batch, hidden_size)
c_prev: (batch, hidden_size)
Returns:
h_t, c_t, cache (for backprop)
"""
# Concatenate [h_{t-1}, x_t]
concat = np.concatenate([h_prev, x_t], axis=1) # (batch, H+D)
# Gate computations
f_t = sigmoid(concat @ self.W_f.T + self.b_f) # Forget gate
i_t = sigmoid(concat @ self.W_i.T + self.b_i) # Input gate
c_tilde = tanh(concat @ self.W_c.T + self.b_c) # Candidate
o_t = sigmoid(concat @ self.W_o.T + self.b_o) # Output gate
# Cell state update
c_t = f_t * c_prev + i_t * c_tilde
# Hidden state
h_t = o_t * tanh(c_t)
cache = (x_t, h_prev, c_prev, concat, f_t, i_t, c_tilde, o_t, c_t)
return h_t, c_t, cachePython
Step 3: Full LSTM Network with Dense Output
class LSTMForecaster:
"""LSTM network for multi-step time series forecasting.
Architecture:
Input (batch, seq_len, features)
โ LSTM layer (hidden_size units)
โ Dense layer (hidden_size โ horizon)
โ Output (batch, horizon)
"""
def __init__(self, input_size, hidden_size, horizon, lr=0.001):
self.lstm = LSTMCell(input_size, hidden_size)
self.hidden_size = hidden_size
self.horizon = horizon
self.lr = lr
# Dense layer: h_final โ forecast
self.W_out = np.random.randn(horizon, hidden_size) * 0.01
self.b_out = np.zeros((1, horizon))
def forward(self, X):
"""Forward pass through all timesteps.
Args:
X: (batch, seq_len, input_size)
Returns:
predictions: (batch, horizon)
"""
batch_size, seq_len, _ = X.shape
# Initialize hidden and cell states
h = np.zeros((batch_size, self.hidden_size))
c = np.zeros((batch_size, self.hidden_size))
self.caches = []
# Unroll LSTM across timesteps
for t in range(seq_len):
h, c, cache = self.lstm.forward(X[:, t, :], h, c)
self.caches.append(cache)
# Final hidden state โ Dense โ prediction
self.h_final = h
predictions = h @ self.W_out.T + self.b_out # (batch, horizon)
return predictions
def compute_loss(self, y_pred, y_true):
"""Mean Squared Error loss."""
self.y_pred = y_pred
self.y_true = y_true
loss = np.mean((y_pred - y_true) ** 2)
return loss
def backward_dense(self):
"""Backprop through the dense output layer."""
batch = self.y_pred.shape[0]
# Gradient of MSE loss
dL_dpred = 2.0 * (self.y_pred - self.y_true) / batch # (batch, horizon)
# Gradients for W_out and b_out
dW_out = dL_dpred.T @ self.h_final # (horizon, hidden_size)
db_out = dL_dpred.sum(axis=0, keepdims=True)
# Gradient flowing back to h_final
dh = dL_dpred @ self.W_out # (batch, hidden_size)
# Update dense layer (SGD)
self.W_out -= self.lr * dW_out
self.b_out -= self.lr * db_out
return dh
def backward_lstm(self, dh_final):
"""Backprop Through Time (BPTT) for LSTM.
Simplified: we update weights using accumulated gradients
across all timesteps (truncated BPTT).
"""
dh_next = dh_final
dc_next = np.zeros_like(dh_next)
# Accumulators for weight gradients
dW_f = np.zeros_like(self.lstm.W_f)
dW_i = np.zeros_like(self.lstm.W_i)
dW_c = np.zeros_like(self.lstm.W_c)
dW_o = np.zeros_like(self.lstm.W_o)
db_f = np.zeros_like(self.lstm.b_f)
db_i = np.zeros_like(self.lstm.b_i)
db_c = np.zeros_like(self.lstm.b_c)
db_o = np.zeros_like(self.lstm.b_o)
for t in reversed(range(len(self.caches))):
cache = self.caches[t]
x_t, h_prev, c_prev, concat, f_t, i_t, c_tilde, o_t, c_t = cache
# Gradients through output gate
tanh_c = tanh(c_t)
do = dh_next * tanh_c
dc = dh_next * o_t * (1 - tanh_c ** 2) + dc_next
# Gradients through cell state update
df = dc * c_prev
di = dc * c_tilde
dc_tilde = dc * i_t
dc_next = dc * f_t # flows to previous timestep
# Gradients through gate activations
df_raw = df * f_t * (1 - f_t) # sigmoid derivative
di_raw = di * i_t * (1 - i_t)
do_raw = do * o_t * (1 - o_t)
dc_raw = dc_tilde * (1 - c_tilde ** 2) # tanh derivative
# Accumulate weight gradients
dW_f += df_raw.T @ concat
dW_i += di_raw.T @ concat
dW_c += dc_raw.T @ concat
dW_o += do_raw.T @ concat
db_f += df_raw.sum(axis=0, keepdims=True)
db_i += di_raw.sum(axis=0, keepdims=True)
db_c += dc_raw.sum(axis=0, keepdims=True)
db_o += do_raw.sum(axis=0, keepdims=True)
# Gradient flowing to h_{t-1}
d_concat = (df_raw @ self.lstm.W_f + di_raw @ self.lstm.W_i +
dc_raw @ self.lstm.W_c + do_raw @ self.lstm.W_o)
dh_next = d_concat[:, :self.hidden_size]
# Gradient clipping to prevent exploding gradients
for grad in [dW_f, dW_i, dW_c, dW_o]:
np.clip(grad, -5, 5, out=grad)
# Update LSTM weights (SGD)
self.lstm.W_f -= self.lr * dW_f
self.lstm.W_i -= self.lr * dW_i
self.lstm.W_c -= self.lr * dW_c
self.lstm.W_o -= self.lr * dW_o
self.lstm.b_f -= self.lr * db_f
self.lstm.b_i -= self.lr * db_i
self.lstm.b_c -= self.lr * db_c
self.lstm.b_o -= self.lr * db_oPython
Step 4: Training Loop with Walk-Forward Awareness
def train_lstm(model, X_train, Y_train, X_val, Y_val,
epochs=100, batch_size=32, patience=10):
"""Train LSTM with mini-batch SGD and early stopping."""
n = X_train.shape[0]
best_val_loss = float('inf')
wait = 0
train_losses, val_losses = [], []
for epoch in range(epochs):
# Mini-batch training (shuffle within train set only)
indices = np.random.permutation(n)
epoch_loss = 0.0
n_batches = 0
for start in range(0, n, batch_size):
end = min(start + batch_size, n)
batch_idx = indices[start:end]
X_b = X_train[batch_idx]
Y_b = Y_train[batch_idx]
# Forward pass
y_pred = model.forward(X_b)
loss = model.compute_loss(y_pred, Y_b)
epoch_loss += loss
n_batches += 1
# Backward pass
dh = model.backward_dense()
model.backward_lstm(dh)
train_loss = epoch_loss / n_batches
train_losses.append(train_loss)
# Validation loss
val_pred = model.forward(X_val)
val_loss = model.compute_loss(val_pred, Y_val)
val_losses.append(val_loss)
# Early stopping
if val_loss < best_val_loss:
best_val_loss = val_loss
wait = 0
else:
wait += 1
if wait >= patience:
print(f"Early stopping at epoch {epoch+1}")
break
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1:3d} | Train Loss: {train_loss:.6f} | Val Loss: {val_loss:.6f}")
return train_losses, val_losses
# Initialize and train
INPUT_SIZE = 3 # nifty_close, vix_india, fii_net
HIDDEN_SIZE = 32 # LSTM hidden units
HORIZON = 30 # 30-day forecast
model = LSTMForecaster(INPUT_SIZE, HIDDEN_SIZE, HORIZON, lr=0.0005)
train_losses, val_losses = train_lstm(
model, X_train, Y_train, X_val, Y_val,
epochs=100, batch_size=32, patience=15
)Python
Step 5: Evaluation with All Four Metrics
def evaluate_forecast(y_true, y_pred, mu_target, std_target):
"""Compute MAE, RMSE, MAPE, and Directional Accuracy.
Args:
y_true, y_pred: normalized predictions
mu_target, std_target: for de-normalization
"""
# De-normalize to original scale (Nifty points)
y_true_orig = y_true * std_target + mu_target
y_pred_orig = y_pred * std_target + mu_target
# MAE
mae = np.mean(np.abs(y_true_orig - y_pred_orig))
# RMSE
rmse = np.sqrt(np.mean((y_true_orig - y_pred_orig) ** 2))
# MAPE
mape = np.mean(np.abs(y_true_orig - y_pred_orig) /
(np.abs(y_true_orig) + 1e-8)) * 100
# Directional Accuracy
# Compare day-over-day direction: did we predict up/down correctly?
true_dir = np.diff(y_true_orig, axis=1)
pred_dir = np.diff(y_pred_orig, axis=1)
da = np.mean((np.sign(true_dir) == np.sign(pred_dir))) * 100
print(f"โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ")
print(f"โ NIFTY 50 FORECAST EVALUATION โ")
print(f"โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฃ")
print(f"โ MAE: {mae:8.2f} points โ")
print(f"โ RMSE: {rmse:8.2f} points โ")
print(f"โ MAPE: {mape:8.2f}% โ")
print(f"โ DA: {da:8.2f}% โ")
print(f"โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ")
return {'mae': mae, 'rmse': rmse, 'mape': mape, 'da': da}
# Evaluate on test set
test_pred = model.forward(X_test)
metrics = evaluate_forecast(Y_test, test_pred, mu[0], std[0])Python
Industry Code: Production LSTM with TensorFlow/Keras
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Production-Grade LSTM for Multi-Step Forecasting
# TensorFlow/Keras Implementation
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.preprocessing import StandardScaler
import numpy as np
# โโ Model Architecture โโ
def build_lstm_model(lookback, n_features, horizon):
"""Build a stacked Bidirectional LSTM for time series."""
model = Sequential([
# Layer 1: Bidirectional LSTM (captures both directions)
Bidirectional(
LSTM(128, return_sequences=True,
dropout=0.2, recurrent_dropout=0.1),
input_shape=(lookback, n_features)
),
# Layer 2: Second LSTM layer
LSTM(64, return_sequences=False,
dropout=0.2, recurrent_dropout=0.1),
# Dense head for multi-step output
Dense(64, activation='relu'),
Dropout(0.1),
Dense(horizon) # MIMO: output all 30 steps at once
])
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss='mse',
metrics=['mae']
)
return model
# โโ Build & Train โโ
LOOKBACK = 60
N_FEATURES = 3 # nifty_close, vix_india, fii_net
HORIZON = 30
model = build_lstm_model(LOOKBACK, N_FEATURES, HORIZON)
model.summary()
# Callbacks
callbacks = [
EarlyStopping(monitor='val_loss', patience=15,
restore_best_weights=True),
ReduceLROnPlateau(monitor='val_loss', factor=0.5,
patience=5, min_lr=1e-6)
]
history = model.fit(
X_train, Y_train,
epochs=200,
batch_size=32,
validation_data=(X_val, Y_val),
callbacks=callbacks,
verbose=1
)
# โโ Evaluate โโ
test_pred = model.predict(X_test)
print(f"Test MAE: {np.mean(np.abs(Y_test - test_pred)):.4f} (normalized)")Python
TCN Alternative with Keras
# Temporal Convolutional Network using keras-tcn
# pip install keras-tcn
from tcn import TCN
def build_tcn_model(lookback, n_features, horizon):
model = Sequential([
TCN(
nb_filters=64,
kernel_size=3,
dilations=[1, 2, 4, 8, 16, 32],
padding='causal',
dropout_rate=0.2,
return_sequences=False,
input_shape=(lookback, n_features)
),
Dense(64, activation='relu'),
Dense(horizon)
])
model.compile(optimizer='adam', loss='mse', metrics=['mae'])
return model
tcn_model = build_tcn_model(LOOKBACK, N_FEATURES, HORIZON)
print(f"TCN parameters: {tcn_model.count_params():,}")
print(f"LSTM parameters: {model.count_params():,}")
# TCN typically has 2-3ร fewer parameters and trains 5ร faster!Python
Walk-Forward Validation โ Production Pipeline
# Full walk-forward validation pipeline
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error, mean_squared_error
import gc
def walk_forward_validate(data, lookback=60, horizon=30,
n_splits=5, build_fn=build_lstm_model):
"""Walk-forward validation with proper temporal splits."""
tscv = TimeSeriesSplit(n_splits=n_splits)
results = []
for fold, (train_idx, val_idx) in enumerate(tscv.split(data)):
print(f"\n{'='*50}")
print(f"Fold {fold+1}/{n_splits}")
print(f" Train: indices {train_idx[0]}โ{train_idx[-1]}")
print(f" Val: indices {val_idx[0]}โ{val_idx[-1]}")
train_data = data[train_idx]
val_data = data[val_idx]
# Scale on training data ONLY
scaler = StandardScaler()
train_scaled = scaler.fit_transform(train_data)
val_scaled = scaler.transform(val_data)
# Create windows
X_tr, Y_tr = create_windows(train_scaled, lookback, horizon)
X_vl, Y_vl = create_windows(val_scaled, lookback, horizon)
if len(X_vl) == 0:
print(" Skipping โ not enough val data for windowing")
continue
# Build fresh model for each fold
fold_model = build_fn(lookback, data.shape[1], horizon)
fold_model.fit(X_tr, Y_tr, epochs=100, batch_size=32,
validation_data=(X_vl, Y_vl),
callbacks=[EarlyStopping(patience=10,
restore_best_weights=True)],
verbose=0)
val_pred = fold_model.predict(X_vl, verbose=0)
# De-normalize for interpretable metrics
mu_y, std_y = scaler.mean_[0], scaler.scale_[0]
Y_vl_orig = Y_vl * std_y + mu_y
val_pred_orig = val_pred * std_y + mu_y
mae = mean_absolute_error(Y_vl_orig.flatten(), val_pred_orig.flatten())
rmse = np.sqrt(mean_squared_error(Y_vl_orig.flatten(), val_pred_orig.flatten()))
results.append({'fold': fold+1, 'mae': mae, 'rmse': rmse})
print(f" MAE: {mae:.2f} | RMSE: {rmse:.2f}")
# Free memory
del fold_model
gc.collect()
print(f"\n{'='*50}")
print(f"Mean MAE across folds: {np.mean([r['mae'] for r in results]):.2f}")
return results
results = walk_forward_validate(data)Python
Visual Diagrams
LSTM Unrolled for Time Series Forecasting
TCN Dilated Causal Convolution
Walk-Forward vs Wrong Cross-Validation
N-BEATS Block Architecture
Worked Example: Predicting Next Week's Nifty 50
Let's trace through a concrete LSTM prediction, step by step, for a 5-day (1-week) ahead forecast.
Given Data
| Day | Nifty Close | VIX India | FII Net (โนCr) |
|---|---|---|---|
| tโ4 (Mon) | 22,450 | 13.2 | +1,200 |
| tโ3 (Tue) | 22,510 | 12.8 | +800 |
| tโ2 (Wed) | 22,480 | 13.5 | โ400 |
| tโ1 (Thu) | 22,390 | 14.1 | โ1,500 |
| t (Fri) | 22,350 | 14.8 | โ2,000 |
Step 1: Observation
VIX India rising from 13.2 โ 14.8 (increasing fear). FII flows turning negative (โโน1,500 Cr, โโน2,000 Cr). Nifty declining from 22,510 โ 22,350. The pattern suggests continued selling pressure.
Step 2: Normalization
Using training set statistics: ฮผ_nifty = 21,800, ฯ_nifty = 1,200
x_t(vix) = (14.8 โ 13.5) / 2.1 = 0.619 (normalized)
x_t(fii) = (โ2000 โ 500) / 1800 = โ1.389 (normalized)
Step 3: LSTM Processing (Simplified)
The LSTM processes the last 60 days. At the final timestep:
- Forget gate f_t โ 0.3 for the "calm market" memory (VIX was low 2 weeks ago โ forget it)
- Input gate i_t โ 0.8 for the "FII selling" signal (strong, recent, write it in)
- Cell state now encodes: "bearish regime โ falling prices + rising volatility + FII outflow"
- Output hโโ is a 32-dim vector summarizing the market state
Step 4: Dense Layer โ 5-Day Forecast
hโโ passes through Dense(64, relu) โ Dense(5):
| Day | Predicted (normalized) | De-normalized | Actual |
|---|---|---|---|
| t+1 (Mon) | 0.412 | 22,294 | 22,280 |
| t+2 (Tue) | 0.385 | 22,262 | 22,310 |
| t+3 (Wed) | 0.371 | 22,245 | 22,190 |
| t+4 (Thu) | 0.340 | 22,208 | 22,150 |
| t+5 (Fri) | 0.325 | 22,190 | 22,220 |
Step 5: Evaluate
MAPE = mean(0.063%, 0.215%, 0.248%, 0.262%, 0.135%) = 0.185%
Directional Accuracy = 3/4 correct directions = 75% (predicted down Mon, up Tue โ, down Wed โ, down Thu โ, up Fri โ)
Case Study: Jio Network Traffic Prediction
๐ข Reliance Jio โ Predicting Network Traffic for 400M+ Users
The Problem
Reliance Jio serves 400+ million subscribers across India โ the world's largest mobile data network. Network capacity must be dynamically allocated across 200,000+ cell towers. Under-provisioning causes call drops and buffering; over-provisioning wastes โน100s of crores in idle infrastructure.
Objective: Predict per-tower network traffic 4 hours ahead to enable proactive capacity allocation.
The Data
| Feature | Granularity | Description |
|---|---|---|
| Data throughput (GB) | 15-min intervals | Primary target โ data consumed per tower |
| Active users | 15-min intervals | Connected devices per tower |
| Voice call minutes | 15-min intervals | Voice traffic load |
| Day of week | Daily | Weekday vs weekend patterns |
| Festival calendar | Event-based | Diwali, IPL, New Year's Eve |
| Cricket schedule | Event-based | IPL/T20 matches โ streaming surges |
| Weather | Hourly | Rain โ people stay indoors โ higher data usage |
| Nearby events | Event-based | Concerts, elections, religious gatherings |
The Architecture: LSTM-Transformer Hybrid
Key Design Decisions
- Tower clustering reduces 200,000 individual models to 50 cluster models. Towers in a cluster share a trained model but receive tower-specific inputs.
- LSTM + Transformer hybrid: LSTM captures within-tower temporal patterns; Transformer cross-attention captures spatial correlations (e.g., when one tower is overloaded, adjacent towers absorb traffic).
- IPL-aware training: Separate calendar encoding for IPL match days. During India vs Pakistan matches, data traffic spikes 3โ5ร โ the model needs explicit IPL features, not just "day of week."
- Walk-forward retraining: Models are retrained weekly on a rolling 6-month window to adapt to changing user behavior and new tower deployments.
Results
| Metric | Baseline (SARIMA) | LSTM Only | LSTM-Transformer |
|---|---|---|---|
| MAE (GB/tower/15min) | 12.4 | 7.8 | 5.2 |
| MAPE | 18.3% | 11.5% | 7.8% |
| Peak hour MAPE | 28.7% | 16.2% | 9.4% |
| Inference time (50 towers) | 2.1s | 0.8s | 1.2s |
Business Impact
- โน840 crore annual savings in infrastructure costs from smarter capacity allocation
- 38% reduction in congestion-related call drops
- 12% improvement in average video streaming quality (fewer buffering events)
- The model runs on NVIDIA A100 GPUs in Jio's private data centers, with inference serving 200K towers every 15 minutes
Common Mistakes & Critical Warnings
๐จ WARNING: Why Nifty 50 Prediction Is Fundamentally Hard
The Efficient Market Hypothesis (EMH) states that stock prices already reflect all available information. In its semi-strong form, no model using publicly available data (historical prices, fundamentals, news) can consistently outperform the market.
Implications for your LSTM:
1. Low Directional Accuracy: DA โ 50โ53% is typical. If your model shows DA > 60% on test data, suspect data leakage or survivorship bias before celebrating.
2. Backtest โ Live: A model that looks great in backtest fails live due to: (a) slippage and transaction costs, (b) regime changes, (c) the market adapting to the same signals your model uses.
3. Misleading MAPE: A "1.5% MAPE" on Nifty sounds impressive but is mostly explained by a naive model that predicts tomorrow = today (which has ~1.2% MAPE!). Always compare against naive baselines.
auto_arima or even Exponential Smoothing (ETS) can beat LSTM. Deep learning shines with: (a) multivariate inputs, (b) long horizons, (c) complex/multiple seasonalities, (d) non-linear patterns.
Comparison: Time Series Forecasting Architectures
| Feature | ARIMA | LSTM | TCN | N-BEATS | Informer |
|---|---|---|---|---|---|
| Type | Statistical | RNN-based | CNN-based | FC-based | Transformer |
| Multivariate | โ (ARIMAX โ) | โ Native | โ Native | โ Univariate | โ Native |
| Non-linear | โ | โ | โ | โ | โ |
| Long-range deps | Limited | Good (100s) | Excellent | Good | Excellent (1000s) |
| Training speed | Fast | Slow (sequential) | Fast (parallel) | Fast | Medium |
| Interpretability | โ High | โ Black box | โ Black box | โ (interpretable variant) | โ ๏ธ Attention weights |
| Data requirement | Low (~100) | Medium (~1000) | Medium (~1000) | Low (~500) | High (~5000+) |
| Multi-step | Recursive only | MIMO natural | MIMO natural | MIMO native | Generative decoder |
| Best for | Short, univariate | General purpose | Long sequences | Competition/benchmark | Very long sequences |
| Indian example | RBI inflation forecast | Zerodha volatility | Jio traffic | Flipkart demand | IMD monsoon |
โข < 500 data points, univariate โ ARIMA/ETS
โข 500โ5000 points, multivariate, medium horizon โ LSTM or TCN
โข Need interpretable decomposition โ N-BEATS (interpretable)
โข Very long sequences (> 5000), large dataset โ PatchTST / Informer
โข Production at scale (1000s of series) โ N-BEATS or TCN (fastest)
Exercises
Section A โ Multiple Choice Questions
Which property must a time series possess for ARIMA to be directly applicable without differencing?
- Seasonality
- Stationarity
- Monotonic trend
- High autocorrelation at all lags
In an LSTM cell, which gate is responsible for deciding what new information to store in the cell state?
- Forget gate
- Input gate
- Output gate
- Reset gate
A data scientist applies StandardScaler.fit_transform() on the entire dataset before splitting into train/val/test. What error has been introduced?
- Underfitting due to aggressive normalization
- Data leakage โ future data statistics contaminate training
- Loss of seasonality information
- Gradient explosion during backpropagation
Which multi-step forecasting strategy outputs all H future timesteps simultaneously from a single model?
- Recursive
- Direct
- MIMO
- Autoregressive
A TCN with kernel size 3 and dilation factors [1, 2, 4, 8, 16] has a receptive field of:
- 15 timesteps
- 31 timesteps
- 63 timesteps
- 93 timesteps
An LSTM model predicts Nifty 50 closing prices with MAPE = 1.5% and Directional Accuracy = 49%. What should you conclude?
- The model is excellent โ 1.5% error is very low
- The model is useless for trading โ it can't predict direction better than random
- The model needs more LSTM layers to improve
- The MAPE is wrong and should be recalculated
Informer's ProbSparse Self-Attention reduces complexity from O(Lยฒ) to:
- O(L)
- O(L log L)
- O(LโL)
- O(L logยฒ L)
In N-BEATS, what does the "backcast" output of each block represent?
- The final forecast
- The block's reconstruction of the input (lookback period)
- The gradient signal for backpropagation
- A compressed representation for the next block
Which evaluation metric is undefined (or problematic) when the actual value y_i equals zero?
- MAE
- RMSE
- MAPE
- Directional Accuracy
A Zepto demand forecasting model uses "day of week" as a feature but encodes it as a single integer (0โ6). What is the problem?
- No problem โ integer encoding is standard
- The model learns a false ordinal relationship (Sunday=0 < Monday=1 < ... < Saturday=6)
- Integer encoding causes gradient explosion
- The feature is redundant with "month" encoding
Section B โ Short Answer Questions
Explain why the recursive multi-step strategy suffers from error accumulation, using a concrete example with Nifty 50 prediction.
Nifty Example: Suppose the model predicts ลท(t+1) = 22,350 (actual: 22,300, error = +50). Now this inflated value becomes input for ลท(t+2), which might predict 22,380 (actual: 22,250, error = +130). By day 30, the cumulative error can exceed 500+ points, making predictions meaningless. MIMO avoids this by outputting all 30 days simultaneously from the true input window.
What is the Efficient Market Hypothesis (EMH), and how does it explain why our LSTM's Directional Accuracy is ~52%?
Describe two ways to make a non-stationary time series stationary. Give an Indian data example for each.
1. First differencing y'(t) = y(t) โ y(tโ1): Removes linear trend. Example: India's monthly GST collection (โน crores) has a strong upward trend from โน90,000 Cr (2019) to โน1,80,000 Cr (2025). First differencing converts to month-over-month change, making it approximately stationary.
2. Log transformation + differencing: Stabilizes multiplicative variance. Example: Sensex index from 2000 (4,000) to 2025 (75,000) โ absolute daily changes grow with the level (โน100 in 2000 vs โน1000 in 2025). Log transform converts multiplicative to additive, then differencing removes trend. Log returns ln(P_t/P_{t-1}) are approximately stationary.
Why does N-BEATS use both "backcast" and "forecast" outputs in each block? How does this enable residual learning?
Explain why Jio clusters its 200,000 cell towers into 50 types rather than training a single global model or 200,000 individual models.
โข 200K individual models: Impractical โ insufficient data per tower for deep learning (only ~1000 data points per tower), 200K training jobs, massive maintenance burden.
โข 1 global model: A tower in Connaught Place (Delhi CBD, commercial, peak=daytime) has completely different patterns from a tower in a Rajasthan village (residential, peak=evening). A single model can't capture this heterogeneity without becoming enormously complex.
โข 50 clusters: Group towers by usage pattern similarity (K-means on feature vectors: peak hours, weekday/weekend ratio, data vs voice mix). Each cluster gets a dedicated model that learns the shared temporal pattern. Tower-specific inputs (latitude, elevation, capacity) provide individualization within the cluster model. This is the "sweet spot" โ enough data per model, manageable complexity, and pattern-specific predictions.
Section C โ Long Answer Questions
(15 marks) Compare LSTM, TCN, and Informer for predicting hourly electricity consumption across India's five regional grids (Northern, Southern, Eastern, Western, North-Eastern). For each architecture, discuss: (a) how it handles the multiple seasonalities (hourly, daily, weekly, monsoon), (b) data requirements, (c) training efficiency, (d) ability to incorporate exogenous variables (temperature, industrial production index, festival calendar), and (e) your recommended architecture with justification.
(15 marks) Design a complete time series forecasting pipeline for Zepto's dark store demand prediction. Cover: data sources, feature engineering (including Indian-specific features), model selection, training with walk-forward validation, evaluation metrics, deployment considerations (inference speed for 700 stores ร 5000 SKUs), and how you would handle cold-start for new dark stores with no historical data.
(15 marks) Critically evaluate the claim: "Deep learning has made ARIMA obsolete for time series forecasting." Discuss with reference to: (a) the M4/M5 forecasting competitions, (b) computational cost considerations for Indian MSME businesses, (c) interpretability requirements in regulated sectors (RBI, SEBI), (d) data scarcity scenarios, and (e) ensemble approaches that combine statistical and DL methods.
Section D โ Programming Exercises
Electricity Consumption Forecasting (LSTM)
Using the UCI "Individual Household Electric Power Consumption" dataset (or simulated Indian grid data), build an LSTM model to predict the next 24 hours of electricity consumption at 1-hour granularity. Requirements:
- Create proper temporal features: hour of day (cyclical), day of week (cyclical), month, is_weekend
- Add Indian holiday/festival indicators (Diwali, Holi, Independence Day)
- Implement walk-forward validation with 5 folds
- Compare LSTM against a naive baseline (predict today = yesterday)
- Report MAE, RMSE, and MAPE for each fold and the mean
- Plot actual vs predicted for the final test fold
Electricity Consumption Forecasting (Model Comparison)
Extend D1 by implementing three models on the same dataset:
- ARIMA (using
pmdarima.auto_arima) - LSTM (2-layer, 64 units, Keras)
- TCN (using
keras-tcn, dilations=[1,2,4,8,16,32])
Compare: (a) test MAE/RMSE/MAPE, (b) training time, (c) inference time per prediction, (d) performance on normal days vs peak days (summer afternoons, festival evenings). Create a comprehensive comparison table and recommend the best model for deployment at a state electricity board (e.g., MSEDCL Maharashtra).
Section E โ Mini-Project
๐๏ธ Mini-Project: Indian Railway Passenger Traffic Predictor
Objective: Build a deep learning system to predict daily passenger count for the top 10 busiest Indian Railway stations (New Delhi, Mumbai CST, Howrah, Chennai Central, Bangalore City, Secunderabad, Ahmedabad, Jaipur, Lucknow, Pune) for the next 7 days.
Requirements:
- Data Generation: Create synthetic but realistic daily passenger data for 5 years (2020โ2025) incorporating: COVID lockdown dip (AprโJun 2020), gradual recovery, festive surges (Diwali, Chhath Puja, Pongal), summer holiday peak (MayโJun), Monday/Friday travel spikes
- Feature Engineering: Station-specific features, day of week (cyclical), festival proximity, school holiday indicator, weather (monsoon months for relevant stations)
- Models: Implement at least two: (a) Stacked LSTM, (b) TCN or N-BEATS
- Validation: Walk-forward validation with monthly retraining
- Evaluation: MAE, MAPE, and Directional Accuracy per station
- Visualization: Dashboard-style plots: actual vs predicted, error distribution, seasonal decomposition
- Report: Which stations are easiest/hardest to predict and why? How does the model handle sudden disruptions (e.g., Odisha train accident June 2023)?
Deliverables:
- Jupyter notebook with complete code
- Comparison table of all models across all 10 stations
- 2-page analysis report with recommendations for Indian Railways
Evaluation Criteria:
| Component | Marks |
|---|---|
| Realistic data generation with Indian patterns | 15 |
| Feature engineering quality | 15 |
| Model implementation (correctness, architecture) | 25 |
| Walk-forward validation (no leakage) | 15 |
| Evaluation metrics & visualization | 15 |
| Analysis report & insights | 15 |
| Total | 100 |
Chapter Summary
Key Takeaways โ Time Series Forecasting with Deep Learning
- Time series have temporal structure โ trend, seasonality, cyclicality, and noise โ that must be respected in all modeling decisions (splitting, scaling, feature engineering).
- Stationarity is the gateway concept. Use ADF test to check; apply differencing and log transforms to achieve it. Even deep learning benefits from stationary inputs.
- ARIMA remains your baseline โ fast, interpretable, and competitive for short-horizon univariate forecasting. Always compare DL models against ARIMA and naive baselines.
- LSTMs are the workhorse of deep time series โ their gated memory naturally handles temporal dependencies. Use 1โ2 layers with 32โ128 units for typical problems.
- TCNs offer faster training (parallel), stable gradients, and flexible receptive fields via dilated causal convolutions. Consider TCNs when training speed matters.
- N-BEATS won the M4 Competition with pure FC layers. Its interpretable variant decomposes forecasts into trend and seasonal components โ valuable for business stakeholders.
- Informer and its successors (PatchTST, TimesFM) bring Transformer power to time series with O(L log L) complexity, enabling very long input sequences.
- Feature engineering often matters more than architecture: lag features, rolling statistics, and domain-specific calendars (Indian festivals, IPL, salary cycles, RBI policy dates).
- MIMO forecasting (predicting all H steps simultaneously) avoids the error accumulation of recursive approaches and is the natural fit for LSTM/TCN output layers.
- Walk-forward validation is the ONLY correct validation strategy. Never use k-fold on time series. Scale using training statistics only โ no future data leakage.
- Evaluate with multiple metrics: MAE for absolute error, RMSE when large errors are costly, MAPE for scale-independent comparison, and Directional Accuracy for trading/decision applications.
- Stock prediction is fundamentally hard (EMH). A "low MAPE" is misleading โ always check Directional Accuracy and compare against naive baselines. Backtest performance rarely translates to live profits.
References & Further Reading
Foundational Papers
- Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735โ1780. โ The original LSTM paper.
- Box, G. E. P. & Jenkins, G. M. (1970). Time Series Analysis: Forecasting and Control. Holden-Day. โ Foundation of ARIMA methodology.
- Bai, S., Kolter, J. Z., & Koltun, V. (2018). "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling." arXiv:1803.01271. โ TCN vs LSTM benchmark.
- Oreshkin, B. N. et al. (2020). "N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting." ICLR 2020. โ M4 Competition winner.
- Zhou, H. et al. (2021). "Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting." AAAI 2021. โ ProbSparse attention for time series.
- Nie, Y. et al. (2023). "A Time Series is Worth 64 Words: Long-term Forecasting with Transformers." ICLR 2023. โ PatchTST.
- Fama, E. F. (1970). "Efficient Capital Markets: A Review of Theory and Empirical Work." Journal of Finance, 25(2), 383โ417. โ EMH.
Textbooks
- Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice, 3rd edition. OTexts. โ Free online at otexts.com/fpp3.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 10: Sequence Modeling.
- Lazzeri, F. (2021). Machine Learning for Time Series Forecasting with Python. Wiley.
Indian Context
- NSE India. "Nifty 50 Index Methodology Document." niftyindices.com
- India Meteorological Department (IMD). "Extended Range Prediction System." โ Uses ML/DL for seasonal monsoon forecasting.
- Reliance Jio Annual Report 2024. "Digital Infrastructure & Network Optimization." โ AI-driven capacity planning.
- SEBI (2023). "Circular on Algorithmic Trading." โ Regulatory guidelines for AI-based trading in Indian markets.
Libraries & Tools
statsmodelsโ ARIMA, seasonal decomposition, ADF test: statsmodels.orgpmdarimaโ Auto ARIMA for Python: alkaline-ml.com/pmdarimakeras-tcnโ TCN implementation for Keras: GitHubdartsโ Unified time series library by Unit8: unit8co.github.io/dartspytorch-forecastingโ Production time series with PyTorch: pytorch-forecasting.readthedocs.ioholidaysโ Indian public holiday library: GitHub