Mathematical Toolkit for Deep Learning
Every neural network is a composition of mathematical functions. Before you write a single line of PyTorch, you must own the math that powers it โ linear algebra for data flow, calculus for learning, probability for uncertainty, and information theory for loss functions.
Prerequisites
- Class 12 Mathematics (CBSE/ISC) โ matrices, determinants, basic calculus
- Chapter 1 of this textbook (Python & NumPy basics)
- Comfort with ฮฃ (summation) and basic algebraic manipulation
Learning Objectives
By the end of this chapter, you will be able to:
- Represent datasets as matrices/tensors and perform core linear algebra operations in NumPy
- Compute dot products, transposes, inverses, and Hadamard products by hand and in code
- Derive derivatives from first principles, apply the chain rule, and compute gradients
- Derive the sigmoid derivative step-by-step and explain why it matters for backpropagation
- Define Bernoulli and Gaussian distributions and compute Maximum Likelihood Estimates
- Derive cross-entropy loss from negative log-likelihood (MLE) โ the "why" behind the loss function
- Explain entropy, KL divergence, and cross-entropy and their roles in neural network training
- Implement all key operations from scratch in NumPy
The Hook: Why Math Matters
๐ฏ The Challenge
Paytm has 350 million registered users. Their data science team needs to predict which users will stop using Paytm Wallet next month. They have 50 features per user: transaction frequency, average spend (โน), last login days, UPI usage ratio, cashback redeemed, and more.
Here's the secret: every prediction is a math equation. The neural network takes a 50-dimensional vector (one user's features), multiplies it by weight matrices, applies nonlinear functions, and outputs a probability between 0 and 1. That probability is the churn score.
To understand how the network learns the right weights โ you need linear algebra (matrix multiplication), calculus (gradient descent), probability (loss function), and information theory (cross-entropy). This chapter gives you every tool.
India Connect
India's UPI processed 14.04 billion transactions worth โน20.64 lakh crore in a single month (March 2024). Behind every fraud detection model, recommendation engine, and credit scoring system at PhonePe, Google Pay, and Paytm โ the same mathematical toolkit powers the neural networks. Master this chapter and you speak the language of Indian fintech AI.
2.1 Linear Algebra for Deep Learning
Linear algebra is the language of data in deep learning. Every input, every weight, every output is a multi-dimensional array. Let's build the vocabulary.
2.1.1 Scalars, Vectors, Matrices, and Tensors
Scalar (0-D Tensor)
A single number. We denote scalars with lowercase italic letters: x, y, ฮฑ, ฮป.
Vector (1-D Tensor)
An ordered list of numbers. We denote vectors with bold lowercase: x. Each element is xโ, xโ, ..., xโ.
Example: A Paytm user's feature vector with 5 features:
x = [25, 12500, 3, 0.85, 450]แต
(age, avg_spend_โน, days_since_login, upi_ratio, cashback_โน)
Matrix (2-D Tensor)
A 2-D array of numbers. Denoted with bold uppercase: A. Shape is m ร n (m rows, n columns).
Example: 100 IRCTC passengers ร 4 features = X โ โยนโฐโฐหฃโด
Tensor (n-D Array)
An array with more than 2 axes. A color image is a 3-D tensor (height ร width ร channels). A batch of images is a 4-D tensor.
Color Image: (H, W, 3) | Batch of Images: (B, H, W, 3)
Example: 32 images of size 224ร224 RGB โ shape (32, 224, 224, 3)
Shape Notation is Everything
In deep learning, shape errors are the #1 bug. Always track tensor shapes. A weight matrix connecting a layer of 784 neurons to 256 neurons has shape (784, 256). The input batch has shape (64, 784). The output is (64, 256). Always verify: inner dimensions must match for matrix multiplication.
2.1.2 Matrix Multiplication: The Dot Product
Matrix multiplication is the single most important operation in neural networks. Every forward pass is a series of matrix multiplications.
C = A ยท B where C_{ij} = ฮฃโ A_{ik} ยท B_{kj} (k = 1 to n)
Result shape: C โ โแตหฃแต
Rule: Inner dimensions must match โ (mรn) ยท (nรp) = (mรp)
Intuition: Each element C_{ij} is the dot product of row i of A with column j of B. Think of it as: "How much does row i of A align with column j of B?"
In neural networks: If x is a single input vector (1รn) and W is a weight matrix (nรh), then xยทW gives the output (1รh) โ one number per hidden neuron. Each output neuron computes a weighted sum of the inputs.
2.1.3 Transpose
If A โ โแตหฃโฟ, then Aแต โ โโฟหฃแต
Rows become columns, columns become rows.
Properties:
(Aแต)แต = A | (A + B)แต = Aแต + Bแต | (AB)แต = BแตAแต
2.1.4 Matrix Inverse
A ยท Aโปยน = Aโปยน ยท A = I (Identity matrix)
Used in: Normal equation for linear regression โ w = (XแตX)โปยน Xแตy
Pro Tip
In practice, we never compute matrix inverses explicitly in deep learning. It's computationally expensive (O(nยณ)) and numerically unstable. We use gradient descent instead. But understanding inverses helps you read research papers and derive closed-form solutions.
2.1.5 Element-wise (Hadamard) Product
Both matrices must have the same shape.
Used in: Gating mechanisms (LSTM, GRU), attention masks, dropout
Don't confuse this with matrix multiplication! The Hadamard product multiplies corresponding elements, while matrix multiplication computes dot products of rows and columns.
Worked Example: IRCTC Passenger Data as a Matrix
Consider 5 IRCTC passengers with 4 features each: Age, Fare (โน), Distance (km), Class (1/2/3).
Data
Passenger Matrix X โ โโตหฃโด:
Age Fare(โน) Distance(km) Class
P1 [ 28, 1450, 850, 3 ]
P2 [ 45, 3200, 1200, 2 ]
P3 [ 32, 5800, 2100, 1 ]
P4 [ 56, 1800, 950, 3 ]
P5 [ 23, 4500, 1800, 1 ]
Shape: (5, 4)
X[2, 1] = 5800 (3rd passenger, 2nd feature = Fare)
X[:, 0] = [28, 45, 32, 56, 23] (all ages โ a column vector)
Weight vector for predicting "will book again" (4 features โ 1 output):
w = [0.01, 0.0002, 0.0003, -0.5]แต
Score for P1 = 28ร0.01 + 1450ร0.0002 + 850ร0.0003 + 3ร(-0.5) = 0.28 + 0.29 + 0.255 - 1.5 = -0.675
Score for P3 = 32ร0.01 + 5800ร0.0002 + 2100ร0.0003 + 1ร(-0.5) = 0.32 + 1.16 + 0.63 - 0.5 = 1.61
P3 (1AC traveler, long distance) has a much higher rebooking score. The entire batch computation is simply X ยท w.
2.2 Calculus for Deep Learning
Calculus tells us how to learn. A neural network improves by computing how much each weight contributed to the error, then adjusting it. That computation is a derivative.
2.2.1 Derivatives from First Principles
The derivative tells you: if x changes by a tiny amount,
how much does f(x) change?
It is the slope of the tangent line at point x.
Example: Deriving d/dx [xยฒ] from first principles
So d/dx [xยฒ] = 2x. At x = 3, the slope is 6 โ meaning a tiny increase in x increases xยฒ by approximately 6 times that increase.
Key Derivative Rules
| Rule | Formula | Example |
|---|---|---|
| Power Rule | d/dx [xโฟ] = nxโฟโปยน | d/dx [xยณ] = 3xยฒ |
| Constant Multiple | d/dx [cf(x)] = cยทf'(x) | d/dx [5xยฒ] = 10x |
| Sum Rule | d/dx [f+g] = f'+g' | d/dx [xยฒ+3x] = 2x+3 |
| Product Rule | d/dx [fg] = f'g + fg' | d/dx [xยทeหฃ] = eหฃ + xeหฃ |
| Exponential | d/dx [eหฃ] = eหฃ | d/dx [eยณหฃ] = 3eยณหฃ |
| Logarithm | d/dx [ln x] = 1/x | d/dx [ln(2x)] = 1/x |
2.2.2 The Chain Rule โ Heart of Backpropagation
dy/dx = (dy/du) ยท (du/dx) where u = g(x)
"The derivative of the outer function ร the derivative of the inner function"
Why this matters: A neural network is a composition of functions: output = fโ(fโ(fโ(x))). To compute how the loss changes w.r.t. weight wโ in the first layer, we chain derivatives through every subsequent layer. This is backpropagation.
Example: Chain Rule
Let y = (3x + 2)โต. Find dy/dx.
2.2.3 Partial Derivatives and Gradients
When a function has multiple inputs (like a loss function with many weights), we take partial derivatives โ differentiate w.r.t. one variable while holding others constant.
โf/โx = 6xy + 2yยณ (treat y as constant)
โf/โy = 3xยฒ + 6xyยฒ (treat x as constant)
The Gradient Vector
The gradient is a vector of all partial derivatives.
It points in the direction of steepest ascent.
To minimize loss: move in the opposite direction โ w = w - ฮฑโL
Gradient Descent Intuition
Imagine you're blindfolded on a hilly terrain (the loss surface). The gradient tells you which direction is uphill. You take a step in the opposite direction (downhill). The learning rate ฮฑ controls how big each step is. Too big โ you overshoot. Too small โ you crawl forever. This is gradient descent: the fundamental learning algorithm of deep learning.
2.2.4 Jacobian and Hessian (Brief Introduction)
J_{ij} = โfแตข/โxโฑผ (shape: m ร n)
The Jacobian generalizes the gradient to vector-valued functions.
Used in: Backpropagation through layers with vector outputs.
Matrix of second-order partial derivatives.
Tells you about curvature of the loss surface.
Used in: Second-order optimization (Newton's method, L-BFGS).
Pro Tip
You won't compute Jacobians or Hessians by hand in practice โ frameworks like PyTorch handle this with automatic differentiation. But understanding them conceptually helps you debug training issues and read advanced papers on optimization.
2.2.5 Worked Example: Derivative of the Sigmoid Function
Deriving ฯ'(z) Step by Step
The sigmoid function is one of the most important activation functions:
Let's derive its derivative from scratch:
dฯ/dz = dฯ/du ยท du/dz
= [1/(1 + eโปแถป)] ยท [eโปแถป/(1 + eโปแถป)]
= [1/(1 + eโปแถป)] ยท [(1 + eโปแถป - 1)/(1 + eโปแถป)]
= [1/(1 + eโปแถป)] ยท [1 - 1/(1 + eโปแถป)]
= ฯ(z) ยท (1 - ฯ(z))
Maximum value: ฯ'(0) = 0.5 ร 0.5 = 0.25
As |z| โ โ, ฯ'(z) โ 0 (vanishing gradient!)
Why this matters: The sigmoid derivative is at most 0.25. When you chain many sigmoid layers, gradients multiply: 0.25 ร 0.25 ร 0.25 = 0.016. After 10 layers: 0.25ยนโฐ โ 0.000001. The gradient vanishes โ the network stops learning. This is the vanishing gradient problem, and it's why ReLU replaced sigmoid in hidden layers.
2.3 Probability & Statistics for Deep Learning
Neural networks don't output certainties โ they output probabilities. Understanding probability is essential for designing loss functions, interpreting outputs, and reasoning about uncertainty.
2.3.1 Bernoulli Distribution
P(X = 1) = p P(X = 0) = 1 - p
Compact form: P(X = x) = pหฃ(1-p)ยนโปหฃ for x โ {0, 1}
Mean: E[X] = p | Variance: Var(X) = p(1-p)
Deep learning connection: Binary classification IS a Bernoulli distribution. When your model outputs P(churn) = 0.82 for a Paytm user, it's saying: "This user's churn follows Bernoulli(0.82)."
2.3.2 Gaussian (Normal) Distribution
p(x) = (1 / โ(2ฯฯยฒ)) ยท exp(โ(x โ ฮผ)ยฒ / (2ฯยฒ))
Mean: ฮผ | Variance: ฯยฒ | 68-95-99.7 rule
Deep learning connection: Weight initialization (Xavier, He) draws from Gaussians. Noise in VAEs is Gaussian. Regression targets are often modeled as Gaussian with learned mean and variance.
2.3.3 Conditional Probability & Bayes' Theorem
posterior = (likelihood ร prior) / evidence
India Connect โ Bayes in Action
Flipkart's search engine uses Bayesian reasoning: P(user wants "iPhone" | typed "i phone") is high because P(typed "i phone" | wants "iPhone") is very high (likelihood), and iPhones are frequently searched (prior). This is how spelling correction and query understanding work.
2.3.4 Maximum Likelihood Estimation (MLE)
MLE answers: "Given the data we observed, what parameter values make the data most probable?"
= argmax_ฮธ ฮ P(xแตข | ฮธ) (assuming i.i.d. samples)
= argmax_ฮธ ฮฃ log P(xแตข | ฮธ) (log-likelihood โ easier to optimize)
2.3.5 MLE for Bernoulli: Full Derivation
Worked Example: MLE for PhonePe Fraud Probability
PhonePe observes 1000 UPI transactions. 47 are fraudulent (y=1), 953 are legitimate (y=0). What is the MLE estimate of the fraud probability p?
L(p) = ฮ _{i=1}^{1000} p^{yแตข} (1-p)^{1-yแตข}
= p^{ฮฃyแตข} (1-p)^{n - ฮฃyแตข}
= pโดโท (1-p)โนโตยณ
โ(p) = log L(p) = 47ยทlog(p) + 953ยทlog(1-p)
dโ/dp = 47/p โ 953/(1โp) = 0
47(1โp) = 953p
47 โ 47p = 953p
47 = 1000p
pฬ = 47/1000 = 0.047
dยฒโ/dpยฒ = โ47/pยฒ โ 953/(1โp)ยฒ < 0 โ (concave โ maximum)
Result: The MLE estimate of fraud probability is 4.7% โ exactly the observed proportion! For Bernoulli, MLE always gives pฬ = (number of successes) / (total trials).
2.3.6 From MLE to Cross-Entropy Loss
Here's the most important derivation in this chapter โ why we use cross-entropy as a loss function.
The Bridge: Negative Log-Likelihood = Cross-Entropy
When our neural network outputs probability ลทแตข for each sample, and the true label is yแตข โ {0,1}, we want to maximize the likelihood. Equivalently, we minimize the negative log-likelihood:
Log-likelihood: โ = ฮฃ [yแตข log(ลทแตข) + (1-yแตข) log(1-ลทแตข)]
Negative log-likelihood (loss to minimize):
L_BCE = โ(1/n) ฮฃ [yแตข log(ลทแตข) + (1-yแตข) log(1-ลทแตข)]
This IS Binary Cross-Entropy Loss!
Cross-entropy is not an arbitrary choice โ it emerges naturally from maximum likelihood estimation under a Bernoulli model. When someone says "we use cross-entropy loss for classification," they're saying "we're doing MLE."
Fun Fact
MSE (Mean Squared Error) is also an MLE result โ but for a Gaussian model. If you assume your regression targets follow y ~ N(ลท, ฯยฒ), then maximizing the log-likelihood gives you exactly the MSE loss. Every standard loss function has a probabilistic interpretation!
2.4 Information Theory for Deep Learning
Information theory, pioneered by Claude Shannon (1948), gives us a mathematical framework for quantifying uncertainty and information. It provides the theoretical foundation for why cross-entropy is the natural loss function for classification.
2.4.1 Entropy: Measuring Uncertainty
(In deep learning, we use natural log: H(p) = โฮฃ p(x) ln p(x))
Intuition: Entropy measures how "surprised" you are on average. A fair coin (p=0.5) has maximum entropy โ you're maximally uncertain. A biased coin (p=0.99) has low entropy โ you know it'll be heads.
Examples
| Distribution | p(heads) | H (bits) | Interpretation |
|---|---|---|---|
| Fair coin | 0.5 | 1.0 | Maximum uncertainty |
| Biased coin | 0.9 | 0.47 | Fairly predictable |
| Certain coin | 1.0 | 0.0 | No uncertainty |
2.4.2 KL Divergence: Measuring Distribution Difference
= ฮฃ p(x) log p(x) โ ฮฃ p(x) log q(x)
= โH(p) + H(p, q)
Properties:
โข D_KL โฅ 0 (always non-negative โ Gibbs' inequality)
โข D_KL = 0 iff p = q
โข NOT symmetric: D_KL(pโq) โ D_KL(qโp) in general
Intuition: KL divergence measures how much "extra information" you need when you use distribution q to approximate distribution p. It's the "penalty" for using the wrong distribution.
2.4.3 Cross-Entropy: The Natural Loss
= H(p) + D_KL(p โ q)
Since H(p) is constant w.r.t. model parameters:
Minimizing Cross-Entropy = Minimizing KL Divergence
Why Cross-Entropy is the Natural Classification Loss
Three perspectives converge to the same answer:
- MLE perspective: Cross-entropy = negative log-likelihood of a Bernoulli/Categorical model
- Information theory perspective: Minimizing cross-entropy = minimizing KL divergence between true and predicted distributions
- Practical perspective: Cross-entropy produces stronger gradients than MSE for wrong predictions (no gradient saturation)
All three say: cross-entropy is the right loss for classification.
Numerical Example: Why Cross-Entropy > MSE for Classification
True label: y = 1. Model predicts ลท = 0.01 (confidently wrong!).
| Loss Function | Value | Gradient w.r.t. ลท |
|---|---|---|
| MSE: (y โ ลท)ยฒ | (1 โ 0.01)ยฒ = 0.98 | โ2(1 โ 0.01) = โ1.98 |
| Cross-Entropy: โy log(ลท) | โlog(0.01) = 4.61 | โ1/0.01 = โ100 |
Cross-entropy gives a gradient of โ100 vs MSE's โ1.98. When the model is confidently wrong, cross-entropy punishes it 50ร harder and pushes stronger corrections. That's why training converges faster with cross-entropy.
Common Mistake
Students often confuse entropy H(p), cross-entropy H(p,q), and KL divergence D_KL(pโq). Remember: H(p,q) = H(p) + D_KL(pโq). Cross-entropy decomposes into "irreducible uncertainty" (entropy) plus "extra cost from using the wrong model" (KL divergence).
NumPy Code Lab: Math from Scratch
Let's implement every key operation in NumPy. This is where formulas become executable code.
4.1 Linear Algebra Operations
Python
import numpy as np
# โโโโโโโโโโโโโโโโโโ SCALARS, VECTORS, MATRICES, TENSORS โโโโโโโโโโโโโโโโโโ
# Scalar
learning_rate = 0.01
# Vector โ 5 features of a Paytm user
user = np.array([25, 12500, 3, 0.85, 450])
print(f"User vector shape: {user.shape}") # (5,)
# Matrix โ 100 users ร 5 features
np.random.seed(42)
X = np.random.randn(100, 5)
print(f"Data matrix shape: {X.shape}") # (100, 5)
# Tensor โ batch of 32 color images (28ร28)
images = np.random.randn(32, 28, 28, 3)
print(f"Image batch shape: {images.shape}") # (32, 28, 28, 3)
# โโโโโโโโโโโโโโโโโโ DOT PRODUCT โโโโโโโโโโโโโโโโโโ
# Vector dot product: weighted sum of features
weights = np.array([0.3, 0.0001, -0.5, 2.0, 0.002])
score = np.dot(user, weights)
print(f"Churn score: {score:.4f}")
# Matrix multiplication: all users at once
W = np.random.randn(5, 3) # 5 features โ 3 hidden neurons
H = X @ W # (100,5) @ (5,3) = (100,3)
print(f"Hidden layer shape: {H.shape}") # (100, 3)
# โโโโโโโโโโโโโโโโโโ TRANSPOSE โโโโโโโโโโโโโโโโโโ
A = np.array([[1, 2, 3],
[4, 5, 6]])
print(f"A shape: {A.shape}") # (2, 3)
print(f"A^T shape: {A.T.shape}") # (3, 2)
# Verify (AB)^T = B^T A^T
B = np.random.randn(3, 4)
lhs = (A @ B).T
rhs = B.T @ A.T
print(f"(AB)^T == B^T A^T: {np.allclose(lhs, rhs)}") # True
# โโโโโโโโโโโโโโโโโโ INVERSE โโโโโโโโโโโโโโโโโโ
M = np.array([[4, 7], [2, 6]])
M_inv = np.linalg.inv(M)
print(f"M ยท Mโปยน =\n{M @ M_inv}") # Identity matrix
# โโโโโโโโโโโโโโโโโโ HADAMARD (ELEMENT-WISE) PRODUCT โโโโโโโโโโโโโโโโโโ
gate = np.array([[1, 0, 1],
[0, 1, 0]]) # Binary mask (like dropout)
data = np.array([[5, 3, 8],
[2, 7, 4]])
masked = gate * data # Hadamard product: * in NumPy
print(f"Hadamard:\n{masked}") # [[5, 0, 8], [0, 7, 0]]
4.2 Calculus: Numerical Gradient & Sigmoid
Python
import numpy as np
# โโโโโโโโโโโโโโโโโโ SIGMOID AND ITS DERIVATIVE โโโโโโโโโโโโโโโโโโ
def sigmoid(z):
"""ฯ(z) = 1 / (1 + e^(-z))"""
return 1 / (1 + np.exp(-z))
def sigmoid_derivative(z):
"""ฯ'(z) = ฯ(z)(1 - ฯ(z))"""
s = sigmoid(z)
return s * (1 - s)
# Verify: maximum of sigmoid derivative is at z=0
z_values = np.linspace(-6, 6, 1000)
derivs = sigmoid_derivative(z_values)
print(f"Max ฯ'(z) = {derivs.max():.4f} at z = {z_values[derivs.argmax()]:.2f}")
# Max ฯ'(z) = 0.2500 at z = 0.00
# โโโโโโโโโโโโโโโโโโ NUMERICAL GRADIENT โโโโโโโโโโโโโโโโโโ
def numerical_gradient(f, x, h=1e-7):
"""Compute gradient numerically (central difference)"""
grad = np.zeros_like(x)
for i in range(len(x)):
x_plus = x.copy(); x_plus[i] += h
x_minus = x.copy(); x_minus[i] -= h
grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
return grad
# Test: f(x,y) = xยฒ + 3xy โ โf/โx = 2x+3y, โf/โy = 3x
def test_func(v):
x, y = v[0], v[1]
return x**2 + 3*x*y
point = np.array([2.0, 5.0])
num_grad = numerical_gradient(test_func, point)
analytical_grad = np.array([2*2 + 3*5, 3*2]) # [19, 6]
print(f"Numerical gradient: {num_grad}")
print(f"Analytical gradient: {analytical_grad}")
print(f"Match: {np.allclose(num_grad, analytical_grad)}") # True
# โโโโโโโโโโโโโโโโโโ GRADIENT DESCENT DEMO โโโโโโโโโโโโโโโโโโ
def loss(w):
"""Simple quadratic loss: L = (w - 3)ยฒ"""
return (w - 3)**2
w = 0.0 # Start at w=0
lr = 0.1 # Learning rate
print(f"{'Step':>4} {'w':>8} {'Loss':>10}")
for step in range(20):
grad = 2 * (w - 3) # dL/dw = 2(w-3)
w = w - lr * grad # Gradient descent update
print(f"{step+1:>4} {w:>8.4f} {loss(w):>10.6f}")
# w converges to 3.0 (the minimum)
4.3 Probability: Distributions, MLE, Cross-Entropy
Python
import numpy as np
# โโโโโโโโโโโโโโโโโโ BERNOULLI DISTRIBUTION โโโโโโโโโโโโโโโโโโ
def bernoulli_pmf(x, p):
"""P(X=x) = p^x * (1-p)^(1-x)"""
return (p ** x) * ((1 - p) ** (1 - x))
# PhonePe fraud example: p = 0.047
p_fraud = 0.047
print(f"P(fraud) = {bernoulli_pmf(1, p_fraud):.4f}") # 0.047
print(f"P(not fraud) = {bernoulli_pmf(0, p_fraud):.4f}") # 0.953
# โโโโโโโโโโโโโโโโโโ GAUSSIAN DISTRIBUTION โโโโโโโโโโโโโโโโโโ
def gaussian_pdf(x, mu, sigma):
"""p(x) = (1/โ(2ฯฯยฒ)) exp(-(x-ฮผ)ยฒ/(2ฯยฒ))"""
coeff = 1 / np.sqrt(2 * np.pi * sigma**2)
exponent = -((x - mu)**2) / (2 * sigma**2)
return coeff * np.exp(exponent)
# Zomato delivery time: mean 35 min, std 8 min
x = np.array([25, 30, 35, 40, 50])
probs = gaussian_pdf(x, mu=35, sigma=8)
for xi, pi in zip(x, probs):
print(f"P(delivery={xi} min) = {pi:.4f}")
# โโโโโโโโโโโโโโโโโโ MLE FOR BERNOULLI โโโโโโโโโโโโโโโโโโ
# Simulate 1000 transactions, 47 fraudulent
np.random.seed(42)
transactions = np.zeros(1000)
transactions[:47] = 1
np.random.shuffle(transactions)
# MLE estimate: p_hat = sum(x) / n
p_hat = transactions.mean()
print(f"\nMLE estimate of fraud probability: {p_hat:.4f}") # 0.047
# Log-likelihood at p_hat
log_lik = np.sum(transactions * np.log(p_hat) +
(1 - transactions) * np.log(1 - p_hat))
print(f"Log-likelihood at p_hat: {log_lik:.2f}")
# โโโโโโโโโโโโโโโโโโ CROSS-ENTROPY LOSS โโโโโโโโโโโโโโโโโโ
def binary_cross_entropy(y_true, y_pred, eps=1e-15):
"""BCE = -(1/n) ฮฃ [y log(ลท) + (1-y) log(1-ลท)]"""
y_pred = np.clip(y_pred, eps, 1 - eps) # Numerical stability
return -np.mean(
y_true * np.log(y_pred) +
(1 - y_true) * np.log(1 - y_pred)
)
# Test: true labels vs model predictions
y_true = np.array([1, 0, 1, 1, 0])
y_good = np.array([0.9, 0.1, 0.85, 0.95, 0.05]) # Good model
y_bad = np.array([0.3, 0.8, 0.4, 0.2, 0.7]) # Bad model
print(f"\nGood model BCE: {binary_cross_entropy(y_true, y_good):.4f}")
print(f"Bad model BCE: {binary_cross_entropy(y_true, y_bad):.4f}")
# Good model has LOWER loss โ cross-entropy works!
# โโโโโโโโโโโโโโโโโโ ENTROPY & KL DIVERGENCE โโโโโโโโโโโโโโโโโโ
def entropy(p):
"""H(p) = -ฮฃ p(x) log p(x)"""
p = p[p > 0] # Avoid log(0)
return -np.sum(p * np.log(p))
def kl_divergence(p, q, eps=1e-15):
"""D_KL(p || q) = ฮฃ p(x) log(p(x)/q(x))"""
q = np.clip(q, eps, 1.0)
return np.sum(p * np.log(p / q))
def cross_entropy(p, q, eps=1e-15):
"""H(p, q) = -ฮฃ p(x) log q(x)"""
q = np.clip(q, eps, 1.0)
return -np.sum(p * np.log(q))
# True distribution vs two models
p_true = np.array([0.7, 0.2, 0.1]) # 3-class problem
q_good = np.array([0.65, 0.25, 0.1])
q_bad = np.array([0.2, 0.3, 0.5])
print(f"\nEntropy H(p): {entropy(p_true):.4f}")
print(f"KL(p||q_good): {kl_divergence(p_true, q_good):.4f}")
print(f"KL(p||q_bad): {kl_divergence(p_true, q_bad):.4f}")
print(f"CrossEnt(p, q_good): {cross_entropy(p_true, q_good):.4f}")
print(f"CrossEnt(p, q_bad): {cross_entropy(p_true, q_bad):.4f}")
print(f"Verify: H(p) + KL(p||q_good) = {entropy(p_true) + kl_divergence(p_true, q_good):.4f}")
# Should equal cross_entropy(p_true, q_good)
Visual Walkthrough: Matrix Multiplication
Let's trace through a 3ร3 matrix multiplication by hand, step by step.
Worked Examples: Hand Calculations
Example 1: Matrix Operations for a Jio User Dataset
A Jio dataset has 3 users ร 3 features: [data_usage_GB, recharge_โน, calls_min]
Calculation
X = โ โ W = โ โ
โ 12 399 45โ โ 0.1 โ
โ 5 199 80โ โ 0.005โ (churn weight vector)
โ 25 599 20โ โ-0.02 โ
โ โ โ โ
Compute XยทW (churn scores):
After sigmoid: ฯ(2.295) = 0.908, ฯ(โ0.105) = 0.474, ฯ(5.095) = 0.994
Interpretation: User 3 (high data, high recharge, few calls) has 99.4% predicted churn probability โ likely a heavy user switching to another carrier. User 2 at 47.4% โ borderline.
Example 2: Cross-Entropy Loss Calculation
3 Flipkart customers. True labels (will return product): y = [1, 0, 1]
Model predictions: ลท = [0.8, 0.3, 0.6]
Lโ = โ[1ยทlog(0.8) + 0ยทlog(0.2)] = โlog(0.8) = โ(โ0.2231) = 0.2231
Lโ = โ[0ยทlog(0.3) + 1ยทlog(0.7)] = โlog(0.7) = โ(โ0.3567) = 0.3567
Lโ = โ[1ยทlog(0.6) + 0ยทlog(0.4)] = โlog(0.6) = โ(โ0.5108) = 0.5108
BCE = (0.2231 + 0.3567 + 0.5108) / 3 = 0.3635
Interpretation: Sample 3 (y=1, ลท=0.6) contributes the most loss because the model is least confident about the correct answer. Cross-entropy penalizes under-confidence proportionally.
Example 3: Computing Entropy and KL Divergence
Swiggy food category distribution in Bangalore:
True: p = [Biryani: 0.4, Pizza: 0.3, Dosa: 0.2, Other: 0.1]
Model A: q_A = [0.35, 0.30, 0.25, 0.10]
Model B: q_B = [0.10, 0.10, 0.10, 0.70]
H = โ[0.4 ln(0.4) + 0.3 ln(0.3) + 0.2 ln(0.2) + 0.1 ln(0.1)]
= โ[0.4(โ0.916) + 0.3(โ1.204) + 0.2(โ1.609) + 0.1(โ2.303)]
= โ[โ0.366 โ 0.361 โ 0.322 โ 0.230]
= 1.279 nats
= 0.4 ln(0.4/0.35) + 0.3 ln(0.3/0.30) + 0.2 ln(0.2/0.25) + 0.1 ln(0.1/0.10)
= 0.4(0.134) + 0.3(0) + 0.2(โ0.223) + 0.1(0)
= 0.054 โ 0.045 = 0.009 nats (very close!)
= 0.4 ln(0.4/0.10) + 0.3 ln(0.3/0.10) + 0.2 ln(0.2/0.10) + 0.1 ln(0.1/0.70)
= 0.4(1.386) + 0.3(1.099) + 0.2(0.693) + 0.1(โ1.946)
= 0.554 + 0.330 + 0.139 โ 0.195 = 0.828 nats (very far!)
Conclusion: Model A (KL = 0.009) is 92ร better than Model B (KL = 0.828) at approximating the true distribution. KL divergence correctly captures that Model B's prediction of 70% "Other" is absurdly wrong for Bangalore food orders.
Common Mistakes & Pitfalls
Mistake 1: Confusing Matrix Multiply and Element-wise Multiply
Wrong: Using A * B (Hadamard) when you mean A @ B (matrix multiply) in NumPy. These are completely different operations! A * B requires same shapes; A @ B requires inner dimensions to match.
Fix: Always use @ or np.dot() for matrix multiplication. Reserve * for element-wise operations.
Mistake 2: Shape Mismatches in Matrix Multiplication
Wrong: Trying to multiply (100, 5) ร (100, 3). Inner dimensions don't match (5 โ 100).
Fix: Always write shapes side by side: (mรn) ร (nรp). The bolded dimensions must be equal. If not, transpose one matrix.
Mistake 3: Using MSE for Classification
Wrong: MSE as loss for binary classification. Gradients saturate when sigmoid output is near 0 or 1.
Fix: Always use cross-entropy for classification. It's the MLE-optimal loss and produces stronger gradients for wrong predictions.
Mistake 4: Forgetting log(0) is Undefined
Wrong: Computing np.log(y_pred) when y_pred contains 0. Result: -inf or NaN.
Fix: Always clip predictions: y_pred = np.clip(y_pred, 1e-15, 1-1e-15) before computing log.
Mistake 5: Thinking KL Divergence is Symmetric
Wrong: Assuming D_KL(pโq) = D_KL(qโp). It's NOT โ KL divergence is not a true "distance."
Fix: Always specify direction. In training, we minimize D_KL(p_true โ q_model), which equals minimizing cross-entropy H(p, q).
Mistake 6: Ignoring the Vanishing Gradient of Sigmoid
Wrong: Stacking many sigmoid layers and wondering why the network doesn't learn.
Fix: Use ReLU for hidden layers. Reserve sigmoid only for the final output layer of binary classification. We derived that ฯ'(z) โค 0.25 โ chaining many of these kills the gradient.
Exercises
Section A: Multiple Choice Questions
-
What is the shape of the result when you multiply a matrix of shape (64, 128) with a matrix of shape (128, 10)?
(a) (128, 128) (b) (64, 10) (c) (10, 64) (d) (64, 128, 10)
Answer: (b) โ (64ร128) ร (128ร10) = (64ร10) -
Which operation is used in LSTM gating mechanisms?
(a) Matrix multiplication (b) Hadamard (element-wise) product (c) Matrix inverse (d) Eigenvalue decomposition
Answer: (b) โ Gates multiply element-wise with the cell state -
What is the derivative of ฯ(z) = 1/(1+eโปแถป)?
(a) ฯ(z)ยฒ (b) ฯ(z)(1โฯ(z)) (c) 1โฯ(z)ยฒ (d) eโปแถป/(1+eโปแถป)
Answer: (b) โ ฯ'(z) = ฯ(z)(1 โ ฯ(z)), derived in Section 2.2.5 -
The maximum value of ฯ'(z) is:
(a) 1.0 (b) 0.5 (c) 0.25 (d) 0.1
Answer: (c) โ At z=0: ฯ(0)=0.5, ฯ'(0)=0.5ร0.5=0.25 -
Which of the following is the correct chain rule?
(a) d/dx[f(g(x))] = f'(x)ยทg'(x) (b) d/dx[f(g(x))] = f'(g(x))ยทg'(x) (c) d/dx[f(g(x))] = f(g'(x)) (d) d/dx[f(g(x))] = f'(g(x))+g'(x)
Answer: (b) โ Derivative of outer evaluated at inner ร derivative of inner -
Cross-entropy loss for binary classification is derived from:
(a) Mean Squared Error (b) Maximum A Posteriori (c) Maximum Likelihood Estimation (d) Least Absolute Deviation
Answer: (c) โ BCE = negative log-likelihood of Bernoulli MLE -
If P(fraud) = 0.03 and the model predicts Pฬ(fraud) = 0.01, the cross-entropy loss โ[yยทlog(ลท)] for a fraud sample (y=1) is:
(a) โlog(0.03) (b) โlog(0.01) = 4.61 (c) โlog(0.99) = 0.01 (d) โlog(0.97)
Answer: (b) โ For y=1, loss = โlog(ลท) = โlog(0.01) = 4.61 -
KL divergence D_KL(pโq) is always:
(a) Negative (b) Zero (c) Non-negative (โฅ 0) (d) Symmetric
Answer: (c) โ Gibbs' inequality guarantees D_KL โฅ 0, with equality iff p=q -
The gradient vector โf points in the direction of:
(a) Steepest descent (b) Steepest ascent (c) Zero change (d) Random direction
Answer: (b) โ Gradient points to steepest ascent; we move opposite for descent -
Which is the correct relationship between entropy, cross-entropy, and KL divergence?
(a) H(p,q) = H(p) โ D_KL(pโq) (b) H(p,q) = H(p) + D_KL(pโq) (c) H(p,q) = D_KL(pโq) โ H(p) (d) H(p,q) = H(p) ร D_KL(pโq)
Answer: (b) โ Cross-entropy = entropy + KL divergence
Section B: Hand Calculation Problems
-
Matrix Multiplication. Compute C = A ร B by hand:
A = โ โ B = โ โ โ 2 3 โ โ 1 โ โ 1 -1 โ โ 4 โ โ โ โ โShow Solution
C = A ร B = [(2ร1)+(3ร4), (1ร1)+(โ1ร4)]แต = [14, โ3]แตThis is a (2ร2) ร (2ร1) = (2ร1) result. Each element is a dot product of a row of A with column B.
-
Chain Rule. Find dy/dx for y = ln(sin(3xยฒ)).
Show Solution
1Let u = 3xยฒ, v = sin(u), y = ln(v)2dy/dv = 1/v = 1/sin(3xยฒ)3dv/du = cos(u) = cos(3xยฒ)4du/dx = 6x5dy/dx = (1/sin(3xยฒ)) ยท cos(3xยฒ) ยท 6x = 6x ยท cot(3xยฒ) -
Gradient Computation. For L(wโ, wโ) = (2wโ + 3wโ โ 7)ยฒ, compute โL at wโ=1, wโ=1.
Show Solution
1At (1,1): L = (2+3โ7)ยฒ = (โ2)ยฒ = 42โL/โwโ = 2(2wโ+3wโโ7)ยท2 = 4(2+3โ7) = 4(โ2) = โ83โL/โwโ = 2(2wโ+3wโโ7)ยท3 = 6(2+3โ7) = 6(โ2) = โ124โL = [โ8, โ12]แต โ Move in opposite direction: [+8, +12] -
Cross-Entropy. True labels: y = [1, 0, 1, 0]. Predictions: ลท = [0.9, 0.2, 0.7, 0.1]. Compute the binary cross-entropy loss.
Show Solution
1Lโ = โlog(0.9) = 0.10542Lโ = โlog(1โ0.2) = โlog(0.8) = 0.22313Lโ = โlog(0.7) = 0.35674Lโ = โlog(1โ0.1) = โlog(0.9) = 0.10545BCE = (0.1054+0.2231+0.3567+0.1054)/4 = 0.1977 -
MLE. A Zomato delivery model observes 200 orders. 160 arrive on time (y=1), 40 are late (y=0). Find the MLE estimate of on-time probability. Then compute the log-likelihood at that estimate.
Show Solution
1pฬ = 160/200 = 0.802โ(pฬ) = 160ยทln(0.8) + 40ยทln(0.2)3= 160(โ0.2231) + 40(โ1.6094)4= โ35.70 + (โ64.38) = โ100.08The negative value is normal โ log-likelihoods for probabilities < 1 are always negative.
Section D: Programming Problems
-
Implement a complete gradient descent optimizer for f(x) = xโด โ 3xยณ + 2.
Start from x=6. Use learning rate 0.01. Run 1000 steps. Print x and f(x) every 100 steps. Verify that x converges near the global minimum. Plot the loss curve.Show Starter Code
Python import numpy as np import matplotlib.pyplot as plt def f(x): return x**4 - 3*x**3 + 2 def df(x): # TODO: compute the derivative pass x = 6.0 lr = 0.01 history = [] for step in range(1000): # TODO: gradient descent update # TODO: record history pass # TODO: plot history -
Build a softmax + cross-entropy loss function from scratch.
Given logits z = [2.0, 1.0, 0.1] and true class y = 0, implement:
(a) Softmax: p_i = exp(z_i) / ฮฃ exp(z_j)
(b) Cross-entropy: L = โlog(p_y)
(c) Gradient: โL/โz_i = p_i โ 1{i=y}
Verify your gradient numerically.Show Starter Code
Python import numpy as np def softmax(z): # TODO: implement (use max trick for stability) pass def cross_entropy_loss(probs, y_true): # TODO: -log(probs[y_true]) pass def softmax_ce_gradient(probs, y_true): # TODO: p_i - 1{i=y} pass z = np.array([2.0, 1.0, 0.1]) y = 0 # TODO: compute and print softmax, loss, gradient # TODO: verify gradient numerically -
Implement a complete 2-layer neural network forward pass using only NumPy.
Architecture: 4 inputs โ 3 hidden (ReLU) โ 1 output (sigmoid).
Initialize random weights (use seed 42). Process a batch of 5 samples. Print shapes at every step. Compute binary cross-entropy loss.Show Starter Code
Python import numpy as np np.random.seed(42) # Data: 5 samples ร 4 features X = np.random.randn(5, 4) y = np.array([[1], [0], [1], [0], [1]]) # shape (5,1) # TODO: Initialize W1 (4ร3), b1 (1ร3), W2 (3ร1), b2 (1ร1) # TODO: Forward pass: Z1 = X@W1+b1, A1 = relu(Z1), Z2 = A1@W2+b2, A2 = sigmoid(Z2) # TODO: Compute BCE loss # TODO: Print all intermediate shapes
Chapter Summary
Key Takeaways
- Linear Algebra: Data lives in tensors. Matrix multiplication (the dot product) is the core computation in every neural network layer: z = Wx + b.
- Shape tracking is non-negotiable: (mรn) ร (nรp) = (mรp). Inner dimensions must match.
- Transpose flips rows โ columns. The Hadamard product (โ) multiplies element-wise. Matrix inverse exists only for square, non-singular matrices.
- Calculus: The derivative measures sensitivity โ how much output changes per unit input change. The chain rule decomposes derivatives through compositions of functions.
- The sigmoid derivative ฯ'(z) = ฯ(z)(1โฯ(z)) has a maximum of 0.25 โ chaining many sigmoid layers causes vanishing gradients.
- The gradient โL points uphill; gradient descent moves downhill: w โ w โ ฮฑโL.
- Probability: Binary classification = Bernoulli distribution. MLE for Bernoulli gives pฬ = successes/trials.
- Cross-entropy loss = negative log-likelihood โ it emerges naturally from MLE, not from arbitrary choice.
- Information Theory: Entropy measures uncertainty. KL divergence measures how wrong your model distribution is. Cross-entropy = Entropy + KL Divergence.
- Cross-entropy produces 50ร stronger gradients than MSE for confidently wrong predictions โ that's why it trains faster for classification.
Cheat Sheet: Formulas You Must Remember
| Name | Formula | Where Used |
|---|---|---|
| Matrix Multiply | C_{ij} = ฮฃ_k A_{ik}B_{kj} | Every forward pass |
| Sigmoid | ฯ(z) = 1/(1+eโปแถป) | Output layer (binary) |
| Sigmoid Derivative | ฯ'(z) = ฯ(z)(1โฯ(z)) | Backpropagation |
| Chain Rule | dy/dx = (dy/du)(du/dx) | Backpropagation |
| Gradient Descent | w โ w โ ฮฑโL | All training |
| Bernoulli MLE | pฬ = ฮฃyแตข / n | Parameter estimation |
| Binary Cross-Entropy | โ(1/n)ฮฃ[y log ลท + (1โy)log(1โลท)] | Classification loss |
| Entropy | H(p) = โฮฃ p(x) log p(x) | Measuring uncertainty |
| KL Divergence | D_KL(pโq) = ฮฃ p log(p/q) | Comparing distributions |
| Cross-Entropy | H(p,q) = H(p) + D_KL(pโq) | Classification loss |
References & Further Reading
Primary Textbooks
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 2 (Linear Algebra), Chapter 3 (Probability), Chapter 4 (Numerical Computation). MIT Press. deeplearningbook.org
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Chapter 1 (Introduction: Probability). Springer.
- Strang, G. (2019). Linear Algebra and Learning from Data. Wellesley-Cambridge Press.
Online Resources
- 3Blue1Brown โ Essence of Linear Algebra (YouTube Series). Exceptional visual intuition for vectors, transformations, and eigenvalues.
- 3Blue1Brown โ Essence of Calculus (YouTube Series). Derivatives and integrals explained visually.
- Khan Academy โ Multivariable Calculus. Gradients, partial derivatives, Jacobians.
- Stanford CS229 Notes โ Linear Algebra Review. Concise reference for ML-relevant linear algebra.
- colah's blog โ Visual Information Theory (2015). Beautiful explanation of entropy, cross-entropy, and KL divergence. colah.github.io
Indian Context
- NPTEL โ Mathematics for Machine Learning by IIT Madras. Free video lectures covering all topics in this chapter.
- NPTEL โ Deep Learning by Prof. Mitesh Khapra, IIT Madras. Mathematical foundations in Weeks 1-3.
- UPI Transaction Statistics โ NPCI. npci.org.in
What's Next?
In Chapter 3: The Perceptron & Neuron Model, we'll put this math to work. You'll see how a single neuron computes z = wแตx + b (linear algebra), applies ฯ(z) (calculus), and learns by minimizing cross-entropy (probability + information theory). Every formula from this chapter will come alive.