Chapter 2 · Part I: Foundations

Mathematical Toolkit for Deep Learning

⏱ 3 hours reading 📄 ~12,000 words 🧮 Full derivations

Every neural network is a composition of mathematical functions. Before you write a single line of PyTorch, you must own the math that powers it — linear algebra for data flow, calculus for learning, probability for uncertainty, and information theory for loss functions.

Remember: Notation & Definitions Understand: Derivations Apply: NumPy Computation Analyze: Why Cross-Entropy? Evaluate: Compare Loss Functions Create: Implement from Scratch

Prerequisites

Class 12 Mathematics (CBSE/ISC) — matrices, determinants, basic calculus
Chapter 1 of this textbook (Python & NumPy basics)
Comfort with Σ (summation) and basic algebraic manipulation

Learning Objectives

By the end of this chapter, you will be able to:

Represent datasets as matrices/tensors and perform core linear algebra operations in NumPy
Compute dot products, transposes, inverses, and Hadamard products by hand and in code
Derive derivatives from first principles, apply the chain rule, and compute gradients
Derive the sigmoid derivative step-by-step and explain why it matters for backpropagation
Define Bernoulli and Gaussian distributions and compute Maximum Likelihood Estimates
Derive cross-entropy loss from negative log-likelihood (MLE) — the "why" behind the loss function
Explain entropy, KL divergence, and cross-entropy and their roles in neural network training
Implement all key operations from scratch in NumPy

The Hook: Why Math Matters

🎯 The Challenge

Paytm has 350 million registered users. Their data science team needs to predict which users will stop using Paytm Wallet next month. They have 50 features per user: transaction frequency, average spend (₹), last login days, UPI usage ratio, cashback redeemed, and more.

Here's the secret: every prediction is a math equation. The neural network takes a 50-dimensional vector (one user's features), multiplies it by weight matrices, applies nonlinear functions, and outputs a probability between 0 and 1. That probability is the churn score.

To understand how the network learns the right weights — you need linear algebra (matrix multiplication), calculus (gradient descent), probability (loss function), and information theory (cross-entropy). This chapter gives you every tool.

India Connect

India's UPI processed 14.04 billion transactions worth ₹20.64 lakh crore in a single month (March 2024). Behind every fraud detection model, recommendation engine, and credit scoring system at PhonePe, Google Pay, and Paytm — the same mathematical toolkit powers the neural networks. Master this chapter and you speak the language of Indian fintech AI.

2.1 Linear Algebra for Deep Learning

Linear algebra is the language of data in deep learning. Every input, every weight, every output is a multi-dimensional array. Let's build the vocabulary.

2.1.1 Scalars, Vectors, Matrices, and Tensors

Scalar (0-D Tensor)

A single number. We denote scalars with lowercase italic letters: x, y, α, λ.

Scalar

x = 42 (a single number — e.g., a user's age)

Vector (1-D Tensor)

An ordered list of numbers. We denote vectors with bold lowercase: x. Each element is x₁, x₂, ..., xₙ.

Column Vector

x = [x₁, x₂, ..., xₙ]ᵀ ∈ ℝⁿ
Example: A Paytm user's feature vector with 5 features:
x = [25, 12500, 3, 0.85, 450]ᵀ
(age, avg_spend_₹, days_since_login, upi_ratio, cashback_₹)

Matrix (2-D Tensor)

A 2-D array of numbers. Denoted with bold uppercase: A. Shape is m × n (m rows, n columns).

Matrix

A ∈ ℝᵐˣⁿ where A_{ij} is the element at row i, column j

Example: 100 IRCTC passengers × 4 features = X ∈ ℝ¹⁰⁰ˣ⁴

Tensor (n-D Array)

An array with more than 2 axes. A color image is a 3-D tensor (height × width × channels). A batch of images is a 4-D tensor.

Tensor Shapes

Scalar: () | Vector: (n,) | Matrix: (m, n)
Color Image: (H, W, 3) | Batch of Images: (B, H, W, 3)

Example: 32 images of size 224×224 RGB → shape (32, 224, 224, 3)

Shape Notation is Everything

In deep learning, shape errors are the #1 bug. Always track tensor shapes. A weight matrix connecting a layer of 784 neurons to 256 neurons has shape (784, 256). The input batch has shape (64, 784). The output is (64, 256). Always verify: inner dimensions must match for matrix multiplication.

2.1.2 Matrix Multiplication: The Dot Product

Matrix multiplication is the single most important operation in neural networks. Every forward pass is a series of matrix multiplications.

Matrix Multiplication

Given A ∈ ℝᵐˣⁿ and B ∈ ℝⁿˣᵖ:

C = A · B where C_{ij} = Σₖ A_{ik} · B_{kj} (k = 1 to n)

Result shape: C ∈ ℝᵐˣᵖ
Rule: Inner dimensions must match → (m×n) · (n×p) = (m×p)

Intuition: Each element C_{ij} is the dot product of row i of A with column j of B. Think of it as: "How much does row i of A align with column j of B?"

In neural networks: If x is a single input vector (1×n) and W is a weight matrix (n×h), then x·W gives the output (1×h) — one number per hidden neuron. Each output neuron computes a weighted sum of the inputs.

2.1.3 Transpose

Transpose

(Aᵀ)_{ij} = A_{ji}

If A ∈ ℝᵐˣⁿ, then Aᵀ ∈ ℝⁿˣᵐ
Rows become columns, columns become rows.

Properties:
(Aᵀ)ᵀ = A | (A + B)ᵀ = Aᵀ + Bᵀ | (AB)ᵀ = BᵀAᵀ

2.1.4 Matrix Inverse

Matrix Inverse

A⁻¹ exists only for square, non-singular matrices.

A · A⁻¹ = A⁻¹ · A = I (Identity matrix)

Used in: Normal equation for linear regression — w = (XᵀX)⁻¹ Xᵀy

Pro Tip

In practice, we never compute matrix inverses explicitly in deep learning. It's computationally expensive (O(n³)) and numerically unstable. We use gradient descent instead. But understanding inverses helps you read research papers and derive closed-form solutions.

2.1.5 Element-wise (Hadamard) Product

Hadamard Product

C = A ⊙ B where C_{ij} = A_{ij} × B_{ij}

Both matrices must have the same shape.
Used in: Gating mechanisms (LSTM, GRU), attention masks, dropout

Don't confuse this with matrix multiplication! The Hadamard product multiplies corresponding elements, while matrix multiplication computes dot products of rows and columns.

Worked Example: IRCTC Passenger Data as a Matrix

Consider 5 IRCTC passengers with 4 features each: Age, Fare (₹), Distance (km), Class (1/2/3).

Data
Passenger Matrix X ∈ ℝ⁵ˣ⁴:

         Age    Fare(₹)  Distance(km)  Class
P1  [    28,    1450,      850,          3   ]
P2  [    45,    3200,     1200,          2   ]
P3  [    32,    5800,     2100,          1   ]
P4  [    56,    1800,      950,          3   ]
P5  [    23,    4500,     1800,          1   ]

Shape: (5, 4)
X[2, 1] = 5800   (3rd passenger, 2nd feature = Fare)
X[:, 0] = [28, 45, 32, 56, 23]  (all ages — a column vector)

Weight vector for predicting "will book again" (4 features → 1 output):

w = [0.01, 0.0002, 0.0003, -0.5]ᵀ

Score for P1 = 28×0.01 + 1450×0.0002 + 850×0.0003 + 3×(-0.5) = 0.28 + 0.29 + 0.255 - 1.5 = -0.675

Score for P3 = 32×0.01 + 5800×0.0002 + 2100×0.0003 + 1×(-0.5) = 0.32 + 1.16 + 0.63 - 0.5 = 1.61

P3 (1AC traveler, long distance) has a much higher rebooking score. The entire batch computation is simply X · w.

2.2 Calculus for Deep Learning

Calculus tells us how to learn. A neural network improves by computing how much each weight contributed to the error, then adjusting it. That computation is a derivative.

2.2.1 Derivatives from First Principles

Definition of a Derivative

f'(x) = lim(h→0) [f(x + h) - f(x)] / h

The derivative tells you: if x changes by a tiny amount,
how much does f(x) change?

It is the slope of the tangent line at point x.

Example: Deriving d/dx [x²] from first principles

1f(x) = x², so f(x+h) = (x+h)² = x² + 2xh + h²

2f(x+h) - f(x) = x² + 2xh + h² - x² = 2xh + h²

3[f(x+h) - f(x)] / h = (2xh + h²) / h = 2x + h

4lim(h→0) [2x + h] = 2x

So d/dx [x²] = 2x. At x = 3, the slope is 6 — meaning a tiny increase in x increases x² by approximately 6 times that increase.

Key Derivative Rules

Rule	Formula	Example
Power Rule	d/dx [xⁿ] = nxⁿ⁻¹	d/dx [x³] = 3x²
Constant Multiple	d/dx [cf(x)] = c·f'(x)	d/dx [5x²] = 10x
Sum Rule	d/dx [f+g] = f'+g'	d/dx [x²+3x] = 2x+3
Product Rule	d/dx [fg] = f'g + fg'	d/dx [x·eˣ] = eˣ + xeˣ
Exponential	d/dx [eˣ] = eˣ	d/dx [e³ˣ] = 3e³ˣ
Logarithm	d/dx [ln x] = 1/x	d/dx [ln(2x)] = 1/x

2.2.2 The Chain Rule — Heart of Backpropagation

Chain Rule

If y = f(g(x)), then:

dy/dx = (dy/du) · (du/dx) where u = g(x)

"The derivative of the outer function × the derivative of the inner function"

Why this matters: A neural network is a composition of functions: output = f₃(f₂(f₁(x))). To compute how the loss changes w.r.t. weight w₁ in the first layer, we chain derivatives through every subsequent layer. This is backpropagation.

Example: Chain Rule

Let y = (3x + 2)⁵. Find dy/dx.

1Let u = 3x + 2, so y = u⁵

2dy/du = 5u⁴, du/dx = 3

3dy/dx = 5u⁴ · 3 = 15(3x + 2)⁴

2.2.3 Partial Derivatives and Gradients

When a function has multiple inputs (like a loss function with many weights), we take partial derivatives — differentiate w.r.t. one variable while holding others constant.

Partial Derivative

f(x, y) = 3x²y + 2xy³

∂f/∂x = 6xy + 2y³ (treat y as constant)
∂f/∂y = 3x² + 6xy² (treat x as constant)

The Gradient Vector

Gradient

∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]ᵀ

The gradient is a vector of all partial derivatives.
It points in the direction of steepest ascent.
To minimize loss: move in the opposite direction → w = w - α∇L

Gradient Descent Intuition

Imagine you're blindfolded on a hilly terrain (the loss surface). The gradient tells you which direction is uphill. You take a step in the opposite direction (downhill). The learning rate α controls how big each step is. Too big → you overshoot. Too small → you crawl forever. This is gradient descent: the fundamental learning algorithm of deep learning.

2.2.4 Jacobian and Hessian (Brief Introduction)

Jacobian Matrix

For a vector-valued function f: ℝⁿ → ℝᵐ:

J_{ij} = ∂fᵢ/∂xⱼ (shape: m × n)

The Jacobian generalizes the gradient to vector-valued functions.
Used in: Backpropagation through layers with vector outputs.

Hessian Matrix

H_{ij} = ∂²f / (∂xᵢ ∂xⱼ) (shape: n × n)

Matrix of second-order partial derivatives.
Tells you about curvature of the loss surface.
Used in: Second-order optimization (Newton's method, L-BFGS).

Pro Tip

You won't compute Jacobians or Hessians by hand in practice — frameworks like PyTorch handle this with automatic differentiation. But understanding them conceptually helps you debug training issues and read advanced papers on optimization.

2.2.5 Worked Example: Derivative of the Sigmoid Function

Deriving σ'(z) Step by Step

The sigmoid function is one of the most important activation functions:

Sigmoid Function

σ(z) = 1 / (1 + e⁻ᶻ)

Let's derive its derivative from scratch:

1Rewrite: σ(z) = (1 + e⁻ᶻ)⁻¹

2Apply chain rule: Let u = 1 + e⁻ᶻ, so σ = u⁻¹
dσ/dz = dσ/du · du/dz

3Compute dσ/du: d/du [u⁻¹] = -u⁻² = -1/(1 + e⁻ᶻ)²

4Compute du/dz: d/dz [1 + e⁻ᶻ] = -e⁻ᶻ

5Multiply: dσ/dz = [-1/(1 + e⁻ᶻ)²] · [-e⁻ᶻ] = e⁻ᶻ / (1 + e⁻ᶻ)²

6Simplify:
    = [1/(1 + e⁻ᶻ)] · [e⁻ᶻ/(1 + e⁻ᶻ)]
    = [1/(1 + e⁻ᶻ)] · [(1 + e⁻ᶻ - 1)/(1 + e⁻ᶻ)]
    = [1/(1 + e⁻ᶻ)] · [1 - 1/(1 + e⁻ᶻ)]
    = σ(z) · (1 - σ(z))

Sigmoid Derivative — Key Result

σ'(z) = σ(z) · (1 − σ(z))

Maximum value: σ'(0) = 0.5 × 0.5 = 0.25
As |z| → ∞, σ'(z) → 0 (vanishing gradient!)

Why this matters: The sigmoid derivative is at most 0.25. When you chain many sigmoid layers, gradients multiply: 0.25 × 0.25 × 0.25 = 0.016. After 10 layers: 0.25¹⁰ ≈ 0.000001. The gradient vanishes — the network stops learning. This is the vanishing gradient problem, and it's why ReLU replaced sigmoid in hidden layers.

2.3 Probability & Statistics for Deep Learning

Neural networks don't output certainties — they output probabilities. Understanding probability is essential for designing loss functions, interpreting outputs, and reasoning about uncertainty.

2.3.1 Bernoulli Distribution

Bernoulli Distribution

X ~ Bernoulli(p)

P(X = 1) = p     P(X = 0) = 1 - p

Compact form: P(X = x) = pˣ(1-p)¹⁻ˣ   for x ∈ {0, 1}

Mean: E[X] = p   |   Variance: Var(X) = p(1-p)

Deep learning connection: Binary classification IS a Bernoulli distribution. When your model outputs P(churn) = 0.82 for a Paytm user, it's saying: "This user's churn follows Bernoulli(0.82)."

2.3.2 Gaussian (Normal) Distribution

Gaussian Distribution

X ~ N(μ, σ²)

p(x) = (1 / √(2πσ²)) · exp(−(x − μ)² / (2σ²))

Mean: μ | Variance: σ² | 68-95-99.7 rule

Deep learning connection: Weight initialization (Xavier, He) draws from Gaussians. Noise in VAEs is Gaussian. Regression targets are often modeled as Gaussian with learned mean and variance.

2.3.3 Conditional Probability & Bayes' Theorem

Conditional Probability

P(A|B) = P(A ∩ B) / P(B) (probability of A given B)

Bayes' Theorem

P(A|B) = P(B|A) · P(A) / P(B)

posterior = (likelihood × prior) / evidence

India Connect — Bayes in Action

Flipkart's search engine uses Bayesian reasoning: P(user wants "iPhone" | typed "i phone") is high because P(typed "i phone" | wants "iPhone") is very high (likelihood), and iPhones are frequently searched (prior). This is how spelling correction and query understanding work.

2.3.4 Maximum Likelihood Estimation (MLE)

MLE answers: "Given the data we observed, what parameter values make the data most probable?"

MLE Principle

θ̂_MLE = argmax_θ P(Data | θ)

= argmax_θ Π P(xᵢ | θ) (assuming i.i.d. samples)

= argmax_θ Σ log P(xᵢ | θ) (log-likelihood — easier to optimize)

2.3.5 MLE for Bernoulli: Full Derivation

Worked Example: MLE for PhonePe Fraud Probability

PhonePe observes 1000 UPI transactions. 47 are fraudulent (y=1), 953 are legitimate (y=0). What is the MLE estimate of the fraud probability p?

1Model: Each transaction yᵢ ~ Bernoulli(p). We observe y₁, y₂, ..., y₁₀₀₀.

2Likelihood:
L(p) = Π_{i=1}^{1000} p^{yᵢ} (1-p)^{1-yᵢ}
= p^{Σyᵢ} (1-p)^{n - Σyᵢ}
= p⁴⁷ (1-p)⁹⁵³

3Log-likelihood:
ℓ(p) = log L(p) = 47·log(p) + 953·log(1-p)

4Differentiate and set to zero:
dℓ/dp = 47/p − 953/(1−p) = 0
47(1−p) = 953p
47 − 47p = 953p
47 = 1000p
p̂ = 47/1000 = 0.047

5Verify it's a maximum:
d²ℓ/dp² = −47/p² − 953/(1−p)² < 0 ✓ (concave → maximum)

Result: The MLE estimate of fraud probability is 4.7% — exactly the observed proportion! For Bernoulli, MLE always gives p̂ = (number of successes) / (total trials).

2.3.6 From MLE to Cross-Entropy Loss

Here's the most important derivation in this chapter — why we use cross-entropy as a loss function.

The Bridge: Negative Log-Likelihood = Cross-Entropy

When our neural network outputs probability ŷᵢ for each sample, and the true label is yᵢ ∈ {0,1}, we want to maximize the likelihood. Equivalently, we minimize the negative log-likelihood:

From MLE to Binary Cross-Entropy

Likelihood: L = Π ŷᵢ^{yᵢ} (1-ŷᵢ)^{1-yᵢ}

Log-likelihood: ℓ = Σ [yᵢ log(ŷᵢ) + (1-yᵢ) log(1-ŷᵢ)]

Negative log-likelihood (loss to minimize):
L_BCE = −(1/n) Σ [yᵢ log(ŷᵢ) + (1-yᵢ) log(1-ŷᵢ)]

This IS Binary Cross-Entropy Loss!

Cross-entropy is not an arbitrary choice — it emerges naturally from maximum likelihood estimation under a Bernoulli model. When someone says "we use cross-entropy loss for classification," they're saying "we're doing MLE."

Fun Fact

MSE (Mean Squared Error) is also an MLE result — but for a Gaussian model. If you assume your regression targets follow y ~ N(ŷ, σ²), then maximizing the log-likelihood gives you exactly the MSE loss. Every standard loss function has a probabilistic interpretation!

2.4 Information Theory for Deep Learning

Information theory, pioneered by Claude Shannon (1948), gives us a mathematical framework for quantifying uncertainty and information. It provides the theoretical foundation for why cross-entropy is the natural loss function for classification.

2.4.1 Entropy: Measuring Uncertainty

Shannon Entropy

H(p) = −Σ p(x) log₂ p(x)

(In deep learning, we use natural log: H(p) = −Σ p(x) ln p(x))

Intuition: Entropy measures how "surprised" you are on average. A fair coin (p=0.5) has maximum entropy — you're maximally uncertain. A biased coin (p=0.99) has low entropy — you know it'll be heads.

Examples

Distribution	p(heads)	H (bits)	Interpretation
Fair coin	0.5	1.0	Maximum uncertainty
Biased coin	0.9	0.47	Fairly predictable
Certain coin	1.0	0.0	No uncertainty

2.4.2 KL Divergence: Measuring Distribution Difference

Kullback-Leibler Divergence

D_KL(p ‖ q) = Σ p(x) log [p(x) / q(x)]

= Σ p(x) log p(x) − Σ p(x) log q(x)

= −H(p) + H(p, q)

Properties:
• D_KL ≥ 0 (always non-negative — Gibbs' inequality)
• D_KL = 0 iff p = q
• NOT symmetric: D_KL(p‖q) ≠ D_KL(q‖p) in general

Intuition: KL divergence measures how much "extra information" you need when you use distribution q to approximate distribution p. It's the "penalty" for using the wrong distribution.

2.4.3 Cross-Entropy: The Natural Loss

Cross-Entropy

H(p, q) = −Σ p(x) log q(x)

= H(p) + D_KL(p ‖ q)

Since H(p) is constant w.r.t. model parameters:
Minimizing Cross-Entropy = Minimizing KL Divergence

Why Cross-Entropy is the Natural Classification Loss

Three perspectives converge to the same answer:

MLE perspective: Cross-entropy = negative log-likelihood of a Bernoulli/Categorical model
Information theory perspective: Minimizing cross-entropy = minimizing KL divergence between true and predicted distributions
Practical perspective: Cross-entropy produces stronger gradients than MSE for wrong predictions (no gradient saturation)

All three say: cross-entropy is the right loss for classification.

Numerical Example: Why Cross-Entropy > MSE for Classification

True label: y = 1. Model predicts ŷ = 0.01 (confidently wrong!).

Loss Function	Value	Gradient w.r.t. ŷ
MSE: (y − ŷ)²	(1 − 0.01)² = 0.98	−2(1 − 0.01) = −1.98
Cross-Entropy: −y log(ŷ)	−log(0.01) = 4.61	−1/0.01 = −100

Cross-entropy gives a gradient of −100 vs MSE's −1.98. When the model is confidently wrong, cross-entropy punishes it 50× harder and pushes stronger corrections. That's why training converges faster with cross-entropy.

Common Mistake

Students often confuse entropy H(p), cross-entropy H(p,q), and KL divergence D_KL(p‖q). Remember: H(p,q) = H(p) + D_KL(p‖q). Cross-entropy decomposes into "irreducible uncertainty" (entropy) plus "extra cost from using the wrong model" (KL divergence).

NumPy Code Lab: Math from Scratch

Let's implement every key operation in NumPy. This is where formulas become executable code.

4.1 Linear Algebra Operations

Python
import numpy as np

# ────────────────── SCALARS, VECTORS, MATRICES, TENSORS ──────────────────

# Scalar
learning_rate = 0.01

# Vector — 5 features of a Paytm user
user = np.array([25, 12500, 3, 0.85, 450])
print(f"User vector shape: {user.shape}")   # (5,)

# Matrix — 100 users × 5 features
np.random.seed(42)
X = np.random.randn(100, 5)
print(f"Data matrix shape: {X.shape}")    # (100, 5)

# Tensor — batch of 32 color images (28×28)
images = np.random.randn(32, 28, 28, 3)
print(f"Image batch shape: {images.shape}")  # (32, 28, 28, 3)

# ────────────────── DOT PRODUCT ──────────────────

# Vector dot product: weighted sum of features
weights = np.array([0.3, 0.0001, -0.5, 2.0, 0.002])
score = np.dot(user, weights)
print(f"Churn score: {score:.4f}")

# Matrix multiplication: all users at once
W = np.random.randn(5, 3)    # 5 features → 3 hidden neurons
H = X @ W                      # (100,5) @ (5,3) = (100,3)
print(f"Hidden layer shape: {H.shape}")  # (100, 3)

# ────────────────── TRANSPOSE ──────────────────
A = np.array([[1, 2, 3],
              [4, 5, 6]])
print(f"A shape: {A.shape}")          # (2, 3)
print(f"A^T shape: {A.T.shape}")      # (3, 2)

# Verify (AB)^T = B^T A^T
B = np.random.randn(3, 4)
lhs = (A @ B).T
rhs = B.T @ A.T
print(f"(AB)^T == B^T A^T: {np.allclose(lhs, rhs)}")  # True

# ────────────────── INVERSE ──────────────────
M = np.array([[4, 7], [2, 6]])
M_inv = np.linalg.inv(M)
print(f"M · M⁻¹ =\n{M @ M_inv}")     # Identity matrix

# ────────────────── HADAMARD (ELEMENT-WISE) PRODUCT ──────────────────
gate = np.array([[1, 0, 1],
                 [0, 1, 0]])   # Binary mask (like dropout)
data = np.array([[5, 3, 8],
                 [2, 7, 4]])
masked = gate * data            # Hadamard product: * in NumPy
print(f"Hadamard:\n{masked}")     # [[5, 0, 8], [0, 7, 0]]

4.2 Calculus: Numerical Gradient & Sigmoid

Python
import numpy as np

# ────────────────── SIGMOID AND ITS DERIVATIVE ──────────────────

def sigmoid(z):
    """σ(z) = 1 / (1 + e^(-z))"""
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    """σ'(z) = σ(z)(1 - σ(z))"""
    s = sigmoid(z)
    return s * (1 - s)

# Verify: maximum of sigmoid derivative is at z=0
z_values = np.linspace(-6, 6, 1000)
derivs = sigmoid_derivative(z_values)
print(f"Max σ'(z) = {derivs.max():.4f} at z = {z_values[derivs.argmax()]:.2f}")
# Max σ'(z) = 0.2500 at z = 0.00

# ────────────────── NUMERICAL GRADIENT ──────────────────

def numerical_gradient(f, x, h=1e-7):
    """Compute gradient numerically (central difference)"""
    grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy(); x_plus[i] += h
        x_minus = x.copy(); x_minus[i] -= h
        grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    return grad

# Test: f(x,y) = x² + 3xy → ∂f/∂x = 2x+3y, ∂f/∂y = 3x
def test_func(v):
    x, y = v[0], v[1]
    return x**2 + 3*x*y

point = np.array([2.0, 5.0])
num_grad = numerical_gradient(test_func, point)
analytical_grad = np.array([2*2 + 3*5, 3*2])  # [19, 6]

print(f"Numerical gradient:  {num_grad}")
print(f"Analytical gradient: {analytical_grad}")
print(f"Match: {np.allclose(num_grad, analytical_grad)}")  # True

# ────────────────── GRADIENT DESCENT DEMO ──────────────────

def loss(w):
    """Simple quadratic loss: L = (w - 3)²"""
    return (w - 3)**2

w = 0.0           # Start at w=0
lr = 0.1          # Learning rate
print(f"{'Step':>4} {'w':>8} {'Loss':>10}")
for step in range(20):
    grad = 2 * (w - 3)   # dL/dw = 2(w-3)
    w = w - lr * grad      # Gradient descent update
    print(f"{step+1:>4} {w:>8.4f} {loss(w):>10.6f}")
# w converges to 3.0 (the minimum)

4.3 Probability: Distributions, MLE, Cross-Entropy

Python
import numpy as np

# ────────────────── BERNOULLI DISTRIBUTION ──────────────────

def bernoulli_pmf(x, p):
    """P(X=x) = p^x * (1-p)^(1-x)"""
    return (p ** x) * ((1 - p) ** (1 - x))

# PhonePe fraud example: p = 0.047
p_fraud = 0.047
print(f"P(fraud)     = {bernoulli_pmf(1, p_fraud):.4f}")   # 0.047
print(f"P(not fraud) = {bernoulli_pmf(0, p_fraud):.4f}")   # 0.953

# ────────────────── GAUSSIAN DISTRIBUTION ──────────────────

def gaussian_pdf(x, mu, sigma):
    """p(x) = (1/√(2πσ²)) exp(-(x-μ)²/(2σ²))"""
    coeff = 1 / np.sqrt(2 * np.pi * sigma**2)
    exponent = -((x - mu)**2) / (2 * sigma**2)
    return coeff * np.exp(exponent)

# Zomato delivery time: mean 35 min, std 8 min
x = np.array([25, 30, 35, 40, 50])
probs = gaussian_pdf(x, mu=35, sigma=8)
for xi, pi in zip(x, probs):
    print(f"P(delivery={xi} min) = {pi:.4f}")

# ────────────────── MLE FOR BERNOULLI ──────────────────

# Simulate 1000 transactions, 47 fraudulent
np.random.seed(42)
transactions = np.zeros(1000)
transactions[:47] = 1
np.random.shuffle(transactions)

# MLE estimate: p_hat = sum(x) / n
p_hat = transactions.mean()
print(f"\nMLE estimate of fraud probability: {p_hat:.4f}")  # 0.047

# Log-likelihood at p_hat
log_lik = np.sum(transactions * np.log(p_hat) +
                 (1 - transactions) * np.log(1 - p_hat))
print(f"Log-likelihood at p_hat: {log_lik:.2f}")

# ────────────────── CROSS-ENTROPY LOSS ──────────────────

def binary_cross_entropy(y_true, y_pred, eps=1e-15):
    """BCE = -(1/n) Σ [y log(ŷ) + (1-y) log(1-ŷ)]"""
    y_pred = np.clip(y_pred, eps, 1 - eps)  # Numerical stability
    return -np.mean(
        y_true * np.log(y_pred) +
        (1 - y_true) * np.log(1 - y_pred)
    )

# Test: true labels vs model predictions
y_true = np.array([1, 0, 1, 1, 0])
y_good = np.array([0.9, 0.1, 0.85, 0.95, 0.05])  # Good model
y_bad  = np.array([0.3, 0.8, 0.4, 0.2, 0.7])   # Bad model

print(f"\nGood model BCE: {binary_cross_entropy(y_true, y_good):.4f}")
print(f"Bad model  BCE: {binary_cross_entropy(y_true, y_bad):.4f}")
# Good model has LOWER loss → cross-entropy works!

# ────────────────── ENTROPY & KL DIVERGENCE ──────────────────

def entropy(p):
    """H(p) = -Σ p(x) log p(x)"""
    p = p[p > 0]  # Avoid log(0)
    return -np.sum(p * np.log(p))

def kl_divergence(p, q, eps=1e-15):
    """D_KL(p || q) = Σ p(x) log(p(x)/q(x))"""
    q = np.clip(q, eps, 1.0)
    return np.sum(p * np.log(p / q))

def cross_entropy(p, q, eps=1e-15):
    """H(p, q) = -Σ p(x) log q(x)"""
    q = np.clip(q, eps, 1.0)
    return -np.sum(p * np.log(q))

# True distribution vs two models
p_true = np.array([0.7, 0.2, 0.1])   # 3-class problem
q_good = np.array([0.65, 0.25, 0.1])
q_bad  = np.array([0.2, 0.3, 0.5])

print(f"\nEntropy H(p):        {entropy(p_true):.4f}")
print(f"KL(p||q_good):       {kl_divergence(p_true, q_good):.4f}")
print(f"KL(p||q_bad):        {kl_divergence(p_true, q_bad):.4f}")
print(f"CrossEnt(p, q_good): {cross_entropy(p_true, q_good):.4f}")
print(f"CrossEnt(p, q_bad):  {cross_entropy(p_true, q_bad):.4f}")
print(f"Verify: H(p) + KL(p||q_good) = {entropy(p_true) + kl_divergence(p_true, q_good):.4f}")
# Should equal cross_entropy(p_true, q_good)

Visual Walkthrough: Matrix Multiplication

Let's trace through a 3×3 matrix multiplication by hand, step by step.

MATRIX MULTIPLICATION: C = A × B (3×3 result) A (3×2) B (2×3) C (3×3) ┌ ┐ ┌ ┐ ┌ ┐ │ 1 2 │ │ 7 8 9 │ │ C₁₁ C₁₂ C₁₃│ │ 3 4 │ × │ 10 11 12│ = │ C₂₁ C₂₂ C₂₃│ │ 5 6 │ └ ┘ │ C₃₁ C₃₂ C₃₃│ └ ┘ └ ┘ Step-by-step computation: C₁₁ = Row1(A) · Col1(B) = (1×7) + (2×10) = 7 + 20 = 27 C₁₂ = Row1(A) · Col2(B) = (1×8) + (2×11) = 8 + 22 = 30 C₁₃ = Row1(A) · Col3(B) = (1×9) + (2×12) = 9 + 24 = 33 C₂₁ = Row2(A) · Col1(B) = (3×7) + (4×10) = 21 + 40 = 61 C₂₂ = Row2(A) · Col2(B) = (3×8) + (4×11) = 24 + 44 = 68 C₂₃ = Row2(A) · Col3(B) = (3×9) + (4×12) = 27 + 48 = 75 C₃₁ = Row3(A) · Col1(B) = (5×7) + (6×10) = 35 + 60 = 95 C₃₂ = Row3(A) · Col2(B) = (5×8) + (6×11) = 40 + 66 = 106 C₃₃ = Row3(A) · Col3(B) = (5×9) + (6×12) = 45 + 72 = 117 Result: ┌ ┐ │ 27 30 33│ Shape check: (3×2) × (2×3) = (3×3) ✓ │ 61 68 75│ Inner dims match: 2 = 2 ✓ │ 95 106 117│ Total multiplications: 3×3×2 = 18 └ ┘

NEURAL NETWORK FORWARD PASS: z = Wx + b Input x (3×1) Weights W (2×3) Bias b Output z (2×1) ┌ ┐ ┌ ┐ ┌ ┐ ┌ ┐ │ 0.5 │ │ 0.2 0.8 -0.1│ │ 0.1│ │ z₁ │ │ 1.0 │ → │-0.5 0.3 0.6│ + │-0.2│ = │ z₂ │ │ 0.3 │ └ ┘ └ ┘ └ ┘ └ ┘ z₁ = (0.2×0.5) + (0.8×1.0) + (-0.1×0.3) + 0.1 = 0.10 + 0.80 - 0.03 + 0.10 = 0.97 z₂ = (-0.5×0.5) + (0.3×1.0) + (0.6×0.3) + (-0.2) = -0.25 + 0.30 + 0.18 - 0.20 = 0.03 After sigmoid: σ(0.97) = 0.725 σ(0.03) = 0.507 ─────────────────────────────────────────────────── Each neuron computes a weighted sum + bias, then passes through an activation function.

Worked Examples: Hand Calculations

Example 1: Matrix Operations for a Jio User Dataset

A Jio dataset has 3 users × 3 features: [data_usage_GB, recharge_₹, calls_min]

Calculation
X = ┌             ┐       W = ┌      ┐
    │ 12   399  45│           │ 0.1  │
    │  5   199  80│           │ 0.005│    (churn weight vector)
    │ 25   599  20│           │-0.02 │
    └             ┘           └      ┘

Compute X·W (churn scores):

1User 1: (12×0.1) + (399×0.005) + (45×-0.02) = 1.2 + 1.995 - 0.9 = 2.295

2User 2: (5×0.1) + (199×0.005) + (80×-0.02) = 0.5 + 0.995 - 1.6 = −0.105

3User 3: (25×0.1) + (599×0.005) + (20×-0.02) = 2.5 + 2.995 - 0.4 = 5.095

After sigmoid: σ(2.295) = 0.908, σ(−0.105) = 0.474, σ(5.095) = 0.994

Interpretation: User 3 (high data, high recharge, few calls) has 99.4% predicted churn probability — likely a heavy user switching to another carrier. User 2 at 47.4% — borderline.

Example 2: Cross-Entropy Loss Calculation

3 Flipkart customers. True labels (will return product): y = [1, 0, 1]

Model predictions: ŷ = [0.8, 0.3, 0.6]

1Per-sample loss:
L₁ = −[1·log(0.8) + 0·log(0.2)] = −log(0.8) = −(−0.2231) = 0.2231
L₂ = −[0·log(0.3) + 1·log(0.7)] = −log(0.7) = −(−0.3567) = 0.3567
L₃ = −[1·log(0.6) + 0·log(0.4)] = −log(0.6) = −(−0.5108) = 0.5108

2Average loss:
BCE = (0.2231 + 0.3567 + 0.5108) / 3 = 0.3635

Interpretation: Sample 3 (y=1, ŷ=0.6) contributes the most loss because the model is least confident about the correct answer. Cross-entropy penalizes under-confidence proportionally.

Example 3: Computing Entropy and KL Divergence

Swiggy food category distribution in Bangalore:

True: p = [Biryani: 0.4, Pizza: 0.3, Dosa: 0.2, Other: 0.1]

Model A: q_A = [0.35, 0.30, 0.25, 0.10]

Model B: q_B = [0.10, 0.10, 0.10, 0.70]

1Entropy H(p):
H = −[0.4 ln(0.4) + 0.3 ln(0.3) + 0.2 ln(0.2) + 0.1 ln(0.1)]
= −[0.4(−0.916) + 0.3(−1.204) + 0.2(−1.609) + 0.1(−2.303)]
= −[−0.366 − 0.361 − 0.322 − 0.230]
= 1.279 nats

2KL(p ‖ q_A):
= 0.4 ln(0.4/0.35) + 0.3 ln(0.3/0.30) + 0.2 ln(0.2/0.25) + 0.1 ln(0.1/0.10)
= 0.4(0.134) + 0.3(0) + 0.2(−0.223) + 0.1(0)
= 0.054 − 0.045 = 0.009 nats (very close!)

3KL(p ‖ q_B):
= 0.4 ln(0.4/0.10) + 0.3 ln(0.3/0.10) + 0.2 ln(0.2/0.10) + 0.1 ln(0.1/0.70)
= 0.4(1.386) + 0.3(1.099) + 0.2(0.693) + 0.1(−1.946)
= 0.554 + 0.330 + 0.139 − 0.195 = 0.828 nats (very far!)

Conclusion: Model A (KL = 0.009) is 92× better than Model B (KL = 0.828) at approximating the true distribution. KL divergence correctly captures that Model B's prediction of 70% "Other" is absurdly wrong for Bangalore food orders.

Common Mistakes & Pitfalls

Mistake 1: Confusing Matrix Multiply and Element-wise Multiply

Wrong: Using A * B (Hadamard) when you mean A @ B (matrix multiply) in NumPy. These are completely different operations! A * B requires same shapes; A @ B requires inner dimensions to match.

Fix: Always use @ or np.dot() for matrix multiplication. Reserve * for element-wise operations.

Mistake 2: Shape Mismatches in Matrix Multiplication

Wrong: Trying to multiply (100, 5) × (100, 3). Inner dimensions don't match (5 ≠ 100).

Fix: Always write shapes side by side: (m×n) × (n×p). The bolded dimensions must be equal. If not, transpose one matrix.

Mistake 3: Using MSE for Classification

Wrong: MSE as loss for binary classification. Gradients saturate when sigmoid output is near 0 or 1.

Fix: Always use cross-entropy for classification. It's the MLE-optimal loss and produces stronger gradients for wrong predictions.

Mistake 4: Forgetting log(0) is Undefined

Wrong: Computing np.log(y_pred) when y_pred contains 0. Result: -inf or NaN.

Fix: Always clip predictions: y_pred = np.clip(y_pred, 1e-15, 1-1e-15) before computing log.

Mistake 5: Thinking KL Divergence is Symmetric

Wrong: Assuming D_KL(p‖q) = D_KL(q‖p). It's NOT — KL divergence is not a true "distance."

Fix: Always specify direction. In training, we minimize D_KL(p_true ‖ q_model), which equals minimizing cross-entropy H(p, q).

Mistake 6: Ignoring the Vanishing Gradient of Sigmoid

Wrong: Stacking many sigmoid layers and wondering why the network doesn't learn.

Fix: Use ReLU for hidden layers. Reserve sigmoid only for the final output layer of binary classification. We derived that σ'(z) ≤ 0.25 — chaining many of these kills the gradient.

Exercises

Section A: Multiple Choice Questions

What is the shape of the result when you multiply a matrix of shape (64, 128) with a matrix of shape (128, 10)?
(a) (128, 128) (b) (64, 10) (c) (10, 64) (d) (64, 128, 10)
Answer: (b) — (64×128) × (128×10) = (64×10)
Which operation is used in LSTM gating mechanisms?
(a) Matrix multiplication (b) Hadamard (element-wise) product (c) Matrix inverse (d) Eigenvalue decomposition
Answer: (b) — Gates multiply element-wise with the cell state
What is the derivative of σ(z) = 1/(1+e⁻ᶻ)?
(a) σ(z)² (b) σ(z)(1−σ(z)) (c) 1−σ(z)² (d) e⁻ᶻ/(1+e⁻ᶻ)
Answer: (b) — σ'(z) = σ(z)(1 − σ(z)), derived in Section 2.2.5
The maximum value of σ'(z) is:
(a) 1.0 (b) 0.5 (c) 0.25 (d) 0.1
Answer: (c) — At z=0: σ(0)=0.5, σ'(0)=0.5×0.5=0.25
Which of the following is the correct chain rule?
(a) d/dx[f(g(x))] = f'(x)·g'(x) (b) d/dx[f(g(x))] = f'(g(x))·g'(x) (c) d/dx[f(g(x))] = f(g'(x)) (d) d/dx[f(g(x))] = f'(g(x))+g'(x)
Answer: (b) — Derivative of outer evaluated at inner × derivative of inner
Cross-entropy loss for binary classification is derived from:
(a) Mean Squared Error (b) Maximum A Posteriori (c) Maximum Likelihood Estimation (d) Least Absolute Deviation
Answer: (c) — BCE = negative log-likelihood of Bernoulli MLE
If P(fraud) = 0.03 and the model predicts P̂(fraud) = 0.01, the cross-entropy loss −[y·log(ŷ)] for a fraud sample (y=1) is:
(a) −log(0.03) (b) −log(0.01) = 4.61 (c) −log(0.99) = 0.01 (d) −log(0.97)
Answer: (b) — For y=1, loss = −log(ŷ) = −log(0.01) = 4.61
KL divergence D_KL(p‖q) is always:
(a) Negative (b) Zero (c) Non-negative (≥ 0) (d) Symmetric
Answer: (c) — Gibbs' inequality guarantees D_KL ≥ 0, with equality iff p=q
The gradient vector ∇f points in the direction of:
(a) Steepest descent (b) Steepest ascent (c) Zero change (d) Random direction
Answer: (b) — Gradient points to steepest ascent; we move opposite for descent
Which is the correct relationship between entropy, cross-entropy, and KL divergence?
(a) H(p,q) = H(p) − D_KL(p‖q) (b) H(p,q) = H(p) + D_KL(p‖q) (c) H(p,q) = D_KL(p‖q) − H(p) (d) H(p,q) = H(p) × D_KL(p‖q)
Answer: (b) — Cross-entropy = entropy + KL divergence

Section B: Hand Calculation Problems

Matrix Multiplication. Compute C = A × B by hand:
```
A = ┌       ┐     B = ┌    ┐
    │ 2   3 │         │ 1  │
    │ 1  -1 │         │ 4  │
    └       ┘         └    ┘
```
Show Solution

C = A × B = [(2×1)+(3×4), (1×1)+(−1×4)]ᵀ = [14, −3]ᵀ

This is a (2×2) × (2×1) = (2×1) result. Each element is a dot product of a row of A with column B.
Chain Rule. Find dy/dx for y = ln(sin(3x²)).
Show Solution

1Let u = 3x², v = sin(u), y = ln(v)

2dy/dv = 1/v = 1/sin(3x²)

3dv/du = cos(u) = cos(3x²)

4du/dx = 6x

5dy/dx = (1/sin(3x²)) · cos(3x²) · 6x = 6x · cot(3x²)
Gradient Computation. For L(w₁, w₂) = (2w₁ + 3w₂ − 7)², compute ∇L at w₁=1, w₂=1.
Show Solution

1At (1,1): L = (2+3−7)² = (−2)² = 4

2∂L/∂w₁ = 2(2w₁+3w₂−7)·2 = 4(2+3−7) = 4(−2) = −8

3∂L/∂w₂ = 2(2w₁+3w₂−7)·3 = 6(2+3−7) = 6(−2) = −12

4∇L = [−8, −12]ᵀ → Move in opposite direction: [+8, +12]
Cross-Entropy. True labels: y = [1, 0, 1, 0]. Predictions: ŷ = [0.9, 0.2, 0.7, 0.1]. Compute the binary cross-entropy loss.
Show Solution

1L₁ = −log(0.9) = 0.1054

2L₂ = −log(1−0.2) = −log(0.8) = 0.2231

3L₃ = −log(0.7) = 0.3567

4L₄ = −log(1−0.1) = −log(0.9) = 0.1054

5BCE = (0.1054+0.2231+0.3567+0.1054)/4 = 0.1977
MLE. A Zomato delivery model observes 200 orders. 160 arrive on time (y=1), 40 are late (y=0). Find the MLE estimate of on-time probability. Then compute the log-likelihood at that estimate.
Show Solution

1p̂ = 160/200 = 0.80

2ℓ(p̂) = 160·ln(0.8) + 40·ln(0.2)

3= 160(−0.2231) + 40(−1.6094)

4= −35.70 + (−64.38) = −100.08

The negative value is normal — log-likelihoods for probabilities < 1 are always negative.

Section D: Programming Problems

Implement a complete gradient descent optimizer for f(x) = x⁴ − 3x³ + 2.
Start from x=6. Use learning rate 0.01. Run 1000 steps. Print x and f(x) every 100 steps. Verify that x converges near the global minimum. Plot the loss curve.

Show Starter Code

Python
import numpy as np
import matplotlib.pyplot as plt

def f(x):
    return x**4 - 3*x**3 + 2

def df(x):
    # TODO: compute the derivative
    pass

x = 6.0
lr = 0.01
history = []

for step in range(1000):
    # TODO: gradient descent update
    # TODO: record history
    pass

# TODO: plot history

Build a softmax + cross-entropy loss function from scratch.
Given logits z = [2.0, 1.0, 0.1] and true class y = 0, implement:
(a) Softmax: p_i = exp(z_i) / Σ exp(z_j)
(b) Cross-entropy: L = −log(p_y)
(c) Gradient: ∂L/∂z_i = p_i − 1{i=y}
Verify your gradient numerically.

Show Starter Code

Python
import numpy as np

def softmax(z):
    # TODO: implement (use max trick for stability)
    pass

def cross_entropy_loss(probs, y_true):
    # TODO: -log(probs[y_true])
    pass

def softmax_ce_gradient(probs, y_true):
    # TODO: p_i - 1{i=y}
    pass

z = np.array([2.0, 1.0, 0.1])
y = 0

# TODO: compute and print softmax, loss, gradient
# TODO: verify gradient numerically

Implement a complete 2-layer neural network forward pass using only NumPy.
Architecture: 4 inputs → 3 hidden (ReLU) → 1 output (sigmoid).
Initialize random weights (use seed 42). Process a batch of 5 samples. Print shapes at every step. Compute binary cross-entropy loss.

Show Starter Code

Python
import numpy as np
np.random.seed(42)

# Data: 5 samples × 4 features
X = np.random.randn(5, 4)
y = np.array([[1], [0], [1], [0], [1]])  # shape (5,1)

# TODO: Initialize W1 (4×3), b1 (1×3), W2 (3×1), b2 (1×1)
# TODO: Forward pass: Z1 = X@W1+b1, A1 = relu(Z1), Z2 = A1@W2+b2, A2 = sigmoid(Z2)
# TODO: Compute BCE loss
# TODO: Print all intermediate shapes

Chapter Summary

Key Takeaways

Linear Algebra: Data lives in tensors. Matrix multiplication (the dot product) is the core computation in every neural network layer: z = Wx + b.
Shape tracking is non-negotiable: (m×n) × (n×p) = (m×p). Inner dimensions must match.
Transpose flips rows ↔ columns. The Hadamard product (⊙) multiplies element-wise. Matrix inverse exists only for square, non-singular matrices.
Calculus: The derivative measures sensitivity — how much output changes per unit input change. The chain rule decomposes derivatives through compositions of functions.
The sigmoid derivative σ'(z) = σ(z)(1−σ(z)) has a maximum of 0.25 — chaining many sigmoid layers causes vanishing gradients.
The gradient ∇L points uphill; gradient descent moves downhill: w ← w − α∇L.
Probability: Binary classification = Bernoulli distribution. MLE for Bernoulli gives p̂ = successes/trials.
Cross-entropy loss = negative log-likelihood — it emerges naturally from MLE, not from arbitrary choice.
Information Theory: Entropy measures uncertainty. KL divergence measures how wrong your model distribution is. Cross-entropy = Entropy + KL Divergence.
Cross-entropy produces 50× stronger gradients than MSE for confidently wrong predictions — that's why it trains faster for classification.

Cheat Sheet: Formulas You Must Remember

Name	Formula	Where Used
Matrix Multiply	C_{ij} = Σ_k A_{ik}B_{kj}	Every forward pass
Sigmoid	σ(z) = 1/(1+e⁻ᶻ)	Output layer (binary)
Sigmoid Derivative	σ'(z) = σ(z)(1−σ(z))	Backpropagation
Chain Rule	dy/dx = (dy/du)(du/dx)	Backpropagation
Gradient Descent	w ← w − α∇L	All training
Bernoulli MLE	p̂ = Σyᵢ / n	Parameter estimation
Binary Cross-Entropy	−(1/n)Σ[y log ŷ + (1−y)log(1−ŷ)]	Classification loss
Entropy	H(p) = −Σ p(x) log p(x)	Measuring uncertainty
KL Divergence	D_KL(p‖q) = Σ p log(p/q)	Comparing distributions
Cross-Entropy	H(p,q) = H(p) + D_KL(p‖q)	Classification loss

References & Further Reading

Primary Textbooks

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 2 (Linear Algebra), Chapter 3 (Probability), Chapter 4 (Numerical Computation). MIT Press. deeplearningbook.org
Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Chapter 1 (Introduction: Probability). Springer.
Strang, G. (2019). Linear Algebra and Learning from Data. Wellesley-Cambridge Press.

Online Resources

3Blue1Brown — Essence of Linear Algebra (YouTube Series). Exceptional visual intuition for vectors, transformations, and eigenvalues.
3Blue1Brown — Essence of Calculus (YouTube Series). Derivatives and integrals explained visually.
Khan Academy — Multivariable Calculus. Gradients, partial derivatives, Jacobians.
Stanford CS229 Notes — Linear Algebra Review. Concise reference for ML-relevant linear algebra.
colah's blog — Visual Information Theory (2015). Beautiful explanation of entropy, cross-entropy, and KL divergence. colah.github.io

Indian Context

NPTEL — Mathematics for Machine Learning by IIT Madras. Free video lectures covering all topics in this chapter.
NPTEL — Deep Learning by Prof. Mitesh Khapra, IIT Madras. Mathematical foundations in Weeks 1-3.
UPI Transaction Statistics — NPCI. npci.org.in

What's Next?

In Chapter 3: The Perceptron & Neuron Model, we'll put this math to work. You'll see how a single neuron computes z = wᵀx + b (linear algebra), applies σ(z) (calculus), and learns by minimizing cross-entropy (probability + information theory). Every formula from this chapter will come alive.