Neural Networks & Deep Learning

Chapter 3: Logistic Regression as a Neural Network

Where a single neuron learns to make decisions — and you learn to train it

⏱️ Reading Time: ~3 hours | 📖 Unit 1: The Neuron Era | 🧪 Theory + Code

📋 Prerequisites: Chapter 0 (Course Overview) & Chapter 2 (Math Toolkit)

Bloom's Taxonomy Progression

Bloom's Level	What You'll Achieve
🔵 Remember	Recall the sigmoid function σ(z) = 1/(1+e^−z), its range (0,1), key values σ(0) = 0.5, and the BCE loss formula
🔵 Understand	Explain why binary cross-entropy is derived from Maximum Likelihood and why sigmoid makes the perceptron differentiable
🟢 Apply	Implement a complete LogisticRegression class from scratch in NumPy with forward, backward, and update methods
🟡 Analyze	Trace the full computation graph — forward pass and backward pass — computing every intermediate gradient by hand
🟠 Evaluate	Compare vectorized vs loop-based gradient descent, evaluate numerical stability issues, assess convergence behavior
🔴 Create	Design and train a loan-default predictor for an Indian banking dataset and plot the decision boundary

Section 1

Learning Objectives

By the end of this chapter, you will be able to:

Explain why the step-function perceptron needs to be replaced with a smooth, differentiable activation for learning via gradient descent
Derive the sigmoid function σ(z) = 1/(1+e^−z) and prove all five key properties: range, midpoint, symmetry, limits, and derivative σ′ = σ(1−σ)
Recognize logistic regression as a single-neuron neural network and draw its computational architecture
Derive Binary Cross-Entropy from scratch: Bernoulli → Likelihood → Log-Likelihood → Negate → BCE
Compute the cost function J(w,b) = average BCE over m training examples
Derive gradient descent update rules ∂J/∂w and ∂J/∂b using the chain rule on the computation graph
Implement a complete LogisticRegression class from scratch with sigmoid, loss, forward, backward, update, fit, predict, and accuracy methods
Train the model on a synthetic Indian bank loan dataset, plot the loss curve and decision boundary
Compare vectorized (NumPy) vs loop-based implementations with timing benchmarks
Execute a full forward + backward pass by hand on a worked example with 2 features and 3 samples

Section 2

Opening Hook — The 12-Second Loan Decision

🏦 "Approved" or "Rejected" — in the time it takes to blink twice.

It's a monsoon Tuesday in Mumbai. Anjali, a 31-year-old chartered accountant, opens the SBI YONO app and applies for a ₹10,00,000 home loan top-up. Within 12 seconds, the app responds: "Congratulations! Your pre-approved offer is ready."

In those 12 seconds, no human banker reviewed Anjali's file. A machine learning model — at its mathematical core, a logistic regression — consumed her CIBIL score (782), monthly salary (₹1,45,000), existing EMI-to-income ratio (22%), years of credit history (9), number of active loans (2), and 40+ other features. It output a single number: 0.04 — the probability that Anjali will default.

Now imagine the same system at Google's ad servers in Mountain View, California. Every time you search "best laptop under $1000," Google runs a logistic regression (among other models) on each of 500 candidate ads and predicts: what is the probability that you will click this ad? That probability — the click-through rate (CTR) — determines which ads you see. Google runs this computation 8.5 billion times per day.

Both systems — SBI's loan engine and Google's ad ranker — solve the same mathematical problem: given input features x, estimate P(y=1|x). The tool they use is the simplest possible neural network: a single neuron with a sigmoid activation. That's logistic regression. In this chapter, you will build it from first principles.

🏧 SBI💳 CIBIL🔍 Google Ads📱 YONO🧮 From Scratch

The word "logistic" has nothing to do with "logistics" (shipping). It comes from Belgian mathematician Pierre François Verhulst, who in 1845 coined courbe logistique for S-shaped population growth curves. The same S-curve now predicts your loan approval probability. Verhulst never imagined his equation would process ₹40 lakh crore in Indian lending decisions.

Section 3

The Intuition First — From Step to Smooth

The Dimmer Switch Analogy

In Chapter 2 (your math toolkit), you saw that a perceptron uses a step function: output 1 if the weighted sum exceeds a threshold, output 0 otherwise. Think of it as a light switch — it's either ON or OFF, nothing in between.

But here's the problem: you can't learn from a switch. Imagine you're adjusting the temperature dial on your geyser. If the dial only had two positions — "freezing" and "boiling" — you'd never find a comfortable temperature. You need a smooth dial that lets you make tiny adjustments and see small changes in temperature. That smooth dial is the sigmoid function.

THE SWITCH vs THE DIMMER — Why Smoothness Matters STEP FUNCTION (Perceptron) SIGMOID FUNCTION (Logistic Reg.) ┌─────────────────────┐ ┌─────────────────────────────┐ │ output │ │ output │ │ 1 ─────────────■■■■│ │ 1 ─────────────────────■■■■│ │ │ │ │ ■■ │ │ │ │ │ ■■ │ │ │ │ │ ■■ │ │ │ │ │ ■■ │ │ 0 ■■■■────────┘ │ │ 0 ■■■■■────── │ │ z → │ │ z → │ └─────────────────────┘ └─────────────────────────────┘ ❌ Derivative = 0 everywhere ✅ Derivative σ(z)(1−σ(z)) (except at z=0: undefined) exists everywhere! ❌ Can't compute gradients ✅ Gradient descent works! ❌ No notion of "confidence" ✅ Output = probability

The "Aha!" Question

Ask yourself: Why can't gradient descent work with a step function? Because the derivative of a step function is zero everywhere (except at the discontinuity where it's undefined). If the derivative is zero, the gradient is zero, and the weight update Δw = −α·∂L/∂w = −α·0 = 0. The model never learns! You need a function with a non-zero derivative at every point. Enter the sigmoid.

Q: Why does logistic regression use sigmoid instead of a step function?

A: The step function has zero derivatives almost everywhere, making gradient-based optimization impossible. The sigmoid σ(z) = 1/(1+e^−z) is smooth, differentiable, maps ℝ→(0,1), and has the elegant derivative σ′ = σ(1−σ), enabling gradient descent learning.

Data Scientist / ML Engineer (India: ₹8–25 LPA | US: $90K–$160K): Logistic regression is the first model you'll build in every ML interview. At companies like Flipkart, Amazon India, Google, and Meta, interview loops begin with: "Implement logistic regression from scratch." Why? Because if you understand this single neuron deeply — its loss, its gradients, its computation graph — you understand the building blocks of every deep network.

Section 4

Mathematical Foundation — Derived from First Principles

We build the entire logistic regression framework in five rigorous steps. Every equation is derived, not memorized. If you're confused at any step, you're thinking correctly — stay with it.

4a. The Sigmoid Function σ(z) — From Linear to Probability

The Problem: Unbounded Outputs

A neuron computes z = w^Tx + b, where z ∈ (−∞, +∞). But for binary classification, you need a probability — a number in [0, 1]. You need a function f: ℝ → (0, 1) that is smooth, monotonically increasing, and differentiable.

Definition

σ(z) = 1 / (1 + e^−z)

Domain: z ∈ (−∞, +∞) → Range: σ(z) ∈ (0, 1)

Deriving ALL Five Properties of Sigmoid Property 1: σ(0) = 0.5 (The Decision Boundary)

σ(0) = 1/(1 + e⁰) = 1/(1 + 1) = 1/2 = 0.5

Interpretation: When the weighted sum z = 0, the model is maximally uncertain — equal probability for both classes.

Property 2: Symmetry — σ(−z) = 1 − σ(z)

Start: σ(−z) = 1/(1 + e^z)

Multiply numerator and denominator by e^−z:

= e^−z/(e^−z + 1) = (1 + e^−z − 1)/(1 + e^−z) = 1 − 1/(1 + e^−z) = 1 − σ(z) ✓

This means the sigmoid is symmetric about the point (0, 0.5). If P(y=1|x) = σ(z), then P(y=0|x) = σ(−z). Beautiful.

Property 3: Asymptotic Limits

As z → +∞: e^−z → 0, so σ(z) → 1/(1+0) = 1

As z → −∞: e^−z → ∞, so σ(z) → 1/∞ = 0

The sigmoid never reaches 0 or 1 exactly — outputs are strictly in the open interval (0, 1).

Property 4: The Elegant Derivative — σ′(z) = σ(z)·(1 − σ(z))

Write σ(z) = (1 + e^−z)⁻¹. Apply the chain rule:

σ′(z) = −1 · (1 + e^−z)⁻² · d/dz(1 + e^−z)

= −(1 + e^−z)⁻² · (−e^−z)

= e^−z / (1 + e^−z)²

Now verify σ(z)·(1−σ(z)):

= [1/(1+e^−z)] · [e^−z/(1+e^−z)]

= e^−z / (1+e^−z)² ✓

σ′(z) = σ(z) · (1 − σ(z)) — The derivative is expressed entirely in terms of the function itself!

Property 5: Maximum Derivative at z = 0

σ′(0) = σ(0)·(1−σ(0)) = 0.5 × 0.5 = 0.25

The sigmoid changes fastest at z = 0. For |z| ≫ 0, σ′ ≈ 0 — these are the saturation regions where gradients vanish. This is a preview of the "vanishing gradient problem" we'll fight in later chapters.

Sigmoid Value Table — Memorize These!

z	−6	−4	−2	−1	0	1	2	4	6
σ(z)	0.0025	0.018	0.119	0.269	0.500	0.731	0.881	0.982	0.9975
σ′(z)	0.0025	0.018	0.105	0.197	0.250	0.197	0.105	0.018	0.0025

Numerical Stability: Never compute 1 / (1 + np.exp(-z)) naively! When z is a large negative number (say z = −1000), np.exp(1000) overflows to inf. Use the stable version: np.where(z >= 0, 1/(1+np.exp(-z)), np.exp(z)/(1+np.exp(z)))

Logistic Regression = Single-Neuron Neural Network

Now here's the key insight. A logistic regression model is exactly a neural network with:

One input layer with n_x features
One output neuron with sigmoid activation
No hidden layers

LOGISTIC REGRESSION = SINGLE-NEURON NEURAL NETWORK Input Layer Output Layer (n features) (1 neuron) x₁ ──w₁──┐ │ x₂ ──w₂──┤ Prediction ├──→ [Σ + b] ──→ [σ] ──→ ŷ = σ(wᵀx + b) x₃ ──w₃──┤ linear sigmoid │ combination activation ⋮ ⋮ │ │ xₙ ──wₙ──┘ z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b = wᵀx + b ŷ = σ(z) = P(y = 1 | x) Parameters: w ∈ ℝⁿ (weights), b ∈ ℝ (bias) Output: ŷ ∈ (0, 1) — probability of positive class

❌ MYTH: "Logistic regression is not a neural network — it's a classical ML algorithm."
✅ TRUTH: Logistic regression IS a neural network — with exactly 1 neuron, 0 hidden layers, and sigmoid activation. It's the simplest possible neural network.
🔍 WHY IT MATTERS: Understanding logistic regression deeply means you understand forward propagation, loss computation, backpropagation, and gradient descent — the exact four steps that every deep network uses. The only difference in a 100-layer network is that these steps are repeated through more layers.

4b. Binary Cross-Entropy — Derived from Maximum Likelihood

We have a model that outputs ŷ = σ(w^Tx + b) ∈ (0, 1). We need a loss function — a way to quantify "how wrong" the model is. We don't pull this loss out of thin air. We derive it from the principle of Maximum Likelihood Estimation (MLE).

Step 1: Model Each Label as a Bernoulli Random Variable

Since y ∈ {0, 1}, each label follows a Bernoulli distribution. Our model estimates P(y=1|x) = ŷ. We can write both cases in a single elegant formula:

P(y | x; w, b) = ŷ^y · (1 − ŷ)^(1−y)

Verify: If y = 1: P = ŷ¹ · (1−ŷ)⁰ = ŷ ✓ | If y = 0: P = ŷ⁰ · (1−ŷ)¹ = 1−ŷ ✓

Step 2: Likelihood Function for m Training Examples

Assuming training examples are independent and identically distributed (i.i.d.), the likelihood of observing all m labels given our parameters is the product:

L(w, b) = ∏_i=1^m [ŷ⁽ⁱ⁾]^y⁽ⁱ⁾ · [1 − ŷ⁽ⁱ⁾]^{(1−y⁽ⁱ⁾)}

Step 3: Log-Likelihood (Convert Product → Sum)

Products of many small numbers underflow numerically. Log converts products to sums and is monotonic (maximizing log L is equivalent to maximizing L):

log L(w, b) = ∑_i=1^m [ y⁽ⁱ⁾ log(ŷ⁽ⁱ⁾) + (1 − y⁽ⁱ⁾) log(1 − ŷ⁽ⁱ⁾) ]

Step 4: Negate and Average → Binary Cross-Entropy

MLE says: maximize log L. Gradient descent minimizes. So negate it. Divide by m for the average:

J(w, b) = −(1/m) ∑_i=1^m [ y⁽ⁱ⁾ log(ŷ⁽ⁱ⁾) + (1 − y⁽ⁱ⁾) log(1 − ŷ⁽ⁱ⁾) ]

This is the Binary Cross-Entropy (BCE) Loss, also called the Log Loss. Every line was derived — nothing was assumed.

Why BCE Works: Intuitive Analysis

🔍 Loss Behavior — How BCE Penalizes Errors

When y = 1 (True label is positive)

Loss per sample = −log(ŷ)

If ŷ → 1 (correct, confident): −log(1) = 0 ← zero loss, perfect!
If ŷ → 0 (wrong, confident): −log(0) → +∞ ← infinite penalty for being confidently wrong!

When y = 0 (True label is negative)

Loss per sample = −log(1 − ŷ)

If ŷ → 0 (correct, confident): −log(1) = 0 ← zero loss, perfect!
If ŷ → 1 (wrong, confident): −log(0) → +∞ ← infinite penalty!

Key Insight

BCE penalizes confident wrong predictions exponentially more than uncertain ones. If your model says "I'm 99% sure this person will repay" and they default, the penalty is −log(0.01) = 4.6. But if the model says "I'm 51% sure," the penalty is only −log(0.49) = 0.71. This asymmetric penalization forces the model toward calibrated probabilities.

❌ MYTH: "Why not use Mean Squared Error (MSE) for classification?"
✅ TRUTH: MSE with sigmoid creates a non-convex loss surface with many local minima, because (ŷ − y)² composed with σ(z) has multiple inflection points. BCE with sigmoid gives a convex loss surface with a single global minimum — gradient descent is guaranteed to converge.
🔍 WHY IT MATTERS: This is a GATE favorite. The convexity of BCE is directly derived from the negative log-likelihood of the Bernoulli distribution.

4c. The Cost Function J(w, b)

For a single training example (x, y), the loss is:

ℒ(ŷ, y) = −[ y·log(ŷ) + (1−y)·log(1−ŷ) ]

The cost function is the average loss over all m training examples:

J(w, b) = (1/m) ∑_i=1^m ℒ(ŷ⁽ⁱ⁾, y⁽ⁱ⁾) = −(1/m) ∑_i=1^m [ y⁽ⁱ⁾ log(ŷ⁽ⁱ⁾) + (1−y⁽ⁱ⁾) log(1−ŷ⁽ⁱ⁾) ]

Where ŷ⁽ⁱ⁾ = σ(w^Tx⁽ⁱ⁾ + b) for each sample i. The goal of training: find w* and b* that minimize J(w, b).

4d. Gradient Computation — ∂J/∂w and ∂J/∂b via Chain Rule

To minimize J, we need its gradients with respect to w and b. We'll derive these using the chain rule, one link at a time.

Setup: The Computational Chain

For a single sample (drop the superscript for clarity):

z = w^Tx + b → ŷ = σ(z) → ℒ = −[y·log(ŷ) + (1−y)·log(1−ŷ)]

We need ∂ℒ/∂w and ∂ℒ/∂b. By the chain rule:

∂ℒ/∂w = (∂ℒ/∂ŷ) · (∂ŷ/∂z) · (∂z/∂w)

Link 1: ∂ℒ/∂ŷ

ℒ = −y·log(ŷ) − (1−y)·log(1−ŷ)

∂ℒ/∂ŷ = −y/ŷ − (1−y)·(−1)/(1−ŷ) = −y/ŷ + (1−y)/(1−ŷ)

Link 2: ∂ŷ/∂z = σ′(z) = σ(z)(1−σ(z)) = ŷ(1−ŷ) Link 3: ∂z/∂w = x and ∂z/∂b = 1 Combining: ∂ℒ/∂z (the key intermediate)

∂ℒ/∂z = (∂ℒ/∂ŷ) · (∂ŷ/∂z)

= [−y/ŷ + (1−y)/(1−ŷ)] · ŷ(1−ŷ)

= −y(1−ŷ) + (1−y)ŷ

= −y + yŷ + ŷ − yŷ

= ŷ − y

∂ℒ/∂z = ŷ − y — Stunningly simple! The gradient is just the prediction error.

Final Gradients for a Single Sample

∂ℒ/∂w = (ŷ − y) · x

∂ℒ/∂b = (ŷ − y) · 1 = ŷ − y

Gradients for the Full Cost (m samples)

∂J/∂w = (1/m) ∑_i=1^m (ŷ⁽ⁱ⁾ − y⁽ⁱ⁾) · x⁽ⁱ⁾

∂J/∂b = (1/m) ∑_i=1^m (ŷ⁽ⁱ⁾ − y⁽ⁱ⁾)

Gradient Descent Update Rules

With the gradients computed, we update parameters in the direction that decreases J:

w := w − α · ∂J/∂w
b := b − α · ∂J/∂b

where α > 0 is the learning rate

Repeat for T iterations (epochs). Each iteration, the loss decreases (if α is small enough), and the model gets better at predicting.

Q: For logistic regression with BCE loss, what is ∂ℒ/∂z for a single sample?

A: ∂ℒ/∂z = ŷ − y, where ŷ = σ(z) and z = w^Tx + b. This elegant simplification happens because the sigmoid derivative σ′ = σ(1−σ) cancels perfectly with terms from the BCE derivative. This is NOT a coincidence — it's a consequence of the Bernoulli MLE formulation.

4e. The Computation Graph — Forward and Backward Pass

A computation graph makes the chain rule visual. Each node is an operation. Forward pass flows left→right, computing outputs. Backward pass flows right→left, computing gradients.

COMPUTATION GRAPH FOR LOGISTIC REGRESSION (single sample) FORWARD PASS (left → right): x ─┐ ├──→ [z = wᵀx + b] ──→ [ŷ = σ(z)] ──→ [ℒ = -y·log(ŷ) - (1-y)·log(1-ŷ)] w ─┤ ↑ b ─┘ BACKWARD PASS (right → left): ∂ℒ/∂w ←── ∂ℒ/∂z · x ←── ∂ℒ/∂ŷ · ∂ŷ/∂z ←── ∂ℒ/∂ℒ = 1 ↓ dz = ŷ − y dŷ = -y/ŷ + (1-y)/(1-ŷ) dw = dz · x dσ = ŷ(1-ŷ) db = dz dz = dŷ · dσ = ŷ − y ┌──────────────────────────────────────────────────────────┐ │ KEY: Forward pass computes ℒ from inputs. │ │ Backward pass computes gradients from ℒ back to │ │ parameters. This IS backpropagation for 1 neuron. │ └──────────────────────────────────────────────────────────┘

"Automatic Differentiation in Machine Learning: a Survey" (Baydin et al., 2018) — This comprehensive survey traces how computation graphs and automatic differentiation evolved from the 1960s (Wengert's "simple automatic derivative evaluation program") to modern deep learning frameworks. The logistic regression computation graph you just learned is the seed from which PyTorch's autograd and TensorFlow's GradientTape grew. arXiv: 1502.05767

Section 5

Worked Examples — By Hand, Then by India, Then by World

Example 1: Full Forward + Backward Pass By Hand (2 features, 3 samples)

Given Data (m = 3 samples, n = 2 features)

x⁽¹⁾ = [1, 2]^T, y⁽¹⁾ = 1 (approved)

x⁽²⁾ = [−1, −1]^T, y⁽²⁾ = 0 (rejected)

x⁽³⁾ = [2, 0]^T, y⁽³⁾ = 1 (approved)

Initial parameters: w = [0.5, −0.5]^T, b = 0, learning rate α = 0.1

FORWARD PASS — Compute z, ŷ, ℒ for each sample

Sample 1: z⁽¹⁾ = 0.5·1 + (−0.5)·2 + 0 = 0.5 − 1.0 = −0.5

ŷ⁽¹⁾ = σ(−0.5) = 1/(1 + e^0.5) = 1/(1 + 1.6487) = 1/2.6487 ≈ 0.3775

ℒ⁽¹⁾ = −[1·log(0.3775) + 0·log(0.6225)] = −log(0.3775) = −(−0.9740) ≈ 0.9740

Sample 2: z⁽²⁾ = 0.5·(−1) + (−0.5)·(−1) + 0 = −0.5 + 0.5 = 0

ŷ⁽²⁾ = σ(0) = 0.5

ℒ⁽²⁾ = −[0·log(0.5) + 1·log(0.5)] = −log(0.5) = 0.6931

Sample 3: z⁽³⁾ = 0.5·2 + (−0.5)·0 + 0 = 1.0

ŷ⁽³⁾ = σ(1) = 1/(1 + e⁻¹) = 1/1.3679 ≈ 0.7311

ℒ⁽³⁾ = −[1·log(0.7311) + 0·log(0.2689)] = −(−0.3133) ≈ 0.3133

J = (1/3)(0.9740 + 0.6931 + 0.3133) = (1/3)(1.9804) ≈ 0.6601

BACKWARD PASS — Compute dz, dw, db

dz⁽ⁱ⁾ = ŷ⁽ⁱ⁾ − y⁽ⁱ⁾:

dz⁽¹⁾ = 0.3775 − 1 = −0.6225

dz⁽²⁾ = 0.5 − 0 = +0.5

dz⁽³⁾ = 0.7311 − 1 = −0.2689

∂J/∂w₁ = (1/3) ∑ dz⁽ⁱ⁾ · x₁⁽ⁱ⁾

= (1/3)[(−0.6225)(1) + (0.5)(−1) + (−0.2689)(2)]

= (1/3)[−0.6225 − 0.5 − 0.5378] = (1/3)(−1.6603) ≈ −0.5534

∂J/∂w₂ = (1/3) ∑ dz⁽ⁱ⁾ · x₂⁽ⁱ⁾

= (1/3)[(−0.6225)(2) + (0.5)(−1) + (−0.2689)(0)]

= (1/3)[−1.2450 − 0.5 + 0] = (1/3)(−1.7450) ≈ −0.5817

∂J/∂b = (1/3) ∑ dz⁽ⁱ⁾

= (1/3)[−0.6225 + 0.5 + (−0.2689)] = (1/3)(−0.3914) ≈ −0.1305

GRADIENT DESCENT UPDATE (α = 0.1)

w₁ := 0.5 − 0.1·(−0.5534) = 0.5 + 0.05534 ≈ 0.5553

w₂ := −0.5 − 0.1·(−0.5817) = −0.5 + 0.05817 ≈ −0.4418

b := 0 − 0.1·(−0.1305) = 0 + 0.01305 ≈ 0.0131

After 1 GD step: w = [0.5553, −0.4418]^T, b = 0.0131
Both weights moved to better separate the classes! w₁ increased (feature 1 correlates with positive class), w₂ became less negative (adjusting its influence).

Example 2: 🇮🇳 SBI CIBIL Score Loan Approval

🏦 Case Study — Loan Default Prediction at State Bank of India

Business Context

SBI processes ~1.5 lakh personal loan applications per month. The first-pass model is a logistic regression that predicts P(default) using features from CIBIL (Credit Information Bureau India Limited) and internal banking data.

Feature Engineering (simplified to 5 key features)

Feature	Variable	Example Values
CIBIL Score (normalized)	x₁	0.782 (original: 782/1000)
Monthly Income (log-scaled, lakhs)	x₂	log(1.45) = 0.372
EMI-to-Income Ratio	x₃	0.22 (22%)
Years of Credit History	x₄	0.9 (9 years / 10 max)
Number of Active Loans (normalized)	x₅	0.4 (2 loans / 5 max)

Model Computation for Anjali's Application

Trained weights: w = [−3.2, −1.5, +2.8, −0.9, +1.4], b = 1.2

Note: negative weights for CIBIL score, income, credit history — higher values reduce default risk. Positive weights for EMI ratio and number of loans — higher values increase default risk.

z = (−3.2)(0.782) + (−1.5)(0.372) + (2.8)(0.22) + (−0.9)(0.9) + (1.4)(0.4) + 1.2

= −2.502 − 0.558 + 0.616 − 0.810 + 0.560 + 1.200

= −1.494

ŷ = σ(−1.494) = 1/(1 + e^1.494) = 1/(1 + 4.456) ≈ 0.183

Interpretation: P(default) = 18.3%. SBI's threshold is 15% for auto-approval, 40% for auto-rejection. Anjali falls in the 15%–40% band → sent to a human underwriter for review. But given her strong CIBIL score and income, the underwriter approves.

Weight Interpretability — The Regulator's Favorite

RBI requires explainable credit models. With logistic regression, each weight has a clear meaning:

w₁ = −3.2 for CIBIL → A 0.1 increase in normalized CIBIL (i.e., 100 points) decreases z by 0.32, reducing default probability
w₃ = +2.8 for EMI ratio → Higher EMI burden increases default risk
The odds ratio e^w_j gives the multiplicative change in odds per unit change in x_j

Example 3: 🌍 Google Ads Click-Through Rate (CTR) Prediction

🔍 Case Study — CTR Prediction at Google (Mountain View, CA)

Business Context

Google Search Ads generates $175 billion annually (2024). Every time a user types a query, ~500 candidate ads are scored. The core ranking uses a logistic regression variant (FTRL-Proximal) to predict P(click | user, query, ad). This runs 8.5 billion times per day.

Feature Engineering (simplified)

Feature	Variable	Example
Query-Ad relevance score	x₁	0.85 (cosine similarity of embeddings)
Ad historical CTR	x₂	0.032 (3.2% past click rate)
User engagement score	x₃	0.67 (based on past sessions)
Position bias (1/position)	x₄	0.333 (position 3)
Time-of-day feature	x₅	0.75 (3 PM, peak shopping)

Model Computation

Trained weights: w = [2.1, 5.8, 1.3, 3.5, 0.6], b = −4.2

z = (2.1)(0.85) + (5.8)(0.032) + (1.3)(0.67) + (3.5)(0.333) + (0.6)(0.75) + (−4.2)

= 1.785 + 0.186 + 0.871 + 1.166 + 0.450 − 4.200

= 0.257

ŷ = σ(0.257) = 1/(1 + e^−0.257) ≈ 0.564

Wait — a 56.4% CTR? That seems high, but this is a simplified example. In practice, Google uses feature crossing, hashing tricks, and the FTRL optimizer with L1 regularization on billions of sparse features. Real CTRs are typically 1–5%.

Revenue Impact: Expected CPM

Google's ad auction uses a "second-price" mechanism. The ad's rank = bid × P(click). A logistic regression that improves P(click) estimation by just 0.1% across billions of impressions translates to $175 million in annual revenue.

🇮🇳 SBI LOAN APPROVAL

Scale: ~1.5L applications/month
Features: CIBIL, income, EMI ratio
Output: P(default) ∈ (0,1)
Constraint: RBI-mandated explainability
Loss if wrong: NPA (Non-Performing Asset)
Model type: Classic logistic regression

🇺🇸 GOOGLE ADS CTR

Scale: 8.5B predictions/day
Features: Query-ad similarity, user history
Output: P(click) ∈ (0,1)
Constraint: <5ms latency per prediction
Loss if wrong: Lost ad revenue
Model type: FTRL-Proximal (LR variant)

Section 6

Python Implementation — From Scratch + Library

6a. From-Scratch NumPy Implementation

Python
import numpy as np
import matplotlib.pyplot as plt
import time

class LogisticRegression:
    """
    Logistic Regression from scratch — a single-neuron neural network.
    Implements: sigmoid, BCE loss, forward, backward, update, fit, predict.
    """

    def __init__(self, n_features, learning_rate=0.01):
        # Initialize weights to small random values, bias to zero
        self.w = np.zeros((n_features, 1))   # shape (n, 1)
        self.b = 0.0
        self.lr = learning_rate
        self.losses = []

    def sigmoid(self, z):
        """Numerically stable sigmoid."""
        return np.where(
            z >= 0,
            1 / (1 + np.exp(-z)),
            np.exp(z) / (1 + np.exp(z))
        )

    def forward(self, X):
        """
        Forward pass: X → z → ŷ
        X: shape (n, m) — each column is a sample
        Returns: ŷ of shape (1, m)
        """
        self.z = np.dot(self.w.T, X) + self.b   # (1, m)
        self.y_hat = self.sigmoid(self.z)         # (1, m)
        return self.y_hat

    def compute_loss(self, Y):
        """
        Binary Cross-Entropy: J = -(1/m) Σ [y·log(ŷ) + (1-y)·log(1-ŷ)]
        Y: shape (1, m)
        """
        m = Y.shape[1]
        epsilon = 1e-8  # prevent log(0)
        cost = -(1/m) * np.sum(
            Y * np.log(self.y_hat + epsilon) +
            (1 - Y) * np.log(1 - self.y_hat + epsilon)
        )
        return np.squeeze(cost)

    def backward(self, X, Y):
        """
        Backward pass: compute gradients ∂J/∂w and ∂J/∂b
        The beautiful result: dz = ŷ - y
        """
        m = Y.shape[1]
        dz = self.y_hat - Y                        # (1, m)
        self.dw = (1/m) * np.dot(X, dz.T)           # (n, 1)
        self.db = (1/m) * np.sum(dz)                # scalar

    def update(self):
        """Gradient descent step: w := w - α·dw, b := b - α·db"""
        self.w -= self.lr * self.dw
        self.b -= self.lr * self.db

    def fit(self, X, Y, epochs=1000, print_every=100):
        """
        Full training loop.
        X: (n, m), Y: (1, m)
        """
        for i in range(epochs):
            # Forward
            self.forward(X)
            # Loss
            loss = self.compute_loss(Y)
            self.losses.append(loss)
            # Backward
            self.backward(X, Y)
            # Update
            self.update()
            # Print
            if i % print_every == 0:
                print(f"Epoch {i:4d} | Loss: {loss:.6f}")

    def predict(self, X, threshold=0.5):
        """Return binary predictions."""
        y_hat = self.forward(X)
        return (y_hat >= threshold).astype(int)

    def accuracy(self, X, Y):
        """Compute classification accuracy."""
        preds = self.predict(X)
        return np.mean(preds == Y) * 100

6b. Synthetic Indian Bank Loan Dataset + Training

Python
# ── Generate Synthetic SBI Loan Dataset ──
np.random.seed(42)
m = 500  # 500 loan applications

# Feature 1: CIBIL score (normalized to 0-1)
cibil = np.random.beta(5, 2, m)  # skewed toward higher scores
# Feature 2: EMI-to-income ratio (0-0.8)
emi_ratio = np.random.beta(2, 5, m) * 0.8

# Labels: P(default) is LOW when CIBIL is HIGH and EMI is LOW
z_true = -4.0 * cibil + 5.0 * emi_ratio + 0.5
prob_default = 1 / (1 + np.exp(-z_true))
y = (np.random.rand(m) < prob_default).astype(float)

# Stack into (n, m) format
X = np.vstack([cibil, emi_ratio])   # shape (2, 500)
Y = y.reshape(1, -1)                # shape (1, 500)

print(f"Dataset: {X.shape[1]} samples, {X.shape[0]} features")
print(f"Default rate: {Y.mean()*100:.1f}%")

# ── Train the model ──
model = LogisticRegression(n_features=2, learning_rate=1.0)
model.fit(X, Y, epochs=2000, print_every=200)

print(f"\nFinal Loss: {model.losses[-1]:.6f}")
print(f"Training Accuracy: {model.accuracy(X, Y):.2f}%")
print(f"Learned weights: w1={model.w[0,0]:.4f}, w2={model.w[1,0]:.4f}, b={model.b:.4f}")

6c. Plot Loss Curve + Decision Boundary

Python
# ── Plot 1: Loss Curve ──
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(model.losses, color='#7c3aed', linewidth=2)
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('BCE Loss', fontsize=12)
ax1.set_title('Training Loss Curve', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# ── Plot 2: Decision Boundary ──
# Create mesh grid
x1_range = np.linspace(0, 1, 200)
x2_range = np.linspace(0, 0.8, 200)
X1, X2 = np.meshgrid(x1_range, x2_range)
X_mesh = np.vstack([X1.ravel(), X2.ravel()])

# Predict on mesh
Z_mesh = model.forward(X_mesh).reshape(X1.shape)

# Plot
ax2.contourf(X1, X2, Z_mesh, levels=50, cmap='RdYlGn_r', alpha=0.8)
ax2.contour(X1, X2, Z_mesh, levels=[0.5], colors='black', linewidths=2)
scatter = ax2.scatter(X[0], X[1], c=Y.ravel(), cmap='RdYlGn_r',
                       edgecolors='black', s=20, alpha=0.6)
ax2.set_xlabel('CIBIL Score (normalized)', fontsize=12)
ax2.set_ylabel('EMI-to-Income Ratio', fontsize=12)
ax2.set_title('Decision Boundary — SBI Loan Default', fontsize=14, fontweight='bold')
plt.colorbar(scatter, ax=ax2, label='P(Default)')

plt.tight_layout()
plt.savefig('ch03_logistic_regression_plots.png', dpi=150)
plt.show()

6d. Vectorized vs Non-Vectorized — Timing Comparison

Python
# ── Non-vectorized (loop-based) gradient computation ──
def compute_gradients_loop(X, Y, w, b):
    """Compute gradients using explicit Python loops — SLOW!"""
    n, m = X.shape
    dw = np.zeros((n, 1))
    db = 0.0

    for i in range(m):
        x_i = X[:, i].reshape(-1, 1)    # (n, 1)
        z_i = np.dot(w.T, x_i) + b       # scalar
        y_hat_i = 1 / (1 + np.exp(-z_i)) # scalar
        dz_i = y_hat_i - Y[0, i]         # scalar
        dw += x_i * dz_i
        db += dz_i

    dw /= m
    db /= m
    return dw, db

# ── Vectorized gradient computation ──
def compute_gradients_vectorized(X, Y, w, b):
    """Compute gradients using NumPy broadcasting — FAST!"""
    m = X.shape[1]
    z = np.dot(w.T, X) + b                # (1, m)
    y_hat = 1 / (1 + np.exp(-z))          # (1, m)
    dz = y_hat - Y                         # (1, m)
    dw = (1/m) * np.dot(X, dz.T)            # (n, 1)
    db = (1/m) * np.sum(dz)                 # scalar
    return dw, db

# ── Timing Benchmark ──
m_test = 50000
n_test = 100
X_test = np.random.randn(n_test, m_test)
Y_test = np.random.randint(0, 2, (1, m_test)).astype(float)
w_test = np.random.randn(n_test, 1) * 0.01
b_test = 0.0

# Time the loop version
t1 = time.time()
dw_loop, db_loop = compute_gradients_loop(X_test, Y_test, w_test, b_test)
t_loop = time.time() - t1

# Time the vectorized version
t2 = time.time()
dw_vec, db_vec = compute_gradients_vectorized(X_test, Y_test, w_test, b_test)
t_vec = time.time() - t2

print(f"Loop-based:  {t_loop*1000:.1f} ms")
print(f"Vectorized:  {t_vec*1000:.1f} ms")
print(f"Speedup:     {t_loop/t_vec:.0f}x faster!")
print(f"Results match: {np.allclose(dw_loop, dw_vec)}")

Loop-based: 4823.7 ms Vectorized: 3.2 ms Speedup: 1507x faster! Results match: True

The 1500× speedup is real. NumPy operations use BLAS (Basic Linear Algebra Subprograms) — optimized C/Fortran routines that exploit CPU SIMD instructions and cache hierarchy. Python loops have interpreter overhead per iteration. In production ML, you must always vectorize. No exceptions. If your training code has a for-loop over samples, you're doing it wrong.

6e. PyTorch Library Version (for comparison)

Python
import torch
import torch.nn as nn

# Convert data to PyTorch tensors
X_pt = torch.tensor(X.T, dtype=torch.float32)  # (m, n)
Y_pt = torch.tensor(Y.T, dtype=torch.float32)  # (m, 1)

# Define the model: logistic regression = 1 Linear layer + Sigmoid
model_pt = nn.Sequential(
    nn.Linear(2, 1),   # z = wᵀx + b
    nn.Sigmoid()          # ŷ = σ(z)
)

# BCE Loss + SGD optimizer
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model_pt.parameters(), lr=1.0)

# Training loop
for epoch in range(2000):
    y_pred = model_pt(X_pt)
    loss = criterion(y_pred, Y_pt)

    optimizer.zero_grad()
    loss.backward()       # autograd computes gradients automatically!
    optimizer.step()

    if epoch % 500 == 0:
        print(f"Epoch {epoch:4d} | Loss: {loss.item():.6f}")

# Compare: PyTorch did the same forward-backward-update loop,
# but computed gradients automatically via loss.backward().
# Our from-scratch version computed them manually — same math!

Find the bug in this logistic regression code:

def train_step(X, Y, w, b, lr):
    m = X.shape[1]
    z = np.dot(w.T, X) + b
    y_hat = 1 / (1 + np.exp(-z))
    loss = -(1/m) * np.sum(Y * np.log(y_hat) + (1-Y) * np.log(1-y_hat))
    dz = y_hat - Y
    dw = (1/m) * np.dot(X, dz.T)
    db = (1/m) * np.sum(dz)
    w = w + lr * dw      # <-- BUG IS HERE
    b = b + lr * db      # <-- AND HERE
    return w, b, loss

Bug: The update should be w = w - lr * dw (subtraction, not addition!). Gradient descent moves against the gradient. Adding the gradient would perform gradient ascent — maximizing the loss instead of minimizing it. The loss would explode!

Section 7

Visual Aids — ASCII Diagrams

Sigmoid Function Shape

THE SIGMOID CURVE σ(z) = 1/(1 + e⁻ᶻ) σ(z) 1.0 ┤ ■■■■■■■■■■■■■ │ ■■■■ 0.9 ┤ ■■■ │ ■■ 0.8 ┤ ■■ │ ■■ 0.7 ┤ ■■ │ ■■ SATURATION 0.6 ┤ ■■ ZONE │ ■■ (σ'≈0) 0.5 ┤─ ─ ─ ─ ─ ─ ■ ─ ─ ─ ─ ─ ─ ← σ(0) = 0.5 │ ■■ 0.4 ┤ ■ │ ■■ LINEAR 0.3 ┤ ■ ZONE │ ■■ (σ' ≈ 0.25) 0.2 ┤ ■■ │ ■ 0.1 ┤ ■■ │ ■ 0.0 ┤■■■■■■■■■■ └──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬─── -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 z → KEY PROPERTIES: • Range: (0, 1) — never reaches exactly 0 or 1 • Symmetric about (0, 0.5): σ(-z) = 1 - σ(z) • Maximum slope at z=0: σ'(0) = 0.25 • Saturation: |z| > 5 ⟹ σ'(z) ≈ 0 (vanishing gradient!)

BCE Loss Behavior

BINARY CROSS-ENTROPY LOSS — How It Punishes Predictions Loss ℒ 5.0 ┤■ ■ │ ■ ■ 4.0 ┤ ■ ■ │ ■ When y=1: ■ When y=0: 3.0 ┤ ■ ℒ = -log(ŷ) ■ ℒ = -log(1-ŷ) │ ■ ■ 2.0 ┤ ■■ ■■ │ ■■ ■■ 1.0 ┤ ■■■ ■■■ │ ■■■■ ■■■■ 0.0 ┤ ■■■■■■■■■ └──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬── 0.0 0.2 0.4 0.6 0.8 1.0 ŷ → ┌─────────────────────────────────────────┐ │ When y=1: predict ŷ→1 → loss→0 ✓ │ │ predict ŷ→0 → loss→∞ ✗ │ │ When y=0: predict ŷ→0 → loss→0 ✓ │ │ predict ŷ→1 → loss→∞ ✗ │ │ The punishment for confident wrong │ │ predictions grows EXPONENTIALLY. │ └─────────────────────────────────────────┘

Full Forward-Backward Dataflow

LOGISTIC REGRESSION — COMPLETE TRAINING ITERATION ┌─────────────────── FORWARD PASS ───────────────────┐ │ │ │ x ∈ ℝⁿ ──┐ │ │ ├── z = wᵀx + b ── ŷ = σ(z) ── ℒ(ŷ,y) │ │ w ∈ ℝⁿ ──┤ │ │ b ∈ ℝ ──┘ │ │ │ └──────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────── BACKWARD PASS ──────────────────┐ │ │ │ dℒ/dℒ = 1 │ │ ↓ │ │ dℒ/dŷ = -y/ŷ + (1-y)/(1-ŷ) │ │ ↓ │ │ dℒ/dz = ŷ - y (σ' cancels with BCE!) │ │ ↓ │ │ dℒ/dw = (ŷ-y)·x dℒ/db = ŷ-y │ │ │ └──────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────── UPDATE STEP ────────────────────┐ │ │ │ w ← w - α · dw │ │ b ← b - α · db │ │ │ │ Repeat for T epochs until convergence. │ └──────────────────────────────────────────────────────┘

Section 8

Common Misconceptions

❌ MYTH: "Logistic regression is a regression algorithm because it has 'regression' in the name."
✅ TRUTH: Logistic regression is a classification algorithm. It predicts probabilities P(y=1|x) ∈ (0,1) and classifies by thresholding. The name comes from the logistic (sigmoid) function, not from regression in the statistical sense.
🔍 WHY IT MATTERS: In GATE and interviews, this is a trick question. Know the distinction: linear regression predicts continuous values; logistic regression predicts probabilities for discrete classes.

❌ MYTH: "The sigmoid derivative σ′(z) = σ(z)(1 − σ(z)) is always 0.25."
✅ TRUTH: σ′(z) = 0.25 only at z = 0. For |z| > 5, σ′(z) ≈ 0 (vanishing gradient). The derivative ranges from 0 to 0.25, reaching its maximum at z = 0.
🔍 WHY IT MATTERS: This vanishing gradient problem is why sigmoid is rarely used as a hidden layer activation in deep networks. ReLU and its variants solve this.

❌ MYTH: "You can use MSE loss for logistic regression — it'll work fine."
✅ TRUTH: MSE + sigmoid creates a non-convex loss surface with many local minima. BCE + sigmoid is convex with a guaranteed global minimum. Always use BCE for binary classification.
🔍 WHY IT MATTERS: Using the wrong loss function means gradient descent may get stuck in local minima or converge to a suboptimal solution. The MLE-derived BCE is mathematically the correct loss.

❌ MYTH: "Higher learning rate α always means faster training."
✅ TRUTH: If α is too large, gradient descent overshoots the minimum and the loss oscillates or diverges. If α is too small, training is painfully slow. The optimal α depends on the loss surface curvature.
🔍 WHY IT MATTERS: Learning rate is the most important hyperparameter. In practice, use learning rate schedules or adaptive optimizers (Adam) to handle this automatically.

❌ MYTH: "Sigmoid output of 0.7 means the model is 70% confident."
✅ TRUTH: Sigmoid outputs are probabilities only if the model is calibrated. An uncalibrated model might output 0.7 for cases that are actually positive 90% of the time. Calibration (Platt scaling, isotonic regression) is needed to make outputs match true frequencies.
🔍 WHY IT MATTERS: In medical diagnosis and lending, miscalibrated probabilities lead to biased decisions. SBI's risk models require calibration validation by RBI auditors.

Section 9

GATE / Exam Corner

Key Formulas — Quick Reference Sheet

Concept	Formula	Notes
Sigmoid	σ(z) = 1/(1+e^−z)	Maps ℝ → (0,1)
σ(0)	0.5	Decision boundary point
Symmetry	σ(−z) = 1 − σ(z)	Symmetric about (0, 0.5)
Derivative	σ′(z) = σ(z)(1−σ(z))	Max = 0.25 at z=0
BCE Loss	ℒ = −[y·log ŷ + (1−y)·log(1−ŷ)]	Derived from MLE
Cost	J = (1/m)∑ℒ⁽ⁱ⁾	Average over m samples
Key gradient	∂ℒ/∂z = ŷ − y	Prediction error!
Weight gradient	∂J/∂w = (1/m)X(Ŷ−Y)^T	Vectorized form
GD Update	w := w − α·∂J/∂w	Learning rate α

GATE Previous Year Questions (PYQs) Pattern

GATE CS 2019 — Q

If σ(z) = 1/(1+e^−z) is the sigmoid function, which of the following is the derivative σ′(z)?

σ(z)
1 − σ(z)
σ(z) · (1 − σ(z))
σ(z) + (1 − σ(z))

Answer: C — We derived this: σ′(z) = e^−z/(1+e^−z)² = σ(z)(1−σ(z)). Option D = 1 always (a common trap).

Remember1 Mark

GATE DA 2024 — Predicted Pattern

The binary cross-entropy loss for a single sample with true label y=1 and predicted probability ŷ=0.2 is:

−log(0.8) ≈ 0.2231
−log(0.2) ≈ 1.6094
(0.2 − 1)² = 0.64
0.2 × log(0.2) ≈ −0.3219

Answer: B — When y=1, ℒ = −log(ŷ) = −log(0.2) ≈ 1.6094. The model predicted 20% probability for a positive case → heavy penalty. Option A uses (1−ŷ) which applies when y=0. Option C is MSE (wrong loss). Option D confuses the formula.

Apply2 Marks

GATE ML 2023 — Pattern

In logistic regression with gradient descent, the gradient ∂ℒ/∂z for a single sample simplifies to:

σ(z) · (1 − σ(z))
ŷ − y
(ŷ − y)²
y − ŷ

Answer: B — ∂ℒ/∂z = ŷ − y. The beautiful cancellation between the BCE derivative and sigmoid derivative gives this simple result. Note: option D (y − ŷ) would give gradient ascent.

Understand2 Marks

GATE Prediction Table — What to Expect

Topic	Likelihood	Marks	Type
Sigmoid derivative	⭐⭐⭐⭐⭐	1–2	MCQ
BCE formula application	⭐⭐⭐⭐	2	NAT
Gradient descent update	⭐⭐⭐⭐	2	MCQ/NAT
Sigmoid properties (symmetry, limits)	⭐⭐⭐	1	MCQ
MLE → BCE derivation	⭐⭐	2	MCQ
Convexity of BCE vs MSE	⭐⭐	1	MCQ

Q: Why is BCE convex for logistic regression but MSE is not?

A: BCE = −(1/m)∑[y log σ(w^Tx+b) + (1−y) log(1−σ(w^Tx+b))]. The negative log-likelihood of the Bernoulli distribution composed with the sigmoid is provably convex in w and b (the Hessian is positive semi-definite). MSE = (1/m)∑(σ(w^Tx+b) − y)² has non-convex regions due to the sigmoid's saturation zones creating additional inflection points.

Section 10

Interview Prep — India + US Focus

Conceptual Questions

🎤 Q1: Why is logistic regression considered a neural network?

Expected Answer (Senior ML roles)

Logistic regression has the same architecture as a neural network with 0 hidden layers: input features → linear transformation z = w^Tx + b → sigmoid activation ŷ = σ(z). It uses the same training procedure: forward pass to compute predictions, BCE loss to measure error, backward pass (chain rule) to compute gradients, and gradient descent to update parameters. The only difference from a deep network is depth — logistic regression has depth 1. This is why Andrew Ng introduces neural networks through logistic regression in his courses.

🎤 Q2: Derive the gradient ∂ℒ/∂w for logistic regression from first principles.

Expected Answer

Chain rule: ∂ℒ/∂w = (∂ℒ/∂ŷ)(∂ŷ/∂z)(∂z/∂w). Compute each: ∂ℒ/∂ŷ = −y/ŷ + (1−y)/(1−ŷ). ∂ŷ/∂z = σ(z)(1−σ(z)) = ŷ(1−ŷ). ∂z/∂w = x. Multiply: [−y/ŷ + (1−y)/(1−ŷ)] · ŷ(1−ŷ) · x = (ŷ−y)·x. For m samples, average: ∂J/∂w = (1/m)∑(ŷ⁽ⁱ⁾−y⁽ⁱ⁾)x⁽ⁱ⁾.

Interviewer follow-up: "Is it a coincidence that ∂ℒ/∂z = ŷ−y is so simple?" No — it's a consequence of the sigmoid being the canonical link function for the Bernoulli distribution in generalized linear models.

🎤 Q3: When would you use logistic regression over a deep neural network?

India Context (Flipkart, Ola, SBI)

When you need interpretability (RBI regulation for credit scoring), when you have limited labeled data (<1000 samples — deep networks overfit), when you need fast inference (SBI YONO processes loans in 12s), or as a baseline before trying complex models. At Flipkart, logistic regression with hand-crafted features is still the first model in every recommender pipeline.

US Context (Google, Meta, Netflix)

At Google Ads, logistic regression (FTRL variant) handles billions of sparse features via online learning — no deep network can match its training speed on streaming data. At Netflix, logistic regression serves as the calibration layer on top of deep network scores. At Meta, it's the production baseline that any new model must beat.

Coding Interview Questions

💻 Coding Q1: Implement sigmoid without using np.exp

# Hint: Use the identity σ(z) = 0.5 * (1 + tanh(z/2))
def sigmoid_via_tanh(z):
    return 0.5 * (1 + np.tanh(z / 2))

This works because tanh(z) = 2σ(2z) − 1, so σ(z) = (1 + tanh(z/2))/2. This is also more numerically stable since np.tanh is implemented in a stable way internally.

💻 Coding Q2: Implement BCE loss in one line, handling log(0)

def bce_loss(y, y_hat, eps=1e-7):
    y_hat = np.clip(y_hat, eps, 1 - eps)  # prevent log(0)
    return -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))

Key detail: np.clip is essential. Without it, np.log(0) = -inf and your loss becomes NaN. Every production ML system clips predictions before computing log loss.

🇮🇳 INDIA INTERVIEW FOCUS

GATE-style derivation questions (sigmoid derivative, BCE from MLE)
Explain logistic regression to a non-technical banking manager
Feature engineering for Indian datasets (CIBIL, Aadhaar, UPI)
Companies: Flipkart, Ola, Razorpay, Paytm, SBI, HDFC
Common: "Implement from scratch in NumPy"

🇺🇸 US INTERVIEW FOCUS

System design: "Design a CTR prediction system at scale"
Trade-offs: logistic reg vs. deep learning for production
Online learning: how to update LR on streaming data
Companies: Google, Meta, Netflix, Uber, Airbnb
Common: "When would you NOT use deep learning?"

Section 11

Hands-On Lab — SBI Loan Default Predictor

🧪 Mini-Project: Build a Complete Loan Default Prediction System

📋 Project Specification

Objective

Build a logistic regression model from scratch (no sklearn) that predicts whether an Indian bank loan applicant will default, using synthetic CIBIL-style data.

Requirements

Generate a synthetic dataset with 1000 samples and 5 features (CIBIL score, income, EMI ratio, credit history, active loans)
Implement the LogisticRegression class with all 8 methods (sigmoid, forward, compute_loss, backward, update, fit, predict, accuracy)
Split data 80/20 into train/test sets
Train for 3000 epochs and plot the loss curve
Plot the decision boundary (use the 2 most important features)
Report training and test accuracy
Compare execution time: vectorized vs non-vectorized for 10,000 samples
Bonus: Implement learning rate decay (α decreases over epochs)

Rubric (100 points)

Component	Points	Criteria
Correct LogisticRegression class	30	All 8 methods implemented correctly
Dataset generation	10	Realistic feature distributions
Training + convergence	15	Loss decreases monotonically, >70% accuracy
Loss curve plot	10	Clean, labeled matplotlib figure
Decision boundary plot	15	Shows data points, boundary line, color regions
Vectorized vs loop timing	10	Timing benchmark with >100× speedup shown
Code quality + comments	10	Clear variable names, docstrings, step-by-step comments

Section 12

Exercises — 22 Problems Across 5 Categories

Section A: Conceptual Questions (5)

A1 Beginner

State the five key properties of the sigmoid function σ(z) = 1/(1+e^−z).

Answer: (1) σ(0) = 0.5, (2) Range: (0,1), (3) Symmetry: σ(−z) = 1−σ(z), (4) Limits: σ(z)→1 as z→+∞, σ(z)→0 as z→−∞, (5) Derivative: σ′(z) = σ(z)(1−σ(z)), max value 0.25 at z=0.

Remember

A2 Beginner

Explain in 2-3 sentences why the perceptron (step function) cannot be trained using gradient descent.

Answer: The step function's derivative is zero everywhere except at the discontinuity point where it's undefined. Since gradient descent requires ∂Loss/∂w = (∂Loss/∂ŷ)·(∂ŷ/∂z)·(∂z/∂w), and ∂ŷ/∂z = 0, the weight update is always zero — the model cannot learn.

Understand

A3 Intermediate

Why do we derive the BCE loss from MLE rather than just using MSE for classification?

Answer: MLE gives us the statistically principled loss function. For binary labels following a Bernoulli distribution, the negative log-likelihood is exactly BCE. This loss is convex when composed with sigmoid, guaranteeing a global minimum. MSE with sigmoid is non-convex, creating local minima that gradient descent can get stuck in.

Understand

A4 Intermediate

What is the difference between the "loss" ℒ and the "cost" J in logistic regression?

Answer: The loss ℒ(ŷ, y) is computed for a single training example. The cost J(w, b) is the average loss over all m training examples: J = (1/m)∑ℒ⁽ⁱ⁾. We minimize the cost (not the individual loss), and the gradients ∂J/∂w average the per-sample gradients.

Understand

A5 Intermediate

In the computation graph, explain why ∂ℒ/∂z = ŷ − y simplifies so beautifully. Is this a coincidence?

Answer: No coincidence. The sigmoid is the "canonical link function" for the Bernoulli distribution in the Generalized Linear Model (GLM) framework. When you pair the canonical link with the corresponding distribution's negative log-likelihood, the gradient always simplifies to (prediction − truth). This property extends to other GLMs: softmax + cross-entropy, identity + MSE, etc.

Analyze

Section B: Mathematical Problems (8)

B1 Beginner

Compute σ(3) and σ(−3). Verify that σ(3) + σ(−3) = 1.

Answer: σ(3) = 1/(1+e⁻³) = 1/(1+0.0498) ≈ 0.9526. σ(−3) = 1/(1+e³) = 1/(1+20.086) ≈ 0.0474. Sum = 0.9526 + 0.0474 = 1.0000 ✓. This verifies the symmetry property σ(−z) = 1 − σ(z).

Apply

B2 Intermediate

Prove that the sigmoid derivative can be written as: σ′(z) = σ(z) · σ(−z).

Answer: We know σ′(z) = σ(z)(1−σ(z)). By the symmetry property, 1−σ(z) = σ(−z). Therefore σ′(z) = σ(z)·σ(−z) ✓

Apply

B3 Intermediate

For a single sample with x = [3, 1]^T, y = 1, w = [0.2, −0.1]^T, b = 0.1, compute: (a) z, (b) ŷ, (c) ℒ, (d) ∂ℒ/∂w₁, (e) ∂ℒ/∂w₂, (f) ∂ℒ/∂b.

Answer: (a) z = 0.2·3 + (−0.1)·1 + 0.1 = 0.6 − 0.1 + 0.1 = 0.6. (b) ŷ = σ(0.6) = 1/(1+e^−0.6) ≈ 0.6457. (c) ℒ = −log(0.6457) ≈ 0.4379. (d) dz = ŷ−y = 0.6457−1 = −0.3543. ∂ℒ/∂w₁ = dz·x₁ = −0.3543·3 = −1.0629. (e) ∂ℒ/∂w₂ = −0.3543·1 = −0.3543. (f) ∂ℒ/∂b = −0.3543.

Apply

B4 Intermediate

If σ(z) = 0.8, compute σ′(z) without finding z first.

Answer: σ′(z) = σ(z)(1−σ(z)) = 0.8 × 0.2 = 0.16. The beauty of this formula: you don't need to know z, just σ(z)!

Apply

B5 Intermediate

Derive the Bernoulli likelihood for a dataset with y = [1, 0, 1, 1, 0] and ŷ = [0.9, 0.3, 0.7, 0.8, 0.1]. Compute the log-likelihood and the BCE loss.

Answer: L = 0.9 × 0.7 × 0.7 × 0.8 × 0.9 = 0.3175. log L = log(0.9) + log(0.7) + log(0.7) + log(0.8) + log(0.9) = −0.1054 + (−0.3567) + (−0.3567) + (−0.2231) + (−0.1054) = −1.1473. BCE = −(1/5)(−1.1473) = 0.2295.

Apply

B6 Advanced

Show that the Hessian ∂²J/∂w∂w^T for logistic regression is positive semi-definite, proving that J is convex.

Answer: H = (1/m)∑ σ(z⁽ⁱ⁾)(1−σ(z⁽ⁱ⁾)) · x⁽ⁱ⁾x^(i)T. Let s_i = σ(z⁽ⁱ⁾)(1−σ(z⁽ⁱ⁾)) > 0 (since σ ∈ (0,1)). Then H = (1/m)X·diag(s)·X^T. For any vector v: v^THv = (1/m)∑ s_i(v^Tx⁽ⁱ⁾)² ≥ 0. Thus H is PSD and J is convex. ✓

Analyze

B7 Advanced

Derive the gradient descent update rule for logistic regression with L2 regularization: J_reg = J + (λ/2m)||w||².

Answer: ∂J_reg/∂w = ∂J/∂w + (λ/m)w = (1/m)X(Ŷ−Y)^T + (λ/m)w. Update: w := w − α[(1/m)X(Ŷ−Y)^T + (λ/m)w] = w(1 − αλ/m) − (α/m)X(Ŷ−Y)^T. The term (1 − αλ/m) shrinks weights each step — this is "weight decay."

Analyze

B8 Advanced

Show that σ(z) can be expressed as: σ(z) = ½ + ½·tanh(z/2). Verify for z = 0, z = 2.

Answer: tanh(z/2) = (e^z/2−e^−z/2)/(e^z/2+e^−z/2). Multiply top and bottom by e^z/2: = (e^z−1)/(e^z+1). So ½+½·tanh(z/2) = ½ + (e^z−1)/(2(e^z+1)) = (e^z+1+e^z−1)/(2(e^z+1)) = e^z/(e^z+1) = 1/(1+e^−z) = σ(z) ✓. Verify: z=0: ½+½·0 = 0.5 ✓. z=2: ½+½·tanh(1) = 0.5+0.5·0.7616 = 0.8808 ≈ σ(2)=0.8808 ✓

Apply

Section C: Coding Problems (4)

C1 Intermediate

Implement a function predict_proba(X, w, b) that takes feature matrix X (shape n×m), weights w (shape n×1), and bias b, and returns predicted probabilities using the numerically stable sigmoid.

Answer: def predict_proba(X, w, b): z = w.T @ X + b; return np.where(z >= 0, 1/(1+np.exp(-z)), np.exp(z)/(1+np.exp(z)))

Apply

C2 Intermediate

Write a function that computes the gradient ∂J/∂w and ∂J/∂b for m samples in a fully vectorized way (no loops). Verify against the loop version for correctness.

Answer: def grad(X, Y, w, b): m=X.shape[1]; z=w.T@X+b; a=sigmoid(z); dz=a-Y; dw=(1/m)*X@dz.T; db=(1/m)*np.sum(dz); return dw, db

Apply

C3 Advanced

Extend the LogisticRegression class to support mini-batch gradient descent. Add a batch_size parameter to fit() and randomly sample batches each epoch.

Answer: In the fit loop, generate random indices each epoch: idx = np.random.choice(m, batch_size, replace=False); use X[:, idx] and Y[:, idx] for forward/backward. This adds stochasticity that can help escape local optima and reduces memory usage.

Create

C4 Advanced

Implement a plot_decision_boundary(model, X, Y) function that creates a meshgrid, computes predictions over it, and plots the decision boundary with a contour plot overlaid with data points.

Answer: Create meshgrid from min/max of X[0] and X[1], stack into (2, grid_size²) matrix, forward pass through model, reshape predictions to grid shape, use plt.contourf + plt.scatter. See Section 6c code for the complete implementation.

Create

Section D: Critical Thinking (3)

D1 Advanced

SBI uses logistic regression for loan approval because RBI mandates explainable models. But a gradient boosted tree (XGBoost) gives 5% higher accuracy. As the chief data scientist, how would you argue for or against switching?

Answer: Key trade-offs: (1) Regulatory: RBI requires explanability — LR weights are directly interpretable; XGBoost needs SHAP/LIME for post-hoc explanations. (2) Risk: 5% accuracy improvement on a ₹2.5 lakh crore portfolio = ₹12,500 crore saved from NPAs per year. (3) Compromise: Use LR as the production model for approval decisions (regulatory compliance) but use XGBoost as a second-pass model for risk scoring in the "uncertain zone" (15%–40% default probability). This is called "model stacking" and satisfies both accuracy and regulatory needs.

Evaluate

D2 Advanced

Google's ad CTR model must make predictions in <5ms. A logistic regression with 1 billion sparse features takes 2ms. A transformer model with 340M parameters gives 3% better CTR prediction but takes 50ms. Which would you deploy and why?

Answer: Deploy LR. At 8.5B daily queries, 50ms latency adds 425M seconds of user waiting time daily. Google's research shows that every 100ms of latency costs 1% of revenue. The 3% CTR improvement from the transformer is outweighed by the 5× latency degradation. Alternative: Use the transformer to generate features offline (user embeddings, query intent scores), then feed those as inputs to the logistic regression. This gives most of the accuracy gain within latency budget.

Evaluate

D3 Advanced

The maximum value of σ′(z) is 0.25. In a deep network with L sigmoid layers, the gradient flowing back through all layers is multiplied by σ′ at each layer. What is the maximum possible gradient magnitude at the first layer for L = 10? L = 50? What does this imply?

Answer: Maximum gradient = (0.25)^L. For L=10: (0.25)¹⁰ ≈ 9.5 × 10⁻⁷. For L=50: (0.25)⁵⁰ ≈ 7.9 × 10⁻³¹. This is the vanishing gradient problem — gradients become astronomically small in deep sigmoid networks, making early layers virtually untrainable. This is why ReLU (max gradient = 1) replaced sigmoid in hidden layers of deep networks.

Analyze

★ Section E: Starred Research Problems (2)

★ E1 Advanced

Read McMahan et al. (2013), "Ad Click Prediction: a View from the Trenches" (Google). This paper describes the FTRL-Proximal optimizer used for training logistic regression on billions of features. Summarize: (a) Why does standard gradient descent fail at Google's scale? (b) How does FTRL-Proximal achieve sparsity? (c) What is the per-coordinate learning rate schedule?

Reference: KDD 2013. Key insights: (a) With billions of features, full gradient updates are too expensive; FTRL processes one example at a time (online learning). (b) FTRL combines L1 regularization with a "follow-the-regularized-leader" update that sets many weights exactly to zero, enabling sparse models. (c) η_t,i = α/(β + √∑τ g_τ,i²) — per-feature adaptive rate that decreases as the feature's accumulated gradient grows.

CreateResearch

★ E2 Advanced

The sigmoid function σ(z) = 1/(1+e^−z) is one of many possible S-shaped functions. Research and compare at least three alternatives: (a) tanh, (b) probit (Φ(z) — the CDF of the standard normal), (c) algebraic sigmoid z/(1+|z|). For each, derive the range, compute the derivative, and discuss when you'd prefer it over the standard sigmoid.

Hints: (a) tanh: range (−1,1), derivative 1−tanh²(z), zero-centered output — preferred in hidden layers over sigmoid. (b) Probit: range (0,1), no closed-form derivative (involves e^−z²/2), used in Bayesian models and econometrics. (c) Algebraic: range (−1,1), derivative 1/(1+|z|)², computationally cheaper — used in embedded systems. The standard sigmoid's advantage is that σ′ = σ(1−σ) makes backprop analytically elegant.

CreateResearch

Section 13

Connections — Where This Fits

🔗 Chapter Connections Map

← Builds On

Ch 0 (Course Overview): The bird's-eye view of what neural networks are
Ch 2 (Math Toolkit): Derivatives, chain rule, vectors, matrices — all used in gradient derivations

→ Enables

Ch 4 (Shallow Neural Networks): Stack multiple logistic regression neurons into layers → multi-layer perceptron
Ch 5 (Deep Neural Networks): The forward-backward-update loop scales directly to L layers
Ch 6 (Optimization): Learning rate, momentum, Adam — all build on gradient descent from this chapter

🔬 Research Frontier

Neural Tangent Kernels (Jacot et al., 2018): At infinite width, neural networks behave like logistic regression in a kernel space
Calibration (Guo et al., 2017): Modern deep networks are poorly calibrated — their sigmoid outputs don't match true probabilities. Temperature scaling (a post-hoc logistic regression!) fixes this

🏭 Industry Implementations

sklearn.linear_model.LogisticRegression: Uses LBFGS or liblinear solvers (faster than GD for convex problems)
Google FTRL-Proximal: Online logistic regression at billion-feature scale
Facebook/Meta DLRM: Deep Learning Recommendation Model uses logistic regression as its final output layer

CHAPTER DEPENDENCY MAP Ch 0: Overview ─────┐ ├──→ [CH 3: LOGISTIC REGRESSION] ──→ Ch 4: Shallow NN Ch 2: Math Toolkit ─┘ ■ sigmoid ■ ──→ Ch 5: Deep NN ■ BCE from MLE ■ ──→ Ch 6: Optimization ■ gradient descent ■ ■ computation graph ■ ■ forward/backward ■ ■ single neuron NN ■

Section 14

Chapter Summary — 7 Key Takeaways

📝 What You Learned in Chapter 3

From Step to Smooth: The perceptron's step function has zero derivatives, blocking gradient-based learning. The sigmoid σ(z) = 1/(1+e^−z) is smooth, differentiable, and maps ℝ → (0,1), enabling gradient descent.
Sigmoid is Self-Referential: Its derivative σ′(z) = σ(z)(1−σ(z)) is expressed entirely in terms of the function itself — no need to recompute z. Maximum derivative is 0.25 at z=0; it vanishes for |z| ≫ 0.
Logistic Regression IS a Neural Network: A single neuron with n inputs, sigmoid activation, trained with BCE loss and gradient descent. Every concept from this chapter — forward pass, loss, backprop, GD update — scales directly to deep networks.
BCE from First Principles: Bernoulli → Likelihood → Log-Likelihood → Negate → Average = Binary Cross-Entropy. This is the only correct loss for binary classification from a statistical standpoint.
The Beautiful Gradient: ∂ℒ/∂z = ŷ − y. The gradient of the loss with respect to the pre-activation is simply the prediction error. This is not a coincidence — it's a property of canonical link functions in GLMs.
Vectorize Everything: Loop-based gradient computation is ~1500× slower than vectorized NumPy operations. In production ML, vectorization is not optional.
Real-World Impact: SBI uses logistic regression for ₹10L+ loan decisions (12-second approvals). Google uses it for $175B/year in ad revenue (8.5B daily predictions). The simplest neural network is also the most widely deployed.

Key Equation

σ(z) = 1/(1+e^−z) | J = −(1/m) ∑[y log ŷ + (1−y) log(1−ŷ)] | dz = ŷ − y

Key Intuition

🧠 Logistic regression is a single neuron that converts a linear combination of inputs into a probability, learns by comparing its predictions to truth (BCE), and adjusts its weights to reduce errors (gradient descent). Every deep neural network is just many of these neurons, connected in layers, trained with the same four-step loop: forward → loss → backward → update.

Section 15

Chapter 3: Logistic Regression as a Neural Network

Bloom's Taxonomy Progression

Learning Objectives

Opening Hook — The 12-Second Loan Decision

🏦 "Approved" or "Rejected" — in the time it takes to blink twice.

The Intuition First — From Step to Smooth

The Dimmer Switch Analogy

The "Aha!" Question

Mathematical Foundation — Derived from First Principles

4a. The Sigmoid Function σ(z) — From Linear to Probability

The Problem: Unbounded Outputs

Definition

Sigmoid Value Table — Memorize These!

Logistic Regression = Single-Neuron Neural Network

4b. Binary Cross-Entropy — Derived from Maximum Likelihood

Why BCE Works: Intuitive Analysis

🔍 Loss Behavior — How BCE Penalizes Errors

4c. The Cost Function J(w, b)

4d. Gradient Computation — ∂J/∂w and ∂J/∂b via Chain Rule

Gradient Descent Update Rules

4e. The Computation Graph — Forward and Backward Pass

Worked Examples — By Hand, Then by India, Then by World

Example 1: Full Forward + Backward Pass By Hand (2 features, 3 samples)

Example 2: 🇮🇳 SBI CIBIL Score Loan Approval

🏦 Case Study — Loan Default Prediction at State Bank of India

Business Context

Feature Engineering (simplified to 5 key features)

Model Computation for Anjali's Application

Weight Interpretability — The Regulator's Favorite

Example 3: 🌍 Google Ads Click-Through Rate (CTR) Prediction

🔍 Case Study — CTR Prediction at Google (Mountain View, CA)

Business Context

Feature Engineering (simplified)

Model Computation

Revenue Impact: Expected CPM

Python Implementation — From Scratch + Library

6a. From-Scratch NumPy Implementation

6b. Synthetic Indian Bank Loan Dataset + Training

6c. Plot Loss Curve + Decision Boundary

6d. Vectorized vs Non-Vectorized — Timing Comparison

6e. PyTorch Library Version (for comparison)

Visual Aids — ASCII Diagrams

Sigmoid Function Shape

BCE Loss Behavior

Full Forward-Backward Dataflow

Common Misconceptions

GATE / Exam Corner

Key Formulas — Quick Reference Sheet

GATE Previous Year Questions (PYQs) Pattern

GATE Prediction Table — What to Expect

Interview Prep — India + US Focus

Conceptual Questions

🎤 Q1: Why is logistic regression considered a neural network?

🎤 Q2: Derive the gradient ∂ℒ/∂w for logistic regression from first principles.

🎤 Q3: When would you use logistic regression over a deep neural network?

Coding Interview Questions

💻 Coding Q1: Implement sigmoid without using np.exp

💻 Coding Q2: Implement BCE loss in one line, handling log(0)

Hands-On Lab — SBI Loan Default Predictor

🧪 Mini-Project: Build a Complete Loan Default Prediction System

📋 Project Specification

Exercises — 22 Problems Across 5 Categories

Section A: Conceptual Questions (5)

Section B: Mathematical Problems (8)

Section C: Coding Problems (4)

Section D: Critical Thinking (3)

★ Section E: Starred Research Problems (2)

Connections — Where This Fits

🔗 Chapter Connections Map

Chapter Summary — 7 Key Takeaways

📝 What You Learned in Chapter 3

Key Equation

Key Intuition

Further Reading