Neural Networks & Deep Learning
Chapter 3: Logistic Regression as a Neural Network
Where a single neuron learns to make decisions โ and you learn to train it
โฑ๏ธ Reading Time: ~3 hours | ๐ Unit 1: The Neuron Era | ๐งช Theory + Code
๐ Prerequisites: Chapter 0 (Course Overview) & Chapter 2 (Math Toolkit)
Bloom's Taxonomy Progression
| Bloom's Level | What You'll Achieve |
|---|---|
| ๐ต Remember | Recall the sigmoid function ฯ(z) = 1/(1+eโz), its range (0,1), key values ฯ(0) = 0.5, and the BCE loss formula |
| ๐ต Understand | Explain why binary cross-entropy is derived from Maximum Likelihood and why sigmoid makes the perceptron differentiable |
| ๐ข Apply | Implement a complete LogisticRegression class from scratch in NumPy with forward, backward, and update methods |
| ๐ก Analyze | Trace the full computation graph โ forward pass and backward pass โ computing every intermediate gradient by hand |
| ๐ Evaluate | Compare vectorized vs loop-based gradient descent, evaluate numerical stability issues, assess convergence behavior |
| ๐ด Create | Design and train a loan-default predictor for an Indian banking dataset and plot the decision boundary |
Learning Objectives
By the end of this chapter, you will be able to:
- Explain why the step-function perceptron needs to be replaced with a smooth, differentiable activation for learning via gradient descent
- Derive the sigmoid function ฯ(z) = 1/(1+eโz) and prove all five key properties: range, midpoint, symmetry, limits, and derivative ฯโฒ = ฯ(1โฯ)
- Recognize logistic regression as a single-neuron neural network and draw its computational architecture
- Derive Binary Cross-Entropy from scratch: Bernoulli โ Likelihood โ Log-Likelihood โ Negate โ BCE
- Compute the cost function J(w,b) = average BCE over m training examples
- Derive gradient descent update rules โJ/โw and โJ/โb using the chain rule on the computation graph
- Implement a complete
LogisticRegressionclass from scratch with sigmoid, loss, forward, backward, update, fit, predict, and accuracy methods - Train the model on a synthetic Indian bank loan dataset, plot the loss curve and decision boundary
- Compare vectorized (NumPy) vs loop-based implementations with timing benchmarks
- Execute a full forward + backward pass by hand on a worked example with 2 features and 3 samples
Opening Hook โ The 12-Second Loan Decision
๐ฆ "Approved" or "Rejected" โ in the time it takes to blink twice.
It's a monsoon Tuesday in Mumbai. Anjali, a 31-year-old chartered accountant, opens the SBI YONO app and applies for a โน10,00,000 home loan top-up. Within 12 seconds, the app responds: "Congratulations! Your pre-approved offer is ready."
In those 12 seconds, no human banker reviewed Anjali's file. A machine learning model โ at its mathematical core, a logistic regression โ consumed her CIBIL score (782), monthly salary (โน1,45,000), existing EMI-to-income ratio (22%), years of credit history (9), number of active loans (2), and 40+ other features. It output a single number: 0.04 โ the probability that Anjali will default.
Now imagine the same system at Google's ad servers in Mountain View, California. Every time you search "best laptop under $1000," Google runs a logistic regression (among other models) on each of 500 candidate ads and predicts: what is the probability that you will click this ad? That probability โ the click-through rate (CTR) โ determines which ads you see. Google runs this computation 8.5 billion times per day.
Both systems โ SBI's loan engine and Google's ad ranker โ solve the same mathematical problem: given input features x, estimate P(y=1|x). The tool they use is the simplest possible neural network: a single neuron with a sigmoid activation. That's logistic regression. In this chapter, you will build it from first principles.
The Intuition First โ From Step to Smooth
The Dimmer Switch Analogy
In Chapter 2 (your math toolkit), you saw that a perceptron uses a step function: output 1 if the weighted sum exceeds a threshold, output 0 otherwise. Think of it as a light switch โ it's either ON or OFF, nothing in between.
But here's the problem: you can't learn from a switch. Imagine you're adjusting the temperature dial on your geyser. If the dial only had two positions โ "freezing" and "boiling" โ you'd never find a comfortable temperature. You need a smooth dial that lets you make tiny adjustments and see small changes in temperature. That smooth dial is the sigmoid function.
The "Aha!" Question
Ask yourself: Why can't gradient descent work with a step function? Because the derivative of a step function is zero everywhere (except at the discontinuity where it's undefined). If the derivative is zero, the gradient is zero, and the weight update ฮw = โฮฑยทโL/โw = โฮฑยท0 = 0. The model never learns! You need a function with a non-zero derivative at every point. Enter the sigmoid.
Mathematical Foundation โ Derived from First Principles
We build the entire logistic regression framework in five rigorous steps. Every equation is derived, not memorized. If you're confused at any step, you're thinking correctly โ stay with it.
4a. The Sigmoid Function ฯ(z) โ From Linear to Probability
The Problem: Unbounded Outputs
A neuron computes z = wTx + b, where z โ (โโ, +โ). But for binary classification, you need a probability โ a number in [0, 1]. You need a function f: โ โ (0, 1) that is smooth, monotonically increasing, and differentiable.
Definition
Domain: z โ (โโ, +โ) โ Range: ฯ(z) โ (0, 1)
ฯ(0) = 1/(1 + e0) = 1/(1 + 1) = 1/2 = 0.5
Interpretation: When the weighted sum z = 0, the model is maximally uncertain โ equal probability for both classes.
Property 2: Symmetry โ ฯ(โz) = 1 โ ฯ(z)Start: ฯ(โz) = 1/(1 + ez)
Multiply numerator and denominator by eโz:
= eโz/(eโz + 1) = (1 + eโz โ 1)/(1 + eโz) = 1 โ 1/(1 + eโz) = 1 โ ฯ(z) โ
This means the sigmoid is symmetric about the point (0, 0.5). If P(y=1|x) = ฯ(z), then P(y=0|x) = ฯ(โz). Beautiful.
Property 3: Asymptotic LimitsAs z โ +โ: eโz โ 0, so ฯ(z) โ 1/(1+0) = 1
As z โ โโ: eโz โ โ, so ฯ(z) โ 1/โ = 0
The sigmoid never reaches 0 or 1 exactly โ outputs are strictly in the open interval (0, 1).
Property 4: The Elegant Derivative โ ฯโฒ(z) = ฯ(z)ยท(1 โ ฯ(z))Write ฯ(z) = (1 + eโz)โ1. Apply the chain rule:
ฯโฒ(z) = โ1 ยท (1 + eโz)โ2 ยท d/dz(1 + eโz)
= โ(1 + eโz)โ2 ยท (โeโz)
= eโz / (1 + eโz)2
Now verify ฯ(z)ยท(1โฯ(z)):
= [1/(1+eโz)] ยท [eโz/(1+eโz)]
= eโz / (1+eโz)2 โ
ฯโฒ(0) = ฯ(0)ยท(1โฯ(0)) = 0.5 ร 0.5 = 0.25
The sigmoid changes fastest at z = 0. For |z| โซ 0, ฯโฒ โ 0 โ these are the saturation regions where gradients vanish. This is a preview of the "vanishing gradient problem" we'll fight in later chapters.
Sigmoid Value Table โ Memorize These!
| z | โ6 | โ4 | โ2 | โ1 | 0 | 1 | 2 | 4 | 6 |
|---|---|---|---|---|---|---|---|---|---|
| ฯ(z) | 0.0025 | 0.018 | 0.119 | 0.269 | 0.500 | 0.731 | 0.881 | 0.982 | 0.9975 |
| ฯโฒ(z) | 0.0025 | 0.018 | 0.105 | 0.197 | 0.250 | 0.197 | 0.105 | 0.018 | 0.0025 |
1 / (1 + np.exp(-z)) naively! When z is a large negative number (say z = โ1000), np.exp(1000) overflows to inf. Use the stable version:
np.where(z >= 0, 1/(1+np.exp(-z)), np.exp(z)/(1+np.exp(z)))
Logistic Regression = Single-Neuron Neural Network
Now here's the key insight. A logistic regression model is exactly a neural network with:
- One input layer with nx features
- One output neuron with sigmoid activation
- No hidden layers
โ TRUTH: Logistic regression IS a neural network โ with exactly 1 neuron, 0 hidden layers, and sigmoid activation. It's the simplest possible neural network.
๐ WHY IT MATTERS: Understanding logistic regression deeply means you understand forward propagation, loss computation, backpropagation, and gradient descent โ the exact four steps that every deep network uses. The only difference in a 100-layer network is that these steps are repeated through more layers.
4b. Binary Cross-Entropy โ Derived from Maximum Likelihood
We have a model that outputs ลท = ฯ(wTx + b) โ (0, 1). We need a loss function โ a way to quantify "how wrong" the model is. We don't pull this loss out of thin air. We derive it from the principle of Maximum Likelihood Estimation (MLE).
Since y โ {0, 1}, each label follows a Bernoulli distribution. Our model estimates P(y=1|x) = ลท. We can write both cases in a single elegant formula:
Verify: If y = 1: P = ลทยน ยท (1โลท)โฐ = ลท โ | If y = 0: P = ลทโฐ ยท (1โลท)ยน = 1โลท โ
Step 2: Likelihood Function for m Training ExamplesAssuming training examples are independent and identically distributed (i.i.d.), the likelihood of observing all m labels given our parameters is the product:
Products of many small numbers underflow numerically. Log converts products to sums and is monotonic (maximizing log L is equivalent to maximizing L):
MLE says: maximize log L. Gradient descent minimizes. So negate it. Divide by m for the average:
This is the Binary Cross-Entropy (BCE) Loss, also called the Log Loss. Every line was derived โ nothing was assumed.
Why BCE Works: Intuitive Analysis
๐ Loss Behavior โ How BCE Penalizes Errors
Loss per sample = โlog(ลท)
- If ลท โ 1 (correct, confident): โlog(1) = 0 โ zero loss, perfect!
- If ลท โ 0 (wrong, confident): โlog(0) โ +โ โ infinite penalty for being confidently wrong!
Loss per sample = โlog(1 โ ลท)
- If ลท โ 0 (correct, confident): โlog(1) = 0 โ zero loss, perfect!
- If ลท โ 1 (wrong, confident): โlog(0) โ +โ โ infinite penalty!
BCE penalizes confident wrong predictions exponentially more than uncertain ones. If your model says "I'm 99% sure this person will repay" and they default, the penalty is โlog(0.01) = 4.6. But if the model says "I'm 51% sure," the penalty is only โlog(0.49) = 0.71. This asymmetric penalization forces the model toward calibrated probabilities.
โ TRUTH: MSE with sigmoid creates a non-convex loss surface with many local minima, because (ลท โ y)ยฒ composed with ฯ(z) has multiple inflection points. BCE with sigmoid gives a convex loss surface with a single global minimum โ gradient descent is guaranteed to converge.
๐ WHY IT MATTERS: This is a GATE favorite. The convexity of BCE is directly derived from the negative log-likelihood of the Bernoulli distribution.
4c. The Cost Function J(w, b)
For a single training example (x, y), the loss is:
The cost function is the average loss over all m training examples:
Where ลท(i) = ฯ(wTx(i) + b) for each sample i. The goal of training: find w* and b* that minimize J(w, b).
4d. Gradient Computation โ โJ/โw and โJ/โb via Chain Rule
To minimize J, we need its gradients with respect to w and b. We'll derive these using the chain rule, one link at a time.
For a single sample (drop the superscript for clarity):
z = wTx + b โ ลท = ฯ(z) โ โ = โ[yยทlog(ลท) + (1โy)ยทlog(1โลท)]
We need โโ/โw and โโ/โb. By the chain rule:
โโ/โw = (โโ/โลท) ยท (โลท/โz) ยท (โz/โw)
Link 1: โโ/โลทโ = โyยทlog(ลท) โ (1โy)ยทlog(1โลท)
โโ/โลท = โy/ลท โ (1โy)ยท(โ1)/(1โลท) = โy/ลท + (1โy)/(1โลท)
Link 2: โลท/โz = ฯโฒ(z) = ฯ(z)(1โฯ(z)) = ลท(1โลท) Link 3: โz/โw = x and โz/โb = 1 Combining: โโ/โz (the key intermediate)โโ/โz = (โโ/โลท) ยท (โลท/โz)
= [โy/ลท + (1โy)/(1โลท)] ยท ลท(1โลท)
= โy(1โลท) + (1โy)ลท
= โy + yลท + ลท โ yลท
= ลท โ y
โโ/โw = (ลท โ y) ยท x
โโ/โb = (ลท โ y) ยท 1 = ลท โ y
Gradients for the Full Cost (m samples)โJ/โb = (1/m) โi=1m (ลท(i) โ y(i))
Gradient Descent Update Rules
With the gradients computed, we update parameters in the direction that decreases J:
b := b โ ฮฑ ยท โJ/โb
where ฮฑ > 0 is the learning rate
Repeat for T iterations (epochs). Each iteration, the loss decreases (if ฮฑ is small enough), and the model gets better at predicting.
4e. The Computation Graph โ Forward and Backward Pass
A computation graph makes the chain rule visual. Each node is an operation. Forward pass flows leftโright, computing outputs. Backward pass flows rightโleft, computing gradients.
Worked Examples โ By Hand, Then by India, Then by World
Example 1: Full Forward + Backward Pass By Hand (2 features, 3 samples)
x(1) = [1, 2]T, y(1) = 1 (approved)
x(2) = [โ1, โ1]T, y(2) = 0 (rejected)
x(3) = [2, 0]T, y(3) = 1 (approved)
Initial parameters: w = [0.5, โ0.5]T, b = 0, learning rate ฮฑ = 0.1
FORWARD PASS โ Compute z, ลท, โ for each sampleSample 1: z(1) = 0.5ยท1 + (โ0.5)ยท2 + 0 = 0.5 โ 1.0 = โ0.5
ลท(1) = ฯ(โ0.5) = 1/(1 + e0.5) = 1/(1 + 1.6487) = 1/2.6487 โ 0.3775
โ(1) = โ[1ยทlog(0.3775) + 0ยทlog(0.6225)] = โlog(0.3775) = โ(โ0.9740) โ 0.9740
Sample 2: z(2) = 0.5ยท(โ1) + (โ0.5)ยท(โ1) + 0 = โ0.5 + 0.5 = 0
ลท(2) = ฯ(0) = 0.5
โ(2) = โ[0ยทlog(0.5) + 1ยทlog(0.5)] = โlog(0.5) = 0.6931
Sample 3: z(3) = 0.5ยท2 + (โ0.5)ยท0 + 0 = 1.0
ลท(3) = ฯ(1) = 1/(1 + eโ1) = 1/1.3679 โ 0.7311
โ(3) = โ[1ยทlog(0.7311) + 0ยทlog(0.2689)] = โ(โ0.3133) โ 0.3133
dz(i) = ลท(i) โ y(i):
dz(1) = 0.3775 โ 1 = โ0.6225
dz(2) = 0.5 โ 0 = +0.5
dz(3) = 0.7311 โ 1 = โ0.2689
โJ/โwโ = (1/3) โ dz(i) ยท xโ(i)
= (1/3)[(โ0.6225)(1) + (0.5)(โ1) + (โ0.2689)(2)]
= (1/3)[โ0.6225 โ 0.5 โ 0.5378] = (1/3)(โ1.6603) โ โ0.5534
โJ/โwโ = (1/3) โ dz(i) ยท xโ(i)
= (1/3)[(โ0.6225)(2) + (0.5)(โ1) + (โ0.2689)(0)]
= (1/3)[โ1.2450 โ 0.5 + 0] = (1/3)(โ1.7450) โ โ0.5817
โJ/โb = (1/3) โ dz(i)
= (1/3)[โ0.6225 + 0.5 + (โ0.2689)] = (1/3)(โ0.3914) โ โ0.1305
GRADIENT DESCENT UPDATE (ฮฑ = 0.1)wโ := 0.5 โ 0.1ยท(โ0.5534) = 0.5 + 0.05534 โ 0.5553
wโ := โ0.5 โ 0.1ยท(โ0.5817) = โ0.5 + 0.05817 โ โ0.4418
b := 0 โ 0.1ยท(โ0.1305) = 0 + 0.01305 โ 0.0131
Both weights moved to better separate the classes! wโ increased (feature 1 correlates with positive class), wโ became less negative (adjusting its influence).
Example 2: ๐ฎ๐ณ SBI CIBIL Score Loan Approval
๐ฆ Case Study โ Loan Default Prediction at State Bank of India
Business Context
SBI processes ~1.5 lakh personal loan applications per month. The first-pass model is a logistic regression that predicts P(default) using features from CIBIL (Credit Information Bureau India Limited) and internal banking data.
Feature Engineering (simplified to 5 key features)
| Feature | Variable | Example Values |
|---|---|---|
| CIBIL Score (normalized) | xโ | 0.782 (original: 782/1000) |
| Monthly Income (log-scaled, lakhs) | xโ | log(1.45) = 0.372 |
| EMI-to-Income Ratio | xโ | 0.22 (22%) |
| Years of Credit History | xโ | 0.9 (9 years / 10 max) |
| Number of Active Loans (normalized) | xโ | 0.4 (2 loans / 5 max) |
Model Computation for Anjali's Application
Trained weights: w = [โ3.2, โ1.5, +2.8, โ0.9, +1.4], b = 1.2
Note: negative weights for CIBIL score, income, credit history โ higher values reduce default risk. Positive weights for EMI ratio and number of loans โ higher values increase default risk.
z = (โ3.2)(0.782) + (โ1.5)(0.372) + (2.8)(0.22) + (โ0.9)(0.9) + (1.4)(0.4) + 1.2
= โ2.502 โ 0.558 + 0.616 โ 0.810 + 0.560 + 1.200
= โ1.494
ลท = ฯ(โ1.494) = 1/(1 + e1.494) = 1/(1 + 4.456) โ 0.183
Interpretation: P(default) = 18.3%. SBI's threshold is 15% for auto-approval, 40% for auto-rejection. Anjali falls in the 15%โ40% band โ sent to a human underwriter for review. But given her strong CIBIL score and income, the underwriter approves.
Weight Interpretability โ The Regulator's Favorite
RBI requires explainable credit models. With logistic regression, each weight has a clear meaning:
- wโ = โ3.2 for CIBIL โ A 0.1 increase in normalized CIBIL (i.e., 100 points) decreases z by 0.32, reducing default probability
- wโ = +2.8 for EMI ratio โ Higher EMI burden increases default risk
- The odds ratio ewj gives the multiplicative change in odds per unit change in xj
Example 3: ๐ Google Ads Click-Through Rate (CTR) Prediction
๐ Case Study โ CTR Prediction at Google (Mountain View, CA)
Business Context
Google Search Ads generates $175 billion annually (2024). Every time a user types a query, ~500 candidate ads are scored. The core ranking uses a logistic regression variant (FTRL-Proximal) to predict P(click | user, query, ad). This runs 8.5 billion times per day.
Feature Engineering (simplified)
| Feature | Variable | Example |
|---|---|---|
| Query-Ad relevance score | xโ | 0.85 (cosine similarity of embeddings) |
| Ad historical CTR | xโ | 0.032 (3.2% past click rate) |
| User engagement score | xโ | 0.67 (based on past sessions) |
| Position bias (1/position) | xโ | 0.333 (position 3) |
| Time-of-day feature | xโ | 0.75 (3 PM, peak shopping) |
Model Computation
Trained weights: w = [2.1, 5.8, 1.3, 3.5, 0.6], b = โ4.2
z = (2.1)(0.85) + (5.8)(0.032) + (1.3)(0.67) + (3.5)(0.333) + (0.6)(0.75) + (โ4.2)
= 1.785 + 0.186 + 0.871 + 1.166 + 0.450 โ 4.200
= 0.257
ลท = ฯ(0.257) = 1/(1 + eโ0.257) โ 0.564
Wait โ a 56.4% CTR? That seems high, but this is a simplified example. In practice, Google uses feature crossing, hashing tricks, and the FTRL optimizer with L1 regularization on billions of sparse features. Real CTRs are typically 1โ5%.
Revenue Impact: Expected CPM
Google's ad auction uses a "second-price" mechanism. The ad's rank = bid ร P(click). A logistic regression that improves P(click) estimation by just 0.1% across billions of impressions translates to $175 million in annual revenue.
- Scale: ~1.5L applications/month
- Features: CIBIL, income, EMI ratio
- Output: P(default) โ (0,1)
- Constraint: RBI-mandated explainability
- Loss if wrong: NPA (Non-Performing Asset)
- Model type: Classic logistic regression
- Scale: 8.5B predictions/day
- Features: Query-ad similarity, user history
- Output: P(click) โ (0,1)
- Constraint: <5ms latency per prediction
- Loss if wrong: Lost ad revenue
- Model type: FTRL-Proximal (LR variant)
Python Implementation โ From Scratch + Library
6a. From-Scratch NumPy Implementation
Python import numpy as np import matplotlib.pyplot as plt import time class LogisticRegression: """ Logistic Regression from scratch โ a single-neuron neural network. Implements: sigmoid, BCE loss, forward, backward, update, fit, predict. """ def __init__(self, n_features, learning_rate=0.01): # Initialize weights to small random values, bias to zero self.w = np.zeros((n_features, 1)) # shape (n, 1) self.b = 0.0 self.lr = learning_rate self.losses = [] def sigmoid(self, z): """Numerically stable sigmoid.""" return np.where( z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)) ) def forward(self, X): """ Forward pass: X โ z โ ลท X: shape (n, m) โ each column is a sample Returns: ลท of shape (1, m) """ self.z = np.dot(self.w.T, X) + self.b # (1, m) self.y_hat = self.sigmoid(self.z) # (1, m) return self.y_hat def compute_loss(self, Y): """ Binary Cross-Entropy: J = -(1/m) ฮฃ [yยทlog(ลท) + (1-y)ยทlog(1-ลท)] Y: shape (1, m) """ m = Y.shape[1] epsilon = 1e-8 # prevent log(0) cost = -(1/m) * np.sum( Y * np.log(self.y_hat + epsilon) + (1 - Y) * np.log(1 - self.y_hat + epsilon) ) return np.squeeze(cost) def backward(self, X, Y): """ Backward pass: compute gradients โJ/โw and โJ/โb The beautiful result: dz = ลท - y """ m = Y.shape[1] dz = self.y_hat - Y # (1, m) self.dw = (1/m) * np.dot(X, dz.T) # (n, 1) self.db = (1/m) * np.sum(dz) # scalar def update(self): """Gradient descent step: w := w - ฮฑยทdw, b := b - ฮฑยทdb""" self.w -= self.lr * self.dw self.b -= self.lr * self.db def fit(self, X, Y, epochs=1000, print_every=100): """ Full training loop. X: (n, m), Y: (1, m) """ for i in range(epochs): # Forward self.forward(X) # Loss loss = self.compute_loss(Y) self.losses.append(loss) # Backward self.backward(X, Y) # Update self.update() # Print if i % print_every == 0: print(f"Epoch {i:4d} | Loss: {loss:.6f}") def predict(self, X, threshold=0.5): """Return binary predictions.""" y_hat = self.forward(X) return (y_hat >= threshold).astype(int) def accuracy(self, X, Y): """Compute classification accuracy.""" preds = self.predict(X) return np.mean(preds == Y) * 100
6b. Synthetic Indian Bank Loan Dataset + Training
Python # โโ Generate Synthetic SBI Loan Dataset โโ np.random.seed(42) m = 500 # 500 loan applications # Feature 1: CIBIL score (normalized to 0-1) cibil = np.random.beta(5, 2, m) # skewed toward higher scores # Feature 2: EMI-to-income ratio (0-0.8) emi_ratio = np.random.beta(2, 5, m) * 0.8 # Labels: P(default) is LOW when CIBIL is HIGH and EMI is LOW z_true = -4.0 * cibil + 5.0 * emi_ratio + 0.5 prob_default = 1 / (1 + np.exp(-z_true)) y = (np.random.rand(m) < prob_default).astype(float) # Stack into (n, m) format X = np.vstack([cibil, emi_ratio]) # shape (2, 500) Y = y.reshape(1, -1) # shape (1, 500) print(f"Dataset: {X.shape[1]} samples, {X.shape[0]} features") print(f"Default rate: {Y.mean()*100:.1f}%") # โโ Train the model โโ model = LogisticRegression(n_features=2, learning_rate=1.0) model.fit(X, Y, epochs=2000, print_every=200) print(f"\nFinal Loss: {model.losses[-1]:.6f}") print(f"Training Accuracy: {model.accuracy(X, Y):.2f}%") print(f"Learned weights: w1={model.w[0,0]:.4f}, w2={model.w[1,0]:.4f}, b={model.b:.4f}")
6c. Plot Loss Curve + Decision Boundary
Python # โโ Plot 1: Loss Curve โโ fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5)) ax1.plot(model.losses, color='#7c3aed', linewidth=2) ax1.set_xlabel('Epoch', fontsize=12) ax1.set_ylabel('BCE Loss', fontsize=12) ax1.set_title('Training Loss Curve', fontsize=14, fontweight='bold') ax1.grid(True, alpha=0.3) # โโ Plot 2: Decision Boundary โโ # Create mesh grid x1_range = np.linspace(0, 1, 200) x2_range = np.linspace(0, 0.8, 200) X1, X2 = np.meshgrid(x1_range, x2_range) X_mesh = np.vstack([X1.ravel(), X2.ravel()]) # Predict on mesh Z_mesh = model.forward(X_mesh).reshape(X1.shape) # Plot ax2.contourf(X1, X2, Z_mesh, levels=50, cmap='RdYlGn_r', alpha=0.8) ax2.contour(X1, X2, Z_mesh, levels=[0.5], colors='black', linewidths=2) scatter = ax2.scatter(X[0], X[1], c=Y.ravel(), cmap='RdYlGn_r', edgecolors='black', s=20, alpha=0.6) ax2.set_xlabel('CIBIL Score (normalized)', fontsize=12) ax2.set_ylabel('EMI-to-Income Ratio', fontsize=12) ax2.set_title('Decision Boundary โ SBI Loan Default', fontsize=14, fontweight='bold') plt.colorbar(scatter, ax=ax2, label='P(Default)') plt.tight_layout() plt.savefig('ch03_logistic_regression_plots.png', dpi=150) plt.show()
6d. Vectorized vs Non-Vectorized โ Timing Comparison
Python # โโ Non-vectorized (loop-based) gradient computation โโ def compute_gradients_loop(X, Y, w, b): """Compute gradients using explicit Python loops โ SLOW!""" n, m = X.shape dw = np.zeros((n, 1)) db = 0.0 for i in range(m): x_i = X[:, i].reshape(-1, 1) # (n, 1) z_i = np.dot(w.T, x_i) + b # scalar y_hat_i = 1 / (1 + np.exp(-z_i)) # scalar dz_i = y_hat_i - Y[0, i] # scalar dw += x_i * dz_i db += dz_i dw /= m db /= m return dw, db # โโ Vectorized gradient computation โโ def compute_gradients_vectorized(X, Y, w, b): """Compute gradients using NumPy broadcasting โ FAST!""" m = X.shape[1] z = np.dot(w.T, X) + b # (1, m) y_hat = 1 / (1 + np.exp(-z)) # (1, m) dz = y_hat - Y # (1, m) dw = (1/m) * np.dot(X, dz.T) # (n, 1) db = (1/m) * np.sum(dz) # scalar return dw, db # โโ Timing Benchmark โโ m_test = 50000 n_test = 100 X_test = np.random.randn(n_test, m_test) Y_test = np.random.randint(0, 2, (1, m_test)).astype(float) w_test = np.random.randn(n_test, 1) * 0.01 b_test = 0.0 # Time the loop version t1 = time.time() dw_loop, db_loop = compute_gradients_loop(X_test, Y_test, w_test, b_test) t_loop = time.time() - t1 # Time the vectorized version t2 = time.time() dw_vec, db_vec = compute_gradients_vectorized(X_test, Y_test, w_test, b_test) t_vec = time.time() - t2 print(f"Loop-based: {t_loop*1000:.1f} ms") print(f"Vectorized: {t_vec*1000:.1f} ms") print(f"Speedup: {t_loop/t_vec:.0f}x faster!") print(f"Results match: {np.allclose(dw_loop, dw_vec)}")
6e. PyTorch Library Version (for comparison)
Python import torch import torch.nn as nn # Convert data to PyTorch tensors X_pt = torch.tensor(X.T, dtype=torch.float32) # (m, n) Y_pt = torch.tensor(Y.T, dtype=torch.float32) # (m, 1) # Define the model: logistic regression = 1 Linear layer + Sigmoid model_pt = nn.Sequential( nn.Linear(2, 1), # z = wแตx + b nn.Sigmoid() # ลท = ฯ(z) ) # BCE Loss + SGD optimizer criterion = nn.BCELoss() optimizer = torch.optim.SGD(model_pt.parameters(), lr=1.0) # Training loop for epoch in range(2000): y_pred = model_pt(X_pt) loss = criterion(y_pred, Y_pt) optimizer.zero_grad() loss.backward() # autograd computes gradients automatically! optimizer.step() if epoch % 500 == 0: print(f"Epoch {epoch:4d} | Loss: {loss.item():.6f}") # Compare: PyTorch did the same forward-backward-update loop, # but computed gradients automatically via loss.backward(). # Our from-scratch version computed them manually โ same math!
def train_step(X, Y, w, b, lr):
m = X.shape[1]
z = np.dot(w.T, X) + b
y_hat = 1 / (1 + np.exp(-z))
loss = -(1/m) * np.sum(Y * np.log(y_hat) + (1-Y) * np.log(1-y_hat))
dz = y_hat - Y
dw = (1/m) * np.dot(X, dz.T)
db = (1/m) * np.sum(dz)
w = w + lr * dw # <-- BUG IS HERE
b = b + lr * db # <-- AND HERE
return w, b, loss
w = w - lr * dw (subtraction, not addition!). Gradient descent moves against the gradient. Adding the gradient would perform gradient ascent โ maximizing the loss instead of minimizing it. The loss would explode!Visual Aids โ ASCII Diagrams
Sigmoid Function Shape
BCE Loss Behavior
Full Forward-Backward Dataflow
Common Misconceptions
โ TRUTH: Logistic regression is a classification algorithm. It predicts probabilities P(y=1|x) โ (0,1) and classifies by thresholding. The name comes from the logistic (sigmoid) function, not from regression in the statistical sense.
๐ WHY IT MATTERS: In GATE and interviews, this is a trick question. Know the distinction: linear regression predicts continuous values; logistic regression predicts probabilities for discrete classes.
โ TRUTH: ฯโฒ(z) = 0.25 only at z = 0. For |z| > 5, ฯโฒ(z) โ 0 (vanishing gradient). The derivative ranges from 0 to 0.25, reaching its maximum at z = 0.
๐ WHY IT MATTERS: This vanishing gradient problem is why sigmoid is rarely used as a hidden layer activation in deep networks. ReLU and its variants solve this.
โ TRUTH: MSE + sigmoid creates a non-convex loss surface with many local minima. BCE + sigmoid is convex with a guaranteed global minimum. Always use BCE for binary classification.
๐ WHY IT MATTERS: Using the wrong loss function means gradient descent may get stuck in local minima or converge to a suboptimal solution. The MLE-derived BCE is mathematically the correct loss.
โ TRUTH: If ฮฑ is too large, gradient descent overshoots the minimum and the loss oscillates or diverges. If ฮฑ is too small, training is painfully slow. The optimal ฮฑ depends on the loss surface curvature.
๐ WHY IT MATTERS: Learning rate is the most important hyperparameter. In practice, use learning rate schedules or adaptive optimizers (Adam) to handle this automatically.
โ TRUTH: Sigmoid outputs are probabilities only if the model is calibrated. An uncalibrated model might output 0.7 for cases that are actually positive 90% of the time. Calibration (Platt scaling, isotonic regression) is needed to make outputs match true frequencies.
๐ WHY IT MATTERS: In medical diagnosis and lending, miscalibrated probabilities lead to biased decisions. SBI's risk models require calibration validation by RBI auditors.
GATE / Exam Corner
Key Formulas โ Quick Reference Sheet
| Concept | Formula | Notes |
|---|---|---|
| Sigmoid | ฯ(z) = 1/(1+eโz) | Maps โ โ (0,1) |
| ฯ(0) | 0.5 | Decision boundary point |
| Symmetry | ฯ(โz) = 1 โ ฯ(z) | Symmetric about (0, 0.5) |
| Derivative | ฯโฒ(z) = ฯ(z)(1โฯ(z)) | Max = 0.25 at z=0 |
| BCE Loss | โ = โ[yยทlog ลท + (1โy)ยทlog(1โลท)] | Derived from MLE |
| Cost | J = (1/m)โโ(i) | Average over m samples |
| Key gradient | โโ/โz = ลท โ y | Prediction error! |
| Weight gradient | โJ/โw = (1/m)X(ลถโY)T | Vectorized form |
| GD Update | w := w โ ฮฑยทโJ/โw | Learning rate ฮฑ |
GATE Previous Year Questions (PYQs) Pattern
If ฯ(z) = 1/(1+eโz) is the sigmoid function, which of the following is the derivative ฯโฒ(z)?
- ฯ(z)
- 1 โ ฯ(z)
- ฯ(z) ยท (1 โ ฯ(z))
- ฯ(z) + (1 โ ฯ(z))
The binary cross-entropy loss for a single sample with true label y=1 and predicted probability ลท=0.2 is:
- โlog(0.8) โ 0.2231
- โlog(0.2) โ 1.6094
- (0.2 โ 1)2 = 0.64
- 0.2 ร log(0.2) โ โ0.3219
In logistic regression with gradient descent, the gradient โโ/โz for a single sample simplifies to:
- ฯ(z) ยท (1 โ ฯ(z))
- ลท โ y
- (ลท โ y)2
- y โ ลท
GATE Prediction Table โ What to Expect
| Topic | Likelihood | Marks | Type |
|---|---|---|---|
| Sigmoid derivative | โญโญโญโญโญ | 1โ2 | MCQ |
| BCE formula application | โญโญโญโญ | 2 | NAT |
| Gradient descent update | โญโญโญโญ | 2 | MCQ/NAT |
| Sigmoid properties (symmetry, limits) | โญโญโญ | 1 | MCQ |
| MLE โ BCE derivation | โญโญ | 2 | MCQ |
| Convexity of BCE vs MSE | โญโญ | 1 | MCQ |
Interview Prep โ India + US Focus
Conceptual Questions
๐ค Q1: Why is logistic regression considered a neural network?
Logistic regression has the same architecture as a neural network with 0 hidden layers: input features โ linear transformation z = wTx + b โ sigmoid activation ลท = ฯ(z). It uses the same training procedure: forward pass to compute predictions, BCE loss to measure error, backward pass (chain rule) to compute gradients, and gradient descent to update parameters. The only difference from a deep network is depth โ logistic regression has depth 1. This is why Andrew Ng introduces neural networks through logistic regression in his courses.
๐ค Q2: Derive the gradient โโ/โw for logistic regression from first principles.
Chain rule: โโ/โw = (โโ/โลท)(โลท/โz)(โz/โw). Compute each: โโ/โลท = โy/ลท + (1โy)/(1โลท). โลท/โz = ฯ(z)(1โฯ(z)) = ลท(1โลท). โz/โw = x. Multiply: [โy/ลท + (1โy)/(1โลท)] ยท ลท(1โลท) ยท x = (ลทโy)ยทx. For m samples, average: โJ/โw = (1/m)โ(ลท(i)โy(i))x(i).
Interviewer follow-up: "Is it a coincidence that โโ/โz = ลทโy is so simple?" No โ it's a consequence of the sigmoid being the canonical link function for the Bernoulli distribution in generalized linear models.
๐ค Q3: When would you use logistic regression over a deep neural network?
When you need interpretability (RBI regulation for credit scoring), when you have limited labeled data (<1000 samples โ deep networks overfit), when you need fast inference (SBI YONO processes loans in 12s), or as a baseline before trying complex models. At Flipkart, logistic regression with hand-crafted features is still the first model in every recommender pipeline.
US Context (Google, Meta, Netflix)At Google Ads, logistic regression (FTRL variant) handles billions of sparse features via online learning โ no deep network can match its training speed on streaming data. At Netflix, logistic regression serves as the calibration layer on top of deep network scores. At Meta, it's the production baseline that any new model must beat.
Coding Interview Questions
๐ป Coding Q1: Implement sigmoid without using np.exp
# Hint: Use the identity ฯ(z) = 0.5 * (1 + tanh(z/2)) def sigmoid_via_tanh(z): return 0.5 * (1 + np.tanh(z / 2))
This works because tanh(z) = 2ฯ(2z) โ 1, so ฯ(z) = (1 + tanh(z/2))/2. This is also more numerically stable since np.tanh is implemented in a stable way internally.
๐ป Coding Q2: Implement BCE loss in one line, handling log(0)
def bce_loss(y, y_hat, eps=1e-7): y_hat = np.clip(y_hat, eps, 1 - eps) # prevent log(0) return -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))
Key detail: np.clip is essential. Without it, np.log(0) = -inf and your loss becomes NaN. Every production ML system clips predictions before computing log loss.
- GATE-style derivation questions (sigmoid derivative, BCE from MLE)
- Explain logistic regression to a non-technical banking manager
- Feature engineering for Indian datasets (CIBIL, Aadhaar, UPI)
- Companies: Flipkart, Ola, Razorpay, Paytm, SBI, HDFC
- Common: "Implement from scratch in NumPy"
- System design: "Design a CTR prediction system at scale"
- Trade-offs: logistic reg vs. deep learning for production
- Online learning: how to update LR on streaming data
- Companies: Google, Meta, Netflix, Uber, Airbnb
- Common: "When would you NOT use deep learning?"
Hands-On Lab โ SBI Loan Default Predictor
๐งช Mini-Project: Build a Complete Loan Default Prediction System
๐ Project Specification
Build a logistic regression model from scratch (no sklearn) that predicts whether an Indian bank loan applicant will default, using synthetic CIBIL-style data.
Requirements- Generate a synthetic dataset with 1000 samples and 5 features (CIBIL score, income, EMI ratio, credit history, active loans)
- Implement the
LogisticRegressionclass with all 8 methods (sigmoid, forward, compute_loss, backward, update, fit, predict, accuracy) - Split data 80/20 into train/test sets
- Train for 3000 epochs and plot the loss curve
- Plot the decision boundary (use the 2 most important features)
- Report training and test accuracy
- Compare execution time: vectorized vs non-vectorized for 10,000 samples
- Bonus: Implement learning rate decay (ฮฑ decreases over epochs)
| Component | Points | Criteria |
|---|---|---|
| Correct LogisticRegression class | 30 | All 8 methods implemented correctly |
| Dataset generation | 10 | Realistic feature distributions |
| Training + convergence | 15 | Loss decreases monotonically, >70% accuracy |
| Loss curve plot | 10 | Clean, labeled matplotlib figure |
| Decision boundary plot | 15 | Shows data points, boundary line, color regions |
| Vectorized vs loop timing | 10 | Timing benchmark with >100ร speedup shown |
| Code quality + comments | 10 | Clear variable names, docstrings, step-by-step comments |
Exercises โ 22 Problems Across 5 Categories
Section A: Conceptual Questions (5)
State the five key properties of the sigmoid function ฯ(z) = 1/(1+eโz).
Explain in 2-3 sentences why the perceptron (step function) cannot be trained using gradient descent.
Why do we derive the BCE loss from MLE rather than just using MSE for classification?
What is the difference between the "loss" โ and the "cost" J in logistic regression?
In the computation graph, explain why โโ/โz = ลท โ y simplifies so beautifully. Is this a coincidence?
Section B: Mathematical Problems (8)
Compute ฯ(3) and ฯ(โ3). Verify that ฯ(3) + ฯ(โ3) = 1.
Prove that the sigmoid derivative can be written as: ฯโฒ(z) = ฯ(z) ยท ฯ(โz).
For a single sample with x = [3, 1]T, y = 1, w = [0.2, โ0.1]T, b = 0.1, compute: (a) z, (b) ลท, (c) โ, (d) โโ/โwโ, (e) โโ/โwโ, (f) โโ/โb.
If ฯ(z) = 0.8, compute ฯโฒ(z) without finding z first.
Derive the Bernoulli likelihood for a dataset with y = [1, 0, 1, 1, 0] and ลท = [0.9, 0.3, 0.7, 0.8, 0.1]. Compute the log-likelihood and the BCE loss.
Show that the Hessian โยฒJ/โwโwT for logistic regression is positive semi-definite, proving that J is convex.
Derive the gradient descent update rule for logistic regression with L2 regularization: Jreg = J + (ฮป/2m)||w||ยฒ.
Show that ฯ(z) can be expressed as: ฯ(z) = ยฝ + ยฝยทtanh(z/2). Verify for z = 0, z = 2.
Section C: Coding Problems (4)
Implement a function predict_proba(X, w, b) that takes feature matrix X (shape nรm), weights w (shape nร1), and bias b, and returns predicted probabilities using the numerically stable sigmoid.
def predict_proba(X, w, b): z = w.T @ X + b; return np.where(z >= 0, 1/(1+np.exp(-z)), np.exp(z)/(1+np.exp(z)))Write a function that computes the gradient โJ/โw and โJ/โb for m samples in a fully vectorized way (no loops). Verify against the loop version for correctness.
def grad(X, Y, w, b): m=X.shape[1]; z=w.T@X+b; a=sigmoid(z); dz=a-Y; dw=(1/m)*X@dz.T; db=(1/m)*np.sum(dz); return dw, dbExtend the LogisticRegression class to support mini-batch gradient descent. Add a batch_size parameter to fit() and randomly sample batches each epoch.
idx = np.random.choice(m, batch_size, replace=False); use X[:, idx] and Y[:, idx] for forward/backward. This adds stochasticity that can help escape local optima and reduces memory usage.Implement a plot_decision_boundary(model, X, Y) function that creates a meshgrid, computes predictions over it, and plots the decision boundary with a contour plot overlaid with data points.
Section D: Critical Thinking (3)
SBI uses logistic regression for loan approval because RBI mandates explainable models. But a gradient boosted tree (XGBoost) gives 5% higher accuracy. As the chief data scientist, how would you argue for or against switching?
Google's ad CTR model must make predictions in <5ms. A logistic regression with 1 billion sparse features takes 2ms. A transformer model with 340M parameters gives 3% better CTR prediction but takes 50ms. Which would you deploy and why?
The maximum value of ฯโฒ(z) is 0.25. In a deep network with L sigmoid layers, the gradient flowing back through all layers is multiplied by ฯโฒ at each layer. What is the maximum possible gradient magnitude at the first layer for L = 10? L = 50? What does this imply?
โ Section E: Starred Research Problems (2)
Read McMahan et al. (2013), "Ad Click Prediction: a View from the Trenches" (Google). This paper describes the FTRL-Proximal optimizer used for training logistic regression on billions of features. Summarize: (a) Why does standard gradient descent fail at Google's scale? (b) How does FTRL-Proximal achieve sparsity? (c) What is the per-coordinate learning rate schedule?
The sigmoid function ฯ(z) = 1/(1+eโz) is one of many possible S-shaped functions. Research and compare at least three alternatives: (a) tanh, (b) probit (ฮฆ(z) โ the CDF of the standard normal), (c) algebraic sigmoid z/(1+|z|). For each, derive the range, compute the derivative, and discuss when you'd prefer it over the standard sigmoid.
Connections โ Where This Fits
๐ Chapter Connections Map
- Ch 0 (Course Overview): The bird's-eye view of what neural networks are
- Ch 2 (Math Toolkit): Derivatives, chain rule, vectors, matrices โ all used in gradient derivations
- Ch 4 (Shallow Neural Networks): Stack multiple logistic regression neurons into layers โ multi-layer perceptron
- Ch 5 (Deep Neural Networks): The forward-backward-update loop scales directly to L layers
- Ch 6 (Optimization): Learning rate, momentum, Adam โ all build on gradient descent from this chapter
- Neural Tangent Kernels (Jacot et al., 2018): At infinite width, neural networks behave like logistic regression in a kernel space
- Calibration (Guo et al., 2017): Modern deep networks are poorly calibrated โ their sigmoid outputs don't match true probabilities. Temperature scaling (a post-hoc logistic regression!) fixes this
- sklearn.linear_model.LogisticRegression: Uses LBFGS or liblinear solvers (faster than GD for convex problems)
- Google FTRL-Proximal: Online logistic regression at billion-feature scale
- Facebook/Meta DLRM: Deep Learning Recommendation Model uses logistic regression as its final output layer
Chapter Summary โ 7 Key Takeaways
๐ What You Learned in Chapter 3
- From Step to Smooth: The perceptron's step function has zero derivatives, blocking gradient-based learning. The sigmoid ฯ(z) = 1/(1+eโz) is smooth, differentiable, and maps โ โ (0,1), enabling gradient descent.
- Sigmoid is Self-Referential: Its derivative ฯโฒ(z) = ฯ(z)(1โฯ(z)) is expressed entirely in terms of the function itself โ no need to recompute z. Maximum derivative is 0.25 at z=0; it vanishes for |z| โซ 0.
- Logistic Regression IS a Neural Network: A single neuron with n inputs, sigmoid activation, trained with BCE loss and gradient descent. Every concept from this chapter โ forward pass, loss, backprop, GD update โ scales directly to deep networks.
- BCE from First Principles: Bernoulli โ Likelihood โ Log-Likelihood โ Negate โ Average = Binary Cross-Entropy. This is the only correct loss for binary classification from a statistical standpoint.
- The Beautiful Gradient: โโ/โz = ลท โ y. The gradient of the loss with respect to the pre-activation is simply the prediction error. This is not a coincidence โ it's a property of canonical link functions in GLMs.
- Vectorize Everything: Loop-based gradient computation is ~1500ร slower than vectorized NumPy operations. In production ML, vectorization is not optional.
- Real-World Impact: SBI uses logistic regression for โน10L+ loan decisions (12-second approvals). Google uses it for $175B/year in ad revenue (8.5B daily predictions). The simplest neural network is also the most widely deployed.
Key Equation
Key Intuition
๐ง Logistic regression is a single neuron that converts a linear combination of inputs into a probability, learns by comparing its predictions to truth (BCE), and adjusts its weights to reduce errors (gradient descent). Every deep neural network is just many of these neurons, connected in layers, trained with the same four-step loop: forward โ loss โ backward โ update.
Further Reading
- NPTEL: "Deep Learning" by Prof. Mitesh Khapra (IIT Madras) โ Lectures 5-7 cover logistic regression and gradient descent with excellent visualizations
- NPTEL: "Machine Learning" by Prof. Balaji Srinivasan (IIT Madras) โ Statistical perspective on logistic regression and MLE
- GATE PYQ Book: Made Easy / ACE Academy ML sections โ sigmoid and BCE derivation questions from 2018-2025
- IISc Bangalore: PRML course notes by Prof. Chiranjib Bhattacharyya โ rigorous mathematical treatment
- Andrew Ng (Coursera): "Neural Networks and Deep Learning" Week 2 โ the course that inspired this chapter's structure
- 3Blue1Brown: "But what is a Neural Network?" โ outstanding visualization of the computation graph
- Distill.pub: "A Visual Introduction to Machine Learning" โ interactive sigmoid and loss visualizations
- Paper: McMahan et al. (2013), "Ad Click Prediction: a View from the Trenches" โ Google's FTRL-Proximal paper
- Book: Bishop, "Pattern Recognition and Machine Learning" โ Chapter 4 on linear models for classification
- Paper: Guo et al. (2017), "On Calibration of Modern Neural Networks" โ ICML, why sigmoid outputs โ calibrated probabilities
โ End of Chapter 3 โ
Next: Chapter 4 โ Shallow Neural Networks โ What happens when you stack multiple neurons into layers?