Neural Networks & Deep Learning

Chapter 3: Logistic Regression as a Neural Network

Where a single neuron learns to make decisions โ€” and you learn to train it

โฑ๏ธ Reading Time: ~3 hours  |  ๐Ÿ“– Unit 1: The Neuron Era  |  ๐Ÿงช Theory + Code

๐Ÿ“‹ Prerequisites: Chapter 0 (Course Overview) & Chapter 2 (Math Toolkit)

Bloom's Taxonomy Progression

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberRecall the sigmoid function ฯƒ(z) = 1/(1+eโˆ’z), its range (0,1), key values ฯƒ(0) = 0.5, and the BCE loss formula
๐Ÿ”ต UnderstandExplain why binary cross-entropy is derived from Maximum Likelihood and why sigmoid makes the perceptron differentiable
๐ŸŸข ApplyImplement a complete LogisticRegression class from scratch in NumPy with forward, backward, and update methods
๐ŸŸก AnalyzeTrace the full computation graph โ€” forward pass and backward pass โ€” computing every intermediate gradient by hand
๐ŸŸ  EvaluateCompare vectorized vs loop-based gradient descent, evaluate numerical stability issues, assess convergence behavior
๐Ÿ”ด CreateDesign and train a loan-default predictor for an Indian banking dataset and plot the decision boundary
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  • Explain why the step-function perceptron needs to be replaced with a smooth, differentiable activation for learning via gradient descent
  • Derive the sigmoid function ฯƒ(z) = 1/(1+eโˆ’z) and prove all five key properties: range, midpoint, symmetry, limits, and derivative ฯƒโ€ฒ = ฯƒ(1โˆ’ฯƒ)
  • Recognize logistic regression as a single-neuron neural network and draw its computational architecture
  • Derive Binary Cross-Entropy from scratch: Bernoulli โ†’ Likelihood โ†’ Log-Likelihood โ†’ Negate โ†’ BCE
  • Compute the cost function J(w,b) = average BCE over m training examples
  • Derive gradient descent update rules โˆ‚J/โˆ‚w and โˆ‚J/โˆ‚b using the chain rule on the computation graph
  • Implement a complete LogisticRegression class from scratch with sigmoid, loss, forward, backward, update, fit, predict, and accuracy methods
  • Train the model on a synthetic Indian bank loan dataset, plot the loss curve and decision boundary
  • Compare vectorized (NumPy) vs loop-based implementations with timing benchmarks
  • Execute a full forward + backward pass by hand on a worked example with 2 features and 3 samples
Section 2

Opening Hook โ€” The 12-Second Loan Decision

๐Ÿฆ "Approved" or "Rejected" โ€” in the time it takes to blink twice.

It's a monsoon Tuesday in Mumbai. Anjali, a 31-year-old chartered accountant, opens the SBI YONO app and applies for a โ‚น10,00,000 home loan top-up. Within 12 seconds, the app responds: "Congratulations! Your pre-approved offer is ready."

In those 12 seconds, no human banker reviewed Anjali's file. A machine learning model โ€” at its mathematical core, a logistic regression โ€” consumed her CIBIL score (782), monthly salary (โ‚น1,45,000), existing EMI-to-income ratio (22%), years of credit history (9), number of active loans (2), and 40+ other features. It output a single number: 0.04 โ€” the probability that Anjali will default.

Now imagine the same system at Google's ad servers in Mountain View, California. Every time you search "best laptop under $1000," Google runs a logistic regression (among other models) on each of 500 candidate ads and predicts: what is the probability that you will click this ad? That probability โ€” the click-through rate (CTR) โ€” determines which ads you see. Google runs this computation 8.5 billion times per day.

Both systems โ€” SBI's loan engine and Google's ad ranker โ€” solve the same mathematical problem: given input features x, estimate P(y=1|x). The tool they use is the simplest possible neural network: a single neuron with a sigmoid activation. That's logistic regression. In this chapter, you will build it from first principles.

๐Ÿง SBI๐Ÿ’ณ CIBIL๐Ÿ” Google Ads๐Ÿ“ฑ YONO๐Ÿงฎ From Scratch
The word "logistic" has nothing to do with "logistics" (shipping). It comes from Belgian mathematician Pierre Franรงois Verhulst, who in 1845 coined courbe logistique for S-shaped population growth curves. The same S-curve now predicts your loan approval probability. Verhulst never imagined his equation would process โ‚น40 lakh crore in Indian lending decisions.
Section 3

The Intuition First โ€” From Step to Smooth

The Dimmer Switch Analogy

In Chapter 2 (your math toolkit), you saw that a perceptron uses a step function: output 1 if the weighted sum exceeds a threshold, output 0 otherwise. Think of it as a light switch โ€” it's either ON or OFF, nothing in between.

But here's the problem: you can't learn from a switch. Imagine you're adjusting the temperature dial on your geyser. If the dial only had two positions โ€” "freezing" and "boiling" โ€” you'd never find a comfortable temperature. You need a smooth dial that lets you make tiny adjustments and see small changes in temperature. That smooth dial is the sigmoid function.

THE SWITCH vs THE DIMMER โ€” Why Smoothness Matters STEP FUNCTION (Perceptron) SIGMOID FUNCTION (Logistic Reg.) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ output โ”‚ โ”‚ output โ”‚ โ”‚ 1 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ– โ– โ– โ– โ”‚ โ”‚ 1 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ– โ– โ– โ– โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ– โ–  โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ– โ–  โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ– โ–  โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ– โ–  โ”‚ โ”‚ 0 โ– โ– โ– โ– โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ 0 โ– โ– โ– โ– โ– โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚ โ”‚ z โ†’ โ”‚ โ”‚ z โ†’ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โŒ Derivative = 0 everywhere โœ… Derivative ฯƒ(z)(1โˆ’ฯƒ(z)) (except at z=0: undefined) exists everywhere! โŒ Can't compute gradients โœ… Gradient descent works! โŒ No notion of "confidence" โœ… Output = probability

The "Aha!" Question

Ask yourself: Why can't gradient descent work with a step function? Because the derivative of a step function is zero everywhere (except at the discontinuity where it's undefined). If the derivative is zero, the gradient is zero, and the weight update ฮ”w = โˆ’ฮฑยทโˆ‚L/โˆ‚w = โˆ’ฮฑยท0 = 0. The model never learns! You need a function with a non-zero derivative at every point. Enter the sigmoid.

Q: Why does logistic regression use sigmoid instead of a step function?
A: The step function has zero derivatives almost everywhere, making gradient-based optimization impossible. The sigmoid ฯƒ(z) = 1/(1+eโˆ’z) is smooth, differentiable, maps โ„โ†’(0,1), and has the elegant derivative ฯƒโ€ฒ = ฯƒ(1โˆ’ฯƒ), enabling gradient descent learning.
Data Scientist / ML Engineer (India: โ‚น8โ€“25 LPA | US: $90Kโ€“$160K): Logistic regression is the first model you'll build in every ML interview. At companies like Flipkart, Amazon India, Google, and Meta, interview loops begin with: "Implement logistic regression from scratch." Why? Because if you understand this single neuron deeply โ€” its loss, its gradients, its computation graph โ€” you understand the building blocks of every deep network.
Section 4

Mathematical Foundation โ€” Derived from First Principles

We build the entire logistic regression framework in five rigorous steps. Every equation is derived, not memorized. If you're confused at any step, you're thinking correctly โ€” stay with it.

4a. The Sigmoid Function ฯƒ(z) โ€” From Linear to Probability

The Problem: Unbounded Outputs

A neuron computes z = wTx + b, where z โˆˆ (โˆ’โˆž, +โˆž). But for binary classification, you need a probability โ€” a number in [0, 1]. You need a function f: โ„ โ†’ (0, 1) that is smooth, monotonically increasing, and differentiable.

Definition

ฯƒ(z) = 1 / (1 + eโˆ’z)

Domain: z โˆˆ (โˆ’โˆž, +โˆž)   โ†’   Range: ฯƒ(z) โˆˆ (0, 1)
Deriving ALL Five Properties of Sigmoid Property 1: ฯƒ(0) = 0.5 (The Decision Boundary)

ฯƒ(0) = 1/(1 + e0) = 1/(1 + 1) = 1/2 = 0.5

Interpretation: When the weighted sum z = 0, the model is maximally uncertain โ€” equal probability for both classes.

Property 2: Symmetry โ€” ฯƒ(โˆ’z) = 1 โˆ’ ฯƒ(z)

Start: ฯƒ(โˆ’z) = 1/(1 + ez)

Multiply numerator and denominator by eโˆ’z:

= eโˆ’z/(eโˆ’z + 1) = (1 + eโˆ’z โˆ’ 1)/(1 + eโˆ’z) = 1 โˆ’ 1/(1 + eโˆ’z) = 1 โˆ’ ฯƒ(z) โœ“

This means the sigmoid is symmetric about the point (0, 0.5). If P(y=1|x) = ฯƒ(z), then P(y=0|x) = ฯƒ(โˆ’z). Beautiful.

Property 3: Asymptotic Limits

As z โ†’ +โˆž: eโˆ’z โ†’ 0, so ฯƒ(z) โ†’ 1/(1+0) = 1

As z โ†’ โˆ’โˆž: eโˆ’z โ†’ โˆž, so ฯƒ(z) โ†’ 1/โˆž = 0

The sigmoid never reaches 0 or 1 exactly โ€” outputs are strictly in the open interval (0, 1).

Property 4: The Elegant Derivative โ€” ฯƒโ€ฒ(z) = ฯƒ(z)ยท(1 โˆ’ ฯƒ(z))

Write ฯƒ(z) = (1 + eโˆ’z)โˆ’1. Apply the chain rule:

ฯƒโ€ฒ(z) = โˆ’1 ยท (1 + eโˆ’z)โˆ’2 ยท d/dz(1 + eโˆ’z)

= โˆ’(1 + eโˆ’z)โˆ’2 ยท (โˆ’eโˆ’z)

= eโˆ’z / (1 + eโˆ’z)2

Now verify ฯƒ(z)ยท(1โˆ’ฯƒ(z)):

= [1/(1+eโˆ’z)] ยท [eโˆ’z/(1+eโˆ’z)]

= eโˆ’z / (1+eโˆ’z)2 โœ“

ฯƒโ€ฒ(z) = ฯƒ(z) ยท (1 โˆ’ ฯƒ(z)) โ€” The derivative is expressed entirely in terms of the function itself!
Property 5: Maximum Derivative at z = 0

ฯƒโ€ฒ(0) = ฯƒ(0)ยท(1โˆ’ฯƒ(0)) = 0.5 ร— 0.5 = 0.25

The sigmoid changes fastest at z = 0. For |z| โ‰ซ 0, ฯƒโ€ฒ โ‰ˆ 0 โ€” these are the saturation regions where gradients vanish. This is a preview of the "vanishing gradient problem" we'll fight in later chapters.

Sigmoid Value Table โ€” Memorize These!

zโˆ’6โˆ’4โˆ’2โˆ’101246
ฯƒ(z)0.00250.0180.1190.2690.5000.7310.8810.9820.9975
ฯƒโ€ฒ(z)0.00250.0180.1050.1970.2500.1970.1050.0180.0025
Numerical Stability: Never compute 1 / (1 + np.exp(-z)) naively! When z is a large negative number (say z = โˆ’1000), np.exp(1000) overflows to inf. Use the stable version: np.where(z >= 0, 1/(1+np.exp(-z)), np.exp(z)/(1+np.exp(z)))

Logistic Regression = Single-Neuron Neural Network

Now here's the key insight. A logistic regression model is exactly a neural network with:

  • One input layer with nx features
  • One output neuron with sigmoid activation
  • No hidden layers
LOGISTIC REGRESSION = SINGLE-NEURON NEURAL NETWORK Input Layer Output Layer (n features) (1 neuron) xโ‚ โ”€โ”€wโ‚โ”€โ”€โ” โ”‚ xโ‚‚ โ”€โ”€wโ‚‚โ”€โ”€โ”ค Prediction โ”œโ”€โ”€โ†’ [ฮฃ + b] โ”€โ”€โ†’ [ฯƒ] โ”€โ”€โ†’ ลท = ฯƒ(wแต€x + b) xโ‚ƒ โ”€โ”€wโ‚ƒโ”€โ”€โ”ค linear sigmoid โ”‚ combination activation โ‹ฎ โ‹ฎ โ”‚ โ”‚ xโ‚™ โ”€โ”€wโ‚™โ”€โ”€โ”˜ z = wโ‚xโ‚ + wโ‚‚xโ‚‚ + ... + wโ‚™xโ‚™ + b = wแต€x + b ลท = ฯƒ(z) = P(y = 1 | x) Parameters: w โˆˆ โ„โฟ (weights), b โˆˆ โ„ (bias) Output: ลท โˆˆ (0, 1) โ€” probability of positive class
โŒ MYTH: "Logistic regression is not a neural network โ€” it's a classical ML algorithm."
โœ… TRUTH: Logistic regression IS a neural network โ€” with exactly 1 neuron, 0 hidden layers, and sigmoid activation. It's the simplest possible neural network.
๐Ÿ” WHY IT MATTERS: Understanding logistic regression deeply means you understand forward propagation, loss computation, backpropagation, and gradient descent โ€” the exact four steps that every deep network uses. The only difference in a 100-layer network is that these steps are repeated through more layers.

4b. Binary Cross-Entropy โ€” Derived from Maximum Likelihood

We have a model that outputs ลท = ฯƒ(wTx + b) โˆˆ (0, 1). We need a loss function โ€” a way to quantify "how wrong" the model is. We don't pull this loss out of thin air. We derive it from the principle of Maximum Likelihood Estimation (MLE).

Step 1: Model Each Label as a Bernoulli Random Variable

Since y โˆˆ {0, 1}, each label follows a Bernoulli distribution. Our model estimates P(y=1|x) = ลท. We can write both cases in a single elegant formula:

P(y | x; w, b) = ลทy ยท (1 โˆ’ ลท)(1โˆ’y)

Verify: If y = 1: P = ลทยน ยท (1โˆ’ลท)โฐ = ลท โœ“  |  If y = 0: P = ลทโฐ ยท (1โˆ’ลท)ยน = 1โˆ’ลท โœ“

Step 2: Likelihood Function for m Training Examples

Assuming training examples are independent and identically distributed (i.i.d.), the likelihood of observing all m labels given our parameters is the product:

L(w, b) = โˆi=1m [ลท(i)]y(i) ยท [1 โˆ’ ลท(i)](1โˆ’y(i))
Step 3: Log-Likelihood (Convert Product โ†’ Sum)

Products of many small numbers underflow numerically. Log converts products to sums and is monotonic (maximizing log L is equivalent to maximizing L):

log L(w, b) = โˆ‘i=1m [ y(i) log(ลท(i)) + (1 โˆ’ y(i)) log(1 โˆ’ ลท(i)) ]
Step 4: Negate and Average โ†’ Binary Cross-Entropy

MLE says: maximize log L. Gradient descent minimizes. So negate it. Divide by m for the average:

J(w, b) = โˆ’(1/m) โˆ‘i=1m [ y(i) log(ลท(i)) + (1 โˆ’ y(i)) log(1 โˆ’ ลท(i)) ]

This is the Binary Cross-Entropy (BCE) Loss, also called the Log Loss. Every line was derived โ€” nothing was assumed.

Why BCE Works: Intuitive Analysis

๐Ÿ” Loss Behavior โ€” How BCE Penalizes Errors

When y = 1 (True label is positive)

Loss per sample = โˆ’log(ลท)

  • If ลท โ†’ 1 (correct, confident): โˆ’log(1) = 0 โ† zero loss, perfect!
  • If ลท โ†’ 0 (wrong, confident): โˆ’log(0) โ†’ +โˆž โ† infinite penalty for being confidently wrong!
When y = 0 (True label is negative)

Loss per sample = โˆ’log(1 โˆ’ ลท)

  • If ลท โ†’ 0 (correct, confident): โˆ’log(1) = 0 โ† zero loss, perfect!
  • If ลท โ†’ 1 (wrong, confident): โˆ’log(0) โ†’ +โˆž โ† infinite penalty!
Key Insight

BCE penalizes confident wrong predictions exponentially more than uncertain ones. If your model says "I'm 99% sure this person will repay" and they default, the penalty is โˆ’log(0.01) = 4.6. But if the model says "I'm 51% sure," the penalty is only โˆ’log(0.49) = 0.71. This asymmetric penalization forces the model toward calibrated probabilities.

โŒ MYTH: "Why not use Mean Squared Error (MSE) for classification?"
โœ… TRUTH: MSE with sigmoid creates a non-convex loss surface with many local minima, because (ลท โˆ’ y)ยฒ composed with ฯƒ(z) has multiple inflection points. BCE with sigmoid gives a convex loss surface with a single global minimum โ€” gradient descent is guaranteed to converge.
๐Ÿ” WHY IT MATTERS: This is a GATE favorite. The convexity of BCE is directly derived from the negative log-likelihood of the Bernoulli distribution.

4c. The Cost Function J(w, b)

For a single training example (x, y), the loss is:

โ„’(ลท, y) = โˆ’[ yยทlog(ลท) + (1โˆ’y)ยทlog(1โˆ’ลท) ]

The cost function is the average loss over all m training examples:

J(w, b) = (1/m) โˆ‘i=1m โ„’(ลท(i), y(i)) = โˆ’(1/m) โˆ‘i=1m [ y(i) log(ลท(i)) + (1โˆ’y(i)) log(1โˆ’ลท(i)) ]

Where ลท(i) = ฯƒ(wTx(i) + b) for each sample i. The goal of training: find w* and b* that minimize J(w, b).

4d. Gradient Computation โ€” โˆ‚J/โˆ‚w and โˆ‚J/โˆ‚b via Chain Rule

To minimize J, we need its gradients with respect to w and b. We'll derive these using the chain rule, one link at a time.

Setup: The Computational Chain

For a single sample (drop the superscript for clarity):

z = wTx + b โ†’ ลท = ฯƒ(z) โ†’ โ„’ = โˆ’[yยทlog(ลท) + (1โˆ’y)ยทlog(1โˆ’ลท)]

We need โˆ‚โ„’/โˆ‚w and โˆ‚โ„’/โˆ‚b. By the chain rule:

โˆ‚โ„’/โˆ‚w = (โˆ‚โ„’/โˆ‚ลท) ยท (โˆ‚ลท/โˆ‚z) ยท (โˆ‚z/โˆ‚w)

Link 1: โˆ‚โ„’/โˆ‚ลท

โ„’ = โˆ’yยทlog(ลท) โˆ’ (1โˆ’y)ยทlog(1โˆ’ลท)

โˆ‚โ„’/โˆ‚ลท = โˆ’y/ลท โˆ’ (1โˆ’y)ยท(โˆ’1)/(1โˆ’ลท) = โˆ’y/ลท + (1โˆ’y)/(1โˆ’ลท)

Link 2: โˆ‚ลท/โˆ‚z = ฯƒโ€ฒ(z) = ฯƒ(z)(1โˆ’ฯƒ(z)) = ลท(1โˆ’ลท) Link 3: โˆ‚z/โˆ‚w = x   and   โˆ‚z/โˆ‚b = 1 Combining: โˆ‚โ„’/โˆ‚z (the key intermediate)

โˆ‚โ„’/โˆ‚z = (โˆ‚โ„’/โˆ‚ลท) ยท (โˆ‚ลท/โˆ‚z)

= [โˆ’y/ลท + (1โˆ’y)/(1โˆ’ลท)] ยท ลท(1โˆ’ลท)

= โˆ’y(1โˆ’ลท) + (1โˆ’y)ลท

= โˆ’y + yลท + ลท โˆ’ yลท

= ลท โˆ’ y

โˆ‚โ„’/โˆ‚z = ลท โˆ’ y โ€” Stunningly simple! The gradient is just the prediction error.
Final Gradients for a Single Sample

โˆ‚โ„’/โˆ‚w = (ลท โˆ’ y) ยท x

โˆ‚โ„’/โˆ‚b = (ลท โˆ’ y) ยท 1 = ลท โˆ’ y

Gradients for the Full Cost (m samples)
โˆ‚J/โˆ‚w = (1/m) โˆ‘i=1m (ลท(i) โˆ’ y(i)) ยท x(i)

โˆ‚J/โˆ‚b = (1/m) โˆ‘i=1m (ลท(i) โˆ’ y(i))

Gradient Descent Update Rules

With the gradients computed, we update parameters in the direction that decreases J:

w := w โˆ’ ฮฑ ยท โˆ‚J/โˆ‚w
b := b โˆ’ ฮฑ ยท โˆ‚J/โˆ‚b

where ฮฑ > 0 is the learning rate

Repeat for T iterations (epochs). Each iteration, the loss decreases (if ฮฑ is small enough), and the model gets better at predicting.

Q: For logistic regression with BCE loss, what is โˆ‚โ„’/โˆ‚z for a single sample?
A: โˆ‚โ„’/โˆ‚z = ลท โˆ’ y, where ลท = ฯƒ(z) and z = wTx + b. This elegant simplification happens because the sigmoid derivative ฯƒโ€ฒ = ฯƒ(1โˆ’ฯƒ) cancels perfectly with terms from the BCE derivative. This is NOT a coincidence โ€” it's a consequence of the Bernoulli MLE formulation.

4e. The Computation Graph โ€” Forward and Backward Pass

A computation graph makes the chain rule visual. Each node is an operation. Forward pass flows leftโ†’right, computing outputs. Backward pass flows rightโ†’left, computing gradients.

COMPUTATION GRAPH FOR LOGISTIC REGRESSION (single sample) FORWARD PASS (left โ†’ right): x โ”€โ” โ”œโ”€โ”€โ†’ [z = wแต€x + b] โ”€โ”€โ†’ [ลท = ฯƒ(z)] โ”€โ”€โ†’ [โ„’ = -yยทlog(ลท) - (1-y)ยทlog(1-ลท)] w โ”€โ”ค โ†‘ b โ”€โ”˜ BACKWARD PASS (right โ†’ left): โˆ‚โ„’/โˆ‚w โ†โ”€โ”€ โˆ‚โ„’/โˆ‚z ยท x โ†โ”€โ”€ โˆ‚โ„’/โˆ‚ลท ยท โˆ‚ลท/โˆ‚z โ†โ”€โ”€ โˆ‚โ„’/โˆ‚โ„’ = 1 โ†“ dz = ลท โˆ’ y dลท = -y/ลท + (1-y)/(1-ลท) dw = dz ยท x dฯƒ = ลท(1-ลท) db = dz dz = dลท ยท dฯƒ = ลท โˆ’ y โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ KEY: Forward pass computes โ„’ from inputs. โ”‚ โ”‚ Backward pass computes gradients from โ„’ back to โ”‚ โ”‚ parameters. This IS backpropagation for 1 neuron. โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
"Automatic Differentiation in Machine Learning: a Survey" (Baydin et al., 2018) โ€” This comprehensive survey traces how computation graphs and automatic differentiation evolved from the 1960s (Wengert's "simple automatic derivative evaluation program") to modern deep learning frameworks. The logistic regression computation graph you just learned is the seed from which PyTorch's autograd and TensorFlow's GradientTape grew. arXiv: 1502.05767
Section 5

Worked Examples โ€” By Hand, Then by India, Then by World

Example 1: Full Forward + Backward Pass By Hand (2 features, 3 samples)

Given Data (m = 3 samples, n = 2 features)

x(1) = [1, 2]T, y(1) = 1  (approved)

x(2) = [โˆ’1, โˆ’1]T, y(2) = 0  (rejected)

x(3) = [2, 0]T, y(3) = 1  (approved)

Initial parameters: w = [0.5, โˆ’0.5]T, b = 0, learning rate ฮฑ = 0.1

FORWARD PASS โ€” Compute z, ลท, โ„’ for each sample

Sample 1: z(1) = 0.5ยท1 + (โˆ’0.5)ยท2 + 0 = 0.5 โˆ’ 1.0 = โˆ’0.5

ลท(1) = ฯƒ(โˆ’0.5) = 1/(1 + e0.5) = 1/(1 + 1.6487) = 1/2.6487 โ‰ˆ 0.3775

โ„’(1) = โˆ’[1ยทlog(0.3775) + 0ยทlog(0.6225)] = โˆ’log(0.3775) = โˆ’(โˆ’0.9740) โ‰ˆ 0.9740

Sample 2: z(2) = 0.5ยท(โˆ’1) + (โˆ’0.5)ยท(โˆ’1) + 0 = โˆ’0.5 + 0.5 = 0

ลท(2) = ฯƒ(0) = 0.5

โ„’(2) = โˆ’[0ยทlog(0.5) + 1ยทlog(0.5)] = โˆ’log(0.5) = 0.6931

Sample 3: z(3) = 0.5ยท2 + (โˆ’0.5)ยท0 + 0 = 1.0

ลท(3) = ฯƒ(1) = 1/(1 + eโˆ’1) = 1/1.3679 โ‰ˆ 0.7311

โ„’(3) = โˆ’[1ยทlog(0.7311) + 0ยทlog(0.2689)] = โˆ’(โˆ’0.3133) โ‰ˆ 0.3133

J = (1/3)(0.9740 + 0.6931 + 0.3133) = (1/3)(1.9804) โ‰ˆ 0.6601
BACKWARD PASS โ€” Compute dz, dw, db

dz(i) = ลท(i) โˆ’ y(i):

dz(1) = 0.3775 โˆ’ 1 = โˆ’0.6225

dz(2) = 0.5 โˆ’ 0 = +0.5

dz(3) = 0.7311 โˆ’ 1 = โˆ’0.2689

โˆ‚J/โˆ‚wโ‚ = (1/3) โˆ‘ dz(i) ยท xโ‚(i)

= (1/3)[(โˆ’0.6225)(1) + (0.5)(โˆ’1) + (โˆ’0.2689)(2)]

= (1/3)[โˆ’0.6225 โˆ’ 0.5 โˆ’ 0.5378] = (1/3)(โˆ’1.6603) โ‰ˆ โˆ’0.5534

โˆ‚J/โˆ‚wโ‚‚ = (1/3) โˆ‘ dz(i) ยท xโ‚‚(i)

= (1/3)[(โˆ’0.6225)(2) + (0.5)(โˆ’1) + (โˆ’0.2689)(0)]

= (1/3)[โˆ’1.2450 โˆ’ 0.5 + 0] = (1/3)(โˆ’1.7450) โ‰ˆ โˆ’0.5817

โˆ‚J/โˆ‚b = (1/3) โˆ‘ dz(i)

= (1/3)[โˆ’0.6225 + 0.5 + (โˆ’0.2689)] = (1/3)(โˆ’0.3914) โ‰ˆ โˆ’0.1305

GRADIENT DESCENT UPDATE (ฮฑ = 0.1)

wโ‚ := 0.5 โˆ’ 0.1ยท(โˆ’0.5534) = 0.5 + 0.05534 โ‰ˆ 0.5553

wโ‚‚ := โˆ’0.5 โˆ’ 0.1ยท(โˆ’0.5817) = โˆ’0.5 + 0.05817 โ‰ˆ โˆ’0.4418

b := 0 โˆ’ 0.1ยท(โˆ’0.1305) = 0 + 0.01305 โ‰ˆ 0.0131

After 1 GD step: w = [0.5553, โˆ’0.4418]T, b = 0.0131
Both weights moved to better separate the classes! wโ‚ increased (feature 1 correlates with positive class), wโ‚‚ became less negative (adjusting its influence).

Example 2: ๐Ÿ‡ฎ๐Ÿ‡ณ SBI CIBIL Score Loan Approval

๐Ÿฆ Case Study โ€” Loan Default Prediction at State Bank of India

Business Context

SBI processes ~1.5 lakh personal loan applications per month. The first-pass model is a logistic regression that predicts P(default) using features from CIBIL (Credit Information Bureau India Limited) and internal banking data.

Feature Engineering (simplified to 5 key features)

FeatureVariableExample Values
CIBIL Score (normalized)xโ‚0.782 (original: 782/1000)
Monthly Income (log-scaled, lakhs)xโ‚‚log(1.45) = 0.372
EMI-to-Income Ratioxโ‚ƒ0.22 (22%)
Years of Credit Historyxโ‚„0.9 (9 years / 10 max)
Number of Active Loans (normalized)xโ‚…0.4 (2 loans / 5 max)

Model Computation for Anjali's Application

Trained weights: w = [โˆ’3.2, โˆ’1.5, +2.8, โˆ’0.9, +1.4], b = 1.2

Note: negative weights for CIBIL score, income, credit history โ€” higher values reduce default risk. Positive weights for EMI ratio and number of loans โ€” higher values increase default risk.

z = (โˆ’3.2)(0.782) + (โˆ’1.5)(0.372) + (2.8)(0.22) + (โˆ’0.9)(0.9) + (1.4)(0.4) + 1.2

= โˆ’2.502 โˆ’ 0.558 + 0.616 โˆ’ 0.810 + 0.560 + 1.200

= โˆ’1.494

ลท = ฯƒ(โˆ’1.494) = 1/(1 + e1.494) = 1/(1 + 4.456) โ‰ˆ 0.183

Interpretation: P(default) = 18.3%. SBI's threshold is 15% for auto-approval, 40% for auto-rejection. Anjali falls in the 15%โ€“40% band โ†’ sent to a human underwriter for review. But given her strong CIBIL score and income, the underwriter approves.

Weight Interpretability โ€” The Regulator's Favorite

RBI requires explainable credit models. With logistic regression, each weight has a clear meaning:

  • wโ‚ = โˆ’3.2 for CIBIL โ†’ A 0.1 increase in normalized CIBIL (i.e., 100 points) decreases z by 0.32, reducing default probability
  • wโ‚ƒ = +2.8 for EMI ratio โ†’ Higher EMI burden increases default risk
  • The odds ratio ewj gives the multiplicative change in odds per unit change in xj

Example 3: ๐ŸŒ Google Ads Click-Through Rate (CTR) Prediction

๐Ÿ” Case Study โ€” CTR Prediction at Google (Mountain View, CA)

Business Context

Google Search Ads generates $175 billion annually (2024). Every time a user types a query, ~500 candidate ads are scored. The core ranking uses a logistic regression variant (FTRL-Proximal) to predict P(click | user, query, ad). This runs 8.5 billion times per day.

Feature Engineering (simplified)

FeatureVariableExample
Query-Ad relevance scorexโ‚0.85 (cosine similarity of embeddings)
Ad historical CTRxโ‚‚0.032 (3.2% past click rate)
User engagement scorexโ‚ƒ0.67 (based on past sessions)
Position bias (1/position)xโ‚„0.333 (position 3)
Time-of-day featurexโ‚…0.75 (3 PM, peak shopping)

Model Computation

Trained weights: w = [2.1, 5.8, 1.3, 3.5, 0.6], b = โˆ’4.2

z = (2.1)(0.85) + (5.8)(0.032) + (1.3)(0.67) + (3.5)(0.333) + (0.6)(0.75) + (โˆ’4.2)

= 1.785 + 0.186 + 0.871 + 1.166 + 0.450 โˆ’ 4.200

= 0.257

ลท = ฯƒ(0.257) = 1/(1 + eโˆ’0.257) โ‰ˆ 0.564

Wait โ€” a 56.4% CTR? That seems high, but this is a simplified example. In practice, Google uses feature crossing, hashing tricks, and the FTRL optimizer with L1 regularization on billions of sparse features. Real CTRs are typically 1โ€“5%.

Revenue Impact: Expected CPM

Google's ad auction uses a "second-price" mechanism. The ad's rank = bid ร— P(click). A logistic regression that improves P(click) estimation by just 0.1% across billions of impressions translates to $175 million in annual revenue.

๐Ÿ‡ฎ๐Ÿ‡ณ SBI LOAN APPROVAL
  • Scale: ~1.5L applications/month
  • Features: CIBIL, income, EMI ratio
  • Output: P(default) โˆˆ (0,1)
  • Constraint: RBI-mandated explainability
  • Loss if wrong: NPA (Non-Performing Asset)
  • Model type: Classic logistic regression
๐Ÿ‡บ๐Ÿ‡ธ GOOGLE ADS CTR
  • Scale: 8.5B predictions/day
  • Features: Query-ad similarity, user history
  • Output: P(click) โˆˆ (0,1)
  • Constraint: <5ms latency per prediction
  • Loss if wrong: Lost ad revenue
  • Model type: FTRL-Proximal (LR variant)
Section 6

Python Implementation โ€” From Scratch + Library

6a. From-Scratch NumPy Implementation

Python
import numpy as np
import matplotlib.pyplot as plt
import time

class LogisticRegression:
    """
    Logistic Regression from scratch โ€” a single-neuron neural network.
    Implements: sigmoid, BCE loss, forward, backward, update, fit, predict.
    """

    def __init__(self, n_features, learning_rate=0.01):
        # Initialize weights to small random values, bias to zero
        self.w = np.zeros((n_features, 1))   # shape (n, 1)
        self.b = 0.0
        self.lr = learning_rate
        self.losses = []

    def sigmoid(self, z):
        """Numerically stable sigmoid."""
        return np.where(
            z >= 0,
            1 / (1 + np.exp(-z)),
            np.exp(z) / (1 + np.exp(z))
        )

    def forward(self, X):
        """
        Forward pass: X โ†’ z โ†’ ลท
        X: shape (n, m) โ€” each column is a sample
        Returns: ลท of shape (1, m)
        """
        self.z = np.dot(self.w.T, X) + self.b   # (1, m)
        self.y_hat = self.sigmoid(self.z)         # (1, m)
        return self.y_hat

    def compute_loss(self, Y):
        """
        Binary Cross-Entropy: J = -(1/m) ฮฃ [yยทlog(ลท) + (1-y)ยทlog(1-ลท)]
        Y: shape (1, m)
        """
        m = Y.shape[1]
        epsilon = 1e-8  # prevent log(0)
        cost = -(1/m) * np.sum(
            Y * np.log(self.y_hat + epsilon) +
            (1 - Y) * np.log(1 - self.y_hat + epsilon)
        )
        return np.squeeze(cost)

    def backward(self, X, Y):
        """
        Backward pass: compute gradients โˆ‚J/โˆ‚w and โˆ‚J/โˆ‚b
        The beautiful result: dz = ลท - y
        """
        m = Y.shape[1]
        dz = self.y_hat - Y                        # (1, m)
        self.dw = (1/m) * np.dot(X, dz.T)           # (n, 1)
        self.db = (1/m) * np.sum(dz)                # scalar

    def update(self):
        """Gradient descent step: w := w - ฮฑยทdw, b := b - ฮฑยทdb"""
        self.w -= self.lr * self.dw
        self.b -= self.lr * self.db

    def fit(self, X, Y, epochs=1000, print_every=100):
        """
        Full training loop.
        X: (n, m), Y: (1, m)
        """
        for i in range(epochs):
            # Forward
            self.forward(X)
            # Loss
            loss = self.compute_loss(Y)
            self.losses.append(loss)
            # Backward
            self.backward(X, Y)
            # Update
            self.update()
            # Print
            if i % print_every == 0:
                print(f"Epoch {i:4d} | Loss: {loss:.6f}")

    def predict(self, X, threshold=0.5):
        """Return binary predictions."""
        y_hat = self.forward(X)
        return (y_hat >= threshold).astype(int)

    def accuracy(self, X, Y):
        """Compute classification accuracy."""
        preds = self.predict(X)
        return np.mean(preds == Y) * 100

6b. Synthetic Indian Bank Loan Dataset + Training

Python
# โ”€โ”€ Generate Synthetic SBI Loan Dataset โ”€โ”€
np.random.seed(42)
m = 500  # 500 loan applications

# Feature 1: CIBIL score (normalized to 0-1)
cibil = np.random.beta(5, 2, m)  # skewed toward higher scores
# Feature 2: EMI-to-income ratio (0-0.8)
emi_ratio = np.random.beta(2, 5, m) * 0.8

# Labels: P(default) is LOW when CIBIL is HIGH and EMI is LOW
z_true = -4.0 * cibil + 5.0 * emi_ratio + 0.5
prob_default = 1 / (1 + np.exp(-z_true))
y = (np.random.rand(m) < prob_default).astype(float)

# Stack into (n, m) format
X = np.vstack([cibil, emi_ratio])   # shape (2, 500)
Y = y.reshape(1, -1)                # shape (1, 500)

print(f"Dataset: {X.shape[1]} samples, {X.shape[0]} features")
print(f"Default rate: {Y.mean()*100:.1f}%")

# โ”€โ”€ Train the model โ”€โ”€
model = LogisticRegression(n_features=2, learning_rate=1.0)
model.fit(X, Y, epochs=2000, print_every=200)

print(f"\nFinal Loss: {model.losses[-1]:.6f}")
print(f"Training Accuracy: {model.accuracy(X, Y):.2f}%")
print(f"Learned weights: w1={model.w[0,0]:.4f}, w2={model.w[1,0]:.4f}, b={model.b:.4f}")
Dataset: 500 samples, 2 features Default rate: 28.4% Epoch 0 | Loss: 0.693147 Epoch 200 | Loss: 0.518321 Epoch 400 | Loss: 0.478226 Epoch 600 | Loss: 0.461582 Epoch 800 | Loss: 0.453174 Epoch 1000 | Loss: 0.448474 Epoch 1200 | Loss: 0.445567 Epoch 1400 | Loss: 0.443685 Epoch 1600 | Loss: 0.442401 Epoch 1800 | Loss: 0.441487 Final Loss: 0.440828 Training Accuracy: 79.40% Learned weights: w1=-3.5192, w2=4.3856, b=0.4291

6c. Plot Loss Curve + Decision Boundary

Python
# โ”€โ”€ Plot 1: Loss Curve โ”€โ”€
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(model.losses, color='#7c3aed', linewidth=2)
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('BCE Loss', fontsize=12)
ax1.set_title('Training Loss Curve', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)

# โ”€โ”€ Plot 2: Decision Boundary โ”€โ”€
# Create mesh grid
x1_range = np.linspace(0, 1, 200)
x2_range = np.linspace(0, 0.8, 200)
X1, X2 = np.meshgrid(x1_range, x2_range)
X_mesh = np.vstack([X1.ravel(), X2.ravel()])

# Predict on mesh
Z_mesh = model.forward(X_mesh).reshape(X1.shape)

# Plot
ax2.contourf(X1, X2, Z_mesh, levels=50, cmap='RdYlGn_r', alpha=0.8)
ax2.contour(X1, X2, Z_mesh, levels=[0.5], colors='black', linewidths=2)
scatter = ax2.scatter(X[0], X[1], c=Y.ravel(), cmap='RdYlGn_r',
                       edgecolors='black', s=20, alpha=0.6)
ax2.set_xlabel('CIBIL Score (normalized)', fontsize=12)
ax2.set_ylabel('EMI-to-Income Ratio', fontsize=12)
ax2.set_title('Decision Boundary โ€” SBI Loan Default', fontsize=14, fontweight='bold')
plt.colorbar(scatter, ax=ax2, label='P(Default)')

plt.tight_layout()
plt.savefig('ch03_logistic_regression_plots.png', dpi=150)
plt.show()

6d. Vectorized vs Non-Vectorized โ€” Timing Comparison

Python
# โ”€โ”€ Non-vectorized (loop-based) gradient computation โ”€โ”€
def compute_gradients_loop(X, Y, w, b):
    """Compute gradients using explicit Python loops โ€” SLOW!"""
    n, m = X.shape
    dw = np.zeros((n, 1))
    db = 0.0

    for i in range(m):
        x_i = X[:, i].reshape(-1, 1)    # (n, 1)
        z_i = np.dot(w.T, x_i) + b       # scalar
        y_hat_i = 1 / (1 + np.exp(-z_i)) # scalar
        dz_i = y_hat_i - Y[0, i]         # scalar
        dw += x_i * dz_i
        db += dz_i

    dw /= m
    db /= m
    return dw, db

# โ”€โ”€ Vectorized gradient computation โ”€โ”€
def compute_gradients_vectorized(X, Y, w, b):
    """Compute gradients using NumPy broadcasting โ€” FAST!"""
    m = X.shape[1]
    z = np.dot(w.T, X) + b                # (1, m)
    y_hat = 1 / (1 + np.exp(-z))          # (1, m)
    dz = y_hat - Y                         # (1, m)
    dw = (1/m) * np.dot(X, dz.T)            # (n, 1)
    db = (1/m) * np.sum(dz)                 # scalar
    return dw, db

# โ”€โ”€ Timing Benchmark โ”€โ”€
m_test = 50000
n_test = 100
X_test = np.random.randn(n_test, m_test)
Y_test = np.random.randint(0, 2, (1, m_test)).astype(float)
w_test = np.random.randn(n_test, 1) * 0.01
b_test = 0.0

# Time the loop version
t1 = time.time()
dw_loop, db_loop = compute_gradients_loop(X_test, Y_test, w_test, b_test)
t_loop = time.time() - t1

# Time the vectorized version
t2 = time.time()
dw_vec, db_vec = compute_gradients_vectorized(X_test, Y_test, w_test, b_test)
t_vec = time.time() - t2

print(f"Loop-based:  {t_loop*1000:.1f} ms")
print(f"Vectorized:  {t_vec*1000:.1f} ms")
print(f"Speedup:     {t_loop/t_vec:.0f}x faster!")
print(f"Results match: {np.allclose(dw_loop, dw_vec)}")
Loop-based: 4823.7 ms Vectorized: 3.2 ms Speedup: 1507x faster! Results match: True
The 1500ร— speedup is real. NumPy operations use BLAS (Basic Linear Algebra Subprograms) โ€” optimized C/Fortran routines that exploit CPU SIMD instructions and cache hierarchy. Python loops have interpreter overhead per iteration. In production ML, you must always vectorize. No exceptions. If your training code has a for-loop over samples, you're doing it wrong.

6e. PyTorch Library Version (for comparison)

Python
import torch
import torch.nn as nn

# Convert data to PyTorch tensors
X_pt = torch.tensor(X.T, dtype=torch.float32)  # (m, n)
Y_pt = torch.tensor(Y.T, dtype=torch.float32)  # (m, 1)

# Define the model: logistic regression = 1 Linear layer + Sigmoid
model_pt = nn.Sequential(
    nn.Linear(2, 1),   # z = wแต€x + b
    nn.Sigmoid()          # ลท = ฯƒ(z)
)

# BCE Loss + SGD optimizer
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model_pt.parameters(), lr=1.0)

# Training loop
for epoch in range(2000):
    y_pred = model_pt(X_pt)
    loss = criterion(y_pred, Y_pt)

    optimizer.zero_grad()
    loss.backward()       # autograd computes gradients automatically!
    optimizer.step()

    if epoch % 500 == 0:
        print(f"Epoch {epoch:4d} | Loss: {loss.item():.6f}")

# Compare: PyTorch did the same forward-backward-update loop,
# but computed gradients automatically via loss.backward().
# Our from-scratch version computed them manually โ€” same math!
Find the bug in this logistic regression code:
def train_step(X, Y, w, b, lr):
    m = X.shape[1]
    z = np.dot(w.T, X) + b
    y_hat = 1 / (1 + np.exp(-z))
    loss = -(1/m) * np.sum(Y * np.log(y_hat) + (1-Y) * np.log(1-y_hat))
    dz = y_hat - Y
    dw = (1/m) * np.dot(X, dz.T)
    db = (1/m) * np.sum(dz)
    w = w + lr * dw      # <-- BUG IS HERE
    b = b + lr * db      # <-- AND HERE
    return w, b, loss
Bug: The update should be w = w - lr * dw (subtraction, not addition!). Gradient descent moves against the gradient. Adding the gradient would perform gradient ascent โ€” maximizing the loss instead of minimizing it. The loss would explode!
Section 7

Visual Aids โ€” ASCII Diagrams

Sigmoid Function Shape

THE SIGMOID CURVE ฯƒ(z) = 1/(1 + eโปแถป) ฯƒ(z) 1.0 โ”ค โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ– โ–  โ”‚ โ– โ– โ– โ–  0.9 โ”ค โ– โ– โ–  โ”‚ โ– โ–  0.8 โ”ค โ– โ–  โ”‚ โ– โ–  0.7 โ”ค โ– โ–  โ”‚ โ– โ–  SATURATION 0.6 โ”ค โ– โ–  ZONE โ”‚ โ– โ–  (ฯƒ'โ‰ˆ0) 0.5 โ”คโ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ–  โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ† ฯƒ(0) = 0.5 โ”‚ โ– โ–  0.4 โ”ค โ–  โ”‚ โ– โ–  LINEAR 0.3 โ”ค โ–  ZONE โ”‚ โ– โ–  (ฯƒ' โ‰ˆ 0.25) 0.2 โ”ค โ– โ–  โ”‚ โ–  0.1 โ”ค โ– โ–  โ”‚ โ–  0.0 โ”คโ– โ– โ– โ– โ– โ– โ– โ– โ– โ–  โ””โ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”€ -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 z โ†’ KEY PROPERTIES: โ€ข Range: (0, 1) โ€” never reaches exactly 0 or 1 โ€ข Symmetric about (0, 0.5): ฯƒ(-z) = 1 - ฯƒ(z) โ€ข Maximum slope at z=0: ฯƒ'(0) = 0.25 โ€ข Saturation: |z| > 5 โŸน ฯƒ'(z) โ‰ˆ 0 (vanishing gradient!)

BCE Loss Behavior

BINARY CROSS-ENTROPY LOSS โ€” How It Punishes Predictions Loss โ„’ 5.0 โ”คโ–  โ–  โ”‚ โ–  โ–  4.0 โ”ค โ–  โ–  โ”‚ โ–  When y=1: โ–  When y=0: 3.0 โ”ค โ–  โ„’ = -log(ลท) โ–  โ„’ = -log(1-ลท) โ”‚ โ–  โ–  2.0 โ”ค โ– โ–  โ– โ–  โ”‚ โ– โ–  โ– โ–  1.0 โ”ค โ– โ– โ–  โ– โ– โ–  โ”‚ โ– โ– โ– โ–  โ– โ– โ– โ–  0.0 โ”ค โ– โ– โ– โ– โ– โ– โ– โ– โ–  โ””โ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€ 0.0 0.2 0.4 0.6 0.8 1.0 ลท โ†’ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ When y=1: predict ลทโ†’1 โ†’ lossโ†’0 โœ“ โ”‚ โ”‚ predict ลทโ†’0 โ†’ lossโ†’โˆž โœ— โ”‚ โ”‚ When y=0: predict ลทโ†’0 โ†’ lossโ†’0 โœ“ โ”‚ โ”‚ predict ลทโ†’1 โ†’ lossโ†’โˆž โœ— โ”‚ โ”‚ The punishment for confident wrong โ”‚ โ”‚ predictions grows EXPONENTIALLY. โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Full Forward-Backward Dataflow

LOGISTIC REGRESSION โ€” COMPLETE TRAINING ITERATION โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ FORWARD PASS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ x โˆˆ โ„โฟ โ”€โ”€โ” โ”‚ โ”‚ โ”œโ”€โ”€ z = wแต€x + b โ”€โ”€ ลท = ฯƒ(z) โ”€โ”€ โ„’(ลท,y) โ”‚ โ”‚ w โˆˆ โ„โฟ โ”€โ”€โ”ค โ”‚ โ”‚ b โˆˆ โ„ โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ BACKWARD PASS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ dโ„’/dโ„’ = 1 โ”‚ โ”‚ โ†“ โ”‚ โ”‚ dโ„’/dลท = -y/ลท + (1-y)/(1-ลท) โ”‚ โ”‚ โ†“ โ”‚ โ”‚ dโ„’/dz = ลท - y (ฯƒ' cancels with BCE!) โ”‚ โ”‚ โ†“ โ”‚ โ”‚ dโ„’/dw = (ลท-y)ยทx dโ„’/db = ลท-y โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ UPDATE STEP โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ w โ† w - ฮฑ ยท dw โ”‚ โ”‚ b โ† b - ฮฑ ยท db โ”‚ โ”‚ โ”‚ โ”‚ Repeat for T epochs until convergence. โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Section 8

Common Misconceptions

โŒ MYTH: "Logistic regression is a regression algorithm because it has 'regression' in the name."
โœ… TRUTH: Logistic regression is a classification algorithm. It predicts probabilities P(y=1|x) โˆˆ (0,1) and classifies by thresholding. The name comes from the logistic (sigmoid) function, not from regression in the statistical sense.
๐Ÿ” WHY IT MATTERS: In GATE and interviews, this is a trick question. Know the distinction: linear regression predicts continuous values; logistic regression predicts probabilities for discrete classes.
โŒ MYTH: "The sigmoid derivative ฯƒโ€ฒ(z) = ฯƒ(z)(1 โˆ’ ฯƒ(z)) is always 0.25."
โœ… TRUTH: ฯƒโ€ฒ(z) = 0.25 only at z = 0. For |z| > 5, ฯƒโ€ฒ(z) โ‰ˆ 0 (vanishing gradient). The derivative ranges from 0 to 0.25, reaching its maximum at z = 0.
๐Ÿ” WHY IT MATTERS: This vanishing gradient problem is why sigmoid is rarely used as a hidden layer activation in deep networks. ReLU and its variants solve this.
โŒ MYTH: "You can use MSE loss for logistic regression โ€” it'll work fine."
โœ… TRUTH: MSE + sigmoid creates a non-convex loss surface with many local minima. BCE + sigmoid is convex with a guaranteed global minimum. Always use BCE for binary classification.
๐Ÿ” WHY IT MATTERS: Using the wrong loss function means gradient descent may get stuck in local minima or converge to a suboptimal solution. The MLE-derived BCE is mathematically the correct loss.
โŒ MYTH: "Higher learning rate ฮฑ always means faster training."
โœ… TRUTH: If ฮฑ is too large, gradient descent overshoots the minimum and the loss oscillates or diverges. If ฮฑ is too small, training is painfully slow. The optimal ฮฑ depends on the loss surface curvature.
๐Ÿ” WHY IT MATTERS: Learning rate is the most important hyperparameter. In practice, use learning rate schedules or adaptive optimizers (Adam) to handle this automatically.
โŒ MYTH: "Sigmoid output of 0.7 means the model is 70% confident."
โœ… TRUTH: Sigmoid outputs are probabilities only if the model is calibrated. An uncalibrated model might output 0.7 for cases that are actually positive 90% of the time. Calibration (Platt scaling, isotonic regression) is needed to make outputs match true frequencies.
๐Ÿ” WHY IT MATTERS: In medical diagnosis and lending, miscalibrated probabilities lead to biased decisions. SBI's risk models require calibration validation by RBI auditors.
Section 9

GATE / Exam Corner

Key Formulas โ€” Quick Reference Sheet

ConceptFormulaNotes
Sigmoidฯƒ(z) = 1/(1+eโˆ’z)Maps โ„ โ†’ (0,1)
ฯƒ(0)0.5Decision boundary point
Symmetryฯƒ(โˆ’z) = 1 โˆ’ ฯƒ(z)Symmetric about (0, 0.5)
Derivativeฯƒโ€ฒ(z) = ฯƒ(z)(1โˆ’ฯƒ(z))Max = 0.25 at z=0
BCE Lossโ„’ = โˆ’[yยทlog ลท + (1โˆ’y)ยทlog(1โˆ’ลท)]Derived from MLE
CostJ = (1/m)โˆ‘โ„’(i)Average over m samples
Key gradientโˆ‚โ„’/โˆ‚z = ลท โˆ’ yPrediction error!
Weight gradientโˆ‚J/โˆ‚w = (1/m)X(ลถโˆ’Y)TVectorized form
GD Updatew := w โˆ’ ฮฑยทโˆ‚J/โˆ‚wLearning rate ฮฑ

GATE Previous Year Questions (PYQs) Pattern

GATE CS 2019 โ€” Q

If ฯƒ(z) = 1/(1+eโˆ’z) is the sigmoid function, which of the following is the derivative ฯƒโ€ฒ(z)?

  1. ฯƒ(z)
  2. 1 โˆ’ ฯƒ(z)
  3. ฯƒ(z) ยท (1 โˆ’ ฯƒ(z))
  4. ฯƒ(z) + (1 โˆ’ ฯƒ(z))
Answer: C โ€” We derived this: ฯƒโ€ฒ(z) = eโˆ’z/(1+eโˆ’z)2 = ฯƒ(z)(1โˆ’ฯƒ(z)). Option D = 1 always (a common trap).
Remember1 Mark
GATE DA 2024 โ€” Predicted Pattern

The binary cross-entropy loss for a single sample with true label y=1 and predicted probability ลท=0.2 is:

  1. โˆ’log(0.8) โ‰ˆ 0.2231
  2. โˆ’log(0.2) โ‰ˆ 1.6094
  3. (0.2 โˆ’ 1)2 = 0.64
  4. 0.2 ร— log(0.2) โ‰ˆ โˆ’0.3219
Answer: B โ€” When y=1, โ„’ = โˆ’log(ลท) = โˆ’log(0.2) โ‰ˆ 1.6094. The model predicted 20% probability for a positive case โ†’ heavy penalty. Option A uses (1โˆ’ลท) which applies when y=0. Option C is MSE (wrong loss). Option D confuses the formula.
Apply2 Marks
GATE ML 2023 โ€” Pattern

In logistic regression with gradient descent, the gradient โˆ‚โ„’/โˆ‚z for a single sample simplifies to:

  1. ฯƒ(z) ยท (1 โˆ’ ฯƒ(z))
  2. ลท โˆ’ y
  3. (ลท โˆ’ y)2
  4. y โˆ’ ลท
Answer: B โ€” โˆ‚โ„’/โˆ‚z = ลท โˆ’ y. The beautiful cancellation between the BCE derivative and sigmoid derivative gives this simple result. Note: option D (y โˆ’ ลท) would give gradient ascent.
Understand2 Marks

GATE Prediction Table โ€” What to Expect

TopicLikelihoodMarksType
Sigmoid derivativeโญโญโญโญโญ1โ€“2MCQ
BCE formula applicationโญโญโญโญ2NAT
Gradient descent updateโญโญโญโญ2MCQ/NAT
Sigmoid properties (symmetry, limits)โญโญโญ1MCQ
MLE โ†’ BCE derivationโญโญ2MCQ
Convexity of BCE vs MSEโญโญ1MCQ
Q: Why is BCE convex for logistic regression but MSE is not?
A: BCE = โˆ’(1/m)โˆ‘[y log ฯƒ(wTx+b) + (1โˆ’y) log(1โˆ’ฯƒ(wTx+b))]. The negative log-likelihood of the Bernoulli distribution composed with the sigmoid is provably convex in w and b (the Hessian is positive semi-definite). MSE = (1/m)โˆ‘(ฯƒ(wTx+b) โˆ’ y)ยฒ has non-convex regions due to the sigmoid's saturation zones creating additional inflection points.
Section 10

Interview Prep โ€” India + US Focus

Conceptual Questions

๐ŸŽค Q1: Why is logistic regression considered a neural network?

Expected Answer (Senior ML roles)

Logistic regression has the same architecture as a neural network with 0 hidden layers: input features โ†’ linear transformation z = wTx + b โ†’ sigmoid activation ลท = ฯƒ(z). It uses the same training procedure: forward pass to compute predictions, BCE loss to measure error, backward pass (chain rule) to compute gradients, and gradient descent to update parameters. The only difference from a deep network is depth โ€” logistic regression has depth 1. This is why Andrew Ng introduces neural networks through logistic regression in his courses.

๐ŸŽค Q2: Derive the gradient โˆ‚โ„’/โˆ‚w for logistic regression from first principles.

Expected Answer

Chain rule: โˆ‚โ„’/โˆ‚w = (โˆ‚โ„’/โˆ‚ลท)(โˆ‚ลท/โˆ‚z)(โˆ‚z/โˆ‚w). Compute each: โˆ‚โ„’/โˆ‚ลท = โˆ’y/ลท + (1โˆ’y)/(1โˆ’ลท). โˆ‚ลท/โˆ‚z = ฯƒ(z)(1โˆ’ฯƒ(z)) = ลท(1โˆ’ลท). โˆ‚z/โˆ‚w = x. Multiply: [โˆ’y/ลท + (1โˆ’y)/(1โˆ’ลท)] ยท ลท(1โˆ’ลท) ยท x = (ลทโˆ’y)ยทx. For m samples, average: โˆ‚J/โˆ‚w = (1/m)โˆ‘(ลท(i)โˆ’y(i))x(i).

Interviewer follow-up: "Is it a coincidence that โˆ‚โ„’/โˆ‚z = ลทโˆ’y is so simple?" No โ€” it's a consequence of the sigmoid being the canonical link function for the Bernoulli distribution in generalized linear models.

๐ŸŽค Q3: When would you use logistic regression over a deep neural network?

India Context (Flipkart, Ola, SBI)

When you need interpretability (RBI regulation for credit scoring), when you have limited labeled data (<1000 samples โ€” deep networks overfit), when you need fast inference (SBI YONO processes loans in 12s), or as a baseline before trying complex models. At Flipkart, logistic regression with hand-crafted features is still the first model in every recommender pipeline.

US Context (Google, Meta, Netflix)

At Google Ads, logistic regression (FTRL variant) handles billions of sparse features via online learning โ€” no deep network can match its training speed on streaming data. At Netflix, logistic regression serves as the calibration layer on top of deep network scores. At Meta, it's the production baseline that any new model must beat.

Coding Interview Questions

๐Ÿ’ป Coding Q1: Implement sigmoid without using np.exp

# Hint: Use the identity ฯƒ(z) = 0.5 * (1 + tanh(z/2))
def sigmoid_via_tanh(z):
    return 0.5 * (1 + np.tanh(z / 2))

This works because tanh(z) = 2ฯƒ(2z) โˆ’ 1, so ฯƒ(z) = (1 + tanh(z/2))/2. This is also more numerically stable since np.tanh is implemented in a stable way internally.

๐Ÿ’ป Coding Q2: Implement BCE loss in one line, handling log(0)

def bce_loss(y, y_hat, eps=1e-7):
    y_hat = np.clip(y_hat, eps, 1 - eps)  # prevent log(0)
    return -np.mean(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))

Key detail: np.clip is essential. Without it, np.log(0) = -inf and your loss becomes NaN. Every production ML system clips predictions before computing log loss.

๐Ÿ‡ฎ๐Ÿ‡ณ INDIA INTERVIEW FOCUS
  • GATE-style derivation questions (sigmoid derivative, BCE from MLE)
  • Explain logistic regression to a non-technical banking manager
  • Feature engineering for Indian datasets (CIBIL, Aadhaar, UPI)
  • Companies: Flipkart, Ola, Razorpay, Paytm, SBI, HDFC
  • Common: "Implement from scratch in NumPy"
๐Ÿ‡บ๐Ÿ‡ธ US INTERVIEW FOCUS
  • System design: "Design a CTR prediction system at scale"
  • Trade-offs: logistic reg vs. deep learning for production
  • Online learning: how to update LR on streaming data
  • Companies: Google, Meta, Netflix, Uber, Airbnb
  • Common: "When would you NOT use deep learning?"
Section 11

Hands-On Lab โ€” SBI Loan Default Predictor

๐Ÿงช Mini-Project: Build a Complete Loan Default Prediction System

๐Ÿ“‹ Project Specification

Objective

Build a logistic regression model from scratch (no sklearn) that predicts whether an Indian bank loan applicant will default, using synthetic CIBIL-style data.

Requirements
  1. Generate a synthetic dataset with 1000 samples and 5 features (CIBIL score, income, EMI ratio, credit history, active loans)
  2. Implement the LogisticRegression class with all 8 methods (sigmoid, forward, compute_loss, backward, update, fit, predict, accuracy)
  3. Split data 80/20 into train/test sets
  4. Train for 3000 epochs and plot the loss curve
  5. Plot the decision boundary (use the 2 most important features)
  6. Report training and test accuracy
  7. Compare execution time: vectorized vs non-vectorized for 10,000 samples
  8. Bonus: Implement learning rate decay (ฮฑ decreases over epochs)
Rubric (100 points)
ComponentPointsCriteria
Correct LogisticRegression class30All 8 methods implemented correctly
Dataset generation10Realistic feature distributions
Training + convergence15Loss decreases monotonically, >70% accuracy
Loss curve plot10Clean, labeled matplotlib figure
Decision boundary plot15Shows data points, boundary line, color regions
Vectorized vs loop timing10Timing benchmark with >100ร— speedup shown
Code quality + comments10Clear variable names, docstrings, step-by-step comments
Section 12

Exercises โ€” 22 Problems Across 5 Categories

Section A: Conceptual Questions (5)

A1 Beginner

State the five key properties of the sigmoid function ฯƒ(z) = 1/(1+eโˆ’z).

Answer: (1) ฯƒ(0) = 0.5, (2) Range: (0,1), (3) Symmetry: ฯƒ(โˆ’z) = 1โˆ’ฯƒ(z), (4) Limits: ฯƒ(z)โ†’1 as zโ†’+โˆž, ฯƒ(z)โ†’0 as zโ†’โˆ’โˆž, (5) Derivative: ฯƒโ€ฒ(z) = ฯƒ(z)(1โˆ’ฯƒ(z)), max value 0.25 at z=0.
Remember
A2 Beginner

Explain in 2-3 sentences why the perceptron (step function) cannot be trained using gradient descent.

Answer: The step function's derivative is zero everywhere except at the discontinuity point where it's undefined. Since gradient descent requires โˆ‚Loss/โˆ‚w = (โˆ‚Loss/โˆ‚ลท)ยท(โˆ‚ลท/โˆ‚z)ยท(โˆ‚z/โˆ‚w), and โˆ‚ลท/โˆ‚z = 0, the weight update is always zero โ€” the model cannot learn.
Understand
A3 Intermediate

Why do we derive the BCE loss from MLE rather than just using MSE for classification?

Answer: MLE gives us the statistically principled loss function. For binary labels following a Bernoulli distribution, the negative log-likelihood is exactly BCE. This loss is convex when composed with sigmoid, guaranteeing a global minimum. MSE with sigmoid is non-convex, creating local minima that gradient descent can get stuck in.
Understand
A4 Intermediate

What is the difference between the "loss" โ„’ and the "cost" J in logistic regression?

Answer: The loss โ„’(ลท, y) is computed for a single training example. The cost J(w, b) is the average loss over all m training examples: J = (1/m)โˆ‘โ„’(i). We minimize the cost (not the individual loss), and the gradients โˆ‚J/โˆ‚w average the per-sample gradients.
Understand
A5 Intermediate

In the computation graph, explain why โˆ‚โ„’/โˆ‚z = ลท โˆ’ y simplifies so beautifully. Is this a coincidence?

Answer: No coincidence. The sigmoid is the "canonical link function" for the Bernoulli distribution in the Generalized Linear Model (GLM) framework. When you pair the canonical link with the corresponding distribution's negative log-likelihood, the gradient always simplifies to (prediction โˆ’ truth). This property extends to other GLMs: softmax + cross-entropy, identity + MSE, etc.
Analyze

Section B: Mathematical Problems (8)

B1 Beginner

Compute ฯƒ(3) and ฯƒ(โˆ’3). Verify that ฯƒ(3) + ฯƒ(โˆ’3) = 1.

Answer: ฯƒ(3) = 1/(1+eโˆ’3) = 1/(1+0.0498) โ‰ˆ 0.9526. ฯƒ(โˆ’3) = 1/(1+e3) = 1/(1+20.086) โ‰ˆ 0.0474. Sum = 0.9526 + 0.0474 = 1.0000 โœ“. This verifies the symmetry property ฯƒ(โˆ’z) = 1 โˆ’ ฯƒ(z).
Apply
B2 Intermediate

Prove that the sigmoid derivative can be written as: ฯƒโ€ฒ(z) = ฯƒ(z) ยท ฯƒ(โˆ’z).

Answer: We know ฯƒโ€ฒ(z) = ฯƒ(z)(1โˆ’ฯƒ(z)). By the symmetry property, 1โˆ’ฯƒ(z) = ฯƒ(โˆ’z). Therefore ฯƒโ€ฒ(z) = ฯƒ(z)ยทฯƒ(โˆ’z) โœ“
Apply
B3 Intermediate

For a single sample with x = [3, 1]T, y = 1, w = [0.2, โˆ’0.1]T, b = 0.1, compute: (a) z, (b) ลท, (c) โ„’, (d) โˆ‚โ„’/โˆ‚wโ‚, (e) โˆ‚โ„’/โˆ‚wโ‚‚, (f) โˆ‚โ„’/โˆ‚b.

Answer: (a) z = 0.2ยท3 + (โˆ’0.1)ยท1 + 0.1 = 0.6 โˆ’ 0.1 + 0.1 = 0.6. (b) ลท = ฯƒ(0.6) = 1/(1+eโˆ’0.6) โ‰ˆ 0.6457. (c) โ„’ = โˆ’log(0.6457) โ‰ˆ 0.4379. (d) dz = ลทโˆ’y = 0.6457โˆ’1 = โˆ’0.3543. โˆ‚โ„’/โˆ‚wโ‚ = dzยทxโ‚ = โˆ’0.3543ยท3 = โˆ’1.0629. (e) โˆ‚โ„’/โˆ‚wโ‚‚ = โˆ’0.3543ยท1 = โˆ’0.3543. (f) โˆ‚โ„’/โˆ‚b = โˆ’0.3543.
Apply
B4 Intermediate

If ฯƒ(z) = 0.8, compute ฯƒโ€ฒ(z) without finding z first.

Answer: ฯƒโ€ฒ(z) = ฯƒ(z)(1โˆ’ฯƒ(z)) = 0.8 ร— 0.2 = 0.16. The beauty of this formula: you don't need to know z, just ฯƒ(z)!
Apply
B5 Intermediate

Derive the Bernoulli likelihood for a dataset with y = [1, 0, 1, 1, 0] and ลท = [0.9, 0.3, 0.7, 0.8, 0.1]. Compute the log-likelihood and the BCE loss.

Answer: L = 0.9 ร— 0.7 ร— 0.7 ร— 0.8 ร— 0.9 = 0.3175. log L = log(0.9) + log(0.7) + log(0.7) + log(0.8) + log(0.9) = โˆ’0.1054 + (โˆ’0.3567) + (โˆ’0.3567) + (โˆ’0.2231) + (โˆ’0.1054) = โˆ’1.1473. BCE = โˆ’(1/5)(โˆ’1.1473) = 0.2295.
Apply
B6 Advanced

Show that the Hessian โˆ‚ยฒJ/โˆ‚wโˆ‚wT for logistic regression is positive semi-definite, proving that J is convex.

Answer: H = (1/m)โˆ‘ ฯƒ(z(i))(1โˆ’ฯƒ(z(i))) ยท x(i)x(i)T. Let si = ฯƒ(z(i))(1โˆ’ฯƒ(z(i))) > 0 (since ฯƒ โˆˆ (0,1)). Then H = (1/m)Xยทdiag(s)ยทXT. For any vector v: vTHv = (1/m)โˆ‘ si(vTx(i))ยฒ โ‰ฅ 0. Thus H is PSD and J is convex. โœ“
Analyze
B7 Advanced

Derive the gradient descent update rule for logistic regression with L2 regularization: Jreg = J + (ฮป/2m)||w||ยฒ.

Answer: โˆ‚Jreg/โˆ‚w = โˆ‚J/โˆ‚w + (ฮป/m)w = (1/m)X(ลถโˆ’Y)T + (ฮป/m)w. Update: w := w โˆ’ ฮฑ[(1/m)X(ลถโˆ’Y)T + (ฮป/m)w] = w(1 โˆ’ ฮฑฮป/m) โˆ’ (ฮฑ/m)X(ลถโˆ’Y)T. The term (1 โˆ’ ฮฑฮป/m) shrinks weights each step โ€” this is "weight decay."
Analyze
B8 Advanced

Show that ฯƒ(z) can be expressed as: ฯƒ(z) = ยฝ + ยฝยทtanh(z/2). Verify for z = 0, z = 2.

Answer: tanh(z/2) = (ez/2โˆ’eโˆ’z/2)/(ez/2+eโˆ’z/2). Multiply top and bottom by ez/2: = (ezโˆ’1)/(ez+1). So ยฝ+ยฝยทtanh(z/2) = ยฝ + (ezโˆ’1)/(2(ez+1)) = (ez+1+ezโˆ’1)/(2(ez+1)) = ez/(ez+1) = 1/(1+eโˆ’z) = ฯƒ(z) โœ“. Verify: z=0: ยฝ+ยฝยท0 = 0.5 โœ“. z=2: ยฝ+ยฝยทtanh(1) = 0.5+0.5ยท0.7616 = 0.8808 โ‰ˆ ฯƒ(2)=0.8808 โœ“
Apply

Section C: Coding Problems (4)

C1 Intermediate

Implement a function predict_proba(X, w, b) that takes feature matrix X (shape nร—m), weights w (shape nร—1), and bias b, and returns predicted probabilities using the numerically stable sigmoid.

Answer: def predict_proba(X, w, b): z = w.T @ X + b; return np.where(z >= 0, 1/(1+np.exp(-z)), np.exp(z)/(1+np.exp(z)))
Apply
C2 Intermediate

Write a function that computes the gradient โˆ‚J/โˆ‚w and โˆ‚J/โˆ‚b for m samples in a fully vectorized way (no loops). Verify against the loop version for correctness.

Answer: def grad(X, Y, w, b): m=X.shape[1]; z=w.T@X+b; a=sigmoid(z); dz=a-Y; dw=(1/m)*X@dz.T; db=(1/m)*np.sum(dz); return dw, db
Apply
C3 Advanced

Extend the LogisticRegression class to support mini-batch gradient descent. Add a batch_size parameter to fit() and randomly sample batches each epoch.

Answer: In the fit loop, generate random indices each epoch: idx = np.random.choice(m, batch_size, replace=False); use X[:, idx] and Y[:, idx] for forward/backward. This adds stochasticity that can help escape local optima and reduces memory usage.
Create
C4 Advanced

Implement a plot_decision_boundary(model, X, Y) function that creates a meshgrid, computes predictions over it, and plots the decision boundary with a contour plot overlaid with data points.

Answer: Create meshgrid from min/max of X[0] and X[1], stack into (2, grid_sizeยฒ) matrix, forward pass through model, reshape predictions to grid shape, use plt.contourf + plt.scatter. See Section 6c code for the complete implementation.
Create

Section D: Critical Thinking (3)

D1 Advanced

SBI uses logistic regression for loan approval because RBI mandates explainable models. But a gradient boosted tree (XGBoost) gives 5% higher accuracy. As the chief data scientist, how would you argue for or against switching?

Answer: Key trade-offs: (1) Regulatory: RBI requires explanability โ€” LR weights are directly interpretable; XGBoost needs SHAP/LIME for post-hoc explanations. (2) Risk: 5% accuracy improvement on a โ‚น2.5 lakh crore portfolio = โ‚น12,500 crore saved from NPAs per year. (3) Compromise: Use LR as the production model for approval decisions (regulatory compliance) but use XGBoost as a second-pass model for risk scoring in the "uncertain zone" (15%โ€“40% default probability). This is called "model stacking" and satisfies both accuracy and regulatory needs.
Evaluate
D2 Advanced

Google's ad CTR model must make predictions in <5ms. A logistic regression with 1 billion sparse features takes 2ms. A transformer model with 340M parameters gives 3% better CTR prediction but takes 50ms. Which would you deploy and why?

Answer: Deploy LR. At 8.5B daily queries, 50ms latency adds 425M seconds of user waiting time daily. Google's research shows that every 100ms of latency costs 1% of revenue. The 3% CTR improvement from the transformer is outweighed by the 5ร— latency degradation. Alternative: Use the transformer to generate features offline (user embeddings, query intent scores), then feed those as inputs to the logistic regression. This gives most of the accuracy gain within latency budget.
Evaluate
D3 Advanced

The maximum value of ฯƒโ€ฒ(z) is 0.25. In a deep network with L sigmoid layers, the gradient flowing back through all layers is multiplied by ฯƒโ€ฒ at each layer. What is the maximum possible gradient magnitude at the first layer for L = 10? L = 50? What does this imply?

Answer: Maximum gradient = (0.25)L. For L=10: (0.25)10 โ‰ˆ 9.5 ร— 10โˆ’7. For L=50: (0.25)50 โ‰ˆ 7.9 ร— 10โˆ’31. This is the vanishing gradient problem โ€” gradients become astronomically small in deep sigmoid networks, making early layers virtually untrainable. This is why ReLU (max gradient = 1) replaced sigmoid in hidden layers of deep networks.
Analyze

โ˜… Section E: Starred Research Problems (2)

โ˜… E1 Advanced

Read McMahan et al. (2013), "Ad Click Prediction: a View from the Trenches" (Google). This paper describes the FTRL-Proximal optimizer used for training logistic regression on billions of features. Summarize: (a) Why does standard gradient descent fail at Google's scale? (b) How does FTRL-Proximal achieve sparsity? (c) What is the per-coordinate learning rate schedule?

Reference: KDD 2013. Key insights: (a) With billions of features, full gradient updates are too expensive; FTRL processes one example at a time (online learning). (b) FTRL combines L1 regularization with a "follow-the-regularized-leader" update that sets many weights exactly to zero, enabling sparse models. (c) ฮทt,i = ฮฑ/(ฮฒ + โˆšโˆ‘ฯ„ gฯ„,iยฒ) โ€” per-feature adaptive rate that decreases as the feature's accumulated gradient grows.
CreateResearch
โ˜… E2 Advanced

The sigmoid function ฯƒ(z) = 1/(1+eโˆ’z) is one of many possible S-shaped functions. Research and compare at least three alternatives: (a) tanh, (b) probit (ฮฆ(z) โ€” the CDF of the standard normal), (c) algebraic sigmoid z/(1+|z|). For each, derive the range, compute the derivative, and discuss when you'd prefer it over the standard sigmoid.

Hints: (a) tanh: range (โˆ’1,1), derivative 1โˆ’tanhยฒ(z), zero-centered output โ€” preferred in hidden layers over sigmoid. (b) Probit: range (0,1), no closed-form derivative (involves eโˆ’zยฒ/2), used in Bayesian models and econometrics. (c) Algebraic: range (โˆ’1,1), derivative 1/(1+|z|)ยฒ, computationally cheaper โ€” used in embedded systems. The standard sigmoid's advantage is that ฯƒโ€ฒ = ฯƒ(1โˆ’ฯƒ) makes backprop analytically elegant.
CreateResearch
Section 13

Connections โ€” Where This Fits

๐Ÿ”— Chapter Connections Map

โ† Builds On
  • Ch 0 (Course Overview): The bird's-eye view of what neural networks are
  • Ch 2 (Math Toolkit): Derivatives, chain rule, vectors, matrices โ€” all used in gradient derivations
โ†’ Enables
  • Ch 4 (Shallow Neural Networks): Stack multiple logistic regression neurons into layers โ†’ multi-layer perceptron
  • Ch 5 (Deep Neural Networks): The forward-backward-update loop scales directly to L layers
  • Ch 6 (Optimization): Learning rate, momentum, Adam โ€” all build on gradient descent from this chapter
๐Ÿ”ฌ Research Frontier
  • Neural Tangent Kernels (Jacot et al., 2018): At infinite width, neural networks behave like logistic regression in a kernel space
  • Calibration (Guo et al., 2017): Modern deep networks are poorly calibrated โ€” their sigmoid outputs don't match true probabilities. Temperature scaling (a post-hoc logistic regression!) fixes this
๐Ÿญ Industry Implementations
  • sklearn.linear_model.LogisticRegression: Uses LBFGS or liblinear solvers (faster than GD for convex problems)
  • Google FTRL-Proximal: Online logistic regression at billion-feature scale
  • Facebook/Meta DLRM: Deep Learning Recommendation Model uses logistic regression as its final output layer
CHAPTER DEPENDENCY MAP Ch 0: Overview โ”€โ”€โ”€โ”€โ”€โ” โ”œโ”€โ”€โ†’ [CH 3: LOGISTIC REGRESSION] โ”€โ”€โ†’ Ch 4: Shallow NN Ch 2: Math Toolkit โ”€โ”˜ โ–  sigmoid โ–  โ”€โ”€โ†’ Ch 5: Deep NN โ–  BCE from MLE โ–  โ”€โ”€โ†’ Ch 6: Optimization โ–  gradient descent โ–  โ–  computation graph โ–  โ–  forward/backward โ–  โ–  single neuron NN โ– 
Section 14

Chapter Summary โ€” 7 Key Takeaways

๐Ÿ“ What You Learned in Chapter 3

  1. From Step to Smooth: The perceptron's step function has zero derivatives, blocking gradient-based learning. The sigmoid ฯƒ(z) = 1/(1+eโˆ’z) is smooth, differentiable, and maps โ„ โ†’ (0,1), enabling gradient descent.
  2. Sigmoid is Self-Referential: Its derivative ฯƒโ€ฒ(z) = ฯƒ(z)(1โˆ’ฯƒ(z)) is expressed entirely in terms of the function itself โ€” no need to recompute z. Maximum derivative is 0.25 at z=0; it vanishes for |z| โ‰ซ 0.
  3. Logistic Regression IS a Neural Network: A single neuron with n inputs, sigmoid activation, trained with BCE loss and gradient descent. Every concept from this chapter โ€” forward pass, loss, backprop, GD update โ€” scales directly to deep networks.
  4. BCE from First Principles: Bernoulli โ†’ Likelihood โ†’ Log-Likelihood โ†’ Negate โ†’ Average = Binary Cross-Entropy. This is the only correct loss for binary classification from a statistical standpoint.
  5. The Beautiful Gradient: โˆ‚โ„’/โˆ‚z = ลท โˆ’ y. The gradient of the loss with respect to the pre-activation is simply the prediction error. This is not a coincidence โ€” it's a property of canonical link functions in GLMs.
  6. Vectorize Everything: Loop-based gradient computation is ~1500ร— slower than vectorized NumPy operations. In production ML, vectorization is not optional.
  7. Real-World Impact: SBI uses logistic regression for โ‚น10L+ loan decisions (12-second approvals). Google uses it for $175B/year in ad revenue (8.5B daily predictions). The simplest neural network is also the most widely deployed.

Key Equation

ฯƒ(z) = 1/(1+eโˆ’z)   |   J = โˆ’(1/m) โˆ‘[y log ลท + (1โˆ’y) log(1โˆ’ลท)]   |   dz = ลท โˆ’ y

Key Intuition

๐Ÿง  Logistic regression is a single neuron that converts a linear combination of inputs into a probability, learns by comparing its predictions to truth (BCE), and adjusts its weights to reduce errors (gradient descent). Every deep neural network is just many of these neurons, connected in layers, trained with the same four-step loop: forward โ†’ loss โ†’ backward โ†’ update.

Section 15

Further Reading

๐Ÿ‡ฎ๐Ÿ‡ณ INDIAN RESOURCES
  • NPTEL: "Deep Learning" by Prof. Mitesh Khapra (IIT Madras) โ€” Lectures 5-7 cover logistic regression and gradient descent with excellent visualizations
  • NPTEL: "Machine Learning" by Prof. Balaji Srinivasan (IIT Madras) โ€” Statistical perspective on logistic regression and MLE
  • GATE PYQ Book: Made Easy / ACE Academy ML sections โ€” sigmoid and BCE derivation questions from 2018-2025
  • IISc Bangalore: PRML course notes by Prof. Chiranjib Bhattacharyya โ€” rigorous mathematical treatment
๐ŸŒ GLOBAL RESOURCES
  • Andrew Ng (Coursera): "Neural Networks and Deep Learning" Week 2 โ€” the course that inspired this chapter's structure
  • 3Blue1Brown: "But what is a Neural Network?" โ€” outstanding visualization of the computation graph
  • Distill.pub: "A Visual Introduction to Machine Learning" โ€” interactive sigmoid and loss visualizations
  • Paper: McMahan et al. (2013), "Ad Click Prediction: a View from the Trenches" โ€” Google's FTRL-Proximal paper
  • Book: Bishop, "Pattern Recognition and Machine Learning" โ€” Chapter 4 on linear models for classification
  • Paper: Guo et al. (2017), "On Calibration of Modern Neural Networks" โ€” ICML, why sigmoid outputs โ‰  calibrated probabilities
"Scaling Laws for Neural Language Models" (Kaplan et al., 2020, OpenAI) โ€” In a fascinating connection to this chapter, the cross-entropy loss of large language models follows predictable power-law scaling with model size, data size, and compute. The BCE loss you derived for a single neuron is the exact same mathematical object โ€” cross-entropy โ€” that governs GPT-4's training. The journey from one neuron to 1.8 trillion parameters is a journey of scale, not of fundamentally different mathematics. arXiv: 2001.08361

โ€” End of Chapter 3 โ€”

Next: Chapter 4 โ†’ Shallow Neural Networks โ€” What happens when you stack multiple neurons into layers?