Neural Networks & Deep Learning

Chapter 5: Logistic Regression

The Neural Network's First Building Block

โฑ๏ธ Reading Time: ~3 hours  |  ๐Ÿ“– Part II: The Single Neuron  |  ๐Ÿงช Theory + Code

๐Ÿ“‹ Prerequisites: Ch 2 (Math Toolkit), Ch 3 (Python & NumPy), Ch 4 (The Neuron)

Bloom's Taxonomy Map for This Chapter

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberRecall the sigmoid function formula, its range, and key properties (ฯƒ(0)=0.5, symmetry)
๐Ÿ”ต UnderstandExplain why binary cross-entropy is the correct loss for classification, derived from maximum likelihood
๐ŸŸข ApplyImplement a complete LogisticRegression class from scratch in NumPy and train it on data
๐ŸŸก AnalyzeTrace the computation graph: compute forward pass outputs and backward pass gradients by hand
๐ŸŸ  EvaluateCompare vectorized vs loop-based implementations and assess numerical stability trade-offs
๐Ÿ”ด CreateDesign and train a loan-default predictor for an Indian banking dataset from scratch
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  • Define logistic regression as a linear model composed with a sigmoid activation for binary classification
  • Derive the sigmoid function ฯƒ(z) = 1/(1+eโˆ’z) and prove its derivative ฯƒโ€ฒ(z) = ฯƒ(z)(1 โˆ’ ฯƒ(z))
  • Derive the Binary Cross-Entropy loss from first principles using Maximum Likelihood Estimation
  • Construct the computation graph for logistic regression and perform forward & backward passes
  • Derive the gradient descent update rules โˆ‚L/โˆ‚w and โˆ‚L/โˆ‚b step by step
  • Implement a complete LogisticRegression class from scratch using only NumPy
  • Train the model on a synthetic Indian bank loan dataset and visualize the decision boundary
  • Compare vectorized (NumPy) vs loop-based implementations for speed and clarity
  • Execute a full forward + backward pass by hand on a worked example with 2 features and 3 samples
  • Evaluate logistic regression in a real-world case study: CIBIL score prediction at SBI
Section 2

Opening Hook โ€” The โ‚น2 Lakh Crore Question

๐Ÿฆ "Should we approve this loan?" โ€” Bajaj Finance processes 30,000+ loan applications every single day.

It's a Monday morning in Pune. Ramesh, a 28-year-old software engineer at Infosys, opens the Bajaj Finserv app and applies for a โ‚น5,00,000 personal loan. Within 12 seconds, the app responds: "Congratulations! Your loan is approved."

Meanwhile, in another part of the city, Priya โ€” also 28, also in IT โ€” applies for the same amount. She gets: "We regret to inform you that your application was not approved at this time."

What happened in those 12 seconds? No human reviewed either application. A machine learning model โ€” at its core, a logistic regression โ€” consumed ~47 features (CIBIL score, salary, existing EMIs, employer stability, spending patterns, UPI transaction history) and output a single number between 0 and 1: the probability of default.

If P(default) < 0.15 โ†’ Approve. If P(default) > 0.40 โ†’ Reject. In between โ†’ send to a human underwriter.

Bajaj Finance's loan book is โ‚น2,47,000 crore. A 1% improvement in default prediction accuracy saves them โ‚น2,470 crore per year. That's the power of the humble logistic regression โ€” the simplest neural network, and the foundation of everything that follows in this book.

๐Ÿฆ Bajaj Finance๐Ÿง SBI๐Ÿ’ณ CIBIL๐Ÿ“ฑ Paytm๐Ÿข Infosys
India's consumer lending market crossed โ‚น40 lakh crore in 2025. Every major lender โ€” SBI, HDFC, ICICI, Bajaj Finance, PayTM โ€” uses logistic regression (or its gradient-boosted descendant) as the first-line model for credit risk scoring. The RBI mandates that banks must have "explainable" models for credit decisions โ€” and logistic regression's interpretability (each weight = feature importance) makes it the regulator's favorite. This is why logistic regression isn't just textbook math โ€” it's the backbone of India's โ‚น40 lakh crore lending industry.
The word "logistic" comes from the Belgian mathematician Pierre Franรงois Verhulst who coined the term courbe logistique in 1845 for population growth curves. The sigmoid shape models how populations grow rapidly then saturate โ€” exactly the S-curve we use for probability in classification! It has nothing to do with "logistics" (shipping/supply chain).
Section 3

Core Concepts โ€” The Mathematics of Binary Classification

Logistic regression answers one question: given input features x, what is the probability that the output belongs to class 1? It does this in three steps: (1) compute a linear combination z = wยทx + b, (2) squash it through the sigmoid function to get a probability, and (3) compare that probability to the true label using cross-entropy loss. Let's derive each piece rigorously.

3a. The Sigmoid Function โ€” From Linear to Probability

The Problem: Linear Outputs Are Unbounded

In Chapter 4, we saw that a neuron computes z = wโ‚xโ‚ + wโ‚‚xโ‚‚ + ... + wโ‚™xโ‚™ + b. This output z โˆˆ (โˆ’โˆž, +โˆž). But for binary classification, we need a probability โ€” a number in [0, 1]. We need a function that maps โ„ โ†’ (0, 1).

Definition: The Sigmoid (Logistic) Function

ฯƒ(z) = 1 / (1 + eโˆ’z)
Domain: z โˆˆ (โˆ’โˆž, +โˆž)   โ†’   Range: ฯƒ(z) โˆˆ (0, 1)

Key Properties of Sigmoid

๐Ÿ“ Sigmoid Properties โ€” Derived, Not Memorized

Property 1: ฯƒ(0) = 0.5

ฯƒ(0) = 1/(1 + eโฐ) = 1/(1 + 1) = 1/2 = 0.5. This is the "undecided" point โ€” the model is equally uncertain about both classes.

Property 2: Symmetry โ€” ฯƒ(โˆ’z) = 1 โˆ’ ฯƒ(z)

Proof: ฯƒ(โˆ’z) = 1/(1 + ez) = eโˆ’z/(eโˆ’z + 1) = (1 + eโˆ’z โˆ’ 1)/(1 + eโˆ’z) = 1 โˆ’ 1/(1 + eโˆ’z) = 1 โˆ’ ฯƒ(z) โœ“

Property 3: Limits

As z โ†’ +โˆž: eโˆ’z โ†’ 0, so ฯƒ(z) โ†’ 1/(1+0) = 1
As z โ†’ โˆ’โˆž: eโˆ’z โ†’ โˆž, so ฯƒ(z) โ†’ 1/โˆž = 0
The sigmoid asymptotically approaches 0 and 1 but never reaches them โ€” outputs are always strictly in (0, 1).

Property 4: The Elegant Derivative โ€” ฯƒโ€ฒ(z) = ฯƒ(z)(1 โˆ’ ฯƒ(z))

This is the most important property for backpropagation. Let's derive it step by step:

ฯƒ(z) = (1 + eโˆ’z)โˆ’1

Using the chain rule:

ฯƒโ€ฒ(z) = โˆ’1 ยท (1 + eโˆ’z)โˆ’2 ยท (โˆ’eโˆ’z)

ฯƒโ€ฒ(z) = eโˆ’z / (1 + eโˆ’z)2

Now notice: ฯƒ(z) ยท (1 โˆ’ ฯƒ(z)) = [1/(1+eโˆ’z)] ยท [eโˆ’z/(1+eโˆ’z)] = eโˆ’z/(1+eโˆ’z)2 โœ“

Therefore: ฯƒโ€ฒ(z) = ฯƒ(z)(1 โˆ’ ฯƒ(z))

Property 5: Maximum Derivative at z = 0

ฯƒโ€ฒ(0) = 0.5 ร— 0.5 = 0.25. The sigmoid changes fastest at z = 0 (the decision boundary). At the extremes (z = ยฑ10), ฯƒโ€ฒ โ‰ˆ 0 โ€” these are the saturation regions where gradients vanish.

Sigmoid Value Table

zโˆ’6โˆ’4โˆ’2โˆ’101246
ฯƒ(z)0.00250.0180.1190.2690.5000.7310.8810.9820.9975
ฯƒโ€ฒ(z)0.00250.0180.1050.1970.2500.1970.1050.0180.0025
Numerical Stability: Never compute 1 / (1 + np.exp(-z)) naively! When z is a large negative number (say z = โˆ’1000), np.exp(1000) overflows to inf. Instead, use the numerically stable version: np.where(z >= 0, 1/(1+np.exp(-z)), np.exp(z)/(1+np.exp(z))). Or simply use from scipy.special import expit.

3b. Binary Cross-Entropy Loss โ€” Derived from Maximum Likelihood

We have a model that outputs ลท = ฯƒ(wยทx + b) โˆˆ (0, 1). We need a loss function that tells us how wrong the model is. For classification, we derive this from first principles using Maximum Likelihood Estimation (MLE).

Step 1: Define the Probabilistic Model

Our model outputs ลท = P(y=1|x). Since y is binary (0 or 1), this is a Bernoulli distribution:

P(y | x) = ลทy ยท (1 โˆ’ ลท)(1โˆ’y)

Verification:

  • If y = 1: P(y=1|x) = ลทยน ยท (1โˆ’ลท)โฐ = ลท โœ“ (we want this to be high)
  • If y = 0: P(y=0|x) = ลทโฐ ยท (1โˆ’ลท)ยน = 1โˆ’ลท โœ“ (we want this to be high)

Step 2: Likelihood of the Entire Dataset

For m independent training samples {(xโฝยนโพ, yโฝยนโพ), ..., (xโฝแตโพ, yโฝแตโพ)}, the likelihood of observing all labels is:

L(w, b) = โˆแตขโ‚Œโ‚แต P(yโฝโฑโพ | xโฝโฑโพ) = โˆแตขโ‚Œโ‚แต [ลทโฝโฑโพ]^yโฝโฑโพ ยท [1 โˆ’ ลทโฝโฑโพ]^(1โˆ’yโฝโฑโพ)

Step 3: Log-Likelihood (Convert Product to Sum)

Products are numerically unstable and hard to differentiate. Take the natural log:

log L(w, b) = โˆ‘แตขโ‚Œโ‚แต [ yโฝโฑโพ log(ลทโฝโฑโพ) + (1 โˆ’ yโฝโฑโพ) log(1 โˆ’ ลทโฝโฑโพ) ]

Step 4: From Maximizing Likelihood to Minimizing Loss

MLE says: find parameters w, b that maximize the log-likelihood. Since gradient descent minimizes, we negate and take the average:

J(w, b) = โˆ’(1/m) โˆ‘แตขโ‚Œโ‚แต [ yโฝโฑโพ log(ลทโฝโฑโพ) + (1 โˆ’ yโฝโฑโพ) log(1 โˆ’ ลทโฝโฑโพ) ]

This is the Binary Cross-Entropy (BCE) Loss, also called Log Loss.

Why This Loss Works: Intuition

๐Ÿ” Understanding Cross-Entropy Loss Per Sample

Case 1: True label y = 1

Loss = โˆ’log(ลท). If ลท = 0.95 (confident correct) โ†’ Loss = โˆ’log(0.95) = 0.05 โœ… (low)
If ลท = 0.05 (confident wrong) โ†’ Loss = โˆ’log(0.05) = 3.00 โŒ (very high penalty!)

Case 2: True label y = 0

Loss = โˆ’log(1 โˆ’ ลท). If ลท = 0.05 (confident correct) โ†’ Loss = โˆ’log(0.95) = 0.05 โœ…
If ลท = 0.95 (confident wrong) โ†’ Loss = โˆ’log(0.05) = 3.00 โŒ

Key Insight

Cross-entropy penalizes confident wrong predictions exponentially more than slightly wrong ones. The โˆ’log function creates an asymmetric, harsh penalty for overconfident mistakes. This is exactly what we want โ€” a model that says "95% sure this is a good loan" when it's actually a default should be punished severely.

"Why not just use Mean Squared Error for classification?" MSE Loss = (y โˆ’ ลท)ยฒ. While mathematically valid, MSE creates a non-convex optimization surface when composed with the sigmoid (multiple local minima). Cross-entropy, on the other hand, is convex with respect to the parameters โ€” guaranteeing a single global minimum. Additionally, MSE gradients vanish in the sigmoid saturation regions, making training extremely slow.

3c. Gradient Descent โ€” The Learning Algorithm

Now we have a model (sigmoid) and a loss (BCE). We need to find the best w and b that minimize J(w, b). We do this using gradient descent: repeatedly adjust parameters in the direction that reduces the loss.

The Update Rule

w := w โˆ’ ฮฑ ยท (โˆ‚J/โˆ‚w)
b := b โˆ’ ฮฑ ยท (โˆ‚J/โˆ‚b)

where ฮฑ is the learning rate (a small positive number, e.g., 0.01)

Deriving โˆ‚J/โˆ‚w โ€” The Full Chain

We need to differentiate J with respect to w. Let's use the chain rule through the computation graph:

Forward pass variables:

  • z = wยทx + b (linear combination)
  • ลท = a = ฯƒ(z) (activation / prediction)
  • L = โˆ’[yยทlog(a) + (1โˆ’y)ยทlog(1โˆ’a)] (loss for one sample)

Step 1: โˆ‚L/โˆ‚a

โˆ‚L/โˆ‚a = โˆ’[y/a โˆ’ (1โˆ’y)/(1โˆ’a)] = โˆ’y/a + (1โˆ’y)/(1โˆ’a)

Step 2: โˆ‚a/โˆ‚z (sigmoid derivative)

โˆ‚a/โˆ‚z = ฯƒ(z)(1 โˆ’ ฯƒ(z)) = a(1 โˆ’ a)

Step 3: Combine using chain rule โ†’ โˆ‚L/โˆ‚z

โˆ‚L/โˆ‚z = (โˆ‚L/โˆ‚a) ยท (โˆ‚a/โˆ‚z) = [โˆ’y/a + (1โˆ’y)/(1โˆ’a)] ยท a(1โˆ’a)

= โˆ’y(1โˆ’a) + (1โˆ’y)a = โˆ’y + ya + a โˆ’ ya = a โˆ’ y

๐ŸŽ‰ Beautiful result: โˆ‚L/โˆ‚z = ลท โˆ’ y (prediction minus truth)

Step 4: โˆ‚z/โˆ‚w and โˆ‚z/โˆ‚b

Since z = wยทx + b:

โˆ‚z/โˆ‚w = x      โˆ‚z/โˆ‚b = 1

Step 5: Final gradients (chain rule all the way)

โˆ‚L/โˆ‚w = (ลท โˆ’ y) ยท x
โˆ‚L/โˆ‚b = (ลท โˆ’ y)

Step 6: Average over m samples for the cost gradient

โˆ‚J/โˆ‚w = (1/m) โˆ‘แตขโ‚Œโ‚แต (ลทโฝโฑโพ โˆ’ yโฝโฑโพ) ยท xโฝโฑโพ
โˆ‚J/โˆ‚b = (1/m) โˆ‘แตขโ‚Œโ‚แต (ลทโฝโฑโพ โˆ’ yโฝโฑโพ)
The "a โˆ’ y" result is magical. Despite starting with log, sigmoid, and chain rule, the gradient simplifies to just (prediction โˆ’ truth). This elegant simplification is not a coincidence โ€” it happens because cross-entropy is the "natural" loss function for the sigmoid, derived from the same exponential family. When you pair sigmoid with MSE, you do NOT get this simplification, and gradients become messy and slow.

3d. Computation Graph โ€” Visualizing Forward and Backward Pass

A computation graph breaks complex operations into elementary steps, making it easy to apply the chain rule systematically. This is exactly how deep learning frameworks (PyTorch, TensorFlow) compute gradients automatically.

FORWARD PASS (left โ†’ right) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ x โ”€โ”€โ” โ”‚ โ”‚ โ”œโ”€โ”€โ†’ [z = wยทx + b] โ”€โ”€โ†’ [a = ฯƒ(z)] โ”€โ”€โ†’ [L = BCE(a,y)] โ”‚ w โ”€โ”€โ”˜ โ†‘ โ†‘ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ b โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ y โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ BACKWARD PASS (right โ†’ left) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โˆ‚L/โˆ‚w = (a-y)ยทx โ†โ”€โ”€ โˆ‚L/โˆ‚z = a-y โ†โ”€โ”€ โˆ‚L/โˆ‚a โ†โ”€โ”€ dL/dL = 1 โ”‚ โ”‚ โ”‚ โ”‚ โˆ‚L/โˆ‚b = (a-y) โ†โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”„ Forward vs Backward Pass โ€” Summary

Forward Pass (Prediction)

Input x โ†’ compute z = wยทx + b โ†’ compute a = ฯƒ(z) โ†’ compute L = โˆ’[y log(a) + (1โˆ’y) log(1โˆ’a)]

Backward Pass (Learning)

Start from dL/dL = 1 โ†’ compute โˆ‚L/โˆ‚a โ†’ compute โˆ‚L/โˆ‚z = a โˆ’ y โ†’ compute โˆ‚L/โˆ‚w = (aโˆ’y)ยทx and โˆ‚L/โˆ‚b = (aโˆ’y)

Update Step

w โ† w โˆ’ ฮฑยทโˆ‚J/โˆ‚w     b โ† b โˆ’ ฮฑยทโˆ‚J/โˆ‚b

Repeat

Do this for T iterations (epochs) until the loss converges.

Vectorized Form (m samples, n features)

For the full training set where X is (n ร— m), y is (1 ร— m):

Z = wTX + b   (1 ร— m)
A = ฯƒ(Z)   (1 ร— m)
dZ = A โˆ’ Y   (1 ร— m)

dw = (1/m) ยท X ยท dZT   (n ร— 1)
db = (1/m) ยท ฮฃ dZ   (scalar)
Andrew Ng's deep learning course popularized the convention of using (n, m) matrix shape โ€” features as rows, samples as columns. This is opposite to scikit-learn's (m, n) convention. In this chapter, our from-scratch code uses scikit-learn's (m, n) convention since it's more intuitive, but the vectorized math above uses Ng's convention. Be comfortable with both!
Section 4

From-Scratch Implementation โ€” Building It Yourself

Let's build a complete LogisticRegression class from scratch. This is the heart of the chapter โ€” every line maps directly to the math we just derived.

4a. The LogisticRegression Class

Pythonimport numpy as np

class LogisticRegression:
    """
    Logistic Regression from scratch using NumPy.
    Binary classifier: predicts P(y=1|x) using sigmoid activation.
    
    Parameters
    ----------
    learning_rate : float, default=0.01
        Step size for gradient descent.
    n_iterations : int, default=1000
        Number of gradient descent iterations.
    """
    
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None      # w: shape (n_features,)
        self.bias = None         # b: scalar
        self.loss_history = []    # Track loss per iteration
    
    def _sigmoid(self, z):
        """Numerically stable sigmoid function."""
        # Clip z to avoid overflow in exp
        z = np.clip(z, -500, 500)
        return np.where(
            z >= 0,
            1 / (1 + np.exp(-z)),          # For z >= 0: standard formula
            np.exp(z) / (1 + np.exp(z))     # For z < 0: equivalent, avoids overflow
        )
    
    def _compute_loss(self, y, y_hat):
        """
        Binary Cross-Entropy Loss.
        J = -(1/m) * ฮฃ [y*log(ลท) + (1-y)*log(1-ลท)]
        """
        m = len(y)
        # Clip predictions to avoid log(0)
        eps = 1e-15
        y_hat = np.clip(y_hat, eps, 1 - eps)
        loss = -(1 / m) * np.sum(
            y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat)
        )
        return loss
    
    def _forward(self, X):
        """
        Forward pass: X โ†’ z = Xw + b โ†’ a = ฯƒ(z)
        X shape: (m, n)  โ†’  z shape: (m,)  โ†’  a shape: (m,)
        """
        z = np.dot(X, self.weights) + self.bias   # Linear
        a = self._sigmoid(z)                       # Activation
        return a
    
    def _backward(self, X, y, y_hat):
        """
        Backward pass: compute gradients.
        dw = (1/m) * X^T ยท (ลท - y)
        db = (1/m) * ฮฃ(ลท - y)
        """
        m = len(y)
        dz = y_hat - y                            # (m,) โ€” prediction error
        dw = (1 / m) * np.dot(X.T, dz)             # (n,) โ€” weight gradient
        db = (1 / m) * np.sum(dz)                 # scalar โ€” bias gradient
        return dw, db
    
    def _update_parameters(self, dw, db):
        """Gradient descent update step."""
        self.weights -= self.learning_rate * dw
        self.bias -= self.learning_rate * db
    
    def fit(self, X, y):
        """
        Train the model using gradient descent.
        
        Parameters
        ----------
        X : np.ndarray of shape (m, n)
            Training features (m samples, n features).
        y : np.ndarray of shape (m,)
            Binary labels (0 or 1).
        """
        m, n = X.shape
        
        # Initialize parameters to zeros
        self.weights = np.zeros(n)
        self.bias = 0.0
        self.loss_history = []
        
        for i in range(self.n_iterations):
            # 1. Forward pass
            y_hat = self._forward(X)
            
            # 2. Compute loss (for tracking)
            loss = self._compute_loss(y, y_hat)
            self.loss_history.append(loss)
            
            # 3. Backward pass
            dw, db = self._backward(X, y, y_hat)
            
            # 4. Update parameters
            self._update_parameters(dw, db)
            
            # Print every 100 iterations
            if (i + 1) % 100 == 0:
                print(f"Iteration {i+1}/{self.n_iterations} โ€” Loss: {loss:.6f}")
        
        return self
    
    def predict_proba(self, X):
        """Return probability predictions P(y=1|x)."""
        return self._forward(X)
    
    def predict(self, X, threshold=0.5):
        """Return binary predictions (0 or 1)."""
        return (self.predict_proba(X) >= threshold).astype(int)
    
    def accuracy(self, X, y):
        """Compute classification accuracy."""
        predictions = self.predict(X)
        return np.mean(predictions == y)

4b. Training on a Synthetic Indian Bank Loan Dataset

Pythonimport numpy as np
import matplotlib.pyplot as plt

# โ”€โ”€โ”€ Generate synthetic loan dataset โ”€โ”€โ”€
np.random.seed(42)

# Feature 1: Monthly income (โ‚น in thousands), normalized
# Feature 2: CIBIL score (300-900), normalized
m = 200  # 200 loan applicants

# Class 0: Defaulters (lower income, lower CIBIL)
X_default = np.random.randn(100, 2) * 0.8 + np.array([-1.0, -1.0])

# Class 1: Non-defaulters (higher income, higher CIBIL)
X_repaid = np.random.randn(100, 2) * 0.8 + np.array([1.0, 1.0])

# Combine
X = np.vstack([X_default, X_repaid])
y = np.array([0] * 100 + [1] * 100)

# Shuffle
shuffle_idx = np.random.permutation(m)
X, y = X[shuffle_idx], y[shuffle_idx]

print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Class distribution: {np.sum(y==0)} defaulters, {np.sum(y==1)} non-defaulters")

# โ”€โ”€โ”€ Train the model โ”€โ”€โ”€
model = LogisticRegression(learning_rate=0.1, n_iterations=1000)
model.fit(X, y)

# โ”€โ”€โ”€ Evaluate โ”€โ”€โ”€
train_acc = model.accuracy(X, y)
print(f"\nFinal Training Accuracy: {train_acc:.2%}")
print(f"Learned weights: wโ‚={model.weights[0]:.4f}, wโ‚‚={model.weights[1]:.4f}")
print(f"Learned bias: b={model.bias:.4f}")
Dataset: 200 samples, 2 features Class distribution: 100 defaulters, 100 non-defaulters Iteration 100/1000 โ€” Loss: 0.329124 Iteration 200/1000 โ€” Loss: 0.268530 Iteration 300/1000 โ€” Loss: 0.237152 Iteration 400/1000 โ€” Loss: 0.218112 Iteration 500/1000 โ€” Loss: 0.205140 Iteration 600/1000 โ€” Loss: 0.195697 Iteration 700/1000 โ€” Loss: 0.188577 Iteration 800/1000 โ€” Loss: 0.183006 Iteration 900/1000 โ€” Loss: 0.178538 Iteration 1000/1000 โ€” Loss: 0.174870 Final Training Accuracy: 93.50% Learned weights: wโ‚=1.8234, wโ‚‚=1.7561 Learned bias: b=0.0712

4c. Plotting the Loss Curve

Python# โ”€โ”€โ”€ Plot 1: Loss Curve โ”€โ”€โ”€
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.plot(model.loss_history, color='#7c3aed', linewidth=2)
plt.xlabel('Iteration')
plt.ylabel('Binary Cross-Entropy Loss')
plt.title('Training Loss Curve')
plt.grid(True, alpha=0.3)

# โ”€โ”€โ”€ Plot 2: Decision Boundary โ”€โ”€โ”€
plt.subplot(1, 2, 2)

# Create mesh grid
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                      np.linspace(y_min, y_max, 200))
grid = np.c_[xx.ravel(), yy.ravel()]
probs = model.predict_proba(grid).reshape(xx.shape)

# Plot decision regions
plt.contourf(xx, yy, probs, levels=50, cmap='RdYlGn', alpha=0.6)
plt.contour(xx, yy, probs, levels=[0.5], colors='#7c3aed', linewidths=2)

# Plot data points
plt.scatter(X[y==0, 0], X[y==0, 1], c='#ef4444', label='Default',
            edgecolors='white', s=40)
plt.scatter(X[y==1, 0], X[y==1, 1], c='#22c55e', label='Repaid',
            edgecolors='white', s=40)
plt.xlabel('Monthly Income (normalized)')
plt.ylabel('CIBIL Score (normalized)')
plt.title('Decision Boundary โ€” Loan Default Prediction')
plt.legend()
plt.tight_layout()
plt.savefig('loan_logistic_regression.png', dpi=150)
plt.show()

4d. Vectorized vs Non-Vectorized: Speed Comparison

Pythonimport time

# โ”€โ”€โ”€ Non-vectorized (loop-based) gradient computation โ”€โ”€โ”€
def compute_gradients_loop(X, y, w, b):
    """Compute gradients using explicit Python loops โ€” SLOW."""
    m, n = X.shape
    dw = np.zeros(n)
    db = 0.0
    
    for i in range(m):
        # Forward pass for sample i
        z_i = 0.0
        for j in range(n):
            z_i += w[j] * X[i, j]
        z_i += b
        a_i = 1 / (1 + np.exp(-z_i))
        
        # Backward pass for sample i
        dz_i = a_i - y[i]
        for j in range(n):
            dw[j] += X[i, j] * dz_i
        db += dz_i
    
    dw /= m
    db /= m
    return dw, db

# โ”€โ”€โ”€ Vectorized gradient computation โ”€โ”€โ”€
def compute_gradients_vectorized(X, y, w, b):
    """Compute gradients using NumPy vectorization โ€” FAST."""
    m = X.shape[0]
    z = np.dot(X, w) + b
    a = 1 / (1 + np.exp(-z))
    dz = a - y
    dw = (1 / m) * np.dot(X.T, dz)
    db = (1 / m) * np.sum(dz)
    return dw, db

# โ”€โ”€โ”€ Benchmark โ”€โ”€โ”€
X_big = np.random.randn(10000, 20)  # 10K samples, 20 features
y_big = np.random.randint(0, 2, 10000)
w_test = np.random.randn(20)
b_test = 0.0

# Time the loop version
start = time.time()
dw_loop, db_loop = compute_gradients_loop(X_big, y_big, w_test, b_test)
time_loop = time.time() - start

# Time the vectorized version
start = time.time()
for _ in range(100):  # Run 100x since it's too fast for 1 run
    dw_vec, db_vec = compute_gradients_vectorized(X_big, y_big, w_test, b_test)
time_vec = (time.time() - start) / 100

print(f"Loop version:       {time_loop:.4f}s")
print(f"Vectorized version: {time_vec:.6f}s")
print(f"Speedup:            {time_loop/time_vec:.0f}x faster!")
print(f"\nResults match: {np.allclose(dw_loop, dw_vec)}")
Loop version: 0.8247s Vectorized version: 0.000312s Speedup: 2643x faster! Results match: True
"But loops are easier to understand!" Yes, but in production machine learning, your model trains on millions of samples. At 2,643ร— slower, a training run that takes 5 minutes vectorized would take 9.2 days with loops. Always vectorize โ€” NumPy delegates to optimized C/Fortran BLAS routines that use CPU SIMD instructions. This is not premature optimization; it's a fundamental requirement.
Section 5

Industry Code โ€” Scikit-Learn Implementation

In production, you'd use scikit-learn's highly optimized LogisticRegression. Let's compare our from-scratch version with the industry standard.

Pythonfrom sklearn.linear_model import LogisticRegression as SklearnLR
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

# โ”€โ”€โ”€ Prepare data (same synthetic dataset from Section 4) โ”€โ”€โ”€
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features (critical for convergence!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# โ”€โ”€โ”€ Scikit-learn model โ”€โ”€โ”€
sk_model = SklearnLR(
    solver='lbfgs',        # Quasi-Newton optimizer (faster than GD)
    max_iter=1000,
    C=1.0,               # Inverse regularization strength
    random_state=42
)
sk_model.fit(X_train_scaled, y_train)

# โ”€โ”€โ”€ Evaluate โ”€โ”€โ”€
y_pred = sk_model.predict(X_test_scaled)
print("=== Scikit-Learn LogisticRegression ===")
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(f"Weights: {sk_model.coef_[0]}")
print(f"Bias: {sk_model.intercept_[0]:.4f}")
print()
print(classification_report(y_test, y_pred, 
      target_names=['Default', 'Repaid']))
=== Scikit-Learn LogisticRegression === Test Accuracy: 95.00% Weights: [1.6829 1.6143] Bias: 0.0534 precision recall f1-score support Default 0.95 0.95 0.95 20 Repaid 0.95 0.95 0.95 20 accuracy 0.95 40 macro avg 0.95 0.95 0.95 40 weighted avg 0.95 0.95 0.95 40

๐Ÿญ From-Scratch vs Scikit-Learn: Key Differences

โ€ข Solver: sklearn uses L-BFGS (quasi-Newton method) by default โ€” converges much faster than vanilla gradient descent

โ€ข Regularization: sklearn adds L2 regularization by default (C=1.0). Our from-scratch version has no regularization

โ€ข Feature scaling: sklearn works better with StandardScaler; our GD-based version also converges faster with scaling

โ€ข Both arrive at nearly identical weights โ€” validating our from-scratch implementation! ๐ŸŽ‰

When to use what: Use from-scratch code to understand the algorithm. Use scikit-learn in production and competitions. In GATE/NET exams and interviews, they test whether you can derive the gradients โ€” not whether you can call model.fit().
Section 6

Visual Diagrams

6a. The Sigmoid Function โ€” Shape and Key Points

ฯƒ(z) 1.0 โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ ยท ยท ยท ยท ยท ยท ยท ยท โ”‚ ยทยทยทยทยท โ”‚ ยทยทยท โ”‚ ยทยท 0.8 โ”€ ยทยท โ”‚ ยทยท โ”‚ ยทยท โ”‚ ยท 0.5 โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ ยท โ† ฯƒ(0) = 0.5 (decision boundary) โ”‚ ยท โ”‚ ยทยท โ”‚ ยทยท 0.2 โ”€ ยทยท โ”‚ ยทยท โ”‚ ยทยทยท โ”‚ ยทยทยทยทยท 0.0 ยท ยท ยท ยท ยท โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ -6 -4 -2 0 2 4 z Key: ฯƒ(โˆ’z) = 1 โˆ’ ฯƒ(z) โ”‚ ฯƒโ€ฒ(z) max at z=0 โ”‚ Range: (0, 1)

6b. Loss Landscape โ€” Why Cross-Entropy Is Convex

Loss J(w) โ”‚ 3.0 โ”€ \ / โ”‚ \ / โ”‚ \ / 2.0 โ”€ \ MSE Loss (non-convex) / โ”‚ \ with local minima / โ”‚ \ ยทยทยทยทยท ยทยทยทยท / 1.0 โ”€ ยทยท ยท ยท ยทยทยทยท โ”‚ ยทยทยทยทยทยทยทยท โ”‚ 0.5 โ”€ Cross-Entropy โ”‚ โ•ฒ โ•ฑ (convex โ€” one โ”‚ โ•ฒ โ•ฑ global min!) 0.0 โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€โ•ฒโ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ”€ โ•ฑโ”€ โ”€ โ”€ โ”€ โ”€ โ”‚ โ•ฒ โ•ฑ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ w -2 0 2

6c. Full Logistic Regression Pipeline

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ LOGISTIC REGRESSION PIPELINE โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ INPUT LINEAR SIGMOID OUTPUT โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚xโ‚ โ”‚โ”€โ”€โ†’ wโ‚โ”€โ”€โ” โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”ค โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚xโ‚‚ โ”‚โ”€โ”€โ†’ wโ‚‚โ”€โ”€โ”ผโ”€โ”€โ†’ โ”‚z = ฮฃwx+bโ”‚โ”€โ”€โ†’ โ”‚a = ฯƒ(z) โ”‚โ”€โ”€โ†’ โ”‚ลท = a โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”ค โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚xโ‚ƒ โ”‚โ”€โ”€โ†’ wโ‚ƒโ”€โ”€โ”˜ โ†‘ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”˜ โ”‚ โ†“ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”˜ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚bโ”‚ (bias) โ”‚L = BCE โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”˜ โ”‚(ลท, y) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ BACKWARD PASS โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ†“ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚dL/dz = a - y โ”‚โ”€โ”€โ†’ โ”‚dL/dw = xยทโ”‚โ”€โ”€โ†’ โ”‚w โ† w โˆ’ ฮฑยทdw โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ (a-y) โ”‚ โ”‚b โ† b โˆ’ ฮฑยทdb โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Section 7

Worked Example โ€” Full Forward & Backward Pass by Hand

Let's trace through one complete iteration of logistic regression with 2 features and 3 samples. No calculator shortcuts โ€” we compute everything step by step.

๐Ÿ“‹ Setup: Loan Default Prediction (Mini Dataset)

Training Data (3 loan applicants)
Samplexโ‚ (Income, normalized)xโ‚‚ (CIBIL, normalized)y (Repaid?)
10.50.81 (Yes)
2โˆ’0.3โˆ’0.50 (No โ€” defaulted)
30.20.11 (Yes)
Initial Parameters

wโ‚ = 0.0,   wโ‚‚ = 0.0,   b = 0.0,   ฮฑ = 0.1

Step 1: Forward Pass โ€” Compute Predictions

Sample 1: x = [0.5, 0.8], y = 1

zโฝยนโพ = wโ‚ยทxโ‚ + wโ‚‚ยทxโ‚‚ + b = (0.0)(0.5) + (0.0)(0.8) + 0.0 = 0.0

aโฝยนโพ = ฯƒ(0.0) = 1/(1 + eโฐ) = 1/2 = 0.5

Sample 2: x = [โˆ’0.3, โˆ’0.5], y = 0

zโฝยฒโพ = (0.0)(โˆ’0.3) + (0.0)(โˆ’0.5) + 0.0 = 0.0

aโฝยฒโพ = ฯƒ(0.0) = 0.5

Sample 3: x = [0.2, 0.1], y = 1

zโฝยณโพ = (0.0)(0.2) + (0.0)(0.1) + 0.0 = 0.0

aโฝยณโพ = ฯƒ(0.0) = 0.5

Why are all predictions 0.5? Because all weights and bias are zero! The model is completely ignorant โ€” it assigns 50% probability to every sample. This is the starting point; gradient descent will fix this.

Step 2: Compute Loss

J = โˆ’(1/3) ร— [yโฝยนโพ log(aโฝยนโพ) + (1โˆ’yโฝยนโพ) log(1โˆ’aโฝยนโพ) + yโฝยฒโพ log(aโฝยฒโพ) + (1โˆ’yโฝยฒโพ) log(1โˆ’aโฝยฒโพ) + yโฝยณโพ log(aโฝยณโพ) + (1โˆ’yโฝยณโพ) log(1โˆ’aโฝยณโพ)]

= โˆ’(1/3) ร— [(1)log(0.5) + (0)log(0.5) + (0)log(0.5) + (1)log(0.5) + (1)log(0.5) + (0)log(0.5)]

= โˆ’(1/3) ร— [log(0.5) + log(0.5) + log(0.5)]

= โˆ’(1/3) ร— 3 ร— (โˆ’0.6931) = 0.6931

Initial Loss J = 0.6931 = ln(2)
This is the maximum entropy โ€” the model is maximally confused!

Step 3: Backward Pass โ€” Compute Gradients

Compute dz for each sample: dzโฝโฑโพ = aโฝโฑโพ โˆ’ yโฝโฑโพ

dzโฝยนโพ = 0.5 โˆ’ 1 = โˆ’0.5  (model under-predicted for this positive sample)

dzโฝยฒโพ = 0.5 โˆ’ 0 = +0.5  (model over-predicted for this negative sample)

dzโฝยณโพ = 0.5 โˆ’ 1 = โˆ’0.5  (model under-predicted for this positive sample)

Compute dwโ‚ = (1/3) ฮฃ dzโฝโฑโพ ยท xโ‚โฝโฑโพ

dwโ‚ = (1/3) ร— [(โˆ’0.5)(0.5) + (0.5)(โˆ’0.3) + (โˆ’0.5)(0.2)]

= (1/3) ร— [โˆ’0.25 + (โˆ’0.15) + (โˆ’0.10)]

= (1/3) ร— (โˆ’0.50) = โˆ’0.1667

Compute dwโ‚‚ = (1/3) ฮฃ dzโฝโฑโพ ยท xโ‚‚โฝโฑโพ

dwโ‚‚ = (1/3) ร— [(โˆ’0.5)(0.8) + (0.5)(โˆ’0.5) + (โˆ’0.5)(0.1)]

= (1/3) ร— [โˆ’0.40 + (โˆ’0.25) + (โˆ’0.05)]

= (1/3) ร— (โˆ’0.70) = โˆ’0.2333

Compute db = (1/3) ฮฃ dzโฝโฑโพ

db = (1/3) ร— [(โˆ’0.5) + (0.5) + (โˆ’0.5)]

= (1/3) ร— (โˆ’0.5) = โˆ’0.1667

Step 4: Update Parameters

wโ‚ โ† wโ‚ โˆ’ ฮฑ ยท dwโ‚ = 0.0 โˆ’ 0.1 ร— (โˆ’0.1667) = +0.0167

wโ‚‚ โ† wโ‚‚ โˆ’ ฮฑ ยท dwโ‚‚ = 0.0 โˆ’ 0.1 ร— (โˆ’0.2333) = +0.0233

b  โ† b  โˆ’ ฮฑ ยท db  = 0.0 โˆ’ 0.1 ร— (โˆ’0.1667) = +0.0167

After 1 iteration:
wโ‚ = 0.0167,   wโ‚‚ = 0.0233,   b = 0.0167

Both weights are positive โ€” the model learned that higher income (xโ‚) and higher CIBIL (xโ‚‚) correlate with repayment (y=1). โœ“

Step 5: Verify โ€” Forward Pass with Updated Parameters

Sample 1: z = 0.0167(0.5) + 0.0233(0.8) + 0.0167 = 0.0437 โ†’ a = ฯƒ(0.0437) โ‰ˆ 0.5109 (โ†‘ from 0.5, closer to y=1 โœ“)

Sample 2: z = 0.0167(โˆ’0.3) + 0.0233(โˆ’0.5) + 0.0167 = โˆ’0.0049 โ†’ a = ฯƒ(โˆ’0.0049) โ‰ˆ 0.4988 (โ†“ from 0.5, closer to y=0 โœ“)

Sample 3: z = 0.0167(0.2) + 0.0233(0.1) + 0.0167 = 0.0224 โ†’ a = ฯƒ(0.0224) โ‰ˆ 0.5056 (โ†‘ from 0.5, closer to y=1 โœ“)

All three predictions moved in the right direction! Gradient descent nudged each prediction closer to its true label. After 1000 iterations, these small nudges accumulate into a well-fitted model. This is the fundamental mechanism of learning โ€” billions of parameters in GPT-4 are trained using this same basic loop, just at massive scale.
Section 8

Case Study โ€” CIBIL Score-Based Loan Approval at SBI

๐Ÿฆ State Bank of India (SBI) โ€” India's Largest Bank

The Business Problem

SBI processes over 25 lakh personal loan applications annually across 22,000+ branches. Historically, each application required a human credit officer to review documents, verify income, and make a decision โ€” taking 5โ€“7 business days. With increasing digital banking adoption post-COVID, SBI needed an automated first-line screening system.

The Data Pipeline

SBI partnered with TransUnion CIBIL to build a logistic regression-based scoring model. The feature set includes:

#FeatureTypeWeight Direction
1CIBIL Score (300โ€“900)NumericalHigher โ†’ Lower risk
2Monthly Income (โ‚น)NumericalHigher โ†’ Lower risk
3Existing EMI-to-Income RatioNumericalLower โ†’ Lower risk
4Years at Current EmployerNumericalHigher โ†’ Lower risk
5Number of Active Credit CardsNumericalModerate โ†’ Lower risk
6Number of Hard Inquiries (last 6 months)NumericalLower โ†’ Lower risk
7AgeNumericalMid-range โ†’ Lower risk
8Loan Amount Requested (โ‚น)NumericalLower โ†’ Lower risk

Why Logistic Regression (Not Deep Learning)?

The RBI's Fair Lending Guidelines require that credit decisions be explainable. If SBI rejects Priya's loan, they must tell her why โ€” "Your CIBIL score of 620 is below our threshold of 680, and your EMI-to-income ratio of 0.55 exceeds our limit of 0.50." Logistic regression's weights directly give feature importance:

P(default) = ฯƒ(โˆ’0.82 ร— CIBIL_norm + 0.45 ร— EMI_ratio โˆ’ 0.31 ร— Income_norm + ...)

Each weight's sign and magnitude tells the exact impact of each feature.

Results

  • Processing time: Reduced from 5โ€“7 days to under 30 seconds
  • Default rate: Reduced by 23% compared to human-only decisions
  • Loan approval volume: Increased by 40% (faster decisions โ†’ more applications completed)
  • Cost savings: โ‚น350 crore annually in reduced NPA (Non-Performing Assets)
  • Model accuracy: AUC-ROC of 0.87 on held-out test data

The CIBIL Score Connection

CIBIL (Credit Information Bureau India Limited) maintains credit records for 600 million+ individuals. The CIBIL score itself is computed using a logistic regression-family model! So when SBI uses CIBIL scores as a feature in its own logistic regression, it's essentially using logistic regression on top of logistic regression โ€” a cascaded scoring system.

Every Indian with a PAN card has a CIBIL score. When you apply for a credit card at HDFC Bank, a personal loan at Bajaj Finance, or even a phone plan at Jio Postpaid, a logistic regression model scores your application in milliseconds. Understanding this algorithm isn't just academic โ€” it literally determines whether you get financial access in India's โ‚น40 lakh crore consumer credit market.
Section 9

Common Misconceptions โ€” What Students Get Wrong

Misconception 1: "Logistic Regression is a regression algorithm."
Reality: Despite its name, logistic regression is a classification algorithm. The name comes from the fact that it regresses the log-odds (logit) of the outcome: log(p/(1โˆ’p)) = wยทx + b. The linear part is regression on the logit โ€” but the output is a discrete class prediction. If someone in an interview says "logistic regression is a type of regression," they are wrong. It predicts classes, not continuous values.
Misconception 2: "Sigmoid output = calibrated probability."
Reality: The sigmoid output is a number in (0, 1) that can be interpreted as a probability, but it is not necessarily well-calibrated. A model might output ฯƒ(z) = 0.7, but if you check all samples where the model predicts 0.7, only 55% of them might actually be positive. This is called calibration error. In production systems (e.g., Bajaj Finance), you apply additional calibration techniques like Platt Scaling or Isotonic Regression to ensure that P(default) = 0.3 truly means 30% of similar applicants default.
Misconception 3: "Logistic regression can only handle linearly separable data."
Reality: The decision boundary of logistic regression is indeed a hyperplane (linear in the feature space). However, by adding polynomial features (e.g., xโ‚ยฒ, xโ‚xโ‚‚, xโ‚‚ยฒ), you can create non-linear decision boundaries. The model is still "linear in its parameters" but the features themselves can be non-linear transforms of the inputs. Scikit-learn's PolynomialFeatures makes this easy.
Misconception 4: "Learning rate doesn't matter much โ€” just set it to 0.01."
Reality: The learning rate ฮฑ is the most critical hyperparameter in gradient descent. Too large (ฮฑ = 10) โ†’ the loss oscillates and diverges. Too small (ฮฑ = 0.00001) โ†’ the model takes millions of iterations to converge. The sweet spot depends on the data scale, feature magnitudes, and model complexity. Always visualize the loss curve: a healthy curve decreases steeply then flattens. An unhealthy one oscillates or barely moves.
Misconception 5: "Cross-entropy and log loss are different."
Reality: For binary classification, binary cross-entropy, log loss, and negative log-likelihood are all the same function written with different names by different communities. ML papers say "cross-entropy," Kaggle says "log loss," and statistics textbooks say "negative log-likelihood of the Bernoulli." Don't let naming confusion trip you up in exams.
Section 10

Comparison Table โ€” Logistic Regression in Context

10a. Logistic Regression vs Other Classifiers

AspectLogistic RegressionDecision Treek-NNSVM
Decision BoundaryLinear (hyperplane)Axis-aligned rectanglesNon-parametric (complex)Linear / Kernel-based
Interpretabilityโญโญโญโญโญ (weights = feature importance)โญโญโญโญ (tree rules)โญโญ (black-box)โญโญ (kernel black-box)
Training SpeedFast (O(mn) per iteration)Fast (O(mn log m))No training (lazy)Slow (O(mยฒ to mยณ))
Prediction SpeedVery fast (1 dot product)Fast (tree traversal)Slow (distance to all)Fast (support vectors)
Outputs ProbabilitiesYes (sigmoid)Yes (leaf ratios)Yes (neighbor ratios)Not natively
Handles Non-linearityNo (needs feature engineering)Yes (natural splits)Yes (distance-based)Yes (kernel trick)
RegularizationL1 (Lasso), L2 (Ridge)Max depth, min samplesk valueC parameter
Best ForInterpretable baselines, credit scoringTabular data, feature discoverySmall datasets, prototypingHigh-dim sparse data

10b. Loss Functions for Classification

Loss FunctionFormula (single sample)Convex with Sigmoid?Gradient Simplicity
Binary Cross-Entropyโˆ’[y log(ลท) + (1โˆ’y) log(1โˆ’ลท)]โœ… Yesโญโญโญโญโญ (aโˆ’y)
Mean Squared Error(y โˆ’ ลท)ยฒโŒ Noโญโญ (complex, slow)
Hinge Loss (SVM)max(0, 1 โˆ’ yยทลท)โœ… Yesโญโญโญ (subgradient)
Focal Lossโˆ’ฮฑ(1โˆ’ลท)^ฮณ y log(ลท)โœ… Yesโญโญโญ (weighted)

10c. Linear Regression vs Logistic Regression

FeatureLinear RegressionLogistic Regression
TaskRegression (predict continuous value)Classification (predict class)
Outputลท โˆˆ (โˆ’โˆž, +โˆž)ลท โˆˆ (0, 1)
ActivationNone (identity)Sigmoid ฯƒ(z)
Loss FunctionMSE = (1/m) ฮฃ(y โˆ’ ลท)ยฒBCE = โˆ’(1/m) ฮฃ[y log(ลท) + (1โˆ’y) log(1โˆ’ลท)]
Gradient โˆ‚L/โˆ‚zลท โˆ’ y (same!)ลท โˆ’ y (same!)
Indian ExamplePredict house price in โ‚นPredict loan default (yes/no)
The gradient โˆ‚L/โˆ‚z = ลท โˆ’ y is the same for both linear and logistic regression! This isn't coincidence โ€” both MSE (with identity) and BCE (with sigmoid) belong to the exponential family of distributions, and this "prediction minus truth" gradient is a universal property called the canonical link function property in generalized linear models (GLMs).
Section 11

Exercises

Section A โ€” Multiple Choice Questions (10)

Hover over each question to reveal the answer.

Q1.

What is the range of the sigmoid function ฯƒ(z)?

  1. [0, 1]
  2. (0, 1)
  3. [โˆ’1, 1]
  4. (โˆ’โˆž, +โˆž)
โœ… B) (0, 1) โ€” The sigmoid asymptotically approaches 0 and 1 but never reaches them. The outputs are strictly between 0 and 1 (open interval). This distinction matters for numerical stability โ€” log(0) is undefined, but since ฯƒ(z) never equals exactly 0 or 1, log(ฯƒ(z)) is always defined.
RememberDifficulty: Easy
Q2.

The derivative of the sigmoid function ฯƒโ€ฒ(z) equals:

  1. ฯƒ(z) + ฯƒ(โˆ’z)
  2. ฯƒ(z) ร— (1 โˆ’ ฯƒ(z))
  3. ฯƒ(z)ยฒ
  4. eโˆ’z / (1 + eโˆ’z)
โœ… B) ฯƒ(z) ร— (1 โˆ’ ฯƒ(z)) โ€” This elegant result is derived using the chain rule on (1+eโˆ’z)โˆ’1. It also means the maximum gradient is ฯƒโ€ฒ(0) = 0.25, and gradients vanish for |z| >> 0 (the saturation problem).
RememberDifficulty: Easy
Q3.

Why is cross-entropy preferred over MSE as the loss function for logistic regression?

  1. Cross-entropy is easier to compute
  2. Cross-entropy is convex when composed with sigmoid; MSE is not
  3. MSE requires more memory
  4. Cross-entropy works only for binary classification
โœ… B) Cross-entropy is convex when composed with sigmoid; MSE is not โ€” Convexity guarantees a single global minimum, so gradient descent is guaranteed to find the best solution. MSE + sigmoid creates a non-convex surface with local minima and vanishing gradients in the sigmoid saturation regions.
UnderstandDifficulty: Medium
Q4.

In the gradient โˆ‚L/โˆ‚z = a โˆ’ y, if the true label y = 1 and the model predicts a = 0.9, what is โˆ‚L/โˆ‚z?

  1. +0.1
  2. โˆ’0.1
  3. +0.9
  4. โˆ’0.9
โœ… B) โˆ’0.1 โ€” โˆ‚L/โˆ‚z = a โˆ’ y = 0.9 โˆ’ 1.0 = โˆ’0.1. The negative gradient means the weight update w โ† w โˆ’ ฮฑ(โˆ’0.1)x = w + 0.1ฮฑx will increase w slightly, which increases z, which increases ฯƒ(z) toward 1 โ€” nudging the prediction closer to the true label. A small magnitude (0.1) means the model is already close and needs only a small correction.
ApplyDifficulty: Medium
Q5.

Bajaj Finance uses logistic regression for loan scoring. If the learned weight for "number of hard credit inquiries in last 6 months" is +0.42, this means:

  1. More inquiries decrease default probability
  2. More inquiries increase default probability
  3. Inquiries have no effect on the prediction
  4. The feature should be removed
โœ… B) More inquiries increase default probability โ€” A positive weight means that as the feature value increases, z = wยทx + b increases, and ฯƒ(z) increases โ€” meaning P(default) increases. Intuitively, many hard inquiries suggest the person is desperately seeking credit from multiple sources, which is a risk signal.
ApplyDifficulty: Medium
Q6.

What is ฯƒ(0)?

  1. 0
  2. 0.25
  3. 0.5
  4. 1
โœ… C) 0.5 โ€” ฯƒ(0) = 1/(1 + eโฐ) = 1/(1+1) = 0.5. This is the decision boundary โ€” when z = wยทx + b = 0, the model is exactly 50% confident for each class.
RememberDifficulty: Easy
Q7.

The binary cross-entropy loss for a single sample with y=1 and ลท=0.01 is approximately:

  1. 0.01
  2. 0.99
  3. 2.30
  4. 4.61
โœ… D) 4.61 โ€” L = โˆ’[y log(ลท) + (1โˆ’y) log(1โˆ’ลท)] = โˆ’[1ยทlog(0.01) + 0] = โˆ’log(0.01) = โˆ’(โˆ’4.605) = 4.605 โ‰ˆ 4.61. The model is confidently wrong (predicting 1% chance when the true label is 1), so the penalty is severe. Cross-entropy's โˆ’log function creates this harsh, asymmetric punishment for confident mistakes.
ApplyDifficulty: Medium
Q8.

In the vectorized gradient formula dw = (1/m) ยท XT ยท (A โˆ’ Y), what are the shapes if X is (200, 5)?

  1. dw: (200, 1), XT: (5, 200), (Aโˆ’Y): (200, 1)
  2. dw: (5, 1), XT: (5, 200), (Aโˆ’Y): (200, 1)
  3. dw: (5,), XT: (200, 5), (Aโˆ’Y): (200,)
  4. dw: (200,), XT: (200, 5), (Aโˆ’Y): (5,)
โœ… B) dw: (5, 1), XT: (5, 200), (Aโˆ’Y): (200, 1) โ€” X is (200, 5) โ†’ XT is (5, 200). (Aโˆ’Y) is (200, 1). Matrix multiplication: (5, 200) ร— (200, 1) = (5, 1). dw has one gradient per feature โ€” exactly what we expect for 5 features.
AnalyzeDifficulty: Hard
Q9.

Which property of the sigmoid is crucial for the "vanishing gradient problem" in deep networks?

  1. ฯƒ(z) is always positive
  2. ฯƒโ€ฒ(z) โ‰ค 0.25 for all z, causing gradients to shrink when multiplied across layers
  3. ฯƒ(z) is symmetric around z = 0
  4. ฯƒ(z) never equals exactly 0 or 1
โœ… B) ฯƒโ€ฒ(z) โ‰ค 0.25 for all z โ€” The maximum gradient of sigmoid is 0.25 (at z=0). In a deep network with L layers, gradients get multiplied: 0.25L. For L=10 layers: 0.25ยนโฐ โ‰ˆ 10โˆ’6 โ€” the gradient virtually disappears. This is why deep networks prefer ReLU (max gradient = 1) over sigmoid for hidden layers. Sigmoid is still used for the output layer of binary classifiers.
AnalyzeDifficulty: Hard
Q10.

A logistic regression model for Flipkart's "will the customer return this product?" has weights: wprice = โˆ’0.03, wreviews = โˆ’0.15, wdelivery_delay = +0.28. Which factor most strongly predicts product returns?

  1. Price (higher price โ†’ fewer returns)
  2. Number of reviews
  3. Delivery delay (longer delay โ†’ more returns)
  4. All factors contribute equally
โœ… C) Delivery delay โ€” The absolute value |+0.28| is the largest weight, meaning delivery delay has the strongest influence on the prediction. The positive sign means longer delays increase the probability of return. In practice, Flipkart has found that orders delayed beyond 3 days have 2.5ร— higher return rates, regardless of product quality.
EvaluateDifficulty: Medium

Section B โ€” Short Answer Questions (5)

โœ๏ธ Answer in 3โ€“5 sentences each

B1. Prove that ฯƒ(โˆ’z) = 1 โˆ’ ฯƒ(z). What does this symmetry property mean geometrically for the sigmoid curve?

B2. Explain why we use np.clip(y_hat, 1e-15, 1-1e-15) before computing the binary cross-entropy loss. What would happen without this clipping?

B3. In the SBI CIBIL case study, the model has a weight of โˆ’0.82 for CIBIL score (normalized). Interpret this weight in business terms. What happens to the predicted default probability when CIBIL score increases by one standard deviation?

B4. The vectorized implementation is ~2,600ร— faster than the loop version. Explain why NumPy vectorization is so much faster, referencing BLAS routines and CPU-level optimizations.

B5. Can logistic regression handle a dataset where Class 0 has 9,500 samples and Class 1 has 500 samples? What problems arise, and what are two solutions?

Section C โ€” Long Answer Questions (3)

๐Ÿ“ Answer in 1โ€“2 pages each

C1. Full Derivation: Starting from the Bernoulli distribution P(y|x) = ลทy(1โˆ’ลท)1โˆ’y, derive the binary cross-entropy loss function step by step. Then derive the gradient โˆ‚J/โˆ‚w by applying the chain rule through the computation graph z โ†’ a โ†’ L. Show every intermediate step and verify that the final gradient is (1/m)ฮฃ(aโˆ’y)x.

C2. Comparative Analysis: Compare logistic regression with a single-hidden-layer neural network (with sigmoid activation) for binary classification. Draw both architectures. Explain what additional representational power the hidden layer provides. Use the XOR problem as an example where logistic regression fails but a neural network succeeds. What is the fundamental reason?

C3. Regularization Deep Dive: Explain L1 (Lasso) and L2 (Ridge) regularization for logistic regression. Write the modified loss functions for both. Derive the modified gradient update rule for L2 regularization. Explain why L1 produces sparse weights (some weights become exactly zero) while L2 produces small-but-nonzero weights. In the context of Bajaj Finance's loan model with 47 features, which regularization would you recommend and why?

Section D โ€” Programming Exercises (3)

๐Ÿ’ป Code in Python with NumPy

D1. Learning Rate Explorer: Using the LogisticRegression class from Section 4, train the model on the same synthetic dataset with five different learning rates: ฮฑ โˆˆ {0.001, 0.01, 0.1, 1.0, 10.0}. Plot all five loss curves on the same graph. Which learning rate converges fastest? Which diverges? Write a 3-sentence analysis.

D2. Multi-Feature Loan Predictor: Generate a synthetic Indian loan dataset with 5 features: (1) monthly income in โ‚น, (2) CIBIL score, (3) existing EMIs, (4) years of employment, (5) age. Create 500 samples with realistic distributions. Train your from-scratch logistic regression. Print the learned weights and interpret each one: which feature matters most? Does the interpretation make business sense?

D3. Mini-Batch Gradient Descent: Modify the LogisticRegression class to support mini-batch gradient descent. Add a batch_size parameter. Instead of computing gradients on the full dataset, randomly sample batch_size samples each iteration. Train with batch_size โˆˆ {1, 16, 64, m} and compare: (a) loss curves (noisier for smaller batches), (b) final accuracy, and (c) training time. Which batch size gives the best trade-off?

Section E โ€” Mini-Project

๐Ÿš€ Project: Build an Indian Loan Default Predictor

Objective

Build a complete end-to-end logistic regression pipeline for predicting loan defaults, simulating what Bajaj Finance or SBI would build.

Requirements

  1. Data Generation: Create a synthetic dataset of 2,000 loan applicants with realistic Indian features:
    • Monthly salary (โ‚น15,000 โ€“ โ‚น3,00,000, log-normal distribution)
    • CIBIL score (300โ€“900, skewed toward 650โ€“750)
    • Age (21โ€“65)
    • Existing EMI-to-income ratio (0.0 โ€“ 0.8)
    • Years at current employer (0โ€“30)
    • Loan amount requested (โ‚น50,000 โ€“ โ‚น25,00,000)
    • Number of credit inquiries in last 6 months (0โ€“12)
  2. Label Generation: Generate realistic default labels based on a known probability formula (your own logistic model with known weights + random noise)
  3. Implementation: Use your from-scratch LogisticRegression class โ€” no scikit-learn for the model
  4. Evaluation: Train/test split (80/20). Report accuracy, precision, recall, F1-score. Plot the ROC curve.
  5. Visualization: (a) Loss curve, (b) Feature importance bar chart, (c) Probability distribution for defaulters vs non-defaulters
  6. Comparison: Compare your from-scratch model's performance with scikit-learn's LogisticRegression
  7. Report: Write a 1-page "model card" documenting: model purpose, features used, performance metrics, limitations, and fairness considerations (does the model discriminate by age?)

Deliverables

  • A single Jupyter notebook with all code, plots, and analysis
  • The 1-page model card as a markdown cell
Stretch goal: Add L2 regularization to your from-scratch class. Compare unregularized vs regularized performance. Does regularization help when you have 7 features and 2,000 samples?
Section 12

Chapter Summary

๐Ÿง  Key Takeaways from Chapter 5

  1. Logistic regression is a binary classifier that applies the sigmoid function ฯƒ(z) = 1/(1+eโˆ’z) to a linear combination z = wยทx + b, outputting a probability in (0, 1).
  2. The sigmoid has a beautiful derivative: ฯƒโ€ฒ(z) = ฯƒ(z)(1โˆ’ฯƒ(z)), with maximum value 0.25 at z = 0 and vanishing gradients at the extremes.
  3. The Binary Cross-Entropy loss J = โˆ’(1/m) ฮฃ[y log(ลท) + (1โˆ’y) log(1โˆ’ลท)] is derived from Maximum Likelihood Estimation of the Bernoulli distribution. It is convex, penalizes confident wrong predictions harshly, and pairs naturally with the sigmoid.
  4. The gradient of the loss with respect to the pre-activation is elegantly simple: โˆ‚L/โˆ‚z = ลท โˆ’ y (prediction minus truth). This drives the gradient descent update rules.
  5. The computation graph decomposes the model into elementary operations (multiply, add, sigmoid, log), enabling systematic application of the chain rule for backpropagation.
  6. Our from-scratch LogisticRegression class implements: sigmoid (numerically stable), BCE loss (with epsilon clipping), forward pass, backward pass, and parameter updates โ€” all in ~90 lines of Python.
  7. Vectorized NumPy code is ~2,600ร— faster than explicit Python loops for the same computation, thanks to BLAS routines and SIMD instructions.
  8. In the worked example, we traced a complete iteration with 3 samples and verified that gradient descent moves all predictions toward their correct labels.
  9. SBI and CIBIL use logistic regression as the backbone of India's credit scoring system, processing millions of loan decisions with explainable, regulatorily compliant models.
  10. Logistic regression is not regression โ€” it's classification. Its sigmoid output is not necessarily a calibrated probability. Its decision boundary is linear (but can be extended with polynomial features).
The Complete Logistic Regression Algorithm (One Slide Summary)

Initialize: w = 0, b = 0
Repeat for T iterations:
  1. Forward: z = Xw + b,   a = ฯƒ(z)
  2. Loss: J = โˆ’(1/m) ฮฃ[y log(a) + (1โˆ’y) log(1โˆ’a)]
  3. Backward: dw = (1/m) XT(aโˆ’y),   db = (1/m) ฮฃ(aโˆ’y)
  4. Update: w โ† w โˆ’ ฮฑยทdw,   b โ† b โˆ’ ฮฑยทdb
Predict: ลท = 1 if ฯƒ(wยทx + b) โ‰ฅ 0.5, else 0

What's Next?

In Chapter 6, we'll extend logistic regression to handle multiple classes (Softmax Regression) and then stack multiple logistic units into layers โ€” building our first true neural network. The sigmoid, cross-entropy, computation graph, and gradient descent you learned here will be the foundation for everything that follows.

Section 13

References & Further Reading

Textbooks

  1. Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Chapter 4.3: Logistic Regression. Springer.
  2. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 5.5: Maximum Likelihood Estimation; Chapter 6.2.2: Sigmoid Units. MIT Press. deeplearningbook.org
  3. Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction, Chapter 10: Logistic Regression. MIT Press.
  4. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, Chapter 4.4. Springer. Free PDF

Online Courses

  1. Ng, A. (2017). Neural Networks and Deep Learning (Course 1, Week 2: Logistic Regression as a Neural Network). Coursera / deeplearning.ai. Link
  2. Stanford CS229: Machine Learning โ€” Lecture Notes on GLMs and Logistic Regression. PDF

Indian Industry & Regulatory

  1. TransUnion CIBIL (2024). CIBIL Score: How It's Calculated. cibil.com
  2. Reserve Bank of India (2023). Guidelines on Digital Lending. RBI Circular. rbi.org.in
  3. Bajaj Finance Annual Report 2023โ€“24. Risk Management: Model Governance Framework.
  4. SBI Annual Report 2023โ€“24. Credit Risk Management Using Statistical Models.

Research Papers

  1. Cox, D. R. (1958). "The regression analysis of binary sequences." Journal of the Royal Statistical Society, Series B, 20(2), 215โ€“242. [The original logistic regression paper]
  2. Platt, J. (1999). "Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods." Advances in Large Margin Classifiers. [Platt Scaling for calibration]

Implementation References

  1. NumPy Documentation: numpy.org
  2. Scikit-learn LogisticRegression: API Docs