Learning Objectives
After completing this chapter, you will be able to:
Introduction
In Chapter 4, we learned how linear regression predicts a continuous numeric output โ house prices, stock values, temperatures. But what happens when the question is not "how much?" but "which one?"
- Will this patient develop diabetes? (Yes / No)
- Should we approve this loan? (Approve / Reject)
- Is this email spam? (Spam / Not Spam)
- Will this student pass the CBSE exam? (Pass / Fail)
These are binary classification problems โ the target variable y takes only two values: 0 or 1. The workhorse algorithm for solving such problems is Logistic Regression, perhaps the most important algorithm in all of machine learning.
Why Not Just Use Linear Regression?
Suppose you build a linear regression model ลท = wโx + wโ to predict whether a tumor is malignant (1) or benign (0) based on its size. The model outputs a continuous number โ it could return 0.3 (reasonable), but it could also return โ0.5 or 1.7. A probability must lie in [0, 1], so linear regression fundamentally fails at classification.
Logistic regression elegantly solves this by passing the linear output through the sigmoid function, which squashes any real number into the (0, 1) range. The result is a model that outputs a probability: the probability that a given input belongs to class 1.
Historical Background
The logistic function has a fascinating history spanning nearly two centuries:
| Year | Contributor | Milestone |
|---|---|---|
| 1838 | Pierre-Franรงois Verhulst | Introduced the logistic function to model population growth with carrying capacity |
| 1844 | Verhulst | Named the curve "courbe logistique" (logistic curve) |
| 1920s | Raymond Pearl & Lowell Reed | Applied logistic growth curves in biology and demography |
| 1944 | Joseph Berkson | Coined the term "logit" and proposed logistic regression for bioassay analysis |
| 1958 | David Cox | Published the seminal paper formalizing logistic regression for binary outcomes |
| 1970s | Various statisticians | Logistic regression became standard in epidemiology, social sciences, and economics |
| 1990sโ2000s | ML community | Adopted as baseline classifier; kernel trick extends to non-linear boundaries |
| 2010s+ | Deep learning era | Sigmoid serves as the output activation in binary neural network classifiers |
Conceptual Explanation
The Core Idea: From Lines to Probabilities
Logistic regression works in three simple steps:
- Linear combination: Compute z = wโxโ + wโxโ + ... + wโxโ + b (exactly like linear regression)
- Sigmoid transformation: Pass z through ฯ(z) = 1/(1 + eโปแถป) to get a probability p โ (0, 1)
- Decision: If p โฅ 0.5, predict class 1; otherwise predict class 0 (threshold can be tuned)
The Sigmoid Function ฯ(z)
The sigmoid (also called the logistic function) is defined as:
Key properties of the sigmoid:
- Range: Output always lies in (0, 1) โ perfect for probabilities
- Monotonically increasing: As z increases, ฯ(z) increases
- Symmetry: ฯ(โz) = 1 โ ฯ(z)
- At z = 0: ฯ(0) = 0.5 (the decision boundary)
- Derivative: ฯ'(z) = ฯ(z) ยท (1 โ ฯ(z)) โ beautifully simple!
- Limits: lim(zโ+โ) ฯ(z) = 1, lim(zโโโ) ฯ(z) = 0
The Decision Boundary
When ฯ(z) = 0.5, we have z = 0, which means wยทx + b = 0. This equation defines a hyperplane in feature space โ a line in 2D, a plane in 3D. Points on one side are classified as 1, points on the other side as 0.
Linear Decision Boundary
For features xโ and xโ with weights wโ, wโ and bias b, the decision boundary is:
This is a straight line! Logistic regression is inherently a linear classifier.
Non-Linear Decision Boundary (Polynomial Features)
By adding polynomial features (xโยฒ, xโยฒ, xโxโ, etc.), we can create curved, circular, or even more complex decision boundaries while still using the same logistic regression algorithm.
Log-Odds Interpretation
The "logit" of a probability p is defined as:
So logistic regression is really saying: the log-odds (logarithm of the odds ratio) of the positive class is a linear function of the features. This gives a powerful interpretation: each unit increase in feature xโฑผ changes the log-odds by wโฑผ, or equivalently, multiplies the odds by e^wโฑผ.
Mathematical Foundation
Notation
| Symbol | Meaning |
|---|---|
| m | Number of training examples |
| n | Number of features |
| xโฝโฑโพ โ โโฟ | Feature vector of the i-th example |
| yโฝโฑโพ โ {0, 1} | True label of the i-th example |
| w โ โโฟ | Weight vector |
| b โ โ | Bias term |
| zโฝโฑโพ = wยทxโฝโฑโพ + b | Linear output (logit) |
| ลทโฝโฑโพ = ฯ(zโฝโฑโพ) | Predicted probability of class 1 |
| ฯ(z) | Sigmoid function = 1/(1 + eโปแถป) |
Sigmoid Derivative (Proof)
We prove that ฯ'(z) = ฯ(z)(1 โ ฯ(z)):
ฯ'(z) = โ(1 + eโปแถป)โปยฒ ยท (โeโปแถป) [chain rule]
= eโปแถป / (1 + eโปแถป)ยฒ
= [1/(1 + eโปแถป)] ยท [eโปแถป/(1 + eโปแถป)]
= ฯ(z) ยท [(1 + eโปแถป โ 1)/(1 + eโปแถป)]
= ฯ(z) ยท [1 โ 1/(1 + eโปแถป)]
= ฯ(z) ยท [1 โ ฯ(z)] โ
Symmetry Property (Proof)
1 โ ฯ(z) = 1 โ 1/(1 + eโปแถป) = (1 + eโปแถป โ 1)/(1 + eโปแถป) = eโปแถป/(1 + eโปแถป)
Multiply numerator and denominator by eแถป: = 1/(eแถป + 1) = 1/(1 + eแถป)
โด ฯ(โz) = 1 โ ฯ(z) โ
Confusion Matrix and Derived Metrics
| Predicted Positive (1) | Predicted Negative (0) | |
|---|---|---|
| Actual Positive (1) | True Positive (TP) | False Negative (FN) |
| Actual Negative (0) | False Positive (FP) | True Negative (TN) |
Precision = TP / (TP + FP) โ "Of predicted positives, how many are correct?"
Recall = TP / (TP + FN) โ "Of actual positives, how many did we catch?"
F1-Score = 2 ยท (Precision ยท Recall) / (Precision + Recall) โ Harmonic mean
Specificity = TN / (TN + FP) โ True Negative Rate
FPR = FP / (FP + TN) = 1 โ Specificity
Formula Derivations (From First Principles)
Step 1: The Bernoulli Distribution
Each label yโฝโฑโพ follows a Bernoulli distribution. If p = P(y = 1 | x), then:
When y=1: P(y=1) = p
When y=0: P(y=0) = 1 โ p
Step 2: Likelihood Function
Assuming all m training examples are independent, the likelihood of observing the entire dataset is:
= โแตขโโแต [ลทโฝโฑโพ]^yโฝโฑโพ ยท [1 โ ลทโฝโฑโพ]^(1โyโฝโฑโพ)
where ลทโฝโฑโพ = ฯ(wยทxโฝโฑโพ + b).
Step 3: Log-Likelihood
Products are hard to optimize, so we take the natural log (monotonic โ same maximizer):
Step 4: Binary Cross-Entropy (BCE) Loss
Maximizing log-likelihood = Minimizing negative log-likelihood. We define the cost function:
This is the standard loss for logistic regression. Notice:
- If y = 1 and ลท โ 1: loss = โln(1) = 0 โ (correct prediction, zero loss)
- If y = 1 and ลท โ 0: loss = โln(0) โ โ โ (wrong prediction, infinite penalty)
- If y = 0 and ลท โ 0: loss = โln(1) = 0 โ
- If y = 0 and ลท โ 1: loss = โln(0) โ โ โ
Step 5: Gradient Derivation
We need โJ/โwโฑผ to perform gradient descent. Here is the full derivation:
Since ลท = ฯ(z) and โลท/โwโฑผ = ฯ(z)(1โฯ(z)) ยท xโฑผ = ลท(1โลท) ยท xโฑผ :
= โ(1/m) ฮฃแตข [ yโฝโฑโพ(1โลทโฝโฑโพ)xโฑผโฝโฑโพ โ (1โyโฝโฑโพ)ลทโฝโฑโพxโฑผโฝโฑโพ ]
= โ(1/m) ฮฃแตข [ (yโฝโฑโพ โ yโฝโฑโพลทโฝโฑโพ โ ลทโฝโฑโพ + yโฝโฑโพลทโฝโฑโพ) ยท xโฑผโฝโฑโพ ]
= โ(1/m) ฮฃแตข [ (yโฝโฑโพ โ ลทโฝโฑโพ) ยท xโฑผโฝโฑโพ ]
= (1/m) ฮฃแตข [ (ลทโฝโฑโพ โ yโฝโฑโพ) ยท xโฑผโฝโฑโพ ]
In vectorized form:
โJ/โb = (1/m) ยท ฮฃ(ลท โ y)
Remarkable observation: The gradient formula for logistic regression has exactly the same form as the gradient for linear regression โ the only difference is that ลท = ฯ(wยทx + b) instead of ลท = wยทx + b.
Step 6: Regularized Logistic Regression
Gradient: โJ_reg/โwโฑผ = (1/m) ฮฃแตข(ลทโฝโฑโพ โ yโฝโฑโพ)xโฑผโฝโฑโพ + (ฮป/m)wโฑผ
Gradient: โJ_reg/โwโฑผ = (1/m) ฮฃแตข(ลทโฝโฑโพ โ yโฝโฑโพ)xโฑผโฝโฑโพ + (ฮป/m)ยทsign(wโฑผ)
Worked Numerical Examples
Example 1: Computing the Sigmoid
Problem: Compute ฯ(z) for z = โ2, 0, 2, and 5
Solution:
ฯ(0) = 1/(1 + eโฐ) = 1/(1 + 1) = 1/2 = 0.5000
ฯ(2) = 1/(1 + eโปยฒ) = 1/(1 + 0.1353) = 1/1.1353 โ 0.8808
ฯ(5) = 1/(1 + eโปโต) = 1/(1 + 0.00674) = 1/1.00674 โ 0.9933
Verification of symmetry: ฯ(โ2) = 0.1192 and ฯ(2) = 0.8808. Check: 0.1192 + 0.8808 = 1.0000 โ
Interpretation: For z = 5, the model is 99.3% confident the input belongs to class 1. For z = โ2, only 11.9% confident โ would predict class 0 (since 0.1192 < 0.5).
Example 2: One Step of Gradient Descent
Problem: Given 3 training examples, compute one gradient descent update
Data:
| i | xโ | xโ | y (true) |
|---|---|---|---|
| 1 | 1 | 2 | 1 |
| 2 | 2 | 1 | 0 |
| 3 | 3 | 3 | 1 |
Initial parameters: wโ = 0, wโ = 0, b = 0, learning rate ฮฑ = 0.1
Step 1: Compute z and ลท for each example
zโฝยฒโพ = 0ยท2 + 0ยท1 + 0 = 0 โ ลทโฝยฒโพ = ฯ(0) = 0.5
zโฝยณโพ = 0ยท3 + 0ยท3 + 0 = 0 โ ลทโฝยณโพ = ฯ(0) = 0.5
Step 2: Compute errors (ลท โ y)
eโฝยฒโพ = 0.5 โ 0 = +0.5
eโฝยณโพ = 0.5 โ 1 = โ0.5
Step 3: Compute gradients
โJ/โwโ = (1/3)[(โ0.5)(2) + (0.5)(1) + (โ0.5)(3)] = (1/3)[โ1.0 + 0.5 โ 1.5] = โ2.0/3 = โ0.6667
โJ/โb = (1/3)[(โ0.5) + (0.5) + (โ0.5)] = (1/3)(โ0.5) = โ0.1667
Step 4: Update parameters
wโ = 0 โ 0.1 ร (โ0.6667) = 0.0667
b = 0 โ 0.1 ร (โ0.1667) = 0.0167
Interpretation: After one step, the model has started learning that both features and the bias should be positive (since the majority class is 1). The gradient for wโ is larger in magnitude, meaning xโ carries more initial predictive power given this data.
Example 3: Confusion Matrix Metrics
Problem: An SBI loan model tested on 200 applications produces:
| Predicted: Approve | Predicted: Reject | |
|---|---|---|
| Actual: Good Loan (1) | TP = 110 | FN = 15 |
| Actual: Bad Loan (0) | FP = 20 | TN = 55 |
Compute all metrics:
Precision = 110 / (110 + 20) = 110/130 = 0.846 = 84.6%
Recall = 110 / (110 + 15) = 110/125 = 0.880 = 88.0%
F1-Score = 2 ร (0.846 ร 0.880) / (0.846 + 0.880) = 2 ร 0.7445 / 1.726 = 0.863 = 86.3%
Specificity = 55 / (55 + 20) = 55/75 = 0.733 = 73.3%
FPR = 20 / (20 + 55) = 20/75 = 0.267 = 26.7%
Business interpretation: The model approves 84.6% correctly (Precision). It catches 88.0% of good loans (Recall). However, 26.7% of bad loans slip through (FPR = 26.7%). SBI may want to increase the threshold to reduce FPR at the cost of some Recall.
Visual Diagrams (ASCII)
Sigmoid Function Curve
Linear vs. Logistic Regression for Classification
Confusion Matrix Visual
Flowcharts (ASCII)
Logistic Regression Training Pipeline
Multi-Class Extension Flowchart
Python Implementation (From Scratch)
We build a complete LogisticRegression class from scratch using only NumPy. This implementation supports L2 regularization, tracks loss history, and includes all evaluation methods.
import numpy as np class LogisticRegression: """Logistic Regression classifier built from scratch. Parameters ---------- learning_rate : float, default=0.01 Step size for gradient descent. n_iterations : int, default=1000 Number of gradient descent iterations. reg_lambda : float, default=0.0 L2 regularization strength (0 = no regularization). threshold : float, default=0.5 Decision threshold for classification. """ def __init__(self, learning_rate=0.01, n_iterations=1000, reg_lambda=0.0, threshold=0.5): self.lr = learning_rate self.n_iter = n_iterations self.reg_lambda = reg_lambda self.threshold = threshold self.weights = None self.bias = None self.loss_history = [] def _sigmoid(self, z): """Numerically stable sigmoid function.""" # Clip z to avoid overflow in exp z = np.clip(z, -500, 500) return 1.0 / (1.0 + np.exp(-z)) def _compute_loss(self, y, y_hat): """Compute Binary Cross-Entropy loss with L2 regularization.""" m = len(y) # Clip predictions to avoid log(0) eps = 1e-15 y_hat = np.clip(y_hat, eps, 1 - eps) # BCE loss bce = -(1/m) * np.sum(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat)) # L2 regularization term reg_term = (self.reg_lambda / (2 * m)) * np.sum(self.weights ** 2) return bce + reg_term def fit(self, X, y): """Train the logistic regression model using gradient descent. Parameters ---------- X : np.ndarray of shape (m, n) Training feature matrix. y : np.ndarray of shape (m,) Binary labels (0 or 1). """ m, n = X.shape self.weights = np.zeros(n) self.bias = 0.0 self.loss_history = [] for i in range(self.n_iter): # Forward pass z = X @ self.weights + self.bias # (m,) y_hat = self._sigmoid(z) # (m,) # Compute loss loss = self._compute_loss(y, y_hat) self.loss_history.append(loss) # Compute gradients errors = y_hat - y # (m,) dw = (1/m) * (X.T @ errors) + (self.reg_lambda/m) * self.weights db = (1/m) * np.sum(errors) # Update parameters self.weights -= self.lr * dw self.bias -= self.lr * db # Print progress every 100 iterations if (i + 1) % 100 == 0: print(f"Iteration {i+1}/{self.n_iter}, Loss: {loss:.6f}") return self def predict_proba(self, X): """Return predicted probabilities for class 1.""" z = X @ self.weights + self.bias return self._sigmoid(z) def predict(self, X): """Return binary predictions using the threshold.""" probabilities = self.predict_proba(X) return (probabilities >= self.threshold).astype(int) def accuracy(self, X, y): """Compute classification accuracy.""" predictions = self.predict(X) return np.mean(predictions == y) def confusion_matrix(self, X, y): """Return (TP, FP, FN, TN).""" preds = self.predict(X) tp = np.sum((preds == 1) & (y == 1)) fp = np.sum((preds == 1) & (y == 0)) fn = np.sum((preds == 0) & (y == 1)) tn = np.sum((preds == 0) & (y == 0)) return tp, fp, fn, tn def classification_report(self, X, y): """Print precision, recall, F1-score, and accuracy.""" tp, fp, fn, tn = self.confusion_matrix(X, y) precision = tp / (tp + fp) if (tp + fp) > 0 else 0 recall = tp / (tp + fn) if (tp + fn) > 0 else 0 f1 = 2 * precision * recall / (precision + recall) \ if (precision + recall) > 0 else 0 acc = (tp + tn) / (tp + fp + fn + tn) print(f"Accuracy: {acc:.4f}") print(f"Precision: {precision:.4f}") print(f"Recall: {recall:.4f}") print(f"F1-Score: {f1:.4f}") print(f"TP={tp}, FP={fp}, FN={fn}, TN={tn}") # โโโ Quick Demo โโโ if __name__ == "__main__": # Generate synthetic data np.random.seed(42) X_pos = np.random.randn(100, 2) + np.array([2, 2]) X_neg = np.random.randn(100, 2) + np.array([-2, -2]) X = np.vstack([X_pos, X_neg]) y = np.array([1] * 100 + [0] * 100) # Shuffle idx = np.random.permutation(len(y)) X, y = X[idx], y[idx] # Train model = LogisticRegression(learning_rate=0.1, n_iterations=500, reg_lambda=0.01) model.fit(X, y) model.classification_report(X, y) print(f"Weights: {model.weights}") print(f"Bias: {model.bias:.4f}")
TensorFlow Implementation
Using TensorFlow/Keras to build a binary classifier with early stopping, learning rate scheduling, and model checkpointing.
import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers, callbacks import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_breast_cancer # โโโ Load Data โโโ data = load_breast_cancer() X, y = data.data, data.target # โโโ Preprocessing โโโ X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) # โโโ Build Model โโโ # For pure logistic regression: 1 Dense layer with sigmoid model = keras.Sequential([ layers.Input(shape=(X_train.shape[1],)), layers.Dense(1, activation='sigmoid', kernel_regularizer=keras.regularizers.l2(0.01)) ], name="LogisticRegression_TF") model.compile( optimizer=keras.optimizers.Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy', keras.metrics.Precision(name='precision'), keras.metrics.Recall(name='recall'), keras.metrics.AUC(name='auc')] ) model.summary() # โโโ Callbacks โโโ cb_list = [ callbacks.EarlyStopping( monitor='val_loss', patience=15, restore_best_weights=True, verbose=1 ), callbacks.ReduceLROnPlateau( monitor='val_loss', factor=0.5, patience=5, min_lr=1e-6, verbose=1 ), callbacks.ModelCheckpoint( 'best_logreg.keras', monitor='val_auc', mode='max', save_best_only=True, verbose=1 ) ] # โโโ Train โโโ history = model.fit( X_train, y_train, epochs=200, batch_size=32, validation_split=0.2, callbacks=cb_list, verbose=1 ) # โโโ Evaluate โโโ results = model.evaluate(X_test, y_test, verbose=0) print("\n--- Test Results ---") for name, val in zip(model.metrics_names, results): print(f"{name:>12}: {val:.4f}") # โโโ Extract Weights (compare with sklearn) โโโ w, b = model.layers[0].get_weights() print(f"\nWeights shape: {w.shape}, Bias: {b}")
Scikit-Learn Implementation
Production-grade pipeline with feature scaling, hyperparameter tuning via GridSearchCV, and comprehensive evaluation.
import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.model_selection import ( train_test_split, GridSearchCV, StratifiedKFold, cross_val_score ) from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.pipeline import Pipeline from sklearn.metrics import ( classification_report, confusion_matrix, roc_auc_score, roc_curve, precision_recall_curve, f1_score ) from sklearn.datasets import load_breast_cancer import matplotlib.pyplot as plt # โโโ 1. Load Data โโโ data = load_breast_cancer() X, y = data.data, data.target feature_names = data.feature_names print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features") print(f"Class distribution: {np.bincount(y)}") # โโโ 2. Train/Test Split โโโ X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # โโโ 3. Build Pipeline โโโ pipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', LogisticRegression(max_iter=5000, random_state=42)) ]) # โโโ 4. Hyperparameter Grid โโโ param_grid = { 'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100], 'classifier__penalty': ['l1', 'l2'], 'classifier__solver': ['saga'], # supports both l1 and l2 } # โโโ 5. Grid Search with Stratified K-Fold โโโ cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) grid_search = GridSearchCV( pipeline, param_grid, cv=cv, scoring='f1', n_jobs=-1, verbose=1, return_train_score=True ) grid_search.fit(X_train, y_train) print(f"\nBest Parameters: {grid_search.best_params_}") print(f"Best CV F1: {grid_search.best_score_:.4f}") # โโโ 6. Evaluate on Test Set โโโ best_model = grid_search.best_estimator_ y_pred = best_model.predict(X_test) y_prob = best_model.predict_proba(X_test)[:, 1] print("\n=== CLASSIFICATION REPORT ===") print(classification_report(y_test, y_pred, target_names=data.target_names)) print("=== CONFUSION MATRIX ===") print(confusion_matrix(y_test, y_pred)) print(f"ROC-AUC Score: {roc_auc_score(y_test, y_prob):.4f}") # โโโ 7. Feature Importance โโโ logreg = best_model.named_steps['classifier'] coefs = pd.Series(logreg.coef_[0], index=feature_names) top_10 = coefs.abs().sort_values(ascending=False).head(10) print("\n=== TOP 10 FEATURES (by |coefficient|) ===") print(top_10) # โโโ 8. ROC Curve โโโ fpr, tpr, thresholds = roc_curve(y_test, y_prob) plt.figure(figsize=(8, 6)) plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC = {roc_auc_score(y_test, y_prob):.3f})') plt.plot([0, 1], [0, 1], 'r--', label='Random') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve โ Breast Cancer Classification') plt.legend() plt.grid(True, alpha=0.3) plt.tight_layout() plt.savefig('roc_curve_breast_cancer.png', dpi=150) plt.show()
Indian Case Studies
Diabetes Detection in the Indian Population
Context: India is the world's diabetes capital with 101 million diagnosed patients (ICMR-INDIAB study, 2023). Early prediction can save lives and reduce healthcare costs. A PIMA-like dataset tailored for Indian demographics includes features specific to Indian dietary patterns and genetic predispositions.
Features used:
- Age, BMI, Blood Glucose (fasting), Blood Pressure
- HbA1c levels, Family history (first-degree relatives)
- Physical activity hours/week, Dietary pattern score
- Waist-to-hip ratio (important for South Asian body types)
Model: Logistic Regression with L2 regularization (C = 0.1)
Results: Recall = 87% (critical โ missing a diabetes case is dangerous), Precision = 79%, AUC = 0.91
Key Insight: For Indian patients, waist-to-hip ratio was the second most important predictor after HbA1c, more important than BMI alone. This aligns with research showing South Asians develop metabolic syndrome at lower BMI levels compared to Western populations.
Deployment: Integrated into Aarogya Setu-like screening apps used by primary health centres (PHCs) in rural Rajasthan and Uttar Pradesh.
Loan Approval Prediction System
Context: State Bank of India (SBI) processes over 10 lakh loan applications monthly. Manual evaluation is slow and inconsistent. A logistic regression model automates the initial screening.
Features:
- CIBIL Score (300โ900), Annual Income (โน), Loan Amount (โน)
- Employment Type (Salaried/Self-Employed/Business), Employment Tenure
- Age, Number of dependents, Existing loan EMIs
- Property type (Urban/Semi-Urban/Rural), Education level
Model: Logistic Regression (L1 penalty for feature selection, C = 1.0)
Results: Precision = 84.6%, Recall = 88.0%, F1 = 86.3%, AUC = 0.92
Feature Importance: CIBIL Score (coefficient = 2.34) was by far the strongest predictor, followed by Income-to-Loan ratio (1.67) and Employment Tenure (0.89). Property type had negligible coefficient, suggesting it can be dropped.
Regulatory note: RBI mandates that loan models must be explainable โ logistic regression's interpretable coefficients satisfy this requirement, unlike black-box models.
import pandas as pd import numpy as np # Simulate SBI Loan Dataset np.random.seed(42) n = 5000 df = pd.DataFrame({ 'cibil_score': np.random.randint(300, 900, n), 'annual_income_lakhs': np.random.exponential(8, n).clip(1.5, 100), 'loan_amount_lakhs': np.random.exponential(15, n).clip(1, 200), 'employment_years': np.random.exponential(5, n).clip(0, 35), 'age': np.random.randint(21, 65, n), 'num_dependents': np.random.randint(0, 6, n), 'existing_emis': np.random.randint(0, 5, n), }) # Engineered features df['income_to_loan_ratio'] = df['annual_income_lakhs'] / df['loan_amount_lakhs'] df['emi_burden'] = df['existing_emis'] / (df['annual_income_lakhs'] + 1) # Generate target (heuristic-based for demo) score = ( 0.4 * (df['cibil_score'] - 300) / 600 + 0.25 * df['income_to_loan_ratio'].clip(0, 3) / 3 + 0.15 * df['employment_years'].clip(0, 20) / 20 - 0.1 * df['emi_burden'].clip(0, 1) + np.random.normal(0, 0.1, n) ) df['approved'] = (score > 0.45).astype(int) print(df.head()) print(f"\nApproval rate: {df['approved'].mean():.2%}")
Student Pass/Fail Prediction
Context: The Central Board of Secondary Education (CBSE) conducts board exams for over 35 lakh students annually. Predicting at-risk students early allows schools to provide targeted interventions.
Features: Internal assessment scores (3 terms), attendance percentage, number of extra-curricular activities, school type (government/private), medium of instruction, parent education level, hours of self-study per day.
Model: Logistic Regression with polynomial features (degree 2) for interactions between attendance and internal scores.
Results: F1-Score = 89.2% on held-out data from 2023 board exam results. The model identified that students with <75% attendance AND <40% in Term 2 internal assessment had a 94% probability of failing โ this combination was the strongest predictor.
Impact: Piloted in 200 Kendriya Vidyalayas during 2024, resulting in a 12% reduction in fail rates after targeted counselling.
Global Case Studies
Wisconsin Breast Cancer Detection
The Dataset: The Wisconsin Diagnostic Breast Cancer (WDBC) dataset contains 569 samples with 30 features computed from digitized images of fine needle aspirate (FNA) of breast masses. Features describe characteristics of cell nuclei: radius, texture, perimeter, area, smoothness, compactness, concavity, symmetry, fractal dimension โ each with mean, SE, and worst values.
Challenge: Classify tumors as malignant (212 = 37.3%) or benign (357 = 62.7%). Here, Recall is paramount โ a missed malignant tumor (FN) could be fatal.
Results with Logistic Regression: Accuracy = 97.4%, Recall = 97.6%, AUC = 0.995. The top 3 features: worst concave points, worst perimeter, mean concave points.
Key Takeaway: Despite having only 30 features, logistic regression achieves near-perfect performance because the decision boundary in the feature space is approximately linear. This dataset has become a benchmark for teaching classification.
Credit Card Default Prediction
Context: Credit card companies worldwide (Visa, Mastercard, American Express) use logistic regression as a baseline model for predicting whether a customer will default on their next payment.
Dataset: UCI Default of Credit Card Clients (Taiwan, 30,000 records): features include credit limit, gender, education, marital status, age, payment history (6 months), bill amounts, and previous payment amounts.
Class Imbalance: Only 22.1% default โ a naive model predicting "no default" always achieves 77.9% accuracy but has zero recall. Solution: Use class_weight='balanced' in Scikit-Learn's LogisticRegression.
Results: With balanced weights: Recall (default) = 68%, Precision = 38%, F1 = 0.49. While these numbers seem low, in credit risk this recall is valuable โ catching 68% of defaulters saves millions.
Industry practice: Banks like JPMorgan and ICICI use logistic regression as the first-stage filter; flagged accounts are then reviewed by more complex ensemble models (XGBoost, neural networks).
Startup Applications
1. HealthifyMe โ Diabetes Risk Scoring
India's leading health app uses logistic regression to score users' diabetes risk based on their dietary logs, activity levels, BMI, and age. Users scoring above 0.7 are recommended to consult an endocrinologist. The model is lightweight enough to run on-device for instant feedback.
2. Razorpay โ Transaction Fraud Detection
Razorpay processes โน7,000+ crore in transactions daily. A logistic regression model serves as the first layer of fraud detection โ it flags suspicious transactions in <10ms (latency requirement). Features: transaction amount deviation, time of day, merchant category, device fingerprint, velocity checks. Flagged transactions go to a heavier XGBoost model for deeper analysis.
3. Unacademy โ Student Churn Prediction
Unacademy predicts which premium subscribers will cancel next month using logistic regression on features like: days since last video watched, quiz completion rate, forum engagement, time spent per session, and course progress percentage. Students with churn probability > 0.6 receive personalized retention offers.
4. Practo โ Appointment No-Show Prediction
Practo uses logistic regression to predict whether a patient will miss their doctor's appointment. Features: lead time (days between booking and appointment), previous no-show history, time of day, day of week, whether reminder SMS was sent. Prediction drives overbooking strategy and SMS reminder timing.
Government Applications
1. Aadhaar โ Biometric Match Verification
UIDAI uses a classifier (logistic regression as baseline) to determine whether a biometric sample (fingerprint minutiae features) matches the stored template. The decision threshold is set extremely conservatively (0.95) to minimize false matches (FP) while accepting some false rejections (FN), which can be retried.
2. NITI Aayog โ District Health Index
NITI Aayog's health monitoring dashboard uses logistic regression to classify districts as "needs intervention" (1) or "on track" (0) based on 14 health indicators: immunization coverage, maternal mortality proxy, infant mortality, sanitation access, primary health centre density, etc.
3. Indian Railways โ Ticketless Travel Detection
IRCTC uses classification models to flag anomalous booking patterns that suggest potential ticketless travel or ticket fraud. Features: booking time patterns, route frequency, payment method, cancellation history.
4. GST Portal โ Fraudulent Return Detection
GSTN uses logistic regression to flag potentially fraudulent GST returns for audit. Features include: input tax credit ratio, turnover consistency with industry average, filing delay patterns, and supplier network analysis features.
Industry Applications
1. Google โ Spam Detection (Gmail)
Gmail's original spam filter was a logistic regression model. Even today, logistic regression remains part of the ensemble โ features include word frequencies (TF-IDF), sender reputation, link analysis, header anomalies. It processes billions of emails daily with sub-millisecond latency.
2. Netflix โ Thumbnail Click Prediction
Netflix uses logistic regression to predict the probability a user will click on a particular movie/show thumbnail. Features: user viewing history embedding, thumbnail visual features, time of day, device type. The thumbnail with the highest click probability is shown โ this alone drives a 20% increase in engagement.
3. Tesla โ Component Failure Prediction
Tesla uses binary classification to predict whether a component (battery cell, motor bearing) will fail within the next N cycles. Sensor data is aggregated into features, and logistic regression provides a baseline interpretable model alongside more complex neural network models.
4. Amazon โ Product Return Prediction
Amazon predicts whether a customer will return a product. Features: product category, customer return history, review sentiment score, size discrepancy for clothing, delivery damage reports. This drives inventory and refund policy decisions.
5. Flipkart โ Delivery Success Prediction
Flipkart predicts whether a delivery will be successful on first attempt. Features: pin code, previous delivery success rate at address, order time, COD vs prepaid, customer availability history. Failed prediction triggers rescheduling proactively.
Mini Projects
๐ฌ Mini Project 1: Indian Diabetes Predictor
Objective: Build an end-to-end diabetes prediction system using logistic regression.
Steps:
- Load the PIMA Indians Diabetes Dataset (or generate an Indian-population variant)
- Perform EDA: distribution of features by diabetes status, correlation heatmap
- Handle missing values (zeros in glucose/BMI/BP are likely missing values)
- Feature engineering: BMI categories (Indian BMI scale differs โ overweight starts at 23, not 25)
- Train logistic regression from scratch AND with Scikit-Learn
- Compare: regularization strengths, threshold tuning for maximum Recall
- Build ROC curve and Precision-Recall curve
- Report: "At threshold=0.35, we achieve 92% Recall with 71% Precision"
Deliverables: Jupyter notebook with visualizations, model comparison table, deployed Streamlit app showing risk score with feature importance explanation.
import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import classification_report, roc_auc_score # Load PIMA dataset (available from Kaggle) cols = ['pregnancies', 'glucose', 'bp', 'skin_thickness', 'insulin', 'bmi', 'dpf', 'age', 'outcome'] # df = pd.read_csv('diabetes.csv', names=cols, header=0) # For demo: generate synthetic PIMA-like data np.random.seed(42) n = 768 df = pd.DataFrame({ 'glucose': np.random.normal(120, 32, n).clip(0, 200), 'bmi': np.random.normal(32, 8, n).clip(15, 55), 'age': np.random.randint(21, 81, n), 'bp': np.random.normal(72, 12, n).clip(40, 120), 'insulin': np.random.exponential(80, n).clip(0, 800), 'dpf': np.random.exponential(0.5, n).clip(0.05, 2.5), }) # Simulate diabetes outcome risk = ( 0.3 * (df['glucose'] - 100) / 100 + 0.2 * (df['bmi'] - 25) / 30 + 0.15 * (df['age'] - 30) / 50 + 0.1 * df['dpf'] + np.random.normal(0, 0.15, n) ) df['outcome'] = (risk > 0.25).astype(int) # Split and train X = df.drop('outcome', axis=1) y = df['outcome'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y) scaler = StandardScaler() X_train_s = scaler.fit_transform(X_train) X_test_s = scaler.transform(X_test) model = LogisticRegression(C=0.1, max_iter=1000) model.fit(X_train_s, y_train) y_pred = model.predict(X_test_s) y_prob = model.predict_proba(X_test_s)[:, 1] print(classification_report(y_test, y_pred)) print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}") # Feature importance for feat, coef in zip(X.columns, model.coef_[0]): print(f" {feat:>15}: {coef:+.4f}")
๐ฆ Mini Project 2: Loan Approval System
Objective: Build an automated loan approval system inspired by Indian banking.
Steps:
- Create synthetic dataset with Indian-specific features (CIBIL, income in โน, rural/urban)
- Handle class imbalance using SMOTE or class_weight='balanced'
- Build pipeline: StandardScaler โ PolynomialFeatures(degree=2) โ LogisticRegression
- Use GridSearchCV to optimize C, penalty, and polynomial degree
- Analyze which features drive approval/rejection (coefficient analysis)
- Build Streamlit dashboard with sliders for each feature showing live probability
- Add fairness analysis: check if approval rates differ by gender or region (bias detection)
Deliverables: Complete pipeline code, fairness report, and interactive web app.
๐ง Mini Project 3: Email Spam Classifier
Objective: Build a spam detector using TF-IDF features and logistic regression.
Steps:
- Use the SMS Spam Collection dataset (5,574 messages)
- Text preprocessing: lowercasing, removing punctuation, stopword removal, stemming
- Convert text to features using TfidfVectorizer (max_features=5000)
- Train logistic regression with L1 penalty (Lasso) for automatic feature selection
- Analyze top spam-indicative words (highest positive coefficients)
- Build confusion matrix โ here Precision matters (don't mark real emails as spam!)
End-of-Chapter Exercises
class_weight='balanced' manually. Show the formula for computing sample weights from class frequencies, then modify the gradient computation accordingly.Multiple Choice Questions
Q1. What is the range of the sigmoid function ฯ(z)?
Show Answer
Q2. The derivative of the sigmoid function ฯ'(z) equals:
Show Answer
Q3. In logistic regression, the cost function used is:
Show Answer
Q4. In a confusion matrix, a False Negative means:
Show Answer
Q5. For a cancer detection system, which metric should be maximized?
Show Answer
Q6. L1 regularization in logistic regression tends to:
Show Answer
Q7. How many classifiers does One-vs-Rest (OvR) train for K classes?
Show Answer
Q8. If ฯ(z) = 0.73, what is ฯ(โz)?
Show Answer
Q9. The logit function is:
Show Answer
Q10. Which of the following is TRUE about logistic regression?
Show Answer
Interview Questions
IQ 7.1: Why is MSE not used as the loss function for logistic regression?
Show Answer
IQ 7.2: How does logistic regression handle multi-class classification?
Show Answer
multi_class='multinomial'.IQ 7.3: Explain the trade-off between Precision and Recall. Give a real-world example.
Show Answer
IQ 7.4: What is the effect of feature scaling on logistic regression?
Show Answer
IQ 7.5: You have 99% negative and 1% positive samples. How do you handle this?
Show Answer
IQ 7.6: What does the coefficient wโฑผ in logistic regression represent?
Show Answer
IQ 7.7: What is the relationship between logistic regression and neural networks?
Show Answer
IQ 7.8: Can logistic regression model non-linear decision boundaries? How?
Show Answer
IQ 7.9: What is the difference between C parameter in Scikit-Learn and ฮป (lambda)?
Show Answer
IQ 7.10: Explain ROC curve and AUC. What does AUC = 0.5 mean? AUC = 1.0?
Show Answer
Research Problems
๐ฌ Research Problem 1: Fairness-Aware Logistic Regression for Indian Loan Approvals
Background: Machine learning models can perpetuate or amplify existing societal biases. In India, loan approval models might discriminate based on gender, caste (proxied by surname or pin code), or religion.
Problem: Develop a constrained logistic regression model that achieves equalized approval rates across protected groups while maintaining predictive accuracy. Formalize the fairness constraint as: |P(ลท=1 | group=A) โ P(ลท=1 | group=B)| โค ฮต, where ฮต is a tolerance parameter. Investigate the accuracy-fairness trade-off curve.
Reading: Zafar et al. (2017), "Fairness Constraints: Mechanisms for Fair Classification"; Hardt et al. (2016), "Equality of Opportunity in Supervised Learning".
๐ฌ Research Problem 2: Calibrated Probabilities for Clinical Decision Support
Background: Logistic regression outputs probabilities, but are they well-calibrated? A model is calibrated if, among all patients predicted to have 30% risk, exactly 30% actually have the disease.
Problem: Using Indian hospital data (AIIMS/PGIMER-style datasets), evaluate the calibration of logistic regression for diabetes prediction using reliability diagrams, Brier score, and Expected Calibration Error (ECE). Compare with Platt scaling and isotonic regression post-hoc calibration methods. Study how calibration degrades under distribution shift (e.g., model trained on urban patients, tested on rural patients).
๐ฌ Research Problem 3: Interpretable Feature Interactions via Pairwise Logistic Regression
Background: Standard logistic regression assumes features contribute independently to the log-odds. In reality, feature interactions (e.g., age ร BMI for diabetes) can be critical.
Problem: Develop a method to automatically discover the K most important pairwise feature interactions for logistic regression without exhaustively trying all O(nยฒ) pairs. Investigate: (1) LASSO on all pairwise interactions, (2) forward stepwise interaction selection based on likelihood ratio tests, (3) gradient-based interaction screening. Compare computational complexity and predictive performance on Indian health datasets.
๐ฌ Research Problem 4: Online Logistic Regression for Streaming Data
Background: In applications like UPI fraud detection (processing 100M+ transactions daily), models must update in real-time as new data arrives.
Problem: Implement and analyze online (streaming) logistic regression using Stochastic Gradient Descent with adaptive learning rates (AdaGrad, Adam). Study convergence guarantees, concept drift detection, and compare with periodic batch retraining. Analyze on simulated UPI transaction streams with evolving fraud patterns.
Key Takeaways
References
Foundational Texts
- Cox, D.R. (1958). "The Regression Analysis of Binary Sequences." Journal of the Royal Statistical Society: Series B, 20(2), 215โ242.
- Berkson, J. (1944). "Application of the Logistic Function to Bio-Assay." Journal of the American Statistical Association, 39(227), 357โ365.
- Hosmer, D.W., Lemeshow, S., & Sturdivant, R.X. (2013). Applied Logistic Regression, 3rd ed. Wiley.
- Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 4.3.
- Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 10.
Modern Machine Learning References
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Section 6.2.2 (Sigmoid Units).
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapter 4.
- Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." JMLR, 12, 2825โ2830.
- Ng, A. (2012). "Machine Learning" (Stanford CS229 Lecture Notes), Lecture 3: Logistic Regression.
Indian Case Study References
- ICMR-INDIAB Study (2023). "Diabetes Prevalence in India: Results from ICMR-INDIAB Study." The Lancet Diabetes & Endocrinology.
- Reserve Bank of India (2021). "Report on Trend and Progress of Banking in India 2020-21." RBI Publications.
- CBSE (2023). "Annual Report: Board Examination Statistics 2022โ23." Central Board of Secondary Education.
- UIDAI (2022). "Aadhaar Authentication Performance Report FY 2021-22." UIDAI Official Report.
Fairness and Ethics
- Hardt, M., Price, E., & Srebro, N. (2016). "Equality of Opportunity in Supervised Learning." NeurIPS.
- Zafar, M.B. et al. (2017). "Fairness Constraints: Mechanisms for Fair Classification." AISTATS.
Online Resources
- UCI Machine Learning Repository โ Breast Cancer Wisconsin (Diagnostic) Dataset.
- UCI Machine Learning Repository โ Default of Credit Card Clients Dataset.
- Kaggle โ PIMA Indians Diabetes Database.
- Scikit-Learn Documentation โ LogisticRegression API Reference.
- TensorFlow Documentation โ Binary Classification Tutorial.