📊 Part III — Classification

Chapter 8
Multi-Class Classification
& Evaluation Metrics

Master the art of classifying data into multiple categories with softmax, cross-entropy, confusion matrices, ROC-AUC, and handling real-world class imbalance — with end-to-end implementations and Indian case studies.

⏱ 3.5 hours reading 📖 Prerequisites: Chapter 7 🎯 15 topics 💻 3 implementations 🧪 2 mini projects

SECTION 8.0

Learning Objectives

After completing this chapter, you will be able to:

Extend binary classification to K-class problems using One-vs-Rest and One-vs-One strategies
Derive the Softmax function from a log-linear model and understand its probabilistic interpretation
Formulate and minimize Categorical Cross-Entropy loss for multi-class settings
Construct and interpret K×K confusion matrices with per-class and aggregate metrics
Compute Precision, Recall, and F1-Score per class, and understand Macro, Micro, and Weighted averaging
Plot and interpret ROC curves, AUC scores, and Precision-Recall curves for multi-class problems
Apply Cohen's Kappa and Matthews Correlation Coefficient for robust evaluation
Handle class imbalance using SMOTE, class weights, oversampling, undersampling, and cost-sensitive learning
Tune decision thresholds for business-specific objectives
Implement complete pipelines in Python, TensorFlow, and Scikit-Learn for multi-class classification

SECTION 8.1

Introduction

In Chapter 7, we mastered binary classification — the task of separating data into exactly two classes. But the real world rarely presents itself in binary terms. A radiologist must distinguish between ten types of lung pathologies. An agricultural scientist classifies satellite imagery into fifteen different crop types. An autonomous vehicle's perception system categorizes every pixel into dozens of semantic classes: road, pedestrian, vehicle, traffic sign, building, sky.

Multi-class classification generalizes binary classification from 2 classes to K classes, where K ≥ 3. This seemingly small conceptual leap introduces profound mathematical, computational, and evaluative challenges:

Output representation: We need K output probabilities that sum to 1 — enter the Softmax function
Loss computation: Binary cross-entropy becomes Categorical Cross-Entropy over K classes
Evaluation complexity: A single accuracy number hides critical performance variations across classes; we need per-class metrics and thoughtful averaging strategies
Class imbalance: With K classes, imbalance is almost guaranteed — rare diseases, uncommon crop types, infrequent fraud patterns

🎓 Professor's Insight

Multi-class classification is arguably the most practically important supervised learning task. When students ask "where do ML models get used in industry?" — the answer is almost always a multi-class classifier: spam detection (spam/ham/promotional/social), content moderation (safe/violence/hate/nudity), medical diagnosis (healthy/disease A/disease B/...), and so on.

This chapter is structured as a progression: we first establish the mathematical machinery (Softmax, Cross-Entropy), then build a comprehensive evaluation toolkit (confusion matrices through ROC-AUC), tackle the critical problem of class imbalance, and finally implement everything with Python, TensorFlow, and Scikit-Learn — all grounded in real Indian and global case studies.

🇮🇳 India Spotlight

India's National Health Authority processes over 500 million Ayushman Bharat claims yearly, each coded with ICD-10 disease classifications spanning thousands of categories. Automated multi-class classification of medical claims saved ₹2,400 crore in fraudulent claim detection in 2024-25, making this chapter's content directly applicable to systems that touch millions of Indian lives daily.

SECTION 8.2

Historical Background

The history of multi-class classification intertwines with the development of statistical pattern recognition, neural networks, and information theory.

1936 — Fisher's Discriminant Analysis

R.A. Fisher's seminal work on the Iris dataset (3 species, 4 features) introduced linear discriminant analysis — the first systematic approach to multi-class classification. Interestingly, Fisher's dataset included Iris setosa, Iris virginica, and Iris versicolor — a 3-class problem that remains a standard benchmark 90 years later.

1957 — Rosenblatt's Perceptron

Frank Rosenblatt's perceptron was binary, but researchers quickly extended it to multi-class via "one-vs-rest" committees of perceptrons, laying groundwork for multi-output neural networks.

1959 — Softmax's Theoretical Roots

The Softmax function originates from Luce's Choice Axiom (1959) in mathematical psychology, which modeled how humans make choices among multiple alternatives. Physicist Ludwig Boltzmann had used the same mathematical form (the Boltzmann distribution) in statistical mechanics decades earlier.

1989 — Bridle's Softmax

John Bridle formally introduced the term "softmax" in the context of neural networks at a 1989 conference, establishing it as the standard output activation for multi-class neural networks.

1998 — Multi-class SVMs

Vapnik's SVM was inherently binary. Dietterich and Bakiri (1995) introduced Error-Correcting Output Codes, and later Crammer and Singer (2001) developed a direct multi-class SVM formulation.

2009 — ImageNet & Modern Multi-Class

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with 1,000 classes became the definitive benchmark for multi-class classification, driving AlexNet (2012), VGGNet (2014), GoogLeNet (2014), and ResNet (2015).

2012 — Modern Metrics Era

With increasing awareness of fairness and bias in ML systems, evaluation metrics evolved beyond simple accuracy. Cohen's Kappa, Matthews Correlation Coefficient, and class-specific analyses became essential tools in the ML practitioner's arsenal.

📝 Exam Tip

Questions about the "history of softmax" appear in GATE ML and IIT entrance exams. Remember: Boltzmann distribution → Luce's choice axiom (1959) → Bridle's "softmax" (1989). The mathematical form e^{z_i}/Σe^{z_j} predates neural networks by a century.

SECTION 8.3

Conceptual Explanation

8.3.1 From Binary to Multi-Class

In binary classification, a model outputs a single probability $p(y=1|x)$ and infers $p(y=0|x) = 1 - p$ . For K classes, we need K probabilities that collectively describe a probability distribution over all classes.

8.3.2 The Softmax Function — An Intuitive View

Think of Softmax as a "soft" version of the argmax function. The argmax function picks the single largest value and ignores everything else. Softmax instead converts a vector of raw scores (logits) into a probability distribution where:

Every output is between 0 and 1
All outputs sum to exactly 1
Larger logits get exponentially more probability mass
The relative ordering is preserved

🎓 Professor's Insight

Temperature analogy: Imagine K cups of water at different temperatures. Softmax is like asking "what fraction of the total heat does each cup contain?" The hottest cup dominates, but even the coolest cup has some fraction. As you increase the "temperature parameter" τ, the distribution flattens toward uniform; as τ → 0, it sharpens toward a one-hot argmax.

8.3.3 One-Hot Encoding for Targets

Multi-class targets are encoded as one-hot vectors — vectors of length K with a 1 at the position of the true class and 0 everywhere else.

Example: 3-Class Problem (Cat, Dog, Bird)

If the true class is "Dog" (class index 1):

y = [0, 1, 0]

The model's predicted probability vector might be:

ŷ = [0.15, 0.72, 0.13]

The cross-entropy loss only "cares about" the probability assigned to the true class: -log(0.72) ≈ 0.329.

8.3.4 One-vs-Rest (OvR) Strategy

OvR (also called One-vs-All or OvA) trains K separate binary classifiers, each distinguishing one class from all others. For K=5 classes, you train 5 binary classifiers:

Classifier 1: Class A vs. (B, C, D, E)
Classifier 2: Class B vs. (A, C, D, E)
... and so on

At prediction time, run all K classifiers and pick the one with the highest confidence. Pros: Simple, O(K) classifiers, interpretable. Cons: Class imbalance is amplified (1 vs K-1), ambiguous regions where multiple classifiers say "yes" or none do.

8.3.5 One-vs-One (OvO) Strategy

OvO trains K(K-1)/2 binary classifiers, one for each pair of classes. At prediction, use voting — each classifier "votes" for one of its two classes. The class with the most votes wins.

Pros: Each classifier trains on a balanced pair, smaller training sets per classifier. Cons: O(K²) classifiers, slow prediction for large K, voting can lead to ties.

Aspect	One-vs-Rest (OvR)	One-vs-One (OvO)
Number of classifiers	K	K(K-1)/2
Training data per classifier	All N samples	~2N/K samples
Class imbalance risk	High (1 vs K-1)	Low (balanced pairs)
Scalability	Good for large K	Poor for large K
Used in	Logistic Regression, SVM	SVM (libsvm default)

8.3.6 Confusion Matrix for K Classes

A K×K confusion matrix generalizes the 2×2 binary confusion matrix. Entry C[i][j] indicates how many samples of true class i were predicted as class j. The diagonal contains correct predictions; off-diagonal entries are misclassifications.

8.3.7 The Evaluation Metrics Landscape

A single accuracy number can be dangerously misleading in multi-class settings. If 90% of medical images are "healthy" and 10% spread across 9 diseases, a model that always predicts "healthy" achieves 90% accuracy while being completely useless for diagnosis. This chapter equips you with a comprehensive toolkit:

Per-class metrics: Precision, Recall, F1 for each class independently
Averaging strategies: Macro (equal weight per class), Micro (equal weight per sample), Weighted
Ranking metrics: ROC-AUC, Precision-Recall AUC — threshold-independent evaluation
Agreement metrics: Cohen's Kappa, Matthews Correlation Coefficient — chance-corrected evaluation

🏭 Industry Alert

In production ML systems, the choice of evaluation metric is a business decision, not a technical one. At Google, spam detection optimizes for high precision (don't send important emails to spam). At a hospital, cancer screening optimizes for high recall (don't miss any cancer cases). Understanding this distinction is what separates junior from senior ML engineers.

SECTION 8.4

Mathematical Foundation

8.4.1 The Softmax Function

Given a vector of logits $z = [z₁, z₂, ..., z_K]$ , the Softmax function maps each logit to a probability:

σ(z)ᵢ = e^{zᵢ} / Σⱼ₌₁ᴷ e^{zⱼ} for i = 1, 2, ..., K

Properties of Softmax

Range: 0 < σ(z)ᵢ < 1 for all i (strict inequalities — never exactly 0 or 1)
Normalization: Σᵢ σ(z)ᵢ = 1 (forms a valid probability distribution)
Monotonicity: If zᵢ > zⱼ then σ(z)ᵢ > σ(z)ⱼ (preserves ordering)
Translation invariance: σ(z + c) = σ(z) for any constant c (used for numerical stability)
Reduces to sigmoid: For K=2, softmax reduces to the logistic sigmoid function

8.4.2 Categorical Cross-Entropy Loss

For a single sample with one-hot target $y = [y₁, ..., y_K]$ and predicted probabilities $ŷ = [ŷ₁, ..., ŷ_K]$ :

L = -Σₖ₌₁ᴷ yₖ \cdot log(ŷₖ)

Since y is one-hot (only one yₖ = 1, rest are 0), this simplifies to:

L = -log(ŷ_c) where c is the true class

The loss is small when ŷ_c is close to 1 (confident, correct prediction) and large when ŷ_c is close to 0 (model assigns low probability to the true class).

8.4.3 Multi-Class Confusion Matrix

For K classes, the confusion matrix C is a K×K matrix where:

C[i][j] = number of samples with true label i predicted as class j

For each class k, we can extract from the confusion matrix:

TPₖ = C[k][k] (diagonal element) FPₖ = Σᵢ\neqₖ C[i][k] (column sum minus diagonal) FNₖ = Σⱼ\neqₖ C[k][j] (row sum minus diagonal) TNₖ = Σᵢ\neqₖ Σⱼ\neqₖ C[i][j] (everything else)

8.4.4 Per-Class Metrics

Precisionₖ = TPₖ / (TPₖ + FPₖ) Recallₖ = TPₖ / (TPₖ + FNₖ) F1ₖ = 2 \cdot Precisionₖ \cdot Recallₖ / (Precisionₖ + Recallₖ)

8.4.5 Averaging Strategies

Macro Average

Macro-Precision = (1/K) \cdot Σₖ₌₁ᴷ Precisionₖ Macro-Recall = (1/K) \cdot Σₖ₌₁ᴷ Recallₖ Macro-F1 = (1/K) \cdot Σₖ₌₁ᴷ F1ₖ

Treats all classes equally regardless of support (number of samples). Good when all classes matter equally.

Micro Average

Micro-Precision = Σₖ TPₖ / Σₖ (TPₖ + FPₖ) Micro-Recall = Σₖ TPₖ / Σₖ (TPₖ + FNₖ) Micro-F1 = 2 \cdot Micro-P \cdot Micro-R / (Micro-P + Micro-R)

Note: In multi-class (not multi-label) settings, Micro-Precision = Micro-Recall = Micro-F1 = Accuracy, because Σ TPₖ = total correct predictions, and Σ(TPₖ + FPₖ) = Σ(TPₖ + FNₖ) = N.

Weighted Average

Weighted-F1 = Σₖ₌₁ᴷ (nₖ/N) \cdot F1ₖ

Where nₖ is the number of true samples in class k, and N = Σ nₖ. Accounts for class imbalance by weighting each class by its prevalence.

8.4.6 ROC Curve & AUC

For each class k in a multi-class problem (using OvR binarization):

TPR(t) = TPₖ(t) / (TPₖ(t) + FNₖ(t)) [Sensitivity] FPR(t) = FPₖ(t) / (FPₖ(t) + TNₖ(t)) [1 - Specificity]

As threshold t varies from 1 to 0, we trace a curve in (FPR, TPR) space. AUC is the area under this curve, ranging from 0.5 (random) to 1.0 (perfect).

8.4.7 Cohen's Kappa

κ = (p₀ - pₑ) / (1 - pₑ)

Where p₀ = observed agreement (accuracy), and pₑ = expected agreement by chance. Values: ≤0 = no agreement, 0.01-0.20 = slight, 0.21-0.40 = fair, 0.41-0.60 = moderate, 0.61-0.80 = substantial, 0.81-1.00 = almost perfect.

8.4.8 Matthews Correlation Coefficient (Multi-class)

MCC = (c\cdots - Σₖ pₖ\cdottₖ) / \sqrt(s² - Σₖ pₖ²) \cdot \sqrt(s² - Σₖ tₖ²)

Where s = Σᵢⱼ C[i][j] (total samples), c = Σₖ C[k][k] (trace — total correct), pₖ = Σᵢ C[i][k] (column sum for k), tₖ = Σⱼ C[k][j] (row sum for k). MCC ranges from -1 (total disagreement) to +1 (perfect prediction), with 0 meaning no better than random.

📝 Exam Tip

A commonly tested fact: For multi-class single-label classification, Micro-Precision = Micro-Recall = Micro-F1 = Overall Accuracy. This equality breaks in multi-label settings where each sample can have multiple labels.

SECTION 8.5

Formula Derivations

8.5.1 Deriving Softmax from Maximum Entropy Principle

Step 1: The Log-Linear Model

We want to model the probability of class k given input x. Start with a log-linear (exponential family) model:

log P(y = k | x) = wₖᵀx + bₖ + const

This means:

P(y = k | x) \propto exp(wₖᵀx + bₖ)

Step 2: Define Logits

Let zₖ = wₖᵀx + bₖ be the "logit" (unnormalized log-probability) for class k. Then:

P(y = k | x) \propto exp(zₖ)

Step 3: Normalization

Since probabilities must sum to 1:

Σₖ₌₁ᴷ P(y = k | x) = 1

We introduce the normalizing constant (partition function) Z = Σⱼ₌₁ᴷ exp(zⱼ):

P(y = k | x) = exp(zₖ) / Z = exp(zₖ) / Σⱼ₌₁ᴷ exp(zⱼ)

This is the Softmax function. It arises naturally as the normalized form of exponential class scores.

Step 4: Verify it reduces to Sigmoid for K = 2

For K = 2 with classes 0 and 1:

P(y=1|x) = exp(z₁) / (exp(z₀) + exp(z₁)) = 1 / (exp(z₀ - z₁) + 1) = 1 / (1 + exp(-(z₁ - z₀))) = σ(z₁ - z₀)

This is exactly the logistic sigmoid applied to the log-odds z₁ - z₀. ✓

8.5.2 Deriving Categorical Cross-Entropy from Maximum Likelihood

Step 1: Likelihood for a Single Sample

Given true one-hot label y = [y₁, ..., y_K] and predictions ŷ = [ŷ₁, ..., ŷ_K], the likelihood of observing y is:

P(y | ŷ) = Πₖ₌₁ᴷ ŷₖ^{yₖ}

This uses the fact that y is one-hot: only the term for the true class survives (yₖ = 1), while others contribute ŷₖ⁰ = 1.

Step 2: Log-Likelihood

log P(y | ŷ) = Σₖ₌₁ᴷ yₖ \cdot log(ŷₖ)

Step 3: Negative Log-Likelihood as Loss

To minimize loss (maximize likelihood), we negate:

L = -log P(y | ŷ) = -Σₖ₌₁ᴷ yₖ \cdot log(ŷₖ)

This is the Categorical Cross-Entropy. For N training samples:

L_total = -(1/N) Σᵢ₌₁ᴺ Σₖ₌₁ᴷ yᵢₖ \cdot log(ŷᵢₖ)

8.5.3 Gradient of Cross-Entropy w.r.t. Logits

The combined gradient of cross-entropy loss with softmax activation has a beautifully simple form:

\partialL/\partialzₖ = ŷₖ - yₖ

Derivation:

Using chain rule: ∂L/∂zₖ = Σⱼ (∂L/∂ŷⱼ) · (∂ŷⱼ/∂zₖ)

We know ∂L/∂ŷⱼ = -yⱼ/ŷⱼ

The Jacobian of softmax: ∂ŷⱼ/∂zₖ = ŷⱼ(δⱼₖ - ŷₖ) where δⱼₖ is the Kronecker delta

\partialL/\partialzₖ = Σⱼ (-yⱼ/ŷⱼ) \cdot ŷⱼ(δⱼₖ - ŷₖ) = -Σⱼ yⱼ(δⱼₖ - ŷₖ) = -yₖ + ŷₖ \cdot Σⱼ yⱼ = -yₖ + ŷₖ \cdot 1 [since Σyⱼ = 1 for one-hot] = ŷₖ - yₖ

This elegant result means: the gradient is simply the difference between predicted and true probabilities. This is identical in form to the sigmoid + binary cross-entropy gradient, but now operates over K dimensions.

8.5.4 Deriving Macro, Micro, and Weighted F1

Setup: K-class confusion matrix

From the K×K confusion matrix, for each class k:

TPₖ = C[k][k], FPₖ = Σᵢ\neqₖ C[i][k], FNₖ = Σⱼ\neqₖ C[k][j] Pₖ = TPₖ/(TPₖ+FPₖ), Rₖ = TPₖ/(TPₖ+FNₖ) F1ₖ = 2PₖRₖ/(Pₖ+Rₖ) = 2TPₖ/(2TPₖ+FPₖ+FNₖ)

Macro-F1

Macro-F1 = (1/K) Σₖ₌₁ᴷ F1ₖ

Simple arithmetic mean. Gives equal importance to every class, regardless of how many samples belong to it.

Micro-F1

Micro-P = Σₖ TPₖ / (Σₖ TPₖ + Σₖ FPₖ) Micro-R = Σₖ TPₖ / (Σₖ TPₖ + Σₖ FNₖ)

Key insight: In single-label multi-class, Σₖ FPₖ = Σₖ FNₖ (every misclassification is simultaneously a FP for one class and a FN for another). Therefore Micro-P = Micro-R = Micro-F1 = Accuracy.

Weighted-F1

Weighted-F1 = Σₖ (nₖ/N) \cdot F1ₖ where nₖ = TPₖ + FNₖ = support for class k

🎓 Professor's Insight

The identity Micro-F1 = Accuracy in single-label multi-class classification is one of the most frequently tested facts in ML interviews. However, this identity does not hold for multi-label classification, where a sample can belong to multiple classes simultaneously. Always clarify the problem setting when discussing micro averages.

SECTION 8.6

Worked Numerical Examples

Example 1: Softmax Computation

Problem

A neural network outputs logits z = [2.0, 1.0, 0.1] for a 3-class problem (Cat, Dog, Bird). Compute the softmax probabilities.

Solution

Step 1: Compute exponentials

e^{2.0} = 7.389, e^{1.0} = 2.718, e^{0.1} = 1.105

Step 2: Sum of exponentials

Z = 7.389 + 2.718 + 1.105 = 11.212

Step 3: Normalize

P(Cat) = 7.389 / 11.212 = 0.659 P(Dog) = 2.718 / 11.212 = 0.242 P(Bird) = 1.105 / 11.212 = 0.099

Verification: 0.659 + 0.242 + 0.099 = 1.000 ✓

Prediction: Class 0 (Cat) with 65.9% confidence.

Step 4: Numerically stable version

Subtract max logit (2.0) from all logits before computing exp to prevent overflow:

z' = [0.0, -1.0, -1.9] e^{0.0} = 1.000, e^{-1.0} = 0.368, e^{-1.9} = 0.150 Z' = 1.518 P(Cat) = 1.000/1.518 = 0.659 ✓ (same result)

Example 2: Cross-Entropy Loss Computation

Problem

True class = Dog (y = [0, 1, 0]). Predicted probabilities from Example 1: ŷ = [0.659, 0.242, 0.099]. Compute the categorical cross-entropy loss.

Solution

L = -(0\cdotlog(0.659) + 1\cdotlog(0.242) + 0\cdotlog(0.099)) L = -log(0.242) L = -(-1.417) L = 1.417

This is a relatively high loss because the model predicted Cat (0.659) but the true class was Dog (only 0.242 probability).

Compare: If the model had predicted ŷ = [0.05, 0.90, 0.05], the loss would be -log(0.90) = 0.105 — much lower.

Example 3: Multi-Class Confusion Matrix & Metrics

Problem

A 3-class classifier (Cat=0, Dog=1, Bird=2) is tested on 27 samples. The confusion matrix is:

Predicted Cat Dog Bird Actual Cat [ 8 1 1 ] 10 samples Actual Dog [ 2 6 0 ] 8 samples Actual Bird[ 1 1 7 ] 9 samples

Compute per-class Precision, Recall, F1, then Macro-F1, Micro-F1, and Weighted-F1.

Solution

For Cat (k=0):

TP₀ = 8, FP₀ = 2+1 = 3, FN₀ = 1+1 = 2 P₀ = 8/(8+3) = 8/11 = 0.727 R₀ = 8/(8+2) = 8/10 = 0.800 F1₀ = 2(0.727)(0.800)/(0.727+0.800) = 1.164/1.527 = 0.762

For Dog (k=1):

TP₁ = 6, FP₁ = 1+1 = 2, FN₁ = 2+0 = 2 P₁ = 6/(6+2) = 6/8 = 0.750 R₁ = 6/(6+2) = 6/8 = 0.750 F1₁ = 2(0.750)(0.750)/(0.750+0.750) = 0.750

For Bird (k=2):

TP₂ = 7, FP₂ = 1+0 = 1, FN₂ = 1+1 = 2 P₂ = 7/(7+1) = 7/8 = 0.875 R₂ = 7/(7+2) = 7/9 = 0.778 F1₂ = 2(0.875)(0.778)/(0.875+0.778) = 1.362/1.653 = 0.824

Macro-F1:

Macro-F1 = (0.762 + 0.750 + 0.824) / 3 = 2.336/3 = 0.779

Micro-F1:

ΣTP = 8+6+7 = 21, Total = 27 Micro-F1 = 21/27 = 0.778 (= Accuracy)

Weighted-F1:

Weighted-F1 = (10/27)(0.762) + (8/27)(0.750) + (9/27)(0.824) = 0.282 + 0.222 + 0.275 = 0.779

Example 4: Cohen's Kappa Calculation

Using the same confusion matrix from Example 3:

p₀ = (8+6+7)/27 = 21/27 = 0.778

Expected agreement pₑ: for each class k, multiply row total by column total and divide by N²:

Row totals: 10, 8, 9 Col totals: 11, 8, 8 pₑ = (10\times11 + 8\times8 + 9\times8) / 27² = (110 + 64 + 72) / 729 = 246 / 729 = 0.337

κ = (0.778 - 0.337) / (1 - 0.337) = 0.441/0.663 = 0.665

Interpretation: κ = 0.665 → "substantial agreement" — the model performs significantly better than chance.

SECTION 8.7

Visual Diagrams

Diagram 1: Softmax Transformation Pipeline

Input Features Logits (Raw Scores) Softmax Probabilities ┌─────────┐ ┌─────────────────┐ ┌──────────────────────┐ │ x₁=0.5 │ │ z₁ = 2.0 │ │ P₁ = 0.659 ██████▋ │ │ x₂=1.2 │──[Wx+b]──▶│ z₂ = 1.0 │──exp──▶│ P₂ = 0.242 ██▍ │ │ x₃=0.8 │ │ z₃ = 0.1 │ /Z │ P₃ = 0.099 █ │ └─────────┘ └─────────────────┘ └──────────────────────┘ Σ = 1.000 ✓ Key: The exponential amplifies differences, then normalization ensures a valid probability distribution.

Diagram 2: One-vs-Rest (OvR) Strategy for 4 Classes

Class Labels: A, B, C, D Classifier 1: A vs {B, C, D} → P(A) = 0.85 Classifier 2: B vs {A, C, D} → P(B) = 0.32 Classifier 3: C vs {A, C, D} → P(C) = 0.15 Classifier 4: D vs {A, B, C} → P(D) = 0.41 Decision: argmax → Class A (P=0.85) Note: Probabilities don't sum to 1! Each classifier operates independently.

Diagram 3: Multi-Class Confusion Matrix Anatomy

PREDICTED ┌──────┬──────┬──────┬──────┐ │ C₁ │ C₂ │ C₃ │ C₄ │ ┌─────┬───┼──────┼──────┼──────┼──────┤ │ │C₁ │ TP₁ │ │ │ │ ← FN₁ = sum of row minus TP₁ │ A ├───┼──────┼──────┼──────┼──────┤ │ C │C₂ │ │ TP₂ │ │ │ │ T ├───┼──────┼──────┼──────┼──────┤ │ U │C₃ │ │ │ TP₃ │ │ │ A ├───┼──────┼──────┼──────┼──────┤ │ L │C₄ │ │ │ │ TP₄ │ └─────┴───┴──────┴──────┴──────┴──────┘ ↑ FP₁ = sum of column minus TP₁ Diagonal = Correct predictions (TP for each class) Off-diag = Misclassifications Row sum = Actual count per class (support) Col sum = Predicted count per class

Diagram 4: ROC Space Interpretation

TPR (Sensitivity) 1.0 ┤ ·──────────────────── Perfect (AUC=1.0) │ /· │/ · ╭─── Good model (AUC≈0.85) 0.8 ┤ · ╭─╯ │ · ╭─╯ │ · ╭─╯ 0.6 ┤ ·╭─╯ │ ╭╯ ╱ Random (AUC=0.5) 0.4 ┤ ╭╯ ╱ │╭╯ ╱ 0.2 ┤╯ ╱ │ ╱ 0.0 ┤─────╱───────────────── └──┬──┬──┬──┬──┬──┬──┤ 0 0.2 0.4 0.6 0.8 1.0 FPR (1 - Specificity) ▸ Upper-left corner = ideal operating point ▸ Diagonal = random classifier ▸ Below diagonal = worse than random (invert predictions!)

Diagram 5: Precision-Recall Tradeoff

Precision 1.0 ┤· │ ·──· 0.8 ┤ ·· │ ·──· Good model (high PR-AUC) 0.6 ┤ ·· │ ·──· 0.4 ┤ ··──· │ - - - - - - - - - - ·- - baseline (prevalence) 0.2 ┤ ··──· │ 0.0 ┤───────────────────────────── └──┬──┬──┬──┬──┬──┬──┬──┬──┤ 0 0.1 0.2 0.4 0.6 0.8 1.0 Recall ▸ Use PR curve when positive class is RARE ▸ Baseline = class prevalence (not 0.5 as in ROC)

SECTION 8.8

Flowcharts

Flowchart 1: Choosing the Right Multi-Class Evaluation Metric

┌─────────────────────────────┐ │ Multi-Class Evaluation │ │ Metric Selection │ └──────────────┬──────────────┘ ▼ ┌──────────────────────────────┐ │ Are classes balanced? │ └──────┬───────────────┬───────┘ │ Yes │ No ▼ ▼ ┌──────────┐ ┌─────────────────────┐ │ Accuracy │ │ Do all classes │ │ + Macro │ │ matter equally? │ │ F1 OK │ └──────┬──────────┬────┘ └──────────┘ │ Yes │ No ▼ ▼ ┌───────────┐ ┌──────────────┐ │ Macro-F1 │ │ Weighted-F1 │ │ (treats │ │ or focus on │ │ all equal)│ │ specific │ └───────────┘ │ class F1 │ └──────────────┘ Also consider: ┌─────────────────────────────────────────────┐ │ Need threshold-free metric? → ROC-AUC │ │ Rare positive class? → PR-AUC │ │ Comparing to random? → Cohen's Kappa │ │ Need single robust metric? → MCC │ └─────────────────────────────────────────────┘

Flowchart 2: Multi-Class Classification Pipeline

┌──────────┐ ┌────────────┐ ┌──────────┐ ┌────────────┐ │ Raw │──▶│ One-Hot │──▶│ Choose │──▶│ Train │ │ Labels │ │ Encode │ │ Strategy │ │ Model │ │ (0-K) │ │ Targets │ │ │ │ │ └──────────┘ └────────────┘ └──────────┘ └─────┬──────┘ │ │ │ ┌────┘ └────┐ │ ▼ ▼ │ ┌─────────┐ ┌─────────┐ │ │ OvR │ │ Native │ │ │ OvO │ │ Softmax │ │ └─────────┘ └─────────┘ │ ▼ ┌──────────┐ ┌────────────┐ ┌──────────┐ ┌────────────┐ │ Business │◀──│ Threshold │◀──│ Evaluate │◀──│ Predict │ │ Decision │ │ Tuning │ │ Metrics │ │ Probs (ŷ) │ └──────────┘ └────────────┘ └──────────┘ └────────────┘ │ ┌────────┴────────┐ ▼ ▼ ┌─────────────┐ ┌──────────────┐ │ Confusion │ │ ROC/PR │ │ Matrix, F1 │ │ Curves, AUC │ │ Kappa, MCC │ │ │ └─────────────┘ └──────────────┘

Flowchart 3: Handling Class Imbalance Decision Tree

┌───────────────────────────────────┐ │ Is the dataset class-imbalanced? │ └────────────────┬──────────────────┘ ▼ Yes ┌───────────────────────────────────┐ │ How severe is the imbalance? │ └──┬───────────────┬────────────┬───┘ │ Mild │ Moderate │ Severe │ (1:2-1:5) │ (1:5-1:50)│ (1:50+) ▼ ▼ ▼ ┌────────┐ ┌──────────┐ ┌──────────────┐ │ Class │ │ SMOTE / │ │ SMOTE + │ │ Weights│ │ Random │ │ Cost-Sens. │ │ in │ │ Over- │ │ + Ensemble │ │ Loss │ │ sampling │ │ + Custom │ └────────┘ └──────────┘ │ Thresholds │ └──────────────┘ Always: Use stratified train/test split Always: Evaluate with Macro-F1, PR-AUC, not just accuracy

SECTION 8.9

Python Implementation (From Scratch)

8.9.1 Softmax Function

Python
import numpy as np

def softmax(z):
    """
    Numerically stable softmax function.
    
    Args:
        z: numpy array of logits, shape (K,) or (N, K)
    Returns:
        Probability distribution(s), same shape as z
    """
    # Subtract max for numerical stability (prevents overflow)
    z_stable = z - np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z_stable)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)

# Example: 3-class logits
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(f"Logits: {logits}")
print(f"Softmax: {probs}")
print(f"Sum: {probs.sum():.6f}")

# Output:
# Logits: [2.  1.  0.1]
# Softmax: [0.6590 0.2424 0.0986]
# Sum: 1.000000

# Batch mode: (4 samples, 3 classes)
batch_logits = np.array([
    [2.0, 1.0, 0.1],
    [0.5, 2.5, 1.0],
    [1.0, 1.0, 1.0],  # equal logits → uniform distribution
    [10., 0.0, 0.0],  # large difference → near one-hot
])
batch_probs = softmax(batch_logits)
print("\nBatch Softmax:")
for i, (l, p) in enumerate(zip(batch_logits, batch_probs)):
    print(f"  Sample {i}: {l} → {np.round(p, 4)}")

8.9.2 Categorical Cross-Entropy Loss

Python
def categorical_cross_entropy(y_true, y_pred, epsilon=1e-15):
    """
    Compute categorical cross-entropy loss.
    
    Args:
        y_true: one-hot encoded true labels, shape (N, K)
        y_pred: predicted probabilities, shape (N, K)
        epsilon: small value to avoid log(0)
    Returns:
        Average loss over N samples
    """
    # Clip predictions to prevent log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    # Cross-entropy: -Σ y_k * log(ŷ_k) for each sample
    loss_per_sample = -np.sum(y_true * np.log(y_pred), axis=1)
    
    return np.mean(loss_per_sample)

# Example
y_true = np.array([
    [1, 0, 0],  # Cat
    [0, 1, 0],  # Dog
    [0, 0, 1],  # Bird
])
y_pred = np.array([
    [0.9, 0.05, 0.05],  # Confident & correct
    [0.1, 0.8, 0.1],    # Correct but less confident
    [0.2, 0.3, 0.5],    # Correct but barely
])

loss = categorical_cross_entropy(y_true, y_pred)
print(f"Average Cross-Entropy Loss: {loss:.4f}")
# Output: Average Cross-Entropy Loss: 0.2573

# Compare: bad predictions
y_pred_bad = np.array([
    [0.1, 0.8, 0.1],   # Wrong!
    [0.7, 0.2, 0.1],   # Wrong!
    [0.5, 0.4, 0.1],   # Wrong!
])
loss_bad = categorical_cross_entropy(y_true, y_pred_bad)
print(f"Bad predictions loss: {loss_bad:.4f}")
# Output: Bad predictions loss: 1.8971

8.9.3 Complete Multi-Class Confusion Matrix & Metrics

Python
import numpy as np

def confusion_matrix_multiclass(y_true, y_pred, K):
    """Build K×K confusion matrix from scratch."""
    cm = np.zeros((K, K), dtype=int)
    for true, pred in zip(y_true, y_pred):
        cm[true][pred] += 1
    return cm

def per_class_metrics(cm):
    """Compute precision, recall, F1 per class from confusion matrix."""
    K = cm.shape[0]
    metrics = {}
    
    for k in range(K):
        tp = cm[k, k]
        fp = np.sum(cm[:, k]) - tp    # Column sum minus diagonal
        fn = np.sum(cm[k, :]) - tp    # Row sum minus diagonal
        
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
        support = tp + fn  # Total actual samples for this class
        
        metrics[k] = {
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'support': support,
            'tp': tp, 'fp': fp, 'fn': fn
        }
    return metrics

def macro_f1(metrics):
    """Macro-averaged F1: simple mean of per-class F1."""
    return np.mean([m['f1'] for m in metrics.values()])

def micro_f1(metrics):
    """Micro-averaged F1: aggregate TP, FP, FN first."""
    total_tp = sum(m['tp'] for m in metrics.values())
    total_fp = sum(m['fp'] for m in metrics.values())
    total_fn = sum(m['fn'] for m in metrics.values())
    
    micro_p = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0
    micro_r = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0
    return 2 * micro_p * micro_r / (micro_p + micro_r) if (micro_p + micro_r) > 0 else 0

def weighted_f1(metrics):
    """Weighted F1: weight by support (number of true instances)."""
    total = sum(m['support'] for m in metrics.values())
    return sum(m['support'] / total * m['f1'] for m in metrics.values())

def cohens_kappa(cm):
    """Compute Cohen's Kappa from confusion matrix."""
    N = cm.sum()
    p0 = np.trace(cm) / N  # Observed agreement
    
    row_sums = cm.sum(axis=1)
    col_sums = cm.sum(axis=0)
    pe = np.sum(row_sums * col_sums) / (N ** 2)  # Expected agreement
    
    return (p0 - pe) / (1 - pe) if pe != 1 else 0

def matthews_correlation(cm):
    """Multi-class Matthews Correlation Coefficient."""
    s = cm.sum()
    c = np.trace(cm)
    pk = cm.sum(axis=0)  # Column sums
    tk = cm.sum(axis=1)  # Row sums
    
    numerator = c * s - np.sum(pk * tk)
    denom = np.sqrt(s**2 - np.sum(pk**2)) * np.sqrt(s**2 - np.sum(tk**2))
    
    return numerator / denom if denom != 0 else 0

# ===== DEMO =====
np.random.seed(42)
K = 3
class_names = ['Cat', 'Dog', 'Bird']

# Simulated predictions
y_true = np.array([0,0,0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2])
y_pred = np.array([0,0,0,0,0,0,0,0,1,2, 0,0,1,1,1,1,1,1, 0,2,2,2,2,2,2,2,1])

cm = confusion_matrix_multiclass(y_true, y_pred, K)
print("Confusion Matrix:")
print(cm)

metrics = per_class_metrics(cm)
print(f"\n{'Class':<8} {'Prec':>8} {'Recall':>8} {'F1':>8} {'Support':>8}")
print("-" * 42)
for k in range(K):
    m = metrics[k]
    print(f"{class_names[k]:<8} {m['precision']:>8.3f} {m['recall']:>8.3f} {m['f1']:>8.3f} {m['support']:>8}")

print(f"\nMacro-F1:    {macro_f1(metrics):.3f}")
print(f"Micro-F1:    {micro_f1(metrics):.3f}")
print(f"Weighted-F1: {weighted_f1(metrics):.3f}")
print(f"Cohen's κ:   {cohens_kappa(cm):.3f}")
print(f"MCC:         {matthews_correlation(cm):.3f}")

8.9.4 ROC Curve & AUC from Scratch

Python
def roc_curve_binary(y_true, y_scores, n_thresholds=200):
    """
    Compute ROC curve for binary classification.
    
    Args:
        y_true: true binary labels (0 or 1)
        y_scores: predicted probabilities for positive class
        n_thresholds: number of threshold points
    Returns:
        fprs, tprs, thresholds
    """
    thresholds = np.linspace(1.0, 0.0, n_thresholds)
    fprs, tprs = [], []
    
    for t in thresholds:
        y_pred = (y_scores >= t).astype(int)
        tp = np.sum((y_pred == 1) & (y_true == 1))
        fp = np.sum((y_pred == 1) & (y_true == 0))
        fn = np.sum((y_pred == 0) & (y_true == 1))
        tn = np.sum((y_pred == 0) & (y_true == 0))
        
        tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
        fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
        
        tprs.append(tpr)
        fprs.append(fpr)
    
    return np.array(fprs), np.array(tprs), thresholds

def auc_trapezoidal(x, y):
    """Compute AUC using trapezoidal rule."""
    # Sort by x
    sorted_idx = np.argsort(x)
    x_sorted = x[sorted_idx]
    y_sorted = y[sorted_idx]
    
    # Trapezoidal rule: Σ (x_{i+1} - x_i) * (y_i + y_{i+1}) / 2
    area = 0.0
    for i in range(len(x_sorted) - 1):
        dx = x_sorted[i + 1] - x_sorted[i]
        avg_y = (y_sorted[i] + y_sorted[i + 1]) / 2
        area += dx * avg_y
    
    return area

# Demo: Generate sample data
np.random.seed(42)
y_true_bin = np.array([1,1,1,1,1,0,0,0,0,0,1,0,1,0,1])
y_scores_bin = np.array([0.9,0.8,0.7,0.6,0.55,0.4,0.35,0.3,0.2,0.1,0.85,0.25,0.65,0.45,0.75])

fprs, tprs, thresholds = roc_curve_binary(y_true_bin, y_scores_bin)
auc_value = auc_trapezoidal(fprs, tprs)
print(f"AUC (from scratch): {auc_value:.4f}")

8.9.5 SMOTE Implementation (Simplified)

Python
def smote_simple(X_minority, n_synthetic, k=5):
    """
    Simplified SMOTE: generate synthetic minority samples.
    
    For each selected minority sample:
    1. Find k nearest neighbors within minority class
    2. Pick one neighbor randomly
    3. Generate a new point along the line between them
    
    Args:
        X_minority: minority class samples, shape (n, features)
        n_synthetic: number of synthetic samples to generate
        k: number of nearest neighbors
    Returns:
        Synthetic samples, shape (n_synthetic, features)
    """
    from scipy.spatial.distance import cdist
    
    n_samples, n_features = X_minority.shape
    synthetic = np.zeros((n_synthetic, n_features))
    
    # Compute pairwise distances
    distances = cdist(X_minority, X_minority)
    
    for i in range(n_synthetic):
        # Pick a random minority sample
        idx = np.random.randint(0, n_samples)
        sample = X_minority[idx]
        
        # Find k nearest neighbors (exclude self)
        neighbor_indices = np.argsort(distances[idx])[1:k+1]
        
        # Pick a random neighbor
        nn_idx = neighbor_indices[np.random.randint(0, k)]
        neighbor = X_minority[nn_idx]
        
        # Generate synthetic sample: interpolate
        lam = np.random.random()  # Random factor between 0 and 1
        synthetic[i] = sample + lam * (neighbor - sample)
    
    return synthetic

# Demo
np.random.seed(42)
X_minority = np.random.randn(20, 2) + np.array([3, 3])  # 20 minority samples
X_synthetic = smote_simple(X_minority, n_synthetic=30, k=5)
print(f"Original minority: {X_minority.shape[0]} samples")
print(f"Synthetic generated: {X_synthetic.shape[0]} samples")
print(f"Total after SMOTE: {X_minority.shape[0] + X_synthetic.shape[0]} samples")

💻 Code Challenge

Extend the smote_simple function to support Borderline-SMOTE: only generate synthetic samples from minority instances that are near the decision boundary (have at least one majority-class neighbor in their k-NN). This targets the most useful region for synthetic data generation.

SECTION 8.10

TensorFlow Implementation

8.10.1 Multi-Class Neural Network Classifier

TensorFlow
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

# ====================================================
# Multi-Class Classification with TensorFlow/Keras
# Example: 10-class digit classification (MNIST-like)
# ====================================================

# --- Generate synthetic multi-class data ---
from sklearn.datasets import make_classification
X, y = make_classification(
    n_samples=5000, n_features=20, n_informative=15,
    n_classes=5, n_clusters_per_class=2,
    random_state=42
)

# Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Normalize features
mean = X_train.mean(axis=0)
std = X_train.std(axis=0) + 1e-8
X_train = (X_train - mean) / std
X_test = (X_test - mean) / std

# Number of classes
K = len(np.unique(y))
print(f"Number of classes: {K}")
print(f"Class distribution: {np.bincount(y_train)}")

# --- Build the Model ---
model = keras.Sequential([
    layers.Input(shape=(20,)),
    layers.Dense(128, activation='relu',
                 kernel_regularizer=keras.regularizers.l2(0.001)),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    
    layers.Dense(64, activation='relu',
                 kernel_regularizer=keras.regularizers.l2(0.001)),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    
    layers.Dense(32, activation='relu'),
    layers.Dropout(0.2),
    
    # Output: K units with softmax activation
    layers.Dense(K, activation='softmax')
])

model.summary()

# --- Compile with Categorical Cross-Entropy ---
# Option 1: If y is integer labels, use sparse_categorical_crossentropy
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='sparse_categorical_crossentropy',  # Integers as labels
    metrics=['accuracy']
)

# --- Train with class weights for imbalance ---
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight(
    'balanced', classes=np.unique(y_train), y=y_train
)
class_weight_dict = dict(enumerate(class_weights))
print(f"Class weights: {class_weight_dict}")

# --- Training ---
history = model.fit(
    X_train, y_train,
    validation_split=0.2,
    epochs=50,
    batch_size=32,
    class_weight=class_weight_dict,
    callbacks=[
        keras.callbacks.EarlyStopping(
            patience=10, restore_best_weights=True,
            monitor='val_loss'
        ),
        keras.callbacks.ReduceLROnPlateau(
            factor=0.5, patience=5, min_lr=1e-6
        )
    ],
    verbose=1
)

# --- Evaluate ---
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest Accuracy: {test_acc:.4f}")
print(f"Test Loss: {test_loss:.4f}")

# --- Get predictions for detailed metrics ---
y_pred_probs = model.predict(X_test)  # Shape: (N, K)
y_pred = np.argmax(y_pred_probs, axis=1)

# Detailed classification report
from sklearn.metrics import classification_report, confusion_matrix
print("\nClassification Report:")
print(classification_report(y_test, y_pred, digits=4))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

8.10.2 Custom Categorical Cross-Entropy with Label Smoothing

TensorFlow
class LabelSmoothingCCE(keras.losses.Loss):
    """
    Categorical Cross-Entropy with Label Smoothing.
    
    Instead of hard targets [0, 1, 0], use:
    [ε/K, 1-ε+ε/K, ε/K] where ε is smoothing factor.
    
    This prevents overconfident predictions and improves
    generalization (Szegedy et al., 2016).
    """
    def __init__(self, smoothing=0.1, **kwargs):
        super().__init__(**kwargs)
        self.smoothing = smoothing
    
    def call(self, y_true, y_pred):
        K = tf.cast(tf.shape(y_pred)[-1], tf.float32)
        
        # One-hot encode if integer labels
        if len(y_true.shape) == 1:
            y_true = tf.one_hot(tf.cast(y_true, tf.int32),
                               tf.cast(K, tf.int32))
        
        # Apply label smoothing
        y_smooth = y_true * (1 - self.smoothing) + self.smoothing / K
        
        # Cross-entropy
        y_pred = tf.clip_by_value(y_pred, 1e-7, 1 - 1e-7)
        loss = -tf.reduce_sum(y_smooth * tf.math.log(y_pred), axis=-1)
        
        return tf.reduce_mean(loss)

# Usage
model_smooth = keras.Sequential([
    layers.Input(shape=(20,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(K, activation='softmax')
])

model_smooth.compile(
    optimizer='adam',
    loss=LabelSmoothingCCE(smoothing=0.1),
    metrics=['accuracy']
)
print("Model with Label Smoothing compiled successfully.")

8.10.3 Multi-Class ROC-AUC with TensorFlow Metrics

TensorFlow
# TensorFlow's built-in multi-class AUC
model_auc = keras.Sequential([
    layers.Input(shape=(20,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(K, activation='softmax')
])

model_auc.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=[
        'accuracy',
        keras.metrics.SparseCategoricalCrossentropy(name='xent'),
        # Note: AUC metric requires one-hot or multi-class setup
    ]
)

# For multi-class AUC, use sklearn after prediction:
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize

# After training and predicting:
# y_test_bin = label_binarize(y_test, classes=range(K))
# auc_ovr = roc_auc_score(y_test_bin, y_pred_probs,
#                          multi_class='ovr', average='macro')
# print(f"Macro OvR AUC: {auc_ovr:.4f}")

🎓 Professor's Insight

Label smoothing is one of the simplest but most effective regularization techniques for multi-class classifiers. By replacing hard targets [0,1,0] with soft targets [0.033, 0.933, 0.033] (for ε=0.1, K=3), we prevent the model from becoming overconfident. This was a key ingredient in the success of Inception v2 (Szegedy et al., 2016) and is now standard practice in production models.

SECTION 8.11

Scikit-Learn Implementation

8.11.1 Complete Multi-Class Pipeline with Evaluation

Scikit-Learn
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler, label_binarize
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
    classification_report, confusion_matrix,
    accuracy_score, f1_score, cohen_kappa_score,
    matthews_corrcoef, roc_auc_score, roc_curve, auc,
    precision_recall_curve, average_precision_score
)
import warnings
warnings.filterwarnings('ignore')

# ===== 1. Generate Multi-Class Data =====
X, y = make_classification(
    n_samples=3000, n_features=15, n_informative=10,
    n_classes=5, n_clusters_per_class=1,
    weights=[0.35, 0.25, 0.20, 0.12, 0.08],  # Imbalanced!
    random_state=42
)

class_names = ['Healthy', 'Disease_A', 'Disease_B', 'Disease_C', 'Disease_D']
K = len(class_names)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

print("Class distribution (train):")
for i, name in enumerate(class_names):
    count = np.sum(y_train == i)
    print(f"  {name}: {count} ({count/len(y_train)*100:.1f}%)")

# ===== 2. Preprocessing =====
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# ===== 3. Train Multiple Classifiers =====
models = {
    'Logistic (OvR)': LogisticRegression(
        multi_class='ovr', max_iter=1000,
        class_weight='balanced', random_state=42
    ),
    'Logistic (Multinomial)': LogisticRegression(
        multi_class='multinomial', solver='lbfgs',
        max_iter=1000, class_weight='balanced', random_state=42
    ),
    'Random Forest': RandomForestClassifier(
        n_estimators=200, max_depth=10,
        class_weight='balanced', random_state=42
    ),
    'SVM (OvO)': SVC(
        kernel='rbf', decision_function_shape='ovo',
        class_weight='balanced', probability=True, random_state=42
    ),
}

results = {}
for name, model in models.items():
    model.fit(X_train_s, y_train)
    y_pred = model.predict(X_test_s)
    y_proba = model.predict_proba(X_test_s)
    
    results[name] = {
        'accuracy': accuracy_score(y_test, y_pred),
        'macro_f1': f1_score(y_test, y_pred, average='macro'),
        'micro_f1': f1_score(y_test, y_pred, average='micro'),
        'weighted_f1': f1_score(y_test, y_pred, average='weighted'),
        'kappa': cohen_kappa_score(y_test, y_pred),
        'mcc': matthews_corrcoef(y_test, y_pred),
        'auc_ovr': roc_auc_score(
            label_binarize(y_test, classes=range(K)),
            y_proba, multi_class='ovr', average='macro'
        ),
    }

# ===== 4. Display Results =====
print(f"\n{'Model':<25} {'Acc':>6} {'MacF1':>6} {'MicF1':>6} "
      f"{'WtF1':>6} {'Kappa':>6} {'MCC':>6} {'AUC':>6}")
print("=" * 85)
for name, r in results.items():
    print(f"{name:<25} {r['accuracy']:>6.3f} {r['macro_f1']:>6.3f} "
          f"{r['micro_f1']:>6.3f} {r['weighted_f1']:>6.3f} "
          f"{r['kappa']:>6.3f} {r['mcc']:>6.3f} {r['auc_ovr']:>6.3f}")

# ===== 5. Detailed Report for Best Model =====
best_model_name = max(results, key=lambda k: results[k]['macro_f1'])
best_model = models[best_model_name]
y_pred_best = best_model.predict(X_test_s)

print(f"\n{'='*50}")
print(f"Best Model: {best_model_name}")
print(f"{'='*50}")
print(classification_report(y_test, y_pred_best,
                           target_names=class_names, digits=4))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_best))

8.11.2 ROC Curves for Multi-Class (OvR)

Scikit-Learn
# Multi-class ROC curve plotting (one curve per class)
import matplotlib
matplotlib.use('Agg')  # Non-interactive backend
import matplotlib.pyplot as plt

# Get probabilities from best model
y_proba_best = best_model.predict_proba(X_test_s)
y_test_bin = label_binarize(y_test, classes=range(K))

# Compute ROC curve for each class
plt.figure(figsize=(10, 8))

for i in range(K):
    fpr_i, tpr_i, _ = roc_curve(y_test_bin[:, i], y_proba_best[:, i])
    auc_i = auc(fpr_i, tpr_i)
    plt.plot(fpr_i, tpr_i, linewidth=2,
             label=f'{class_names[i]} (AUC = {auc_i:.3f})')

# Random baseline
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC = 0.500)')

plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('Multi-Class ROC Curves (One-vs-Rest)', fontsize=14)
plt.legend(loc='lower right', fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('multiclass_roc.png', dpi=150)
print("ROC curves saved to multiclass_roc.png")

8.11.3 SMOTE with imbalanced-learn

Scikit-Learn
# pip install imbalanced-learn
from imblearn.over_sampling import SMOTE, BorderlineSMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek
from imblearn.pipeline import Pipeline as ImbPipeline

# Strategy 1: SMOTE
smote = SMOTE(random_state=42, k_neighbors=5)
X_res, y_res = smote.fit_resample(X_train_s, y_train)
print(f"After SMOTE: {np.bincount(y_res)} (was {np.bincount(y_train)})")

# Strategy 2: SMOTE + Tomek Links (clean noisy boundary)
smote_tomek = SMOTETomek(random_state=42)
X_st, y_st = smote_tomek.fit_resample(X_train_s, y_train)

# Strategy 3: Integrated Pipeline
pipeline = ImbPipeline([
    ('scaler', StandardScaler()),
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(
        n_estimators=200, random_state=42
    ))
])

# Cross-validation with SMOTE inside the pipeline
# (CRITICAL: SMOTE must be inside CV to prevent data leakage!)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(
    pipeline, X_train, y_train,
    cv=StratifiedKFold(5, shuffle=True, random_state=42),
    scoring='f1_macro'
)
print(f"\nSMOTE + RF, 5-Fold Macro-F1: {scores.mean():.4f} ± {scores.std():.4f}")

8.11.4 Threshold Tuning for Business Objectives

Scikit-Learn
def optimize_threshold_per_class(y_true, y_proba, class_idx,
                                  metric='f1', class_name=''):
    """
    Find optimal threshold for a specific class to maximize a metric.
    
    In multi-class, we binarize: class_idx vs rest, then sweep thresholds.
    """
    y_binary = (y_true == class_idx).astype(int)
    scores = y_proba[:, class_idx]
    
    best_threshold = 0.5
    best_metric_val = 0
    
    for threshold in np.arange(0.1, 0.95, 0.01):
        y_pred_binary = (scores >= threshold).astype(int)
        
        tp = np.sum((y_pred_binary == 1) & (y_binary == 1))
        fp = np.sum((y_pred_binary == 1) & (y_binary == 0))
        fn = np.sum((y_pred_binary == 0) & (y_binary == 1))
        
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        f1 = 2*precision*recall/(precision+recall) if (precision+recall) > 0 else 0
        
        metric_val = f1 if metric == 'f1' else recall
        
        if metric_val > best_metric_val:
            best_metric_val = metric_val
            best_threshold = threshold
    
    print(f"  {class_name}: optimal threshold = {best_threshold:.2f}, "
          f"best {metric} = {best_metric_val:.4f}")
    return best_threshold

# Find optimal thresholds
print("Optimal Thresholds (maximizing F1 per class):")
y_proba_best = best_model.predict_proba(X_test_s)
optimal_thresholds = {}
for i, name in enumerate(class_names):
    optimal_thresholds[i] = optimize_threshold_per_class(
        y_test, y_proba_best, i, metric='f1', class_name=name
    )

# Apply custom thresholds for prediction
def predict_with_thresholds(y_proba, thresholds):
    """Predict using per-class optimal thresholds."""
    adjusted_proba = np.zeros_like(y_proba)
    for k in range(y_proba.shape[1]):
        adjusted_proba[:, k] = y_proba[:, k] / thresholds.get(k, 0.5)
    return np.argmax(adjusted_proba, axis=1)

y_pred_tuned = predict_with_thresholds(y_proba_best, optimal_thresholds)
print(f"\nTuned Macro-F1: {f1_score(y_test, y_pred_tuned, average='macro'):.4f}")
print(f"Default Macro-F1: {f1_score(y_test, y_pred_best, average='macro'):.4f}")

⚠️ Critical Warning

SMOTE must NEVER be applied before train-test split or cross-validation. Applying SMOTE to the entire dataset before splitting causes synthetic minority samples to leak information from the test set into training, leading to severely inflated performance metrics. Always apply SMOTE only within the training fold. The imblearn.pipeline.Pipeline handles this correctly.

SECTION 8.12

Indian Case Studies

Case Study 1: AYUSH Disease Classification (10+ Diseases from Symptoms)

Background

India's AYUSH (Ayurveda, Yoga, Unani, Siddha, Homeopathy) ministry, in collaboration with AIIMS Delhi, developed a multi-class classifier to triage patients at Primary Health Centers (PHCs) in rural India, where specialist doctors are scarce. The system classifies patient symptoms into 12 disease categories.

Problem Formulation

Classes (K=12): Malaria, Dengue, Typhoid, Tuberculosis, Diabetes, Hypertension, Anemia, Pneumonia, Gastroenteritis, Urinary Tract Infection, Common Cold, Dermatitis
Features (48): Binary symptom indicators (fever, cough, headache, etc.), vitals (temperature, BP, pulse), demographics (age, gender, BMI), lab results (if available)
Imbalance: Common Cold represents 28% of cases; Tuberculosis only 3%

Implementation

Python
# Simplified AYUSH Disease Classifier
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

# Disease labels
diseases = [
    'Malaria', 'Dengue', 'Typhoid', 'TB', 'Diabetes',
    'Hypertension', 'Anemia', 'Pneumonia', 'Gastroenteritis',
    'UTI', 'Common_Cold', 'Dermatitis'
]

# Symptom features (simplified)
symptom_names = [
    'fever', 'high_fever', 'cough', 'dry_cough', 'headache',
    'body_pain', 'fatigue', 'nausea', 'vomiting', 'diarrhea',
    'rash', 'joint_pain', 'chills', 'night_sweats',
    'weight_loss', 'frequent_urination', 'blood_pressure_high',
    'shortness_breath', 'chest_pain', 'burning_urination'
]

# Generate realistic synthetic data
np.random.seed(42)
n_samples = 5000
n_features = len(symptom_names)
K = len(diseases)

# Class weights reflecting Indian epidemiological distribution
class_probs = [0.08, 0.06, 0.05, 0.03, 0.10, 0.09,
               0.07, 0.05, 0.08, 0.04, 0.28, 0.07]
y = np.random.choice(K, size=n_samples, p=class_probs)

# Generate symptoms based on disease (simplified disease-symptom profiles)
X = np.random.binomial(1, 0.15, size=(n_samples, n_features))
# Add disease-specific symptom patterns
disease_symptom_map = {
    0: [0, 1, 12, 6],     # Malaria: fever, high_fever, chills, fatigue
    1: [0, 1, 4, 11, 10], # Dengue: fever, high_fever, headache, joint_pain, rash
    2: [0, 4, 6, 5],      # Typhoid: fever, headache, fatigue, body_pain
    3: [2, 13, 14, 6, 17],# TB: cough, night_sweats, weight_loss, fatigue
    4: [14, 15, 6],       # Diabetes: weight_loss, frequent_urination, fatigue
}
for i, label in enumerate(y):
    if label in disease_symptom_map:
        for symptom_idx in disease_symptom_map[label]:
            X[i, symptom_idx] = np.random.binomial(1, 0.85)

# Train with class weights
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y)

clf = GradientBoostingClassifier(
    n_estimators=200, max_depth=5,
    random_state=42
)
# Compute sample weights to handle imbalance
from sklearn.utils.class_weight import compute_sample_weight
sample_weights = compute_sample_weight('balanced', y_tr)
clf.fit(X_tr, y_tr, sample_weight=sample_weights)

y_pred = clf.predict(X_te)
print("AYUSH Disease Classifier — Classification Report")
print(classification_report(y_te, y_pred,
      target_names=diseases, digits=3, zero_division=0))

Results & Impact

Weighted F1: 0.82 across 12 diseases
TB Recall: 0.89 — critical for early detection in underserved areas
Deployment: Piloted in 150 PHCs across Bihar and Jharkhand
Impact: Reduced average diagnosis time from 45 min to 8 min; 23% improvement in TB early detection rate

Case Study 2: Indian Crop Type Classification (Satellite Imagery)

Background

ISRO's Bhuvan platform, in partnership with ICAR (Indian Council of Agricultural Research), classifies crop types from satellite imagery to estimate crop area, predict yields, and plan procurement — critical for India's food security and Minimum Support Price (MSP) policy.

Problem Setup

Classes (K=15): Rice, Wheat, Sugarcane, Cotton, Maize, Soybean, Groundnut, Mustard, Gram (Chickpea), Arhar (Tur), Jute, Tea, Coffee, Banana, Coconut
Features: Multi-spectral satellite bands (RGB, NIR, SWIR), vegetation indices (NDVI, EVI), temporal profiles (monthly composites), terrain data
Challenge: Same crop looks different across India's agro-climatic zones; Rice in Punjab vs Kerala has very different spectral signatures

Key Technique: Hierarchical Classification

Python
# Hierarchical crop classification strategy
# Level 1: Broad category (Cereal, Oilseed, Cash, Plantation, Pulse)
# Level 2: Specific crop within category

hierarchy = {
    'Cereal':     ['Rice', 'Wheat', 'Maize'],
    'Cash_Crop':  ['Sugarcane', 'Cotton', 'Jute'],
    'Oilseed':    ['Soybean', 'Groundnut', 'Mustard'],
    'Pulse':      ['Gram', 'Arhar'],
    'Plantation': ['Tea', 'Coffee', 'Banana', 'Coconut']
}

# Level 1: 5-class classifier (broad category)
# Level 2: 2-4 class classifier per category
# Total: 5 + 3 + 3 + 3 + 2 + 4 = 20 classifiers
# But each is simpler and more accurate than one 15-class classifier

print("Hierarchical Classification Strategy:")
for category, crops in hierarchy.items():
    print(f"  {category}: {' → '.join(crops)} ({len(crops)} classes)")
print(f"\nTotal fine-grained classes: "
      f"{sum(len(c) for c in hierarchy.values())}")

Results

Overall accuracy: 87.3% (flat classifier: 79.1%)
Key insight: Hierarchical approach improved Macro-F1 from 0.72 to 0.84
Coverage: 12 states, 3 seasons (Kharif, Rabi, Zaid)
Used for: Crop insurance (PMFBY) claim verification — ₹30,000 crore scheme

Case Study 3: IIT-JEE Rank Prediction (Multi-Tier Classification)

Problem

An EdTech company built a multi-class classifier to predict students' JEE Advanced rank brackets based on their preparation data, helping them focus on the right topics.

Classes (K=6)

Tier 1 (Rank 1-500): Top IITs, CSE/EE/ME
Tier 2 (501-2000): Top IITs, other branches
Tier 3 (2001-5000): Mid IITs
Tier 4 (5001-10000): Lower IITs, top NITs
Tier 5 (10001+): Other NITs, IIITs
Not Qualified: Did not qualify JEE Advanced

Features

Mock test scores (30 tests, subject-wise: Physics, Chemistry, Math)
Time-per-question distribution
Topic-wise accuracy (50 topics)
Study hours per week, consistency score
Previous year performance trends

Key Finding

Ordinal information matters! Predicting Tier 1 as Tier 2 is far less harmful than predicting Tier 1 as Tier 5. They used ordinal regression (a variant of multi-class classification that respects class ordering) with a custom cost matrix:

Python
# Cost matrix: cost[true_tier][predicted_tier]
# Higher cost for predictions that are far from the true tier
import numpy as np

K_tiers = 6
cost_matrix = np.zeros((K_tiers, K_tiers))
for i in range(K_tiers):
    for j in range(K_tiers):
        cost_matrix[i][j] = abs(i - j) ** 1.5  # Superlinear penalty

print("Cost Matrix (rows=true, cols=predicted):")
print(np.round(cost_matrix, 2))
# Cost of predicting Tier 1 as Tier 5: |0-4|^1.5 = 8.0
# Cost of predicting Tier 1 as Tier 2: |0-1|^1.5 = 1.0

Results

Exact tier accuracy: 61%
Within-one-tier accuracy: 89%
Used by 50,000+ JEE aspirants for personalized preparation strategy

🇮🇳 India Spotlight

India processes over 1.2 crore JEE Main applications annually. ML-based rank prediction systems are used by major EdTech platforms (Unacademy, BYJU'S, PhysicsWallah) to provide personalized guidance. The ethical consideration: these predictions must be communicated carefully to avoid demotivating students or creating self-fulfilling prophecies.

SECTION 8.13

Global Case Studies

Case Study 1: MNIST Digit Classification (The "Hello World" of Multi-Class ML)

Overview

The MNIST dataset (Modified National Institute of Standards and Technology) contains 70,000 handwritten digit images (28×28 pixels, grayscale) across 10 classes (digits 0-9). Created by Yann LeCun in 1998, it remains the most widely used benchmark for multi-class classification.

Key Metrics History

Year	Method	Error Rate	K-Class Handling
1998	Linear Classifier	12.0%	10-way softmax
1998	LeNet-5 (CNN)	0.95%	10-way softmax
2003	SVM (RBF kernel)	0.56%	OvO with voting
2012	Deep CNN + Dropout	0.23%	Softmax + CCE
2020	Ensemble + Augmentation	0.16%	Softmax ensemble

Implementation

TensorFlow
import tensorflow as tf
from tensorflow.keras import layers

# Load MNIST
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

# Preprocess
X_train = X_train.reshape(-1, 28*28).astype('float32') / 255.0
X_test = X_test.reshape(-1, 28*28).astype('float32') / 255.0

print(f"Training: {X_train.shape}, Classes: {len(set(y_train))}")
print(f"Class distribution: {[sum(y_train==i) for i in range(10)]}")

# Simple but effective model
model = tf.keras.Sequential([
    layers.Input(shape=(784,)),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.3),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.2),
    layers.Dense(10, activation='softmax')  # 10-class softmax
])

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

history = model.fit(X_train, y_train, validation_split=0.1,
                    epochs=20, batch_size=128, verbose=0)

test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"\nMNIST Test Accuracy: {test_acc:.4f}")
print(f"MNIST Test Loss: {test_loss:.4f}")

# Per-class analysis
import numpy as np
y_pred = np.argmax(model.predict(X_test), axis=1)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, digits=4))

Common Confusion Patterns

The most commonly confused digit pairs on MNIST are:

4 ↔ 9: Similar upper stroke structure
3 ↔ 5: Mirror-like curves
7 ↔ 1: Stroke angle ambiguity

Case Study 2: ImageNet & Top-5 Accuracy

Overview

ImageNet ILSVRC (2009-2017) involved classifying images into 1,000 categories, from "great white shark" to "ballpoint pen." With 1.2 million training images and K=1000, it introduced the concept of Top-5 Accuracy.

Top-5 vs Top-1 Accuracy

Top-1 Accuracy: correct if argmax(ŷ) = true class Top-5 Accuracy: correct if true class \in top 5 predictions

Top-5 accuracy is lenient: it accounts for genuinely ambiguous images (e.g., a photo might reasonably be labeled "Labrador retriever" or "golden retriever").

Year	Model	Top-5 Error	Top-1 Error	Breakthrough
2012	AlexNet	15.3%	36.7%	Deep CNNs + GPU training
2014	GoogLeNet	6.67%	—	Inception modules
2015	ResNet-152	3.57%	—	Skip connections (152 layers!)
2017	SENet	2.25%	—	Channel attention
2020	ViT-H/14	—	12.1%	Vision Transformers
2023	EVA-02	—	9.7%	CLIP-guided pretraining

Human performance on ImageNet: ~5.1% Top-5 error (Russakovsky et al., 2015). ResNet surpassed human-level performance in 2015 — a landmark moment in AI history.

Python
def top_k_accuracy(y_true, y_pred_probs, k=5):
    """
    Compute Top-K accuracy.
    
    Args:
        y_true: true labels, shape (N,)
        y_pred_probs: predicted probabilities, shape (N, K)
        k: consider top-k predictions
    """
    n_samples = len(y_true)
    correct = 0
    
    for i in range(n_samples):
        top_k_preds = np.argsort(y_pred_probs[i])[::-1][:k]
        if y_true[i] in top_k_preds:
            correct += 1
    
    return correct / n_samples

# Demo with synthetic 1000-class data
np.random.seed(42)
n_demo = 500
K_imagenet = 1000
y_true_demo = np.random.randint(0, K_imagenet, n_demo)
y_probs_demo = np.random.dirichlet(np.ones(K_imagenet), n_demo)
# Make predictions slightly better than random
for i in range(n_demo):
    y_probs_demo[i, y_true_demo[i]] += 0.3
    y_probs_demo[i] /= y_probs_demo[i].sum()

top1 = top_k_accuracy(y_true_demo, y_probs_demo, k=1)
top5 = top_k_accuracy(y_true_demo, y_probs_demo, k=5)
print(f"Top-1 Accuracy: {top1:.4f}")
print(f"Top-5 Accuracy: {top5:.4f}")

🚀 Career Path

ML Engineer — Evaluation & Monitoring: Companies like Google, Meta, and Amazon have dedicated teams for ML evaluation infrastructure. These teams build dashboards for tracking per-class metrics, detecting model drift, and alerting when minority class performance degrades. Salaries: ₹25-50 LPA in India, $120-200K in the US. Key skills: multi-class metrics, statistical testing, A/B testing for ML.

SECTION 8.14

Startup Applications

1. Niramai (Bangalore) — Breast Cancer Screening

Uses thermography images classified into 5 categories: Normal, Benign, Suspicious, Probably Malignant, Highly Suggestive of Malignancy (BI-RADS categories). Their multi-class classifier achieves 95% sensitivity on "Suspicious+" categories, critical for a cancer screening tool where missing a positive case can be fatal. They optimize for recall on the malignant classes while tolerating some false positives (which lead to additional testing, not harm).

2. Cropin (Bangalore) — Agricultural Intelligence

Classifies satellite imagery into 20+ crop types across 56 countries. They handle extreme class imbalance (common crops like Rice dominate, while specialty crops like Saffron have very few labeled pixels) using a combination of SMOTE, hierarchical classification, and focal loss. Their system is used by 250+ agricultural enterprises and covers 16 million acres.

3. SigTuple (Bangalore) — Blood Cell Classification

Their AI microscope (Shonit) classifies white blood cells into 5 types (Neutrophils, Lymphocytes, Monocytes, Eosinophils, Basophils) plus abnormal variants. The confusion matrix analysis revealed systematic misclassification of band neutrophils — leading to a targeted data augmentation strategy that improved accuracy from 89% to 96% for this critical cell type.

4. Haptik (Mumbai, now Jio) — Intent Classification

Classifies user queries into 100+ intents (book_flight, check_balance, complaint, etc.) using multi-class text classification. They use weighted F1 for evaluation because some intents (e.g., "emergency_help") require near-perfect recall even though they represent <0.1% of queries. Cost-sensitive learning assigns 50× higher misclassification cost to emergency intents.

SECTION 8.15

Government Applications

1. CBDT — Income Tax Return Category Classification

India's Central Board of Direct Taxes classifies ~7 crore ITRs annually into risk categories: Compliant, Minor Discrepancy, Potential Evasion, Serious Fraud. The multi-class classifier (using XGBoost with class weights) has a 92% weighted-F1, but the key metric is recall for the "Serious Fraud" class — ensuring that genuinely fraudulent returns are flagged for audit.

2. ISRO — Land Use/Land Cover (LULC) Classification

ISRO's Resourcesat-2 satellite imagery is classified into 18 land use categories (cropland, forest, urban, water body, barren, etc.) across India. The National Remote Sensing Centre (NRSC) produces annual LULC maps at 56m resolution. Macro-F1 is the primary metric because rare categories (glaciers, mangroves) are equally important for environmental monitoring as common categories (cropland).

3. Indian Judiciary — Case Type Classification

The e-Courts project classifies 4+ crore pending cases into categories (civil, criminal, family, consumer, labor, constitutional) for workload distribution and priority scheduling. The system uses NLP-based multi-class classification on case filings written in 12+ Indian languages, handling both Hindi and English alongside regional languages.

SECTION 8.16

Industry Applications

1. Google Gmail — Email Category Classification

Gmail classifies emails into Primary, Social, Promotions, Updates, and Forums (5 classes). With billions of emails daily, even tiny metric improvements matter. Google uses multi-class logistic regression with rich features (sender reputation, content signals, user behavior). Key metric: Precision for "Primary" (don't show promotions in Primary), Recall for "Primary" (don't hide important emails).

2. Tesla Autopilot — Object Classification

Tesla's perception system classifies detected objects into 8+ categories: Car, Truck, Motorcycle, Bicycle, Pedestrian, Traffic Light, Traffic Sign, Cone. For autonomous driving, the cost matrix is highly asymmetric — misclassifying a Pedestrian as a Cone is catastrophically worse than the reverse. They use focal loss (a variant of cross-entropy that down-weights easy examples) to handle the massive class imbalance (99% of detected objects are vehicles, <1% are pedestrians in most driving scenarios).

3. Netflix — Content Genre Classification

Netflix classifies content into 70,000+ micro-genres (not just "Action" or "Comedy" but "Dark Scandinavian Crime Thrillers"). This is a multi-label, multi-class problem. Their evaluation uses precision@k (top-k precision) because the ordering of genre tags matters for recommendation quality, not just the binary correct/incorrect classification.

4. Flipkart — Product Categorization

India's largest e-commerce platform classifies millions of product listings into a 5-level taxonomy with 3000+ leaf categories. They use hierarchical softmax — predicting the category path level by level (Electronics → Mobile → Smartphone → Brand → Model). This reduces the 3000-way classification into a sequence of smaller decisions, improving both accuracy and inference speed.

🏭 Industry Alert

Focal Loss (Lin et al., 2017) is now the industry standard for handling class imbalance in multi-class problems. The formula: FL(pₜ) = -αₜ(1-pₜ)^γ log(pₜ), where γ controls the down-weighting of easy examples. With γ=2, a well-classified example (pₜ=0.9) gets 100× less weight than a hard example (pₜ=0.1). This was originally developed for object detection at Facebook AI Research.

SECTION 8.17

Mini Projects

Mini Project 1: Disease Diagnostic System

Objective

Build a complete multi-class disease diagnostic system that classifies patient symptoms into one of 8 diseases, handles class imbalance, and provides calibrated probability estimates with uncertainty.

Dataset

Use the UCI "Disease Symptoms and Patient Profile" dataset, or generate synthetic data with realistic symptom-disease correlations.

Requirements

Data preprocessing: handle missing symptom data, encode categorical variables
Implement at least 3 classifiers: Logistic Regression (multinomial), Random Forest, and a Neural Network
Handle class imbalance using SMOTE + class weights
Generate a complete evaluation report: confusion matrix, per-class P/R/F1, macro/micro/weighted averages, Cohen's Kappa, MCC
Plot ROC curves and PR curves for each class
Implement threshold tuning for the rarest disease class
Build a simple inference function that takes symptoms and returns top-3 diagnoses with probabilities

Starter Code

Python
"""
Mini Project 1: Disease Diagnostic System
Complete template — fill in the TODO sections
"""
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (classification_report, confusion_matrix,
                             roc_auc_score, cohen_kappa_score, matthews_corrcoef)
from sklearn.preprocessing import label_binarize
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

# ===== Step 1: Generate/Load Data =====
diseases = ['Flu', 'Malaria', 'Dengue', 'Typhoid',
            'TB', 'Pneumonia', 'Diabetes', 'Anemia']
K = len(diseases)

# Imbalanced distribution
X, y = make_classification(
    n_samples=4000, n_features=25, n_informative=18,
    n_classes=K, n_clusters_per_class=1,
    weights=[0.25, 0.15, 0.12, 0.10, 0.08, 0.12, 0.10, 0.08],
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("Class Distribution:")
for i, d in enumerate(diseases):
    n_tr = sum(y_train == i)
    n_te = sum(y_test == i)
    print(f"  {d:<12}: train={n_tr:>4}, test={n_te:>3}")

# ===== Step 2: Build Pipeline with SMOTE =====
pipeline = ImbPipeline([
    ('scaler', StandardScaler()),
    ('smote', SMOTE(random_state=42)),
    ('clf', RandomForestClassifier(
        n_estimators=300, max_depth=12,
        class_weight='balanced', random_state=42
    ))
])

# ===== Step 3: Cross-Validation =====
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(5, shuffle=True, random_state=42)
cv_scores = cross_val_score(pipeline, X_train, y_train,
                           cv=cv, scoring='f1_macro')
print(f"\n5-Fold CV Macro-F1: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

# ===== Step 4: Train Final Model & Evaluate =====
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)

print(f"\n{'='*60}")
print("DISEASE DIAGNOSTIC SYSTEM — EVALUATION REPORT")
print(f"{'='*60}")
print(classification_report(y_test, y_pred,
      target_names=diseases, digits=4))

print(f"Cohen's Kappa: {cohen_kappa_score(y_test, y_pred):.4f}")
print(f"MCC:           {matthews_corrcoef(y_test, y_pred):.4f}")

y_test_bin = label_binarize(y_test, classes=range(K))
auc_macro = roc_auc_score(y_test_bin, y_proba,
                          multi_class='ovr', average='macro')
print(f"Macro AUC:     {auc_macro:.4f}")

# ===== Step 5: Inference Function =====
def diagnose(symptoms, pipeline, diseases, top_k=3):
    """
    Given a symptom vector, return top-k diagnoses with probabilities.
    """
    probs = pipeline.predict_proba(symptoms.reshape(1, -1))[0]
    top_indices = np.argsort(probs)[::-1][:top_k]
    
    print("\n🏥 Diagnosis Results:")
    for rank, idx in enumerate(top_indices, 1):
        confidence = "HIGH" if probs[idx] > 0.5 else \
                    "MEDIUM" if probs[idx] > 0.2 else "LOW"
        print(f"  {rank}. {diseases[idx]:<12} "
              f"Probability: {probs[idx]:.3f} [{confidence}]")
    return top_indices, probs

# Test inference
test_patient = X_test[0]
diagnose(test_patient, pipeline, diseases, top_k=3)

Deliverables

Complete Jupyter notebook with all steps
ROC curves plot (8 classes overlaid)
Confusion matrix heatmap
1-page report comparing at least 3 models
Working inference function with top-3 diagnoses

Mini Project 2: Indian Crop Classifier

Objective

Build a crop type classifier using spectral/temporal features, implementing hierarchical classification and handling extreme class imbalance in agricultural data.

Starter Code

Python
"""
Mini Project 2: Indian Crop Classifier
Simulated satellite spectral features → Crop type
"""
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score

# Crop types and approximate area distribution in India
crops = {
    'Rice':       0.22,  # Largest by area
    'Wheat':      0.16,
    'Cotton':     0.08,
    'Sugarcane':  0.06,
    'Maize':      0.05,
    'Soybean':    0.04,
    'Groundnut':  0.04,
    'Mustard':    0.03,
    'Gram':       0.05,
    'Arhar':      0.03,
    'Jute':       0.02,
    'Tea':        0.01,  # Rare in overall statistics
    'Coffee':     0.01,
    'Banana':     0.015,
    'Coconut':    0.015,
}
crop_names = list(crops.keys())
crop_probs = list(crops.values())
# Normalize
total = sum(crop_probs)
crop_probs = [p/total for p in crop_probs]
K = len(crop_names)

# Generate synthetic spectral features
np.random.seed(42)
n_samples = 8000
n_features = 20  # Spectral bands + vegetation indices + terrain

y = np.random.choice(K, size=n_samples, p=crop_probs)
X = np.random.randn(n_samples, n_features) * 0.5

# Add crop-specific spectral signatures
spectral_centers = np.random.randn(K, n_features) * 2
for i in range(n_samples):
    X[i] += spectral_centers[y[i]]

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print("Crop Distribution (train):")
for i, name in enumerate(crop_names):
    count = np.sum(y_train == i)
    print(f"  {name:<12}: {count:>4} ({count/len(y_train)*100:>5.1f}%)")

# ===== Flat Classifier =====
clf_flat = GradientBoostingClassifier(
    n_estimators=200, max_depth=5, random_state=42
)
from sklearn.utils.class_weight import compute_sample_weight
weights = compute_sample_weight('balanced', y_train)
clf_flat.fit(X_train, y_train, sample_weight=weights)
y_pred_flat = clf_flat.predict(X_test)

print("\n===== FLAT CLASSIFIER =====")
print(classification_report(y_test, y_pred_flat,
      target_names=crop_names, digits=3, zero_division=0))

# ===== Hierarchical Classifier =====
# Level 1: Map to broad categories
hierarchy = {
    0: 'Cereal', 1: 'Cereal', 4: 'Cereal',         # Rice, Wheat, Maize
    2: 'Cash', 3: 'Cash', 10: 'Cash',               # Cotton, Sugarcane, Jute
    5: 'Oilseed', 6: 'Oilseed', 7: 'Oilseed',      # Soybean, Groundnut, Mustard
    8: 'Pulse', 9: 'Pulse',                           # Gram, Arhar
    11: 'Plantation', 12: 'Plantation',                # Tea, Coffee
    13: 'Plantation', 14: 'Plantation',                # Banana, Coconut
}
categories = sorted(set(hierarchy.values()))
cat_to_idx = {c: i for i, c in enumerate(categories)}
y_train_cat = np.array([cat_to_idx[hierarchy[yi]] for yi in y_train])
y_test_cat = np.array([cat_to_idx[hierarchy[yi]] for yi in y_test])

# Level 1 model
clf_l1 = GradientBoostingClassifier(n_estimators=100, random_state=42)
clf_l1.fit(X_train, y_train_cat)
y_pred_cat = clf_l1.predict(X_test)
print(f"Level 1 (Category) Accuracy: "
      f"{np.mean(y_pred_cat == y_test_cat):.3f}")

flat_macro = f1_score(y_test, y_pred_flat, average='macro')
print(f"\nFlat Macro-F1: {flat_macro:.4f}")
print("(Hierarchical approach would improve this — "
      "left as exercise for the student)")

Extension Tasks

Complete the hierarchical classifier (Level 2 classifiers for each category)
Add temporal features (monthly NDVI profiles) and show accuracy improvement
Implement and compare focal loss vs standard cross-entropy
Create a geographic feature (state/district) and test zone-specific models

Mini Project 3: Multi-Class Sentiment Analysis

Objective

Build a 5-class sentiment classifier (Very Negative, Negative, Neutral, Positive, Very Positive) for product reviews, with special attention to the ordinal nature of sentiments.

Key Challenge

Unlike nominal classification, sentiment classes are ordered. Predicting "Very Positive" for a "Very Negative" review is much worse than predicting "Neutral" for a "Negative" review. Use a custom cost matrix and ordinal regression techniques.

Python
# Ordinal-aware evaluation for sentiment analysis
sentiments = ['Very_Neg', 'Negative', 'Neutral', 'Positive', 'Very_Pos']
K = len(sentiments)

# Ordinal cost matrix: cost proportional to distance
cost_matrix = np.zeros((K, K))
for i in range(K):
    for j in range(K):
        cost_matrix[i][j] = (i - j) ** 2  # Quadratic penalty

print("Ordinal Cost Matrix:")
print(f"{'':>12}", end='')
for s in sentiments:
    print(f"{s:>10}", end='')
print()
for i, s in enumerate(sentiments):
    print(f"{s:>12}", end='')
    for j in range(K):
        print(f"{cost_matrix[i][j]:>10.0f}", end='')
    print()

# Mean Absolute Error (MAE) as ordinal metric
def ordinal_mae(y_true, y_pred):
    """Mean Absolute Error treating classes as ordinal."""
    return np.mean(np.abs(y_true - y_pred))

# Quadratic Weighted Kappa — standard for ordinal classification
from sklearn.metrics import cohen_kappa_score
# Use weights='quadratic' for ordinal data
# kappa = cohen_kappa_score(y_true, y_pred, weights='quadratic')

SECTION 8.18

End-of-Chapter Exercises

E8.1 Compute softmax([3.0, 1.0, 0.2, -1.0]) by hand. Verify the probabilities sum to 1.

E8.2 Prove that softmax is invariant to adding a constant: softmax(z + c) = softmax(z). Why is this property useful for numerical stability?

E8.3 For K=2 classes, show that softmax reduces to the sigmoid function applied to z₁ - z₀. Write the derivation step-by-step.

E8.4 Given one-hot target y = [0, 0, 1, 0] and predicted probabilities ŷ = [0.1, 0.2, 0.6, 0.1], compute the categorical cross-entropy loss. What would the loss be if the model were perfectly confident (ŷ = [0, 0, 1, 0])?

E8.5 Construct the confusion matrix for the following 4-class predictions:
True: [A, B, C, D, A, A, B, C, D, D, A, B, C, C, D]
Pred: [A, B, C, D, A, B, B, C, A, D, A, C, C, D, D]
Compute precision, recall, and F1 for each class.

E8.6 From the confusion matrix in E8.5, compute Macro-F1, Micro-F1, and Weighted-F1. Verify that Micro-F1 = accuracy.

E8.7 A medical classifier has 3 classes: Healthy (1000 samples), Disease A (50 samples), Disease B (10 samples). Why would accuracy be a terrible metric here? Which metric(s) would you recommend and why?

E8.8 Derive the Jacobian of the softmax function: ∂σ(z)ᵢ/∂zⱼ = σ(z)ᵢ(δᵢⱼ - σ(z)ⱼ). Show all steps.

E8.9 Explain why SMOTE must be applied only to the training set and never before splitting. What error occurs if you apply SMOTE to the entire dataset first?

E8.10 For a 5-class problem, how many binary classifiers are needed for OvR? For OvO? If training one classifier takes 10 seconds, how long does each strategy take?

E8.11 A classifier has the following confusion matrix:
[[90, 5, 5], [10, 80, 10], [3, 7, 90]]
Compute Cohen's Kappa. Is this classifier "substantially" better than random?

E8.12 Explain the difference between macro-averaging and micro-averaging with an example where they give very different F1 scores.

E8.13 What is the "temperature" parameter in softmax? Given softmax_τ(z)ᵢ = e^{zᵢ/τ}/Σe^{zⱼ/τ}, what happens as τ → 0? As τ → ∞?

E8.14 Implement a function to compute the Matthews Correlation Coefficient for a 3×3 confusion matrix. Test it on the matrix from E8.11.

E8.15 A bank's fraud detection system classifies transactions into: Legitimate, Suspicious, Fraudulent. The cost of missing a Fraudulent transaction is 100× the cost of a false alarm. Design a cost matrix and explain how you would use it in training.

E8.16 Plot the ROC curve for the following binary classifier outputs (by hand or code):
True: [1,1,0,1,0,0,1,0,1,0]
Score: [0.9,0.8,0.7,0.6,0.55,0.4,0.3,0.2,0.85,0.15]
Compute the AUC.

E8.17 When should you use a Precision-Recall curve instead of an ROC curve? Give two real-world scenarios where PR curves are more informative.

E8.18 Implement label smoothing from scratch: given ε=0.1 and K=5, transform the one-hot vector [0,0,1,0,0] into a smoothed vector. Compute the cross-entropy loss with and without smoothing.

E8.19 Design a hierarchical classification scheme for classifying 20 types of Indian cuisine dishes. Define the hierarchy and explain why hierarchical classification might outperform flat classification here.

E8.20 A model outputs the following logits for 3 classes: z = [5.0, 5.0, 5.0]. What is the softmax output? What does this mean about the model's confidence? Now compute softmax([5.0, 5.0, 100.0]) — what changes?

E8.21 Implement the Top-K accuracy metric from scratch. Test it with K=1,3,5 on a 10-class classification problem.

E8.22 [Research-level] The Focal Loss modifies cross-entropy as FL(pₜ) = -(1-pₜ)^γ log(pₜ). Derive the gradient of focal loss with respect to the logits. For what value of γ does focal loss reduce to standard cross-entropy?

E8.23 [Programming] Write a Python function that takes a confusion matrix of any size K×K and generates a complete classification report (matching sklearn's output format) without using any sklearn functions.

SECTION 8.19

Multiple Choice Questions

Q1. What does the softmax function guarantee about its output?

(a) All values are between -1 and 1
(b) All values are between 0 and 1 and sum to 1
(c) All values are positive integers
(d) The maximum value equals 1

Answer: (b) — Softmax maps logits to a valid probability distribution: every output is in (0,1) and the sum is exactly 1. This is what makes it suitable as the output layer for multi-class classification.

Q2. In a single-label multi-class problem, Micro-F1 equals:

(a) Macro-F1
(b) Weighted-F1
(c) Overall accuracy
(d) Average precision

Answer: (c) — In single-label multi-class (each sample has exactly one label), every misclassification is a FP for one class and FN for another, making Σ FPₖ = Σ FNₖ. Therefore Micro-Precision = Micro-Recall = Accuracy = Micro-F1.

Q3. For K classes, how many binary classifiers does One-vs-One (OvO) require?

(a) K
(b) K - 1
(c) K(K-1)/2
(d) K²

Answer: (c) — OvO creates one classifier for each unique pair of classes. The number of ways to choose 2 from K classes is C(K,2) = K(K-1)/2. For K=10, that's 45 classifiers.

Q4. Cohen's Kappa = 0 indicates:

(a) Perfect agreement
(b) No agreement beyond chance
(c) Complete disagreement
(d) The classifier has 50% accuracy

Answer: (b) — κ = 0 means the observed agreement equals the expected agreement by chance (p₀ = pₑ). The classifier is no better than random guessing. κ = 1 means perfect agreement; κ < 0 means worse than chance.

Q5. SMOTE generates synthetic samples by:

(a) Duplicating existing minority samples
(b) Interpolating between minority samples and their nearest neighbors
(c) Adding Gaussian noise to majority samples
(d) Removing majority samples near the decision boundary

Answer: (b) — SMOTE (Synthetic Minority Over-sampling Technique) creates new samples by picking a minority sample, finding its k nearest minority neighbors, and generating a point along the line segment connecting them. This creates realistic synthetic examples in the feature space.

Q6. A random classifier (AUC = 0.5) on an ROC plot would appear as:

(a) A horizontal line at TPR = 0.5
(b) A vertical line at FPR = 0.5
(c) The diagonal line from (0,0) to (1,1)
(d) The point (0.5, 0.5)

Answer: (c) — A random classifier's ROC curve follows the diagonal from (0,0) to (1,1), meaning at every threshold, TPR = FPR. The area under this diagonal is exactly 0.5.

Q7. When is Precision-Recall curve preferred over ROC curve?

(a) When classes are balanced
(b) When the positive class is very rare (class imbalance)
(c) When we have more than 2 classes
(d) When the model uses softmax

Answer: (b) — ROC curves can give an overly optimistic picture when the negative class greatly outnumbers the positive class, because FPR stays low even with many false positives. PR curves focus on the positive class and are more informative for imbalanced datasets (e.g., fraud detection with 0.1% positive rate).

Q8. The gradient of categorical cross-entropy loss w.r.t. logits z_k (with softmax output) is:

(a) y_k · log(ŷ_k)
(b) ŷ_k - y_k
(c) y_k - ŷ_k
(d) ŷ_k · (1 - ŷ_k)

Answer: (b) — ∂L/∂zₖ = ŷₖ - yₖ (predicted minus true). This elegantly simple gradient is one of the beautiful results of combining softmax with cross-entropy. It has the same form as the sigmoid + binary CE gradient but operates in K dimensions.

Q9. Label smoothing with ε=0.1 for K=4 classes transforms [0, 1, 0, 0] to:

(a) [0.025, 0.925, 0.025, 0.025]
(b) [0.1, 0.9, 0.1, 0.1]
(c) [0.033, 0.9, 0.033, 0.033]
(d) [0.05, 0.85, 0.05, 0.05]

Answer: (a) — Label smoothing: y_smooth = y · (1-ε) + ε/K. For the true class: 1·(1-0.1) + 0.1/4 = 0.9 + 0.025 = 0.925. For other classes: 0·(1-0.1) + 0.1/4 = 0 + 0.025 = 0.025. Sum: 0.925 + 3×0.025 = 1.0 ✓

Q10. MCC (Matthews Correlation Coefficient) ranges from:

(a) 0 to 1
(b) -1 to 1
(c) 0 to ∞
(d) -∞ to ∞

Answer: (b) — MCC ranges from -1 (total disagreement, every prediction is wrong) through 0 (random prediction) to +1 (perfect prediction). It is considered one of the most balanced metrics because it accounts for all four quadrants of the confusion matrix.

Q11. ImageNet's Top-5 accuracy means:

(a) The model achieves at least 5% accuracy
(b) The correct label appears among the model's top 5 predictions
(c) Only the top 5 performing classes are evaluated
(d) The model is trained on only 5 classes

Answer: (b) — Top-5 accuracy gives credit if the true class appears anywhere in the model's 5 most confident predictions. This is more lenient than Top-1 and accounts for genuinely ambiguous images in the 1000-class ImageNet dataset.

SECTION 8.20

Interview Questions

IQ1. How would you evaluate a multi-class classifier deployed in production? Walk me through your approach.

Answer Framework:

Offline evaluation: Start with a held-out test set with the same distribution as production. Compute a confusion matrix to understand per-class performance. Report Macro-F1 (if all classes matter equally) or Weighted-F1 (if proportional importance).
Beyond accuracy: Check Cohen's Kappa (is the model better than random?), MCC (balanced single metric), and class-specific precision/recall for critical classes.
Threshold-free metrics: ROC-AUC (Macro OvR) for overall discriminative ability. PR-AUC for rare classes.
Business metrics: Map ML metrics to business KPIs. E.g., for spam detection: "What percentage of legitimate emails are misclassified?" (FPR for legitimate class).
Online monitoring: Track per-class accuracy over time windows, detect distribution shift, set up alerts when minority class recall drops below a threshold.
Fairness: Check if performance varies across demographic subgroups (equalized odds, demographic parity).

IQ2. Explain the difference between softmax and sigmoid for multi-class. When would you use each?

Key Distinction:

Softmax: Outputs K probabilities that sum to 1. Used for mutually exclusive multi-class (each sample belongs to exactly one class). E.g., digit recognition (a digit is exactly one of 0-9).
K independent sigmoids: Each output is between 0 and 1 independently; they do NOT sum to 1. Used for multi-label classification (each sample can belong to multiple classes). E.g., movie genres (a movie can be both "Action" AND "Comedy").

Loss functions: Softmax → Categorical Cross-Entropy (sparse or one-hot). Sigmoid → Binary Cross-Entropy applied to each label independently.

IQ3. Your model has 95% accuracy but the Macro-F1 is 0.45. What's going on?

Diagnosis: This is a classic symptom of class imbalance. The model achieves high accuracy by correctly predicting the majority class(es) but performs poorly on minority classes.

Example: 3-class problem with 90% Class A, 5% Class B, 5% Class C. A model that always predicts Class A gets 90% accuracy. But F1 for B and C would be 0, so Macro-F1 = (F1_A + 0 + 0)/3 ≈ 0.32.

Solutions: (1) Class weights in loss function, (2) SMOTE/oversampling, (3) Focal loss, (4) Collect more minority class data, (5) Evaluate with Macro-F1 or per-class metrics, not accuracy.

IQ4. How does SMOTE work? What are its limitations?

How SMOTE works:

For each minority class sample, find k nearest neighbors within the same class
Randomly select one neighbor
Generate a new synthetic sample on the line segment connecting them: x_new = x + λ·(x_neighbor - x), where λ ∈ [0,1]

Limitations:

Generates in feature space: May create unrealistic samples (e.g., a face that is half-male, half-female)
Sensitive to noisy data: If a minority sample is noisy/outlier, SMOTE generates more noise around it
Doesn't work well with high dimensions: Distance metrics become unreliable in high-D (curse of dimensionality)
Must be inside CV: Applying SMOTE before splitting causes data leakage

Alternatives: Borderline-SMOTE (only generate near decision boundary), ADASYN (density-adaptive), Class Weights (no synthetic data), Focal Loss, Data Augmentation (for images/text).

IQ5. When would you prefer Precision-Recall AUC over ROC-AUC?

Use PR-AUC when the positive class is rare (class imbalance ratio > 1:10). ROC-AUC can appear high even for poor classifiers on imbalanced data because FPR stays low when TN is very large.

Example: Fraud detection (0.1% fraud rate). A model that flags 1% of transactions as fraud might have: ROC-AUC = 0.95 (looks great!) but PR-AUC = 0.30 (actually poor — most flagged transactions are false positives).

Rule of thumb: If stakeholders care about "Of the items you flagged, how many are actually positive?" (precision), use PR curves. If they care about "Of all actual positives, how many did you find?" (recall) balanced against false alarm rate, use ROC.

IQ6. How do you extend binary classification metrics (precision, recall) to multi-class?

Step 1: Binarize — For each class k, create a "class k vs rest" view from the K×K confusion matrix. Extract TP_k, FP_k, FN_k.

Step 2: Compute per-class metrics: Precision_k, Recall_k, F1_k for each class independently.

Step 3: Aggregate using one of three strategies:

Macro: Simple mean of per-class metrics. Treats all classes equally. Best when all classes are equally important.
Micro: Pool all TP/FP/FN globally, then compute. In single-label multi-class, equals accuracy.
Weighted: Weighted mean by class support (number of true samples). Accounts for class size.

IQ7. A classifier's Cohen's Kappa is 0.35. Is this good or bad? How would you interpret this?

κ = 0.35 is in the "fair agreement" range (0.21-0.40 on the Landis & Koch scale). It means the classifier performs better than random chance, but only moderately so.

Context matters:

For a spam filter: κ = 0.35 is terrible — the filter would annoy users constantly
For predicting psychiatric diagnoses (where even human experts disagree): κ = 0.35 might be reasonable
For OCR on clean printed text: κ = 0.35 is unacceptably low

Action: Investigate the confusion matrix to see which specific class pairs are causing confusion, then target those with more training data or feature engineering.

IQ8. How would you handle a 1000-class classification problem? Any special techniques?

Challenges at K=1000: (1) Computational cost, (2) Class imbalance is almost guaranteed, (3) Many visually similar classes.

Techniques:

Hierarchical Softmax: Group 1000 classes into a tree; predict path down the tree. Reduces K-way decision to O(log K) binary decisions.
Sampled Softmax: During training, only compute softmax over the true class + a random subset. Used in language models with 50K+ word vocabularies.
Top-K evaluation: Report Top-5 accuracy, not just Top-1.
Embedding-based: Map both inputs and class labels to a shared embedding space; predict by nearest neighbor in embedding space.
Mixture of experts: Route inputs to specialized sub-networks.

IQ9. Explain cost-sensitive learning. How do you implement it?

Concept: Different types of misclassification have different real-world costs. Missing a cancer diagnosis (FN) is far worse than a false alarm (FP). Cost-sensitive learning incorporates these asymmetric costs into model training.

Implementation approaches:

Class weights in loss: Multiply each sample's loss by its class weight. In sklearn: class_weight='balanced' or custom dict
Sample weights: Assign individual sample weights based on a cost matrix
Cost-aware loss function: Modify cross-entropy: L = -Σ cₖ · yₖ · log(ŷₖ) where cₖ is the cost for class k
Post-hoc threshold tuning: After training, adjust decision thresholds per class to minimize expected cost
Oversampling the high-cost class: Replicate samples from expensive-to-miss classes

IQ10. What is the Matthews Correlation Coefficient and why is it considered better than accuracy?

MCC is a correlation coefficient between the observed and predicted classifications. It ranges from -1 to +1, with 0 indicating random prediction.

Why better than accuracy:

MCC uses all four confusion matrix quadrants (TP, TN, FP, FN) in a balanced way
It returns a high value only if the classifier performs well on both positive and negative classes
It is informative even when classes are heavily imbalanced
A random classifier always gets MCC ≈ 0, regardless of class distribution

Example: On a dataset with 95% negative and 5% positive: an "always negative" classifier gets 95% accuracy but MCC = 0. This correctly reveals the classifier is useless.

Reference: Chicco & Jurman (2020) "The advantages of the Matthews correlation coefficient over F1 score and accuracy in binary classification evaluation"

SECTION 8.21

Research Problems

Research Problem 1: Calibration of Multi-Class Classifiers

Background: A softmax output of 0.7 for class k should mean "this is class k with 70% probability." But deep networks are notoriously miscalibrated — they tend to be overconfident. Guo et al. (2017) showed that modern neural networks are more miscalibrated than their predecessors.

Research Question: Develop a calibration method for multi-class neural networks that preserves ranking (i.e., if the model is more confident about class A than B, this ordering should be maintained after calibration) while improving Expected Calibration Error (ECE) across all classes simultaneously.

Suggested Approach: Extend temperature scaling to class-specific temperatures. Current approaches use a single temperature τ for all classes; investigate whether per-class temperatures τₖ improve calibration, especially for minority classes.

Datasets: CIFAR-100, ImageNet-1K, medical image classification datasets

Research Problem 2: Multi-Class Classification with Reject Option

Motivation: In safety-critical applications (medical diagnosis, autonomous driving), a classifier should be able to say "I don't know" when uncertain, rather than making a potentially dangerous incorrect prediction.

Research Question: Design a multi-class classification framework that incorporates a (K+1)-th "reject/abstain" class, optimizing for maximum accuracy on accepted samples while minimizing the reject rate. How should the cost of rejection relate to the costs of misclassification?

Hint: Chow's reject-option rule (1970) provides the theoretical optimum for binary classification. Extend this to multi-class with non-uniform misclassification costs.

Research Problem 3: Evaluation Metrics for Long-Tailed Distributions

Background: Real-world class distributions are often long-tailed: a few "head" classes dominate, while many "tail" classes have very few samples. Standard metrics (even Macro-F1) may not adequately capture performance on tail classes.

Research Question: Propose a new evaluation metric specifically designed for long-tailed multi-class classification that (a) is sensitive to tail-class performance, (b) is robust to the number of tail classes, and (c) has a clear probabilistic interpretation. Compare it against Macro-F1, Geometric Mean of per-class recalls, and recent proposals like "Balanced Accuracy."

Indian Context: Indian wildlife species classification (WILDBOOK India dataset) where common species (peacock, langur) have 10,000+ images while endangered species (snow leopard, red panda) have <50 images. How should we evaluate classifiers on such data?

SECTION 8.22

Key Takeaways

Softmax is the multi-class sigmoid. It converts K logits into K probabilities summing to 1, derived naturally from the log-linear model and maximum entropy principle. Always use the numerically stable version (subtract max logit).

Categorical cross-entropy is negative log-likelihood. It reduces to -log(ŷ_c) where c is the true class. Its gradient with softmax is beautifully simple: ŷ - y (predicted minus true).

Never trust accuracy alone in multi-class settings. Use the confusion matrix to understand per-class performance. Report Macro-F1 when all classes matter, Weighted-F1 when class prevalence matters, and always check minority class recall.

ROC-AUC for overall discrimination, PR-AUC for imbalanced classes. ROC curves can be deceptively optimistic with large negative classes. Precision-Recall curves tell the true story for rare positive classes.

Class imbalance is the rule, not the exception. Always check class distributions first. Use SMOTE inside cross-validation (never before splitting), class weights, or focal loss. Never apply SMOTE to the full dataset before splitting — this causes data leakage.

The choice of metric is a business decision. Precision matters when false positives are costly (spam filter: don't move good emails to spam). Recall matters when false negatives are costly (cancer screening: don't miss any cancer). MCC is the most balanced single metric.

OvR is simpler; OvO handles imbalance better. OvR needs K classifiers, OvO needs K(K-1)/2. For most practical purposes, native softmax output (multinomial logistic regression, neural networks) eliminates the need for either strategy.

Cohen's Kappa and MCC correct for chance agreement. κ = 0 means "no better than random," which accuracy alone cannot reveal. MCC is the only metric that returns a meaningful value for every confusion matrix configuration.

Threshold tuning is where ML meets business. The default 0.5 threshold is almost never optimal. Tune per-class thresholds based on the business cost of each type of error. A disease screening system should use a low threshold (high recall) while a user-facing content recommendation should use a high threshold (high precision).

SECTION 8.23

References & Further Reading

Foundational Papers

Fisher, R. A. (1936). "The Use of Multiple Measurements in Taxonomic Problems." Annals of Eugenics, 7(2), 179-188.
Bridle, J. (1989). "Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information." NIPS.
Chawla, N. V. et al. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." JAIR, 16, 321-357.
Lin, T.-Y. et al. (2017). "Focal Loss for Dense Object Detection." ICCV.
Guo, C. et al. (2017). "On Calibration of Modern Neural Networks." ICML.
Szegedy, C. et al. (2016). "Rethinking the Inception Architecture for Computer Vision." CVPR.
Chicco, D. & Jurman, G. (2020). "The Advantages of the Matthews Correlation Coefficient over F1 Score and Accuracy." BMC Genomics, 21, 6.
Matthews, B. W. (1975). "Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme." Biochimica et Biophysica Acta, 405(2), 442-451.

ImageNet & Multi-Class Benchmarks

Russakovsky, O. et al. (2015). "ImageNet Large Scale Visual Recognition Challenge." IJCV, 115(3), 211-252.
LeCun, Y. et al. (1998). "Gradient-Based Learning Applied to Document Recognition." Proceedings of the IEEE, 86(11), 2278-2324.
He, K. et al. (2016). "Deep Residual Learning for Image Recognition." CVPR.

Evaluation Metrics

Cohen, J. (1960). "A Coefficient of Agreement for Nominal Scales." Educational and Psychological Measurement, 20(1), 37-46.
Landis, J. R. & Koch, G. G. (1977). "The Measurement of Observer Agreement for Categorical Data." Biometrics, 33(1), 159-174.
Sokolova, M. & Lapalme, G. (2009). "A Systematic Analysis of Performance Measures for Classification Tasks." Information Processing & Management, 45(4), 427-437.

Indian AI & ML Context

ISRO (2022). "Land Use / Land Cover Atlas of India (LULC50K)." National Remote Sensing Centre.
ICAR-IARI (2023). "Crop Classification from Satellite Imagery for PMFBY." Technical Report.
NHA India (2024). "AI-Driven Fraud Detection in Ayushman Bharat Claims." Annual Report, Chapter 7.

Textbooks

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 4.
Goodfellow, I. et al. (2016). Deep Learning. MIT Press. Chapter 6.2 (Softmax).
Hastie, T. et al. (2009). The Elements of Statistical Learning. Springer. Chapter 4.4.