Chapter 8
Multi-Class Classification
& Evaluation Metrics
Master the art of classifying data into multiple categories with softmax, cross-entropy, confusion matrices, ROC-AUC, and handling real-world class imbalance โ with end-to-end implementations and Indian case studies.
Learning Objectives
After completing this chapter, you will be able to:
- Extend binary classification to K-class problems using One-vs-Rest and One-vs-One strategies
- Derive the Softmax function from a log-linear model and understand its probabilistic interpretation
- Formulate and minimize Categorical Cross-Entropy loss for multi-class settings
- Construct and interpret KรK confusion matrices with per-class and aggregate metrics
- Compute Precision, Recall, and F1-Score per class, and understand Macro, Micro, and Weighted averaging
- Plot and interpret ROC curves, AUC scores, and Precision-Recall curves for multi-class problems
- Apply Cohen's Kappa and Matthews Correlation Coefficient for robust evaluation
- Handle class imbalance using SMOTE, class weights, oversampling, undersampling, and cost-sensitive learning
- Tune decision thresholds for business-specific objectives
- Implement complete pipelines in Python, TensorFlow, and Scikit-Learn for multi-class classification
Introduction
In Chapter 7, we mastered binary classification โ the task of separating data into exactly two classes. But the real world rarely presents itself in binary terms. A radiologist must distinguish between ten types of lung pathologies. An agricultural scientist classifies satellite imagery into fifteen different crop types. An autonomous vehicle's perception system categorizes every pixel into dozens of semantic classes: road, pedestrian, vehicle, traffic sign, building, sky.
Multi-class classification generalizes binary classification from 2 classes to K classes, where K โฅ 3. This seemingly small conceptual leap introduces profound mathematical, computational, and evaluative challenges:
- Output representation: We need K output probabilities that sum to 1 โ enter the Softmax function
- Loss computation: Binary cross-entropy becomes Categorical Cross-Entropy over K classes
- Evaluation complexity: A single accuracy number hides critical performance variations across classes; we need per-class metrics and thoughtful averaging strategies
- Class imbalance: With K classes, imbalance is almost guaranteed โ rare diseases, uncommon crop types, infrequent fraud patterns
Multi-class classification is arguably the most practically important supervised learning task. When students ask "where do ML models get used in industry?" โ the answer is almost always a multi-class classifier: spam detection (spam/ham/promotional/social), content moderation (safe/violence/hate/nudity), medical diagnosis (healthy/disease A/disease B/...), and so on.
This chapter is structured as a progression: we first establish the mathematical machinery (Softmax, Cross-Entropy), then build a comprehensive evaluation toolkit (confusion matrices through ROC-AUC), tackle the critical problem of class imbalance, and finally implement everything with Python, TensorFlow, and Scikit-Learn โ all grounded in real Indian and global case studies.
India's National Health Authority processes over 500 million Ayushman Bharat claims yearly, each coded with ICD-10 disease classifications spanning thousands of categories. Automated multi-class classification of medical claims saved โน2,400 crore in fraudulent claim detection in 2024-25, making this chapter's content directly applicable to systems that touch millions of Indian lives daily.
Historical Background
The history of multi-class classification intertwines with the development of statistical pattern recognition, neural networks, and information theory.
1936 โ Fisher's Discriminant Analysis
R.A. Fisher's seminal work on the Iris dataset (3 species, 4 features) introduced linear discriminant analysis โ the first systematic approach to multi-class classification. Interestingly, Fisher's dataset included Iris setosa, Iris virginica, and Iris versicolor โ a 3-class problem that remains a standard benchmark 90 years later.
1957 โ Rosenblatt's Perceptron
Frank Rosenblatt's perceptron was binary, but researchers quickly extended it to multi-class via "one-vs-rest" committees of perceptrons, laying groundwork for multi-output neural networks.
1959 โ Softmax's Theoretical Roots
The Softmax function originates from Luce's Choice Axiom (1959) in mathematical psychology, which modeled how humans make choices among multiple alternatives. Physicist Ludwig Boltzmann had used the same mathematical form (the Boltzmann distribution) in statistical mechanics decades earlier.
1989 โ Bridle's Softmax
John Bridle formally introduced the term "softmax" in the context of neural networks at a 1989 conference, establishing it as the standard output activation for multi-class neural networks.
1998 โ Multi-class SVMs
Vapnik's SVM was inherently binary. Dietterich and Bakiri (1995) introduced Error-Correcting Output Codes, and later Crammer and Singer (2001) developed a direct multi-class SVM formulation.
2009 โ ImageNet & Modern Multi-Class
The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with 1,000 classes became the definitive benchmark for multi-class classification, driving AlexNet (2012), VGGNet (2014), GoogLeNet (2014), and ResNet (2015).
2012 โ Modern Metrics Era
With increasing awareness of fairness and bias in ML systems, evaluation metrics evolved beyond simple accuracy. Cohen's Kappa, Matthews Correlation Coefficient, and class-specific analyses became essential tools in the ML practitioner's arsenal.
Questions about the "history of softmax" appear in GATE ML and IIT entrance exams. Remember: Boltzmann distribution โ Luce's choice axiom (1959) โ Bridle's "softmax" (1989). The mathematical form e^{z_i}/ฮฃe^{z_j} predates neural networks by a century.
Conceptual Explanation
8.3.1 From Binary to Multi-Class
In binary classification, a model outputs a single probability p(y=1|x) and infers p(y=0|x) = 1 - p. For K classes, we need K probabilities that collectively describe a probability distribution over all classes.
8.3.2 The Softmax Function โ An Intuitive View
Think of Softmax as a "soft" version of the argmax function. The argmax function picks the single largest value and ignores everything else. Softmax instead converts a vector of raw scores (logits) into a probability distribution where:
- Every output is between 0 and 1
- All outputs sum to exactly 1
- Larger logits get exponentially more probability mass
- The relative ordering is preserved
Temperature analogy: Imagine K cups of water at different temperatures. Softmax is like asking "what fraction of the total heat does each cup contain?" The hottest cup dominates, but even the coolest cup has some fraction. As you increase the "temperature parameter" ฯ, the distribution flattens toward uniform; as ฯ โ 0, it sharpens toward a one-hot argmax.
8.3.3 One-Hot Encoding for Targets
Multi-class targets are encoded as one-hot vectors โ vectors of length K with a 1 at the position of the true class and 0 everywhere else.
Example: 3-Class Problem (Cat, Dog, Bird)
If the true class is "Dog" (class index 1):
The model's predicted probability vector might be:
The cross-entropy loss only "cares about" the probability assigned to the true class: -log(0.72) โ 0.329.
8.3.4 One-vs-Rest (OvR) Strategy
OvR (also called One-vs-All or OvA) trains K separate binary classifiers, each distinguishing one class from all others. For K=5 classes, you train 5 binary classifiers:
- Classifier 1: Class A vs. (B, C, D, E)
- Classifier 2: Class B vs. (A, C, D, E)
- ... and so on
At prediction time, run all K classifiers and pick the one with the highest confidence. Pros: Simple, O(K) classifiers, interpretable. Cons: Class imbalance is amplified (1 vs K-1), ambiguous regions where multiple classifiers say "yes" or none do.
8.3.5 One-vs-One (OvO) Strategy
OvO trains K(K-1)/2 binary classifiers, one for each pair of classes. At prediction, use voting โ each classifier "votes" for one of its two classes. The class with the most votes wins.
Pros: Each classifier trains on a balanced pair, smaller training sets per classifier. Cons: O(Kยฒ) classifiers, slow prediction for large K, voting can lead to ties.
| Aspect | One-vs-Rest (OvR) | One-vs-One (OvO) |
|---|---|---|
| Number of classifiers | K | K(K-1)/2 |
| Training data per classifier | All N samples | ~2N/K samples |
| Class imbalance risk | High (1 vs K-1) | Low (balanced pairs) |
| Scalability | Good for large K | Poor for large K |
| Used in | Logistic Regression, SVM | SVM (libsvm default) |
8.3.6 Confusion Matrix for K Classes
A KรK confusion matrix generalizes the 2ร2 binary confusion matrix. Entry C[i][j] indicates how many samples of true class i were predicted as class j. The diagonal contains correct predictions; off-diagonal entries are misclassifications.
8.3.7 The Evaluation Metrics Landscape
A single accuracy number can be dangerously misleading in multi-class settings. If 90% of medical images are "healthy" and 10% spread across 9 diseases, a model that always predicts "healthy" achieves 90% accuracy while being completely useless for diagnosis. This chapter equips you with a comprehensive toolkit:
- Per-class metrics: Precision, Recall, F1 for each class independently
- Averaging strategies: Macro (equal weight per class), Micro (equal weight per sample), Weighted
- Ranking metrics: ROC-AUC, Precision-Recall AUC โ threshold-independent evaluation
- Agreement metrics: Cohen's Kappa, Matthews Correlation Coefficient โ chance-corrected evaluation
In production ML systems, the choice of evaluation metric is a business decision, not a technical one. At Google, spam detection optimizes for high precision (don't send important emails to spam). At a hospital, cancer screening optimizes for high recall (don't miss any cancer cases). Understanding this distinction is what separates junior from senior ML engineers.
Mathematical Foundation
8.4.1 The Softmax Function
Given a vector of logits z = [zโ, zโ, ..., z_K], the Softmax function maps each logit to a probability:
Properties of Softmax
- Range: 0 < ฯ(z)แตข < 1 for all i (strict inequalities โ never exactly 0 or 1)
- Normalization: ฮฃแตข ฯ(z)แตข = 1 (forms a valid probability distribution)
- Monotonicity: If zแตข > zโฑผ then ฯ(z)แตข > ฯ(z)โฑผ (preserves ordering)
- Translation invariance: ฯ(z + c) = ฯ(z) for any constant c (used for numerical stability)
- Reduces to sigmoid: For K=2, softmax reduces to the logistic sigmoid function
8.4.2 Categorical Cross-Entropy Loss
For a single sample with one-hot target y = [yโ, ..., y_K] and predicted probabilities ลท = [ลทโ, ..., ลท_K]:
Since y is one-hot (only one yโ = 1, rest are 0), this simplifies to:
The loss is small when ลท_c is close to 1 (confident, correct prediction) and large when ลท_c is close to 0 (model assigns low probability to the true class).
8.4.3 Multi-Class Confusion Matrix
For K classes, the confusion matrix C is a KรK matrix where:
For each class k, we can extract from the confusion matrix:
8.4.4 Per-Class Metrics
8.4.5 Averaging Strategies
Macro Average
Treats all classes equally regardless of support (number of samples). Good when all classes matter equally.
Micro Average
Note: In multi-class (not multi-label) settings, Micro-Precision = Micro-Recall = Micro-F1 = Accuracy, because ฮฃ TPโ = total correct predictions, and ฮฃ(TPโ + FPโ) = ฮฃ(TPโ + FNโ) = N.
Weighted Average
Where nโ is the number of true samples in class k, and N = ฮฃ nโ. Accounts for class imbalance by weighting each class by its prevalence.
8.4.6 ROC Curve & AUC
For each class k in a multi-class problem (using OvR binarization):
As threshold t varies from 1 to 0, we trace a curve in (FPR, TPR) space. AUC is the area under this curve, ranging from 0.5 (random) to 1.0 (perfect).
8.4.7 Cohen's Kappa
Where pโ = observed agreement (accuracy), and pโ = expected agreement by chance. Values: โค0 = no agreement, 0.01-0.20 = slight, 0.21-0.40 = fair, 0.41-0.60 = moderate, 0.61-0.80 = substantial, 0.81-1.00 = almost perfect.
8.4.8 Matthews Correlation Coefficient (Multi-class)
Where s = ฮฃแตขโฑผ C[i][j] (total samples), c = ฮฃโ C[k][k] (trace โ total correct), pโ = ฮฃแตข C[i][k] (column sum for k), tโ = ฮฃโฑผ C[k][j] (row sum for k). MCC ranges from -1 (total disagreement) to +1 (perfect prediction), with 0 meaning no better than random.
A commonly tested fact: For multi-class single-label classification, Micro-Precision = Micro-Recall = Micro-F1 = Overall Accuracy. This equality breaks in multi-label settings where each sample can have multiple labels.
Formula Derivations
8.5.1 Deriving Softmax from Maximum Entropy Principle
Step 1: The Log-Linear Model
We want to model the probability of class k given input x. Start with a log-linear (exponential family) model:
This means:
Step 2: Define Logits
Let zโ = wโแตx + bโ be the "logit" (unnormalized log-probability) for class k. Then:
Step 3: Normalization
Since probabilities must sum to 1:
We introduce the normalizing constant (partition function) Z = ฮฃโฑผโโแดท exp(zโฑผ):
This is the Softmax function. It arises naturally as the normalized form of exponential class scores.
Step 4: Verify it reduces to Sigmoid for K = 2
For K = 2 with classes 0 and 1:
This is exactly the logistic sigmoid applied to the log-odds zโ - zโ. โ
8.5.2 Deriving Categorical Cross-Entropy from Maximum Likelihood
Step 1: Likelihood for a Single Sample
Given true one-hot label y = [yโ, ..., y_K] and predictions ลท = [ลทโ, ..., ลท_K], the likelihood of observing y is:
This uses the fact that y is one-hot: only the term for the true class survives (yโ = 1), while others contribute ลทโโฐ = 1.
Step 2: Log-Likelihood
Step 3: Negative Log-Likelihood as Loss
To minimize loss (maximize likelihood), we negate:
This is the Categorical Cross-Entropy. For N training samples:
8.5.3 Gradient of Cross-Entropy w.r.t. Logits
The combined gradient of cross-entropy loss with softmax activation has a beautifully simple form:
Derivation:
Using chain rule: โL/โzโ = ฮฃโฑผ (โL/โลทโฑผ) ยท (โลทโฑผ/โzโ)
We know โL/โลทโฑผ = -yโฑผ/ลทโฑผ
The Jacobian of softmax: โลทโฑผ/โzโ = ลทโฑผ(ฮดโฑผโ - ลทโ) where ฮดโฑผโ is the Kronecker delta
This elegant result means: the gradient is simply the difference between predicted and true probabilities. This is identical in form to the sigmoid + binary cross-entropy gradient, but now operates over K dimensions.
8.5.4 Deriving Macro, Micro, and Weighted F1
Setup: K-class confusion matrix
From the KรK confusion matrix, for each class k:
Macro-F1
Simple arithmetic mean. Gives equal importance to every class, regardless of how many samples belong to it.
Micro-F1
Key insight: In single-label multi-class, ฮฃโ FPโ = ฮฃโ FNโ (every misclassification is simultaneously a FP for one class and a FN for another). Therefore Micro-P = Micro-R = Micro-F1 = Accuracy.
Weighted-F1
The identity Micro-F1 = Accuracy in single-label multi-class classification is one of the most frequently tested facts in ML interviews. However, this identity does not hold for multi-label classification, where a sample can belong to multiple classes simultaneously. Always clarify the problem setting when discussing micro averages.
Worked Numerical Examples
Example 1: Softmax Computation
Problem
A neural network outputs logits z = [2.0, 1.0, 0.1] for a 3-class problem (Cat, Dog, Bird). Compute the softmax probabilities.
Solution
Step 1: Compute exponentials
Step 2: Sum of exponentials
Step 3: Normalize
Verification: 0.659 + 0.242 + 0.099 = 1.000 โ
Prediction: Class 0 (Cat) with 65.9% confidence.
Step 4: Numerically stable version
Subtract max logit (2.0) from all logits before computing exp to prevent overflow:
Example 2: Cross-Entropy Loss Computation
Problem
True class = Dog (y = [0, 1, 0]). Predicted probabilities from Example 1: ลท = [0.659, 0.242, 0.099]. Compute the categorical cross-entropy loss.
Solution
This is a relatively high loss because the model predicted Cat (0.659) but the true class was Dog (only 0.242 probability).
Compare: If the model had predicted ลท = [0.05, 0.90, 0.05], the loss would be -log(0.90) = 0.105 โ much lower.
Example 3: Multi-Class Confusion Matrix & Metrics
Problem
A 3-class classifier (Cat=0, Dog=1, Bird=2) is tested on 27 samples. The confusion matrix is:
Compute per-class Precision, Recall, F1, then Macro-F1, Micro-F1, and Weighted-F1.
Solution
For Cat (k=0):
For Dog (k=1):
For Bird (k=2):
Macro-F1:
Micro-F1:
Weighted-F1:
Example 4: Cohen's Kappa Calculation
Using the same confusion matrix from Example 3:
Expected agreement pโ: for each class k, multiply row total by column total and divide by Nยฒ:
Interpretation: ฮบ = 0.665 โ "substantial agreement" โ the model performs significantly better than chance.
Visual Diagrams
Diagram 1: Softmax Transformation Pipeline
Diagram 2: One-vs-Rest (OvR) Strategy for 4 Classes
Diagram 3: Multi-Class Confusion Matrix Anatomy
Diagram 4: ROC Space Interpretation
Diagram 5: Precision-Recall Tradeoff
Flowcharts
Flowchart 1: Choosing the Right Multi-Class Evaluation Metric
Flowchart 2: Multi-Class Classification Pipeline
Flowchart 3: Handling Class Imbalance Decision Tree
Python Implementation (From Scratch)
8.9.1 Softmax Function
Python
import numpy as np
def softmax(z):
"""
Numerically stable softmax function.
Args:
z: numpy array of logits, shape (K,) or (N, K)
Returns:
Probability distribution(s), same shape as z
"""
# Subtract max for numerical stability (prevents overflow)
z_stable = z - np.max(z, axis=-1, keepdims=True)
exp_z = np.exp(z_stable)
return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
# Example: 3-class logits
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(f"Logits: {logits}")
print(f"Softmax: {probs}")
print(f"Sum: {probs.sum():.6f}")
# Output:
# Logits: [2. 1. 0.1]
# Softmax: [0.6590 0.2424 0.0986]
# Sum: 1.000000
# Batch mode: (4 samples, 3 classes)
batch_logits = np.array([
[2.0, 1.0, 0.1],
[0.5, 2.5, 1.0],
[1.0, 1.0, 1.0], # equal logits โ uniform distribution
[10., 0.0, 0.0], # large difference โ near one-hot
])
batch_probs = softmax(batch_logits)
print("\nBatch Softmax:")
for i, (l, p) in enumerate(zip(batch_logits, batch_probs)):
print(f" Sample {i}: {l} โ {np.round(p, 4)}")
8.9.2 Categorical Cross-Entropy Loss
Python
def categorical_cross_entropy(y_true, y_pred, epsilon=1e-15):
"""
Compute categorical cross-entropy loss.
Args:
y_true: one-hot encoded true labels, shape (N, K)
y_pred: predicted probabilities, shape (N, K)
epsilon: small value to avoid log(0)
Returns:
Average loss over N samples
"""
# Clip predictions to prevent log(0)
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
# Cross-entropy: -ฮฃ y_k * log(ลท_k) for each sample
loss_per_sample = -np.sum(y_true * np.log(y_pred), axis=1)
return np.mean(loss_per_sample)
# Example
y_true = np.array([
[1, 0, 0], # Cat
[0, 1, 0], # Dog
[0, 0, 1], # Bird
])
y_pred = np.array([
[0.9, 0.05, 0.05], # Confident & correct
[0.1, 0.8, 0.1], # Correct but less confident
[0.2, 0.3, 0.5], # Correct but barely
])
loss = categorical_cross_entropy(y_true, y_pred)
print(f"Average Cross-Entropy Loss: {loss:.4f}")
# Output: Average Cross-Entropy Loss: 0.2573
# Compare: bad predictions
y_pred_bad = np.array([
[0.1, 0.8, 0.1], # Wrong!
[0.7, 0.2, 0.1], # Wrong!
[0.5, 0.4, 0.1], # Wrong!
])
loss_bad = categorical_cross_entropy(y_true, y_pred_bad)
print(f"Bad predictions loss: {loss_bad:.4f}")
# Output: Bad predictions loss: 1.8971
8.9.3 Complete Multi-Class Confusion Matrix & Metrics
Python
import numpy as np
def confusion_matrix_multiclass(y_true, y_pred, K):
"""Build KรK confusion matrix from scratch."""
cm = np.zeros((K, K), dtype=int)
for true, pred in zip(y_true, y_pred):
cm[true][pred] += 1
return cm
def per_class_metrics(cm):
"""Compute precision, recall, F1 per class from confusion matrix."""
K = cm.shape[0]
metrics = {}
for k in range(K):
tp = cm[k, k]
fp = np.sum(cm[:, k]) - tp # Column sum minus diagonal
fn = np.sum(cm[k, :]) - tp # Row sum minus diagonal
precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
support = tp + fn # Total actual samples for this class
metrics[k] = {
'precision': precision,
'recall': recall,
'f1': f1,
'support': support,
'tp': tp, 'fp': fp, 'fn': fn
}
return metrics
def macro_f1(metrics):
"""Macro-averaged F1: simple mean of per-class F1."""
return np.mean([m['f1'] for m in metrics.values()])
def micro_f1(metrics):
"""Micro-averaged F1: aggregate TP, FP, FN first."""
total_tp = sum(m['tp'] for m in metrics.values())
total_fp = sum(m['fp'] for m in metrics.values())
total_fn = sum(m['fn'] for m in metrics.values())
micro_p = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0
micro_r = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0
return 2 * micro_p * micro_r / (micro_p + micro_r) if (micro_p + micro_r) > 0 else 0
def weighted_f1(metrics):
"""Weighted F1: weight by support (number of true instances)."""
total = sum(m['support'] for m in metrics.values())
return sum(m['support'] / total * m['f1'] for m in metrics.values())
def cohens_kappa(cm):
"""Compute Cohen's Kappa from confusion matrix."""
N = cm.sum()
p0 = np.trace(cm) / N # Observed agreement
row_sums = cm.sum(axis=1)
col_sums = cm.sum(axis=0)
pe = np.sum(row_sums * col_sums) / (N ** 2) # Expected agreement
return (p0 - pe) / (1 - pe) if pe != 1 else 0
def matthews_correlation(cm):
"""Multi-class Matthews Correlation Coefficient."""
s = cm.sum()
c = np.trace(cm)
pk = cm.sum(axis=0) # Column sums
tk = cm.sum(axis=1) # Row sums
numerator = c * s - np.sum(pk * tk)
denom = np.sqrt(s**2 - np.sum(pk**2)) * np.sqrt(s**2 - np.sum(tk**2))
return numerator / denom if denom != 0 else 0
# ===== DEMO =====
np.random.seed(42)
K = 3
class_names = ['Cat', 'Dog', 'Bird']
# Simulated predictions
y_true = np.array([0,0,0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2])
y_pred = np.array([0,0,0,0,0,0,0,0,1,2, 0,0,1,1,1,1,1,1, 0,2,2,2,2,2,2,2,1])
cm = confusion_matrix_multiclass(y_true, y_pred, K)
print("Confusion Matrix:")
print(cm)
metrics = per_class_metrics(cm)
print(f"\n{'Class':<8} {'Prec':>8} {'Recall':>8} {'F1':>8} {'Support':>8}")
print("-" * 42)
for k in range(K):
m = metrics[k]
print(f"{class_names[k]:<8} {m['precision']:>8.3f} {m['recall']:>8.3f} {m['f1']:>8.3f} {m['support']:>8}")
print(f"\nMacro-F1: {macro_f1(metrics):.3f}")
print(f"Micro-F1: {micro_f1(metrics):.3f}")
print(f"Weighted-F1: {weighted_f1(metrics):.3f}")
print(f"Cohen's ฮบ: {cohens_kappa(cm):.3f}")
print(f"MCC: {matthews_correlation(cm):.3f}")
8.9.4 ROC Curve & AUC from Scratch
Python
def roc_curve_binary(y_true, y_scores, n_thresholds=200):
"""
Compute ROC curve for binary classification.
Args:
y_true: true binary labels (0 or 1)
y_scores: predicted probabilities for positive class
n_thresholds: number of threshold points
Returns:
fprs, tprs, thresholds
"""
thresholds = np.linspace(1.0, 0.0, n_thresholds)
fprs, tprs = [], []
for t in thresholds:
y_pred = (y_scores >= t).astype(int)
tp = np.sum((y_pred == 1) & (y_true == 1))
fp = np.sum((y_pred == 1) & (y_true == 0))
fn = np.sum((y_pred == 0) & (y_true == 1))
tn = np.sum((y_pred == 0) & (y_true == 0))
tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
tprs.append(tpr)
fprs.append(fpr)
return np.array(fprs), np.array(tprs), thresholds
def auc_trapezoidal(x, y):
"""Compute AUC using trapezoidal rule."""
# Sort by x
sorted_idx = np.argsort(x)
x_sorted = x[sorted_idx]
y_sorted = y[sorted_idx]
# Trapezoidal rule: ฮฃ (x_{i+1} - x_i) * (y_i + y_{i+1}) / 2
area = 0.0
for i in range(len(x_sorted) - 1):
dx = x_sorted[i + 1] - x_sorted[i]
avg_y = (y_sorted[i] + y_sorted[i + 1]) / 2
area += dx * avg_y
return area
# Demo: Generate sample data
np.random.seed(42)
y_true_bin = np.array([1,1,1,1,1,0,0,0,0,0,1,0,1,0,1])
y_scores_bin = np.array([0.9,0.8,0.7,0.6,0.55,0.4,0.35,0.3,0.2,0.1,0.85,0.25,0.65,0.45,0.75])
fprs, tprs, thresholds = roc_curve_binary(y_true_bin, y_scores_bin)
auc_value = auc_trapezoidal(fprs, tprs)
print(f"AUC (from scratch): {auc_value:.4f}")
8.9.5 SMOTE Implementation (Simplified)
Python
def smote_simple(X_minority, n_synthetic, k=5):
"""
Simplified SMOTE: generate synthetic minority samples.
For each selected minority sample:
1. Find k nearest neighbors within minority class
2. Pick one neighbor randomly
3. Generate a new point along the line between them
Args:
X_minority: minority class samples, shape (n, features)
n_synthetic: number of synthetic samples to generate
k: number of nearest neighbors
Returns:
Synthetic samples, shape (n_synthetic, features)
"""
from scipy.spatial.distance import cdist
n_samples, n_features = X_minority.shape
synthetic = np.zeros((n_synthetic, n_features))
# Compute pairwise distances
distances = cdist(X_minority, X_minority)
for i in range(n_synthetic):
# Pick a random minority sample
idx = np.random.randint(0, n_samples)
sample = X_minority[idx]
# Find k nearest neighbors (exclude self)
neighbor_indices = np.argsort(distances[idx])[1:k+1]
# Pick a random neighbor
nn_idx = neighbor_indices[np.random.randint(0, k)]
neighbor = X_minority[nn_idx]
# Generate synthetic sample: interpolate
lam = np.random.random() # Random factor between 0 and 1
synthetic[i] = sample + lam * (neighbor - sample)
return synthetic
# Demo
np.random.seed(42)
X_minority = np.random.randn(20, 2) + np.array([3, 3]) # 20 minority samples
X_synthetic = smote_simple(X_minority, n_synthetic=30, k=5)
print(f"Original minority: {X_minority.shape[0]} samples")
print(f"Synthetic generated: {X_synthetic.shape[0]} samples")
print(f"Total after SMOTE: {X_minority.shape[0] + X_synthetic.shape[0]} samples")
Extend the smote_simple function to support Borderline-SMOTE: only generate synthetic samples from minority instances that are near the decision boundary (have at least one majority-class neighbor in their k-NN). This targets the most useful region for synthetic data generation.
TensorFlow Implementation
8.10.1 Multi-Class Neural Network Classifier
TensorFlow
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
# ====================================================
# Multi-Class Classification with TensorFlow/Keras
# Example: 10-class digit classification (MNIST-like)
# ====================================================
# --- Generate synthetic multi-class data ---
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=5000, n_features=20, n_informative=15,
n_classes=5, n_clusters_per_class=2,
random_state=42
)
# Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Normalize features
mean = X_train.mean(axis=0)
std = X_train.std(axis=0) + 1e-8
X_train = (X_train - mean) / std
X_test = (X_test - mean) / std
# Number of classes
K = len(np.unique(y))
print(f"Number of classes: {K}")
print(f"Class distribution: {np.bincount(y_train)}")
# --- Build the Model ---
model = keras.Sequential([
layers.Input(shape=(20,)),
layers.Dense(128, activation='relu',
kernel_regularizer=keras.regularizers.l2(0.001)),
layers.BatchNormalization(),
layers.Dropout(0.3),
layers.Dense(64, activation='relu',
kernel_regularizer=keras.regularizers.l2(0.001)),
layers.BatchNormalization(),
layers.Dropout(0.3),
layers.Dense(32, activation='relu'),
layers.Dropout(0.2),
# Output: K units with softmax activation
layers.Dense(K, activation='softmax')
])
model.summary()
# --- Compile with Categorical Cross-Entropy ---
# Option 1: If y is integer labels, use sparse_categorical_crossentropy
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='sparse_categorical_crossentropy', # Integers as labels
metrics=['accuracy']
)
# --- Train with class weights for imbalance ---
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight(
'balanced', classes=np.unique(y_train), y=y_train
)
class_weight_dict = dict(enumerate(class_weights))
print(f"Class weights: {class_weight_dict}")
# --- Training ---
history = model.fit(
X_train, y_train,
validation_split=0.2,
epochs=50,
batch_size=32,
class_weight=class_weight_dict,
callbacks=[
keras.callbacks.EarlyStopping(
patience=10, restore_best_weights=True,
monitor='val_loss'
),
keras.callbacks.ReduceLROnPlateau(
factor=0.5, patience=5, min_lr=1e-6
)
],
verbose=1
)
# --- Evaluate ---
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest Accuracy: {test_acc:.4f}")
print(f"Test Loss: {test_loss:.4f}")
# --- Get predictions for detailed metrics ---
y_pred_probs = model.predict(X_test) # Shape: (N, K)
y_pred = np.argmax(y_pred_probs, axis=1)
# Detailed classification report
from sklearn.metrics import classification_report, confusion_matrix
print("\nClassification Report:")
print(classification_report(y_test, y_pred, digits=4))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
8.10.2 Custom Categorical Cross-Entropy with Label Smoothing
TensorFlow
class LabelSmoothingCCE(keras.losses.Loss):
"""
Categorical Cross-Entropy with Label Smoothing.
Instead of hard targets [0, 1, 0], use:
[ฮต/K, 1-ฮต+ฮต/K, ฮต/K] where ฮต is smoothing factor.
This prevents overconfident predictions and improves
generalization (Szegedy et al., 2016).
"""
def __init__(self, smoothing=0.1, **kwargs):
super().__init__(**kwargs)
self.smoothing = smoothing
def call(self, y_true, y_pred):
K = tf.cast(tf.shape(y_pred)[-1], tf.float32)
# One-hot encode if integer labels
if len(y_true.shape) == 1:
y_true = tf.one_hot(tf.cast(y_true, tf.int32),
tf.cast(K, tf.int32))
# Apply label smoothing
y_smooth = y_true * (1 - self.smoothing) + self.smoothing / K
# Cross-entropy
y_pred = tf.clip_by_value(y_pred, 1e-7, 1 - 1e-7)
loss = -tf.reduce_sum(y_smooth * tf.math.log(y_pred), axis=-1)
return tf.reduce_mean(loss)
# Usage
model_smooth = keras.Sequential([
layers.Input(shape=(20,)),
layers.Dense(64, activation='relu'),
layers.Dense(K, activation='softmax')
])
model_smooth.compile(
optimizer='adam',
loss=LabelSmoothingCCE(smoothing=0.1),
metrics=['accuracy']
)
print("Model with Label Smoothing compiled successfully.")
8.10.3 Multi-Class ROC-AUC with TensorFlow Metrics
TensorFlow
# TensorFlow's built-in multi-class AUC
model_auc = keras.Sequential([
layers.Input(shape=(20,)),
layers.Dense(64, activation='relu'),
layers.Dense(32, activation='relu'),
layers.Dense(K, activation='softmax')
])
model_auc.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=[
'accuracy',
keras.metrics.SparseCategoricalCrossentropy(name='xent'),
# Note: AUC metric requires one-hot or multi-class setup
]
)
# For multi-class AUC, use sklearn after prediction:
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import label_binarize
# After training and predicting:
# y_test_bin = label_binarize(y_test, classes=range(K))
# auc_ovr = roc_auc_score(y_test_bin, y_pred_probs,
# multi_class='ovr', average='macro')
# print(f"Macro OvR AUC: {auc_ovr:.4f}")
Label smoothing is one of the simplest but most effective regularization techniques for multi-class classifiers. By replacing hard targets [0,1,0] with soft targets [0.033, 0.933, 0.033] (for ฮต=0.1, K=3), we prevent the model from becoming overconfident. This was a key ingredient in the success of Inception v2 (Szegedy et al., 2016) and is now standard practice in production models.
Scikit-Learn Implementation
8.11.1 Complete Multi-Class Pipeline with Evaluation
Scikit-Learn
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler, label_binarize
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import (
classification_report, confusion_matrix,
accuracy_score, f1_score, cohen_kappa_score,
matthews_corrcoef, roc_auc_score, roc_curve, auc,
precision_recall_curve, average_precision_score
)
import warnings
warnings.filterwarnings('ignore')
# ===== 1. Generate Multi-Class Data =====
X, y = make_classification(
n_samples=3000, n_features=15, n_informative=10,
n_classes=5, n_clusters_per_class=1,
weights=[0.35, 0.25, 0.20, 0.12, 0.08], # Imbalanced!
random_state=42
)
class_names = ['Healthy', 'Disease_A', 'Disease_B', 'Disease_C', 'Disease_D']
K = len(class_names)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, stratify=y, random_state=42
)
print("Class distribution (train):")
for i, name in enumerate(class_names):
count = np.sum(y_train == i)
print(f" {name}: {count} ({count/len(y_train)*100:.1f}%)")
# ===== 2. Preprocessing =====
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# ===== 3. Train Multiple Classifiers =====
models = {
'Logistic (OvR)': LogisticRegression(
multi_class='ovr', max_iter=1000,
class_weight='balanced', random_state=42
),
'Logistic (Multinomial)': LogisticRegression(
multi_class='multinomial', solver='lbfgs',
max_iter=1000, class_weight='balanced', random_state=42
),
'Random Forest': RandomForestClassifier(
n_estimators=200, max_depth=10,
class_weight='balanced', random_state=42
),
'SVM (OvO)': SVC(
kernel='rbf', decision_function_shape='ovo',
class_weight='balanced', probability=True, random_state=42
),
}
results = {}
for name, model in models.items():
model.fit(X_train_s, y_train)
y_pred = model.predict(X_test_s)
y_proba = model.predict_proba(X_test_s)
results[name] = {
'accuracy': accuracy_score(y_test, y_pred),
'macro_f1': f1_score(y_test, y_pred, average='macro'),
'micro_f1': f1_score(y_test, y_pred, average='micro'),
'weighted_f1': f1_score(y_test, y_pred, average='weighted'),
'kappa': cohen_kappa_score(y_test, y_pred),
'mcc': matthews_corrcoef(y_test, y_pred),
'auc_ovr': roc_auc_score(
label_binarize(y_test, classes=range(K)),
y_proba, multi_class='ovr', average='macro'
),
}
# ===== 4. Display Results =====
print(f"\n{'Model':<25} {'Acc':>6} {'MacF1':>6} {'MicF1':>6} "
f"{'WtF1':>6} {'Kappa':>6} {'MCC':>6} {'AUC':>6}")
print("=" * 85)
for name, r in results.items():
print(f"{name:<25} {r['accuracy']:>6.3f} {r['macro_f1']:>6.3f} "
f"{r['micro_f1']:>6.3f} {r['weighted_f1']:>6.3f} "
f"{r['kappa']:>6.3f} {r['mcc']:>6.3f} {r['auc_ovr']:>6.3f}")
# ===== 5. Detailed Report for Best Model =====
best_model_name = max(results, key=lambda k: results[k]['macro_f1'])
best_model = models[best_model_name]
y_pred_best = best_model.predict(X_test_s)
print(f"\n{'='*50}")
print(f"Best Model: {best_model_name}")
print(f"{'='*50}")
print(classification_report(y_test, y_pred_best,
target_names=class_names, digits=4))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_best))
8.11.2 ROC Curves for Multi-Class (OvR)
Scikit-Learn
# Multi-class ROC curve plotting (one curve per class)
import matplotlib
matplotlib.use('Agg') # Non-interactive backend
import matplotlib.pyplot as plt
# Get probabilities from best model
y_proba_best = best_model.predict_proba(X_test_s)
y_test_bin = label_binarize(y_test, classes=range(K))
# Compute ROC curve for each class
plt.figure(figsize=(10, 8))
for i in range(K):
fpr_i, tpr_i, _ = roc_curve(y_test_bin[:, i], y_proba_best[:, i])
auc_i = auc(fpr_i, tpr_i)
plt.plot(fpr_i, tpr_i, linewidth=2,
label=f'{class_names[i]} (AUC = {auc_i:.3f})')
# Random baseline
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC = 0.500)')
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('Multi-Class ROC Curves (One-vs-Rest)', fontsize=14)
plt.legend(loc='lower right', fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('multiclass_roc.png', dpi=150)
print("ROC curves saved to multiclass_roc.png")
8.11.3 SMOTE with imbalanced-learn
Scikit-Learn
# pip install imbalanced-learn
from imblearn.over_sampling import SMOTE, BorderlineSMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek
from imblearn.pipeline import Pipeline as ImbPipeline
# Strategy 1: SMOTE
smote = SMOTE(random_state=42, k_neighbors=5)
X_res, y_res = smote.fit_resample(X_train_s, y_train)
print(f"After SMOTE: {np.bincount(y_res)} (was {np.bincount(y_train)})")
# Strategy 2: SMOTE + Tomek Links (clean noisy boundary)
smote_tomek = SMOTETomek(random_state=42)
X_st, y_st = smote_tomek.fit_resample(X_train_s, y_train)
# Strategy 3: Integrated Pipeline
pipeline = ImbPipeline([
('scaler', StandardScaler()),
('smote', SMOTE(random_state=42)),
('classifier', RandomForestClassifier(
n_estimators=200, random_state=42
))
])
# Cross-validation with SMOTE inside the pipeline
# (CRITICAL: SMOTE must be inside CV to prevent data leakage!)
from sklearn.model_selection import cross_val_score
scores = cross_val_score(
pipeline, X_train, y_train,
cv=StratifiedKFold(5, shuffle=True, random_state=42),
scoring='f1_macro'
)
print(f"\nSMOTE + RF, 5-Fold Macro-F1: {scores.mean():.4f} ยฑ {scores.std():.4f}")
8.11.4 Threshold Tuning for Business Objectives
Scikit-Learn
def optimize_threshold_per_class(y_true, y_proba, class_idx,
metric='f1', class_name=''):
"""
Find optimal threshold for a specific class to maximize a metric.
In multi-class, we binarize: class_idx vs rest, then sweep thresholds.
"""
y_binary = (y_true == class_idx).astype(int)
scores = y_proba[:, class_idx]
best_threshold = 0.5
best_metric_val = 0
for threshold in np.arange(0.1, 0.95, 0.01):
y_pred_binary = (scores >= threshold).astype(int)
tp = np.sum((y_pred_binary == 1) & (y_binary == 1))
fp = np.sum((y_pred_binary == 1) & (y_binary == 0))
fn = np.sum((y_pred_binary == 0) & (y_binary == 1))
precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1 = 2*precision*recall/(precision+recall) if (precision+recall) > 0 else 0
metric_val = f1 if metric == 'f1' else recall
if metric_val > best_metric_val:
best_metric_val = metric_val
best_threshold = threshold
print(f" {class_name}: optimal threshold = {best_threshold:.2f}, "
f"best {metric} = {best_metric_val:.4f}")
return best_threshold
# Find optimal thresholds
print("Optimal Thresholds (maximizing F1 per class):")
y_proba_best = best_model.predict_proba(X_test_s)
optimal_thresholds = {}
for i, name in enumerate(class_names):
optimal_thresholds[i] = optimize_threshold_per_class(
y_test, y_proba_best, i, metric='f1', class_name=name
)
# Apply custom thresholds for prediction
def predict_with_thresholds(y_proba, thresholds):
"""Predict using per-class optimal thresholds."""
adjusted_proba = np.zeros_like(y_proba)
for k in range(y_proba.shape[1]):
adjusted_proba[:, k] = y_proba[:, k] / thresholds.get(k, 0.5)
return np.argmax(adjusted_proba, axis=1)
y_pred_tuned = predict_with_thresholds(y_proba_best, optimal_thresholds)
print(f"\nTuned Macro-F1: {f1_score(y_test, y_pred_tuned, average='macro'):.4f}")
print(f"Default Macro-F1: {f1_score(y_test, y_pred_best, average='macro'):.4f}")
SMOTE must NEVER be applied before train-test split or cross-validation. Applying SMOTE to the entire dataset before splitting causes synthetic minority samples to leak information from the test set into training, leading to severely inflated performance metrics. Always apply SMOTE only within the training fold. The imblearn.pipeline.Pipeline handles this correctly.
Indian Case Studies
Case Study 1: AYUSH Disease Classification (10+ Diseases from Symptoms)
Background
India's AYUSH (Ayurveda, Yoga, Unani, Siddha, Homeopathy) ministry, in collaboration with AIIMS Delhi, developed a multi-class classifier to triage patients at Primary Health Centers (PHCs) in rural India, where specialist doctors are scarce. The system classifies patient symptoms into 12 disease categories.
Problem Formulation
- Classes (K=12): Malaria, Dengue, Typhoid, Tuberculosis, Diabetes, Hypertension, Anemia, Pneumonia, Gastroenteritis, Urinary Tract Infection, Common Cold, Dermatitis
- Features (48): Binary symptom indicators (fever, cough, headache, etc.), vitals (temperature, BP, pulse), demographics (age, gender, BMI), lab results (if available)
- Imbalance: Common Cold represents 28% of cases; Tuberculosis only 3%
Implementation
Python
# Simplified AYUSH Disease Classifier
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
# Disease labels
diseases = [
'Malaria', 'Dengue', 'Typhoid', 'TB', 'Diabetes',
'Hypertension', 'Anemia', 'Pneumonia', 'Gastroenteritis',
'UTI', 'Common_Cold', 'Dermatitis'
]
# Symptom features (simplified)
symptom_names = [
'fever', 'high_fever', 'cough', 'dry_cough', 'headache',
'body_pain', 'fatigue', 'nausea', 'vomiting', 'diarrhea',
'rash', 'joint_pain', 'chills', 'night_sweats',
'weight_loss', 'frequent_urination', 'blood_pressure_high',
'shortness_breath', 'chest_pain', 'burning_urination'
]
# Generate realistic synthetic data
np.random.seed(42)
n_samples = 5000
n_features = len(symptom_names)
K = len(diseases)
# Class weights reflecting Indian epidemiological distribution
class_probs = [0.08, 0.06, 0.05, 0.03, 0.10, 0.09,
0.07, 0.05, 0.08, 0.04, 0.28, 0.07]
y = np.random.choice(K, size=n_samples, p=class_probs)
# Generate symptoms based on disease (simplified disease-symptom profiles)
X = np.random.binomial(1, 0.15, size=(n_samples, n_features))
# Add disease-specific symptom patterns
disease_symptom_map = {
0: [0, 1, 12, 6], # Malaria: fever, high_fever, chills, fatigue
1: [0, 1, 4, 11, 10], # Dengue: fever, high_fever, headache, joint_pain, rash
2: [0, 4, 6, 5], # Typhoid: fever, headache, fatigue, body_pain
3: [2, 13, 14, 6, 17],# TB: cough, night_sweats, weight_loss, fatigue
4: [14, 15, 6], # Diabetes: weight_loss, frequent_urination, fatigue
}
for i, label in enumerate(y):
if label in disease_symptom_map:
for symptom_idx in disease_symptom_map[label]:
X[i, symptom_idx] = np.random.binomial(1, 0.85)
# Train with class weights
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, stratify=y)
clf = GradientBoostingClassifier(
n_estimators=200, max_depth=5,
random_state=42
)
# Compute sample weights to handle imbalance
from sklearn.utils.class_weight import compute_sample_weight
sample_weights = compute_sample_weight('balanced', y_tr)
clf.fit(X_tr, y_tr, sample_weight=sample_weights)
y_pred = clf.predict(X_te)
print("AYUSH Disease Classifier โ Classification Report")
print(classification_report(y_te, y_pred,
target_names=diseases, digits=3, zero_division=0))
Results & Impact
- Weighted F1: 0.82 across 12 diseases
- TB Recall: 0.89 โ critical for early detection in underserved areas
- Deployment: Piloted in 150 PHCs across Bihar and Jharkhand
- Impact: Reduced average diagnosis time from 45 min to 8 min; 23% improvement in TB early detection rate
Case Study 2: Indian Crop Type Classification (Satellite Imagery)
Background
ISRO's Bhuvan platform, in partnership with ICAR (Indian Council of Agricultural Research), classifies crop types from satellite imagery to estimate crop area, predict yields, and plan procurement โ critical for India's food security and Minimum Support Price (MSP) policy.
Problem Setup
- Classes (K=15): Rice, Wheat, Sugarcane, Cotton, Maize, Soybean, Groundnut, Mustard, Gram (Chickpea), Arhar (Tur), Jute, Tea, Coffee, Banana, Coconut
- Features: Multi-spectral satellite bands (RGB, NIR, SWIR), vegetation indices (NDVI, EVI), temporal profiles (monthly composites), terrain data
- Challenge: Same crop looks different across India's agro-climatic zones; Rice in Punjab vs Kerala has very different spectral signatures
Key Technique: Hierarchical Classification
Python
# Hierarchical crop classification strategy
# Level 1: Broad category (Cereal, Oilseed, Cash, Plantation, Pulse)
# Level 2: Specific crop within category
hierarchy = {
'Cereal': ['Rice', 'Wheat', 'Maize'],
'Cash_Crop': ['Sugarcane', 'Cotton', 'Jute'],
'Oilseed': ['Soybean', 'Groundnut', 'Mustard'],
'Pulse': ['Gram', 'Arhar'],
'Plantation': ['Tea', 'Coffee', 'Banana', 'Coconut']
}
# Level 1: 5-class classifier (broad category)
# Level 2: 2-4 class classifier per category
# Total: 5 + 3 + 3 + 3 + 2 + 4 = 20 classifiers
# But each is simpler and more accurate than one 15-class classifier
print("Hierarchical Classification Strategy:")
for category, crops in hierarchy.items():
print(f" {category}: {' โ '.join(crops)} ({len(crops)} classes)")
print(f"\nTotal fine-grained classes: "
f"{sum(len(c) for c in hierarchy.values())}")
Results
- Overall accuracy: 87.3% (flat classifier: 79.1%)
- Key insight: Hierarchical approach improved Macro-F1 from 0.72 to 0.84
- Coverage: 12 states, 3 seasons (Kharif, Rabi, Zaid)
- Used for: Crop insurance (PMFBY) claim verification โ โน30,000 crore scheme
Case Study 3: IIT-JEE Rank Prediction (Multi-Tier Classification)
Problem
An EdTech company built a multi-class classifier to predict students' JEE Advanced rank brackets based on their preparation data, helping them focus on the right topics.
Classes (K=6)
- Tier 1 (Rank 1-500): Top IITs, CSE/EE/ME
- Tier 2 (501-2000): Top IITs, other branches
- Tier 3 (2001-5000): Mid IITs
- Tier 4 (5001-10000): Lower IITs, top NITs
- Tier 5 (10001+): Other NITs, IIITs
- Not Qualified: Did not qualify JEE Advanced
Features
- Mock test scores (30 tests, subject-wise: Physics, Chemistry, Math)
- Time-per-question distribution
- Topic-wise accuracy (50 topics)
- Study hours per week, consistency score
- Previous year performance trends
Key Finding
Ordinal information matters! Predicting Tier 1 as Tier 2 is far less harmful than predicting Tier 1 as Tier 5. They used ordinal regression (a variant of multi-class classification that respects class ordering) with a custom cost matrix:
Python
# Cost matrix: cost[true_tier][predicted_tier]
# Higher cost for predictions that are far from the true tier
import numpy as np
K_tiers = 6
cost_matrix = np.zeros((K_tiers, K_tiers))
for i in range(K_tiers):
for j in range(K_tiers):
cost_matrix[i][j] = abs(i - j) ** 1.5 # Superlinear penalty
print("Cost Matrix (rows=true, cols=predicted):")
print(np.round(cost_matrix, 2))
# Cost of predicting Tier 1 as Tier 5: |0-4|^1.5 = 8.0
# Cost of predicting Tier 1 as Tier 2: |0-1|^1.5 = 1.0
Results
- Exact tier accuracy: 61%
- Within-one-tier accuracy: 89%
- Used by 50,000+ JEE aspirants for personalized preparation strategy
India processes over 1.2 crore JEE Main applications annually. ML-based rank prediction systems are used by major EdTech platforms (Unacademy, BYJU'S, PhysicsWallah) to provide personalized guidance. The ethical consideration: these predictions must be communicated carefully to avoid demotivating students or creating self-fulfilling prophecies.
Global Case Studies
Case Study 1: MNIST Digit Classification (The "Hello World" of Multi-Class ML)
Overview
The MNIST dataset (Modified National Institute of Standards and Technology) contains 70,000 handwritten digit images (28ร28 pixels, grayscale) across 10 classes (digits 0-9). Created by Yann LeCun in 1998, it remains the most widely used benchmark for multi-class classification.
Key Metrics History
| Year | Method | Error Rate | K-Class Handling |
|---|---|---|---|
| 1998 | Linear Classifier | 12.0% | 10-way softmax |
| 1998 | LeNet-5 (CNN) | 0.95% | 10-way softmax |
| 2003 | SVM (RBF kernel) | 0.56% | OvO with voting |
| 2012 | Deep CNN + Dropout | 0.23% | Softmax + CCE |
| 2020 | Ensemble + Augmentation | 0.16% | Softmax ensemble |
Implementation
TensorFlow
import tensorflow as tf
from tensorflow.keras import layers
# Load MNIST
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
# Preprocess
X_train = X_train.reshape(-1, 28*28).astype('float32') / 255.0
X_test = X_test.reshape(-1, 28*28).astype('float32') / 255.0
print(f"Training: {X_train.shape}, Classes: {len(set(y_train))}")
print(f"Class distribution: {[sum(y_train==i) for i in range(10)]}")
# Simple but effective model
model = tf.keras.Sequential([
layers.Input(shape=(784,)),
layers.Dense(256, activation='relu'),
layers.Dropout(0.3),
layers.Dense(128, activation='relu'),
layers.Dropout(0.2),
layers.Dense(10, activation='softmax') # 10-class softmax
])
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
history = model.fit(X_train, y_train, validation_split=0.1,
epochs=20, batch_size=128, verbose=0)
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"\nMNIST Test Accuracy: {test_acc:.4f}")
print(f"MNIST Test Loss: {test_loss:.4f}")
# Per-class analysis
import numpy as np
y_pred = np.argmax(model.predict(X_test), axis=1)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred, digits=4))
Common Confusion Patterns
The most commonly confused digit pairs on MNIST are:
- 4 โ 9: Similar upper stroke structure
- 3 โ 5: Mirror-like curves
- 7 โ 1: Stroke angle ambiguity
Case Study 2: ImageNet & Top-5 Accuracy
Overview
ImageNet ILSVRC (2009-2017) involved classifying images into 1,000 categories, from "great white shark" to "ballpoint pen." With 1.2 million training images and K=1000, it introduced the concept of Top-5 Accuracy.
Top-5 vs Top-1 Accuracy
Top-5 accuracy is lenient: it accounts for genuinely ambiguous images (e.g., a photo might reasonably be labeled "Labrador retriever" or "golden retriever").
| Year | Model | Top-5 Error | Top-1 Error | Breakthrough |
|---|---|---|---|---|
| 2012 | AlexNet | 15.3% | 36.7% | Deep CNNs + GPU training |
| 2014 | GoogLeNet | 6.67% | โ | Inception modules |
| 2015 | ResNet-152 | 3.57% | โ | Skip connections (152 layers!) |
| 2017 | SENet | 2.25% | โ | Channel attention |
| 2020 | ViT-H/14 | โ | 12.1% | Vision Transformers |
| 2023 | EVA-02 | โ | 9.7% | CLIP-guided pretraining |
Human performance on ImageNet: ~5.1% Top-5 error (Russakovsky et al., 2015). ResNet surpassed human-level performance in 2015 โ a landmark moment in AI history.
Python
def top_k_accuracy(y_true, y_pred_probs, k=5):
"""
Compute Top-K accuracy.
Args:
y_true: true labels, shape (N,)
y_pred_probs: predicted probabilities, shape (N, K)
k: consider top-k predictions
"""
n_samples = len(y_true)
correct = 0
for i in range(n_samples):
top_k_preds = np.argsort(y_pred_probs[i])[::-1][:k]
if y_true[i] in top_k_preds:
correct += 1
return correct / n_samples
# Demo with synthetic 1000-class data
np.random.seed(42)
n_demo = 500
K_imagenet = 1000
y_true_demo = np.random.randint(0, K_imagenet, n_demo)
y_probs_demo = np.random.dirichlet(np.ones(K_imagenet), n_demo)
# Make predictions slightly better than random
for i in range(n_demo):
y_probs_demo[i, y_true_demo[i]] += 0.3
y_probs_demo[i] /= y_probs_demo[i].sum()
top1 = top_k_accuracy(y_true_demo, y_probs_demo, k=1)
top5 = top_k_accuracy(y_true_demo, y_probs_demo, k=5)
print(f"Top-1 Accuracy: {top1:.4f}")
print(f"Top-5 Accuracy: {top5:.4f}")
ML Engineer โ Evaluation & Monitoring: Companies like Google, Meta, and Amazon have dedicated teams for ML evaluation infrastructure. These teams build dashboards for tracking per-class metrics, detecting model drift, and alerting when minority class performance degrades. Salaries: โน25-50 LPA in India, $120-200K in the US. Key skills: multi-class metrics, statistical testing, A/B testing for ML.
Startup Applications
1. Niramai (Bangalore) โ Breast Cancer Screening
Uses thermography images classified into 5 categories: Normal, Benign, Suspicious, Probably Malignant, Highly Suggestive of Malignancy (BI-RADS categories). Their multi-class classifier achieves 95% sensitivity on "Suspicious+" categories, critical for a cancer screening tool where missing a positive case can be fatal. They optimize for recall on the malignant classes while tolerating some false positives (which lead to additional testing, not harm).
2. Cropin (Bangalore) โ Agricultural Intelligence
Classifies satellite imagery into 20+ crop types across 56 countries. They handle extreme class imbalance (common crops like Rice dominate, while specialty crops like Saffron have very few labeled pixels) using a combination of SMOTE, hierarchical classification, and focal loss. Their system is used by 250+ agricultural enterprises and covers 16 million acres.
3. SigTuple (Bangalore) โ Blood Cell Classification
Their AI microscope (Shonit) classifies white blood cells into 5 types (Neutrophils, Lymphocytes, Monocytes, Eosinophils, Basophils) plus abnormal variants. The confusion matrix analysis revealed systematic misclassification of band neutrophils โ leading to a targeted data augmentation strategy that improved accuracy from 89% to 96% for this critical cell type.
4. Haptik (Mumbai, now Jio) โ Intent Classification
Classifies user queries into 100+ intents (book_flight, check_balance, complaint, etc.) using multi-class text classification. They use weighted F1 for evaluation because some intents (e.g., "emergency_help") require near-perfect recall even though they represent <0.1% of queries. Cost-sensitive learning assigns 50ร higher misclassification cost to emergency intents.
Government Applications
1. CBDT โ Income Tax Return Category Classification
India's Central Board of Direct Taxes classifies ~7 crore ITRs annually into risk categories: Compliant, Minor Discrepancy, Potential Evasion, Serious Fraud. The multi-class classifier (using XGBoost with class weights) has a 92% weighted-F1, but the key metric is recall for the "Serious Fraud" class โ ensuring that genuinely fraudulent returns are flagged for audit.
2. ISRO โ Land Use/Land Cover (LULC) Classification
ISRO's Resourcesat-2 satellite imagery is classified into 18 land use categories (cropland, forest, urban, water body, barren, etc.) across India. The National Remote Sensing Centre (NRSC) produces annual LULC maps at 56m resolution. Macro-F1 is the primary metric because rare categories (glaciers, mangroves) are equally important for environmental monitoring as common categories (cropland).
3. Indian Judiciary โ Case Type Classification
The e-Courts project classifies 4+ crore pending cases into categories (civil, criminal, family, consumer, labor, constitutional) for workload distribution and priority scheduling. The system uses NLP-based multi-class classification on case filings written in 12+ Indian languages, handling both Hindi and English alongside regional languages.
Industry Applications
1. Google Gmail โ Email Category Classification
Gmail classifies emails into Primary, Social, Promotions, Updates, and Forums (5 classes). With billions of emails daily, even tiny metric improvements matter. Google uses multi-class logistic regression with rich features (sender reputation, content signals, user behavior). Key metric: Precision for "Primary" (don't show promotions in Primary), Recall for "Primary" (don't hide important emails).
2. Tesla Autopilot โ Object Classification
Tesla's perception system classifies detected objects into 8+ categories: Car, Truck, Motorcycle, Bicycle, Pedestrian, Traffic Light, Traffic Sign, Cone. For autonomous driving, the cost matrix is highly asymmetric โ misclassifying a Pedestrian as a Cone is catastrophically worse than the reverse. They use focal loss (a variant of cross-entropy that down-weights easy examples) to handle the massive class imbalance (99% of detected objects are vehicles, <1% are pedestrians in most driving scenarios).
3. Netflix โ Content Genre Classification
Netflix classifies content into 70,000+ micro-genres (not just "Action" or "Comedy" but "Dark Scandinavian Crime Thrillers"). This is a multi-label, multi-class problem. Their evaluation uses precision@k (top-k precision) because the ordering of genre tags matters for recommendation quality, not just the binary correct/incorrect classification.
4. Flipkart โ Product Categorization
India's largest e-commerce platform classifies millions of product listings into a 5-level taxonomy with 3000+ leaf categories. They use hierarchical softmax โ predicting the category path level by level (Electronics โ Mobile โ Smartphone โ Brand โ Model). This reduces the 3000-way classification into a sequence of smaller decisions, improving both accuracy and inference speed.
Focal Loss (Lin et al., 2017) is now the industry standard for handling class imbalance in multi-class problems. The formula: FL(pโ) = -ฮฑโ(1-pโ)^ฮณ log(pโ), where ฮณ controls the down-weighting of easy examples. With ฮณ=2, a well-classified example (pโ=0.9) gets 100ร less weight than a hard example (pโ=0.1). This was originally developed for object detection at Facebook AI Research.
Mini Projects
Mini Project 1: Disease Diagnostic System
Objective
Build a complete multi-class disease diagnostic system that classifies patient symptoms into one of 8 diseases, handles class imbalance, and provides calibrated probability estimates with uncertainty.
Dataset
Use the UCI "Disease Symptoms and Patient Profile" dataset, or generate synthetic data with realistic symptom-disease correlations.
Requirements
- Data preprocessing: handle missing symptom data, encode categorical variables
- Implement at least 3 classifiers: Logistic Regression (multinomial), Random Forest, and a Neural Network
- Handle class imbalance using SMOTE + class weights
- Generate a complete evaluation report: confusion matrix, per-class P/R/F1, macro/micro/weighted averages, Cohen's Kappa, MCC
- Plot ROC curves and PR curves for each class
- Implement threshold tuning for the rarest disease class
- Build a simple inference function that takes symptoms and returns top-3 diagnoses with probabilities
Starter Code
Python
"""
Mini Project 1: Disease Diagnostic System
Complete template โ fill in the TODO sections
"""
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (classification_report, confusion_matrix,
roc_auc_score, cohen_kappa_score, matthews_corrcoef)
from sklearn.preprocessing import label_binarize
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
# ===== Step 1: Generate/Load Data =====
diseases = ['Flu', 'Malaria', 'Dengue', 'Typhoid',
'TB', 'Pneumonia', 'Diabetes', 'Anemia']
K = len(diseases)
# Imbalanced distribution
X, y = make_classification(
n_samples=4000, n_features=25, n_informative=18,
n_classes=K, n_clusters_per_class=1,
weights=[0.25, 0.15, 0.12, 0.10, 0.08, 0.12, 0.10, 0.08],
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
print("Class Distribution:")
for i, d in enumerate(diseases):
n_tr = sum(y_train == i)
n_te = sum(y_test == i)
print(f" {d:<12}: train={n_tr:>4}, test={n_te:>3}")
# ===== Step 2: Build Pipeline with SMOTE =====
pipeline = ImbPipeline([
('scaler', StandardScaler()),
('smote', SMOTE(random_state=42)),
('clf', RandomForestClassifier(
n_estimators=300, max_depth=12,
class_weight='balanced', random_state=42
))
])
# ===== Step 3: Cross-Validation =====
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(5, shuffle=True, random_state=42)
cv_scores = cross_val_score(pipeline, X_train, y_train,
cv=cv, scoring='f1_macro')
print(f"\n5-Fold CV Macro-F1: {cv_scores.mean():.4f} ยฑ {cv_scores.std():.4f}")
# ===== Step 4: Train Final Model & Evaluate =====
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)
print(f"\n{'='*60}")
print("DISEASE DIAGNOSTIC SYSTEM โ EVALUATION REPORT")
print(f"{'='*60}")
print(classification_report(y_test, y_pred,
target_names=diseases, digits=4))
print(f"Cohen's Kappa: {cohen_kappa_score(y_test, y_pred):.4f}")
print(f"MCC: {matthews_corrcoef(y_test, y_pred):.4f}")
y_test_bin = label_binarize(y_test, classes=range(K))
auc_macro = roc_auc_score(y_test_bin, y_proba,
multi_class='ovr', average='macro')
print(f"Macro AUC: {auc_macro:.4f}")
# ===== Step 5: Inference Function =====
def diagnose(symptoms, pipeline, diseases, top_k=3):
"""
Given a symptom vector, return top-k diagnoses with probabilities.
"""
probs = pipeline.predict_proba(symptoms.reshape(1, -1))[0]
top_indices = np.argsort(probs)[::-1][:top_k]
print("\n๐ฅ Diagnosis Results:")
for rank, idx in enumerate(top_indices, 1):
confidence = "HIGH" if probs[idx] > 0.5 else \
"MEDIUM" if probs[idx] > 0.2 else "LOW"
print(f" {rank}. {diseases[idx]:<12} "
f"Probability: {probs[idx]:.3f} [{confidence}]")
return top_indices, probs
# Test inference
test_patient = X_test[0]
diagnose(test_patient, pipeline, diseases, top_k=3)
Deliverables
- Complete Jupyter notebook with all steps
- ROC curves plot (8 classes overlaid)
- Confusion matrix heatmap
- 1-page report comparing at least 3 models
- Working inference function with top-3 diagnoses
Mini Project 2: Indian Crop Classifier
Objective
Build a crop type classifier using spectral/temporal features, implementing hierarchical classification and handling extreme class imbalance in agricultural data.
Starter Code
Python
"""
Mini Project 2: Indian Crop Classifier
Simulated satellite spectral features โ Crop type
"""
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score
# Crop types and approximate area distribution in India
crops = {
'Rice': 0.22, # Largest by area
'Wheat': 0.16,
'Cotton': 0.08,
'Sugarcane': 0.06,
'Maize': 0.05,
'Soybean': 0.04,
'Groundnut': 0.04,
'Mustard': 0.03,
'Gram': 0.05,
'Arhar': 0.03,
'Jute': 0.02,
'Tea': 0.01, # Rare in overall statistics
'Coffee': 0.01,
'Banana': 0.015,
'Coconut': 0.015,
}
crop_names = list(crops.keys())
crop_probs = list(crops.values())
# Normalize
total = sum(crop_probs)
crop_probs = [p/total for p in crop_probs]
K = len(crop_names)
# Generate synthetic spectral features
np.random.seed(42)
n_samples = 8000
n_features = 20 # Spectral bands + vegetation indices + terrain
y = np.random.choice(K, size=n_samples, p=crop_probs)
X = np.random.randn(n_samples, n_features) * 0.5
# Add crop-specific spectral signatures
spectral_centers = np.random.randn(K, n_features) * 2
for i in range(n_samples):
X[i] += spectral_centers[y[i]]
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
print("Crop Distribution (train):")
for i, name in enumerate(crop_names):
count = np.sum(y_train == i)
print(f" {name:<12}: {count:>4} ({count/len(y_train)*100:>5.1f}%)")
# ===== Flat Classifier =====
clf_flat = GradientBoostingClassifier(
n_estimators=200, max_depth=5, random_state=42
)
from sklearn.utils.class_weight import compute_sample_weight
weights = compute_sample_weight('balanced', y_train)
clf_flat.fit(X_train, y_train, sample_weight=weights)
y_pred_flat = clf_flat.predict(X_test)
print("\n===== FLAT CLASSIFIER =====")
print(classification_report(y_test, y_pred_flat,
target_names=crop_names, digits=3, zero_division=0))
# ===== Hierarchical Classifier =====
# Level 1: Map to broad categories
hierarchy = {
0: 'Cereal', 1: 'Cereal', 4: 'Cereal', # Rice, Wheat, Maize
2: 'Cash', 3: 'Cash', 10: 'Cash', # Cotton, Sugarcane, Jute
5: 'Oilseed', 6: 'Oilseed', 7: 'Oilseed', # Soybean, Groundnut, Mustard
8: 'Pulse', 9: 'Pulse', # Gram, Arhar
11: 'Plantation', 12: 'Plantation', # Tea, Coffee
13: 'Plantation', 14: 'Plantation', # Banana, Coconut
}
categories = sorted(set(hierarchy.values()))
cat_to_idx = {c: i for i, c in enumerate(categories)}
y_train_cat = np.array([cat_to_idx[hierarchy[yi]] for yi in y_train])
y_test_cat = np.array([cat_to_idx[hierarchy[yi]] for yi in y_test])
# Level 1 model
clf_l1 = GradientBoostingClassifier(n_estimators=100, random_state=42)
clf_l1.fit(X_train, y_train_cat)
y_pred_cat = clf_l1.predict(X_test)
print(f"Level 1 (Category) Accuracy: "
f"{np.mean(y_pred_cat == y_test_cat):.3f}")
flat_macro = f1_score(y_test, y_pred_flat, average='macro')
print(f"\nFlat Macro-F1: {flat_macro:.4f}")
print("(Hierarchical approach would improve this โ "
"left as exercise for the student)")
Extension Tasks
- Complete the hierarchical classifier (Level 2 classifiers for each category)
- Add temporal features (monthly NDVI profiles) and show accuracy improvement
- Implement and compare focal loss vs standard cross-entropy
- Create a geographic feature (state/district) and test zone-specific models
Mini Project 3: Multi-Class Sentiment Analysis
Objective
Build a 5-class sentiment classifier (Very Negative, Negative, Neutral, Positive, Very Positive) for product reviews, with special attention to the ordinal nature of sentiments.
Key Challenge
Unlike nominal classification, sentiment classes are ordered. Predicting "Very Positive" for a "Very Negative" review is much worse than predicting "Neutral" for a "Negative" review. Use a custom cost matrix and ordinal regression techniques.
Python
# Ordinal-aware evaluation for sentiment analysis
sentiments = ['Very_Neg', 'Negative', 'Neutral', 'Positive', 'Very_Pos']
K = len(sentiments)
# Ordinal cost matrix: cost proportional to distance
cost_matrix = np.zeros((K, K))
for i in range(K):
for j in range(K):
cost_matrix[i][j] = (i - j) ** 2 # Quadratic penalty
print("Ordinal Cost Matrix:")
print(f"{'':>12}", end='')
for s in sentiments:
print(f"{s:>10}", end='')
print()
for i, s in enumerate(sentiments):
print(f"{s:>12}", end='')
for j in range(K):
print(f"{cost_matrix[i][j]:>10.0f}", end='')
print()
# Mean Absolute Error (MAE) as ordinal metric
def ordinal_mae(y_true, y_pred):
"""Mean Absolute Error treating classes as ordinal."""
return np.mean(np.abs(y_true - y_pred))
# Quadratic Weighted Kappa โ standard for ordinal classification
from sklearn.metrics import cohen_kappa_score
# Use weights='quadratic' for ordinal data
# kappa = cohen_kappa_score(y_true, y_pred, weights='quadratic')
End-of-Chapter Exercises
E8.1 Compute softmax([3.0, 1.0, 0.2, -1.0]) by hand. Verify the probabilities sum to 1.
E8.2 Prove that softmax is invariant to adding a constant: softmax(z + c) = softmax(z). Why is this property useful for numerical stability?
E8.3 For K=2 classes, show that softmax reduces to the sigmoid function applied to zโ - zโ. Write the derivation step-by-step.
E8.4 Given one-hot target y = [0, 0, 1, 0] and predicted probabilities ลท = [0.1, 0.2, 0.6, 0.1], compute the categorical cross-entropy loss. What would the loss be if the model were perfectly confident (ลท = [0, 0, 1, 0])?
E8.5 Construct the confusion matrix for the following 4-class predictions:
True: [A, B, C, D, A, A, B, C, D, D, A, B, C, C, D]
Pred: [A, B, C, D, A, B, B, C, A, D, A, C, C, D, D]
Compute precision, recall, and F1 for each class.
E8.6 From the confusion matrix in E8.5, compute Macro-F1, Micro-F1, and Weighted-F1. Verify that Micro-F1 = accuracy.
E8.7 A medical classifier has 3 classes: Healthy (1000 samples), Disease A (50 samples), Disease B (10 samples). Why would accuracy be a terrible metric here? Which metric(s) would you recommend and why?
E8.8 Derive the Jacobian of the softmax function: โฯ(z)แตข/โzโฑผ = ฯ(z)แตข(ฮดแตขโฑผ - ฯ(z)โฑผ). Show all steps.
E8.9 Explain why SMOTE must be applied only to the training set and never before splitting. What error occurs if you apply SMOTE to the entire dataset first?
E8.10 For a 5-class problem, how many binary classifiers are needed for OvR? For OvO? If training one classifier takes 10 seconds, how long does each strategy take?
E8.11 A classifier has the following confusion matrix:
[[90, 5, 5], [10, 80, 10], [3, 7, 90]]
Compute Cohen's Kappa. Is this classifier "substantially" better than random?
E8.12 Explain the difference between macro-averaging and micro-averaging with an example where they give very different F1 scores.
E8.13 What is the "temperature" parameter in softmax? Given softmax_ฯ(z)แตข = e^{zแตข/ฯ}/ฮฃe^{zโฑผ/ฯ}, what happens as ฯ โ 0? As ฯ โ โ?
E8.14 Implement a function to compute the Matthews Correlation Coefficient for a 3ร3 confusion matrix. Test it on the matrix from E8.11.
E8.15 A bank's fraud detection system classifies transactions into: Legitimate, Suspicious, Fraudulent. The cost of missing a Fraudulent transaction is 100ร the cost of a false alarm. Design a cost matrix and explain how you would use it in training.
E8.16 Plot the ROC curve for the following binary classifier outputs (by hand or code):
True: [1,1,0,1,0,0,1,0,1,0]
Score: [0.9,0.8,0.7,0.6,0.55,0.4,0.3,0.2,0.85,0.15]
Compute the AUC.
E8.17 When should you use a Precision-Recall curve instead of an ROC curve? Give two real-world scenarios where PR curves are more informative.
E8.18 Implement label smoothing from scratch: given ฮต=0.1 and K=5, transform the one-hot vector [0,0,1,0,0] into a smoothed vector. Compute the cross-entropy loss with and without smoothing.
E8.19 Design a hierarchical classification scheme for classifying 20 types of Indian cuisine dishes. Define the hierarchy and explain why hierarchical classification might outperform flat classification here.
E8.20 A model outputs the following logits for 3 classes: z = [5.0, 5.0, 5.0]. What is the softmax output? What does this mean about the model's confidence? Now compute softmax([5.0, 5.0, 100.0]) โ what changes?
E8.21 Implement the Top-K accuracy metric from scratch. Test it with K=1,3,5 on a 10-class classification problem.
E8.22 [Research-level] The Focal Loss modifies cross-entropy as FL(pโ) = -(1-pโ)^ฮณ log(pโ). Derive the gradient of focal loss with respect to the logits. For what value of ฮณ does focal loss reduce to standard cross-entropy?
E8.23 [Programming] Write a Python function that takes a confusion matrix of any size KรK and generates a complete classification report (matching sklearn's output format) without using any sklearn functions.
Multiple Choice Questions
Interview Questions
Answer Framework:
- Offline evaluation: Start with a held-out test set with the same distribution as production. Compute a confusion matrix to understand per-class performance. Report Macro-F1 (if all classes matter equally) or Weighted-F1 (if proportional importance).
- Beyond accuracy: Check Cohen's Kappa (is the model better than random?), MCC (balanced single metric), and class-specific precision/recall for critical classes.
- Threshold-free metrics: ROC-AUC (Macro OvR) for overall discriminative ability. PR-AUC for rare classes.
- Business metrics: Map ML metrics to business KPIs. E.g., for spam detection: "What percentage of legitimate emails are misclassified?" (FPR for legitimate class).
- Online monitoring: Track per-class accuracy over time windows, detect distribution shift, set up alerts when minority class recall drops below a threshold.
- Fairness: Check if performance varies across demographic subgroups (equalized odds, demographic parity).
Key Distinction:
- Softmax: Outputs K probabilities that sum to 1. Used for mutually exclusive multi-class (each sample belongs to exactly one class). E.g., digit recognition (a digit is exactly one of 0-9).
- K independent sigmoids: Each output is between 0 and 1 independently; they do NOT sum to 1. Used for multi-label classification (each sample can belong to multiple classes). E.g., movie genres (a movie can be both "Action" AND "Comedy").
Loss functions: Softmax โ Categorical Cross-Entropy (sparse or one-hot). Sigmoid โ Binary Cross-Entropy applied to each label independently.
Diagnosis: This is a classic symptom of class imbalance. The model achieves high accuracy by correctly predicting the majority class(es) but performs poorly on minority classes.
Example: 3-class problem with 90% Class A, 5% Class B, 5% Class C. A model that always predicts Class A gets 90% accuracy. But F1 for B and C would be 0, so Macro-F1 = (F1_A + 0 + 0)/3 โ 0.32.
Solutions: (1) Class weights in loss function, (2) SMOTE/oversampling, (3) Focal loss, (4) Collect more minority class data, (5) Evaluate with Macro-F1 or per-class metrics, not accuracy.
How SMOTE works:
- For each minority class sample, find k nearest neighbors within the same class
- Randomly select one neighbor
- Generate a new synthetic sample on the line segment connecting them: x_new = x + ฮปยท(x_neighbor - x), where ฮป โ [0,1]
Limitations:
- Generates in feature space: May create unrealistic samples (e.g., a face that is half-male, half-female)
- Sensitive to noisy data: If a minority sample is noisy/outlier, SMOTE generates more noise around it
- Doesn't work well with high dimensions: Distance metrics become unreliable in high-D (curse of dimensionality)
- Must be inside CV: Applying SMOTE before splitting causes data leakage
Alternatives: Borderline-SMOTE (only generate near decision boundary), ADASYN (density-adaptive), Class Weights (no synthetic data), Focal Loss, Data Augmentation (for images/text).
Use PR-AUC when the positive class is rare (class imbalance ratio > 1:10). ROC-AUC can appear high even for poor classifiers on imbalanced data because FPR stays low when TN is very large.
Example: Fraud detection (0.1% fraud rate). A model that flags 1% of transactions as fraud might have: ROC-AUC = 0.95 (looks great!) but PR-AUC = 0.30 (actually poor โ most flagged transactions are false positives).
Rule of thumb: If stakeholders care about "Of the items you flagged, how many are actually positive?" (precision), use PR curves. If they care about "Of all actual positives, how many did you find?" (recall) balanced against false alarm rate, use ROC.
Step 1: Binarize โ For each class k, create a "class k vs rest" view from the KรK confusion matrix. Extract TP_k, FP_k, FN_k.
Step 2: Compute per-class metrics: Precision_k, Recall_k, F1_k for each class independently.
Step 3: Aggregate using one of three strategies:
- Macro: Simple mean of per-class metrics. Treats all classes equally. Best when all classes are equally important.
- Micro: Pool all TP/FP/FN globally, then compute. In single-label multi-class, equals accuracy.
- Weighted: Weighted mean by class support (number of true samples). Accounts for class size.
ฮบ = 0.35 is in the "fair agreement" range (0.21-0.40 on the Landis & Koch scale). It means the classifier performs better than random chance, but only moderately so.
Context matters:
- For a spam filter: ฮบ = 0.35 is terrible โ the filter would annoy users constantly
- For predicting psychiatric diagnoses (where even human experts disagree): ฮบ = 0.35 might be reasonable
- For OCR on clean printed text: ฮบ = 0.35 is unacceptably low
Action: Investigate the confusion matrix to see which specific class pairs are causing confusion, then target those with more training data or feature engineering.
Challenges at K=1000: (1) Computational cost, (2) Class imbalance is almost guaranteed, (3) Many visually similar classes.
Techniques:
- Hierarchical Softmax: Group 1000 classes into a tree; predict path down the tree. Reduces K-way decision to O(log K) binary decisions.
- Sampled Softmax: During training, only compute softmax over the true class + a random subset. Used in language models with 50K+ word vocabularies.
- Top-K evaluation: Report Top-5 accuracy, not just Top-1.
- Embedding-based: Map both inputs and class labels to a shared embedding space; predict by nearest neighbor in embedding space.
- Mixture of experts: Route inputs to specialized sub-networks.
Concept: Different types of misclassification have different real-world costs. Missing a cancer diagnosis (FN) is far worse than a false alarm (FP). Cost-sensitive learning incorporates these asymmetric costs into model training.
Implementation approaches:
- Class weights in loss: Multiply each sample's loss by its class weight. In sklearn:
class_weight='balanced'or custom dict - Sample weights: Assign individual sample weights based on a cost matrix
- Cost-aware loss function: Modify cross-entropy: L = -ฮฃ cโ ยท yโ ยท log(ลทโ) where cโ is the cost for class k
- Post-hoc threshold tuning: After training, adjust decision thresholds per class to minimize expected cost
- Oversampling the high-cost class: Replicate samples from expensive-to-miss classes
MCC is a correlation coefficient between the observed and predicted classifications. It ranges from -1 to +1, with 0 indicating random prediction.
Why better than accuracy:
- MCC uses all four confusion matrix quadrants (TP, TN, FP, FN) in a balanced way
- It returns a high value only if the classifier performs well on both positive and negative classes
- It is informative even when classes are heavily imbalanced
- A random classifier always gets MCC โ 0, regardless of class distribution
Example: On a dataset with 95% negative and 5% positive: an "always negative" classifier gets 95% accuracy but MCC = 0. This correctly reveals the classifier is useless.
Reference: Chicco & Jurman (2020) "The advantages of the Matthews correlation coefficient over F1 score and accuracy in binary classification evaluation"
Research Problems
Research Problem 1: Calibration of Multi-Class Classifiers
Background: A softmax output of 0.7 for class k should mean "this is class k with 70% probability." But deep networks are notoriously miscalibrated โ they tend to be overconfident. Guo et al. (2017) showed that modern neural networks are more miscalibrated than their predecessors.
Research Question: Develop a calibration method for multi-class neural networks that preserves ranking (i.e., if the model is more confident about class A than B, this ordering should be maintained after calibration) while improving Expected Calibration Error (ECE) across all classes simultaneously.
Suggested Approach: Extend temperature scaling to class-specific temperatures. Current approaches use a single temperature ฯ for all classes; investigate whether per-class temperatures ฯโ improve calibration, especially for minority classes.
Datasets: CIFAR-100, ImageNet-1K, medical image classification datasets
Research Problem 2: Multi-Class Classification with Reject Option
Motivation: In safety-critical applications (medical diagnosis, autonomous driving), a classifier should be able to say "I don't know" when uncertain, rather than making a potentially dangerous incorrect prediction.
Research Question: Design a multi-class classification framework that incorporates a (K+1)-th "reject/abstain" class, optimizing for maximum accuracy on accepted samples while minimizing the reject rate. How should the cost of rejection relate to the costs of misclassification?
Hint: Chow's reject-option rule (1970) provides the theoretical optimum for binary classification. Extend this to multi-class with non-uniform misclassification costs.
Research Problem 3: Evaluation Metrics for Long-Tailed Distributions
Background: Real-world class distributions are often long-tailed: a few "head" classes dominate, while many "tail" classes have very few samples. Standard metrics (even Macro-F1) may not adequately capture performance on tail classes.
Research Question: Propose a new evaluation metric specifically designed for long-tailed multi-class classification that (a) is sensitive to tail-class performance, (b) is robust to the number of tail classes, and (c) has a clear probabilistic interpretation. Compare it against Macro-F1, Geometric Mean of per-class recalls, and recent proposals like "Balanced Accuracy."
Indian Context: Indian wildlife species classification (WILDBOOK India dataset) where common species (peacock, langur) have 10,000+ images while endangered species (snow leopard, red panda) have <50 images. How should we evaluate classifiers on such data?
Key Takeaways
References & Further Reading
Foundational Papers
- Fisher, R. A. (1936). "The Use of Multiple Measurements in Taxonomic Problems." Annals of Eugenics, 7(2), 179-188.
- Bridle, J. (1989). "Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information." NIPS.
- Chawla, N. V. et al. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." JAIR, 16, 321-357.
- Lin, T.-Y. et al. (2017). "Focal Loss for Dense Object Detection." ICCV.
- Guo, C. et al. (2017). "On Calibration of Modern Neural Networks." ICML.
- Szegedy, C. et al. (2016). "Rethinking the Inception Architecture for Computer Vision." CVPR.
- Chicco, D. & Jurman, G. (2020). "The Advantages of the Matthews Correlation Coefficient over F1 Score and Accuracy." BMC Genomics, 21, 6.
- Matthews, B. W. (1975). "Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme." Biochimica et Biophysica Acta, 405(2), 442-451.
ImageNet & Multi-Class Benchmarks
- Russakovsky, O. et al. (2015). "ImageNet Large Scale Visual Recognition Challenge." IJCV, 115(3), 211-252.
- LeCun, Y. et al. (1998). "Gradient-Based Learning Applied to Document Recognition." Proceedings of the IEEE, 86(11), 2278-2324.
- He, K. et al. (2016). "Deep Residual Learning for Image Recognition." CVPR.
Evaluation Metrics
- Cohen, J. (1960). "A Coefficient of Agreement for Nominal Scales." Educational and Psychological Measurement, 20(1), 37-46.
- Landis, J. R. & Koch, G. G. (1977). "The Measurement of Observer Agreement for Categorical Data." Biometrics, 33(1), 159-174.
- Sokolova, M. & Lapalme, G. (2009). "A Systematic Analysis of Performance Measures for Classification Tasks." Information Processing & Management, 45(4), 427-437.
Indian AI & ML Context
- ISRO (2022). "Land Use / Land Cover Atlas of India (LULC50K)." National Remote Sensing Centre.
- ICAR-IARI (2023). "Crop Classification from Satellite Imagery for PMFBY." Technical Report.
- NHA India (2024). "AI-Driven Fraud Detection in Ayushman Bharat Claims." Annual Report, Chapter 7.
Textbooks
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 4.
- Goodfellow, I. et al. (2016). Deep Learning. MIT Press. Chapter 6.2 (Softmax).
- Hastie, T. et al. (2009). The Elements of Statistical Learning. Springer. Chapter 4.4.