Neural Networks & Deep Learning

Chapter 11: Hyperparameter Tuning & ML Strategy

The Art and Science of Finding the Right Knobs to Turn

โฑ๏ธ Reading Time: ~2.5 hours  |  ๐Ÿ“– Part III: Training Deep Networks  |  ๐Ÿง  Strategy + Code Chapter

๐Ÿ“‹ Prerequisites: Chapters 6โ€“10 (Deep Networks, Optimization, Regularization, Batch Norm)

Bloom's Taxonomy Map for This Chapter

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberRecall the priority ranking of hyperparameters (Tier 1โ€“4) and default recommended values
๐Ÿ”ต UnderstandExplain why random search dominates grid search and why logarithmic scale is needed for learning rate
๐ŸŸข ApplyImplement a hyperparameter search loop and LR finder from scratch in Python
๐ŸŸก AnalyzePerform structured error analysis โ€” break down misclassifications into actionable categories
๐ŸŸ  EvaluateDiagnose train/dev mismatch, decide whether a model has avoidable bias or variance problems
๐Ÿ”ด CreateDesign an end-to-end ML strategy for a real-world project with mismatched data distributions
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  • Rank hyperparameters by importance (Tier 1โ€“4) and justify why learning rate is the single most critical hyperparameter
  • Compare grid search vs. random search using the probability argument, and explain when random search is strictly superior
  • Apply the coarse-to-fine search strategy and use logarithmic sampling for learning rate and regularization strength
  • Design train/dev/test splits appropriate for big-data scenarios (1M+ examples) vs. small-data scenarios
  • Diagnose train/dev distribution mismatch and propose corrective strategies (data synthesis, re-weighting)
  • Perform structured error analysis by manually inspecting misclassified examples and building error breakdown tables
  • Apply orthogonalization: fix high bias first (bigger model, longer training), then fix high variance (more data, regularization)
  • Use human-level performance as a proxy for Bayes optimal error to calculate avoidable bias
  • Implement a hyperparameter search loop and a fastai-style LR finder from scratch in Python
  • Evaluate when to use "Panda" vs. "Caviar" strategy for hyperparameter tuning based on compute budget
Section 2

Opening Hook

๐ŸŽ›๏ธ Too Many Knobs, Too Little Time

You've built a 12-layer neural network. You sit down to train it and realize there are at least 8 hyperparameters staring at you:

ฮฑ (learning rate) ยท hidden units ยท # layers ยท epochs ยท batch size ยท dropout rate ยท ฮป (L2 reg) ยท ฮฒ (momentum)

If you try just 5 values for each, that's 5โธ = 390,625 experiments. At 10 minutes per experiment on a single GPU, that's 7.4 years of compute. Even on a โ‚น5 lakh cloud budget, you'd burn through it in days.

So how do teams at Flipkart, Ola, and Jio tune models that serve hundreds of millions of users โ€” and do it in weeks, not years?

The answer is not brute force. It's strategy.

FlipkartOlaJioPaytm
Andrew Ng reports that in his experience, the learning rate ฮฑ alone accounts for more performance difference than all other hyperparameters combined. Getting ฮฑ right (or at least in the right order of magnitude) is often the difference between a model that converges in 2 hours and one that never converges at all.
Section 3

Core Concepts

This chapter covers two tightly related topics: (A) Hyperparameter Tuning โ€” the mechanics of finding good values, and (B) ML Strategy โ€” the decision framework for what to work on next. Together, they form the practitioner's toolkit for going from a working prototype to a production-quality model.

Section 3 ยท 11.1

The Hyperparameter Landscape

Not all hyperparameters are created equal. Andrew Ng's practical hierarchy, refined across years of production ML projects, ranks them into four tiers of importance:

๐Ÿ”ด TIER 1 โ€” Tune First (Highest Impact)

Learning Rate (ฮฑ) โ€” The single most important hyperparameter. A 10ร— change in ฮฑ can make or break your model. Always tune this first.

๐ŸŸ  TIER 2 โ€” Tune Second (High Impact)
  • Momentum term (ฮฒ) โ€” typically 0.9, but values like 0.99 or 0.95 can matter
  • Number of hidden units โ€” directly controls model capacity
  • Mini-batch size โ€” affects gradient noise and training speed
๐ŸŸก TIER 3 โ€” Tune Third (Moderate Impact)
  • Number of layers โ€” depth vs. width trade-off
  • Learning rate decay โ€” schedule type and decay factor
๐ŸŸข TIER 4 โ€” Usually Keep Defaults (Low Impact)
  • Adam parameters: ฮฒโ‚ = 0.9, ฮฒโ‚‚ = 0.999, ฮต = 10โปโธ โ€” almost never need tuning
The 80/20 Rule of Hyperparameter Tuning: Spend 80% of your tuning budget on Tier 1 and Tier 2 hyperparameters. The remaining 20% on Tier 3. Tier 4 (Adam params) should almost always stay at their defaults โ€” changing them rarely helps and can waste precious GPU hours.

Key Insight: Why Learning Rate is King

Intuition

Learning rate controls how big each gradient descent step is. Too large โ†’ the model overshoots and loss explodes. Too small โ†’ the model crawls and never reaches a good minimum in reasonable time. Every other hyperparameter (hidden units, layers, regularization) only shapes the loss landscape โ€” but ฮฑ determines whether you can even navigate it.

Analogy

Think of tuning a TV. Learning rate is like the power switch and channel selector โ€” get it wrong and you see nothing. Hidden units are like brightness and contrast โ€” they refine the picture. Adam's ฮฒโ‚, ฮฒโ‚‚ are like the backlight frequency โ€” you'd never touch them unless you're an engineer.

Flipkart's recommendation engine serves 400M+ users. Their ML team reportedly spends 60% of hyperparameter tuning time on learning rate schedules (warm-up + cosine decay), and only 10% on architecture search. Getting ฮฑ right on their transformer-based models reduced training costs by an estimated โ‚น15 lakh per quarter.
Section 3 ยท 11.2

Grid Search vs. Random Search

Grid Search: The Naรฏve Approach

In grid search, you define a set of values for each hyperparameter and try every combination:

Python# Grid search: 5 values ร— 5 values = 25 experiments
learning_rates = [0.001, 0.003, 0.01, 0.03, 0.1]
hidden_units   = [64, 128, 256, 512, 1024]

for lr in learning_rates:
    for hu in hidden_units:
        train_and_evaluate(lr=lr, hidden=hu)  # 25 runs

Problem: If learning rate matters much more than hidden units (which it does โ€” Tier 1 vs. Tier 2), then in a 5ร—5 grid you only test 5 unique learning rates. The other 20 experiments are "wasted" exploring hidden unit values when ฮฑ is already suboptimal.

Random Search: The Better Alternative

In random search, you sample each hyperparameter independently from a range:

Pythonimport numpy as np

# Random search: 25 experiments, but 25 UNIQUE learning rates!
for trial in range(25):
    lr = 10 ** np.random.uniform(-4, -1)   # log-uniform: 0.0001 to 0.1
    hu = np.random.choice([64, 128, 256, 512, 1024])
    train_and_evaluate(lr=lr, hidden=hu)

Advantage: Now you test 25 unique learning rates instead of just 5. Since ฮฑ matters most, random search explores the most important dimension much more richly.

The Probability Argument for Random Search

Formal Reasoning

Suppose the optimal learning rate lies in a narrow "sweet spot" that covers 10% of your search range. With grid search using 5 values, the probability of at least one value falling in this sweet spot is:

P(hit) = 1 โˆ’ (1 โˆ’ 0.10)โต = 1 โˆ’ 0.9โต = 1 โˆ’ 0.590 = 0.410 (41%)

With random search using 25 independently sampled points projected onto the LR axis:

P(hit) = 1 โˆ’ (1 โˆ’ 0.10)ยฒโต = 1 โˆ’ 0.9ยฒโต = 1 โˆ’ 0.072 = 0.928 (93%!)
Key Takeaway

Same total budget (25 experiments), but random search gives you 93% probability of hitting the sweet spot for the most important hyperparameter, vs. only 41% for grid search. This gap widens as the number of hyperparameters increases.

Paper Reference

This was formally proven by Bergstra & Bengio (2012): "Random Search for Hyper-Parameter Optimization" โ€” one of the most cited hyperparameter tuning papers in ML history.

Mistake: "I'll use grid search because it's systematic and exhaustive."
Reality: Grid search is exhaustive only across the full combination โ€” but for the individual axis that matters most, it's extremely wasteful. Random search with the same budget samples more unique values along every individual axis.
Section 3 ยท 11.3

Coarse-to-Fine Strategy & Logarithmic Scale

The Coarse-to-Fine Workflow

In practice, you don't run one massive search. You iterate in rounds:

  1. Round 1 (Coarse): Sample broadly. LR from 10โปโด to 10โฐ, hidden units from 32 to 2048. Run ~25 experiments with fewer epochs (5โ€“10).
  2. Identify the promising region: e.g., best results cluster around LR โˆˆ [10โปยณ, 10โปยฒ] and hidden units โˆˆ [256, 512].
  3. Round 2 (Fine): Zoom into the promising region. LR from 10โปยณ to 10โปยฒ, hidden units from 200 to 600. Run ~25 more experiments with more epochs (20โ€“50).
  4. Round 3 (Final): Narrow further if needed. Train the top 3 candidates with full epochs and pick the best.
Coarse-to-Fine Search Strategy โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• Round 1: Broad Search Round 2: Zoom In โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ร— ร— ร— โ”‚ โ”‚ โ”‚ โ”‚ ร— ร— ร— ร— โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ ร— โ˜… ร— ร— โ”‚ โ”‚ โ”‚ ร— โ˜… ร— ร— โ”‚ โ”‚ โ”‚ ร— โ˜… โ˜… ร— โ”‚ โ”€โ”€โ”€โ†’ โ”‚ โ”‚ ร— โ˜… ร— ร— โ”‚ โ”‚ โ”‚ ร— โ˜… ร— ร— โ”‚ โ”‚ โ”‚ ร— โ˜… ร— ร— โ”‚ โ”‚ โ”‚ ร— ร— ร— โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ ร— ร— ร— ร— โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ LR: 10โปโด โ†’ 10โฐ LR: 10โปยณ โ†’ 10โปยฒ โ˜… = good results โ˜… = best results

Why Logarithmic Scale for Learning Rate?

Learning rate values that "matter" are spread across orders of magnitude. The difference between 0.0001 and 0.001 is just as significant as the difference between 0.01 and 0.1 โ€” each is a 10ร— change. If you sampled uniformly from [0.0001, 1], you'd spend 90% of your samples in [0.1, 1] and only 0.1% of samples in [0.0001, 0.001].

Logarithmic sampling: ฮฑ = 10r, where r ~ Uniform(a, b)
Example: r ~ Uniform(โˆ’4, โˆ’1) gives ฮฑ โˆˆ [10โปโด, 10โปยน] = [0.0001, 0.1]
Pythonimport numpy as np

# โœ… CORRECT: Log-uniform sampling for learning rate
r = np.random.uniform(-4, -1)  # exponent between -4 and -1
alpha = 10 ** r                   # ฮฑ โˆˆ [0.0001, 0.1]

# โŒ WRONG: Uniform sampling for learning rate
alpha = np.random.uniform(0.0001, 0.1)  # heavily biased toward large values

# โœ… Log-uniform for ฮฒ (momentum): sample (1 - ฮฒ) on log scale
# ฮฒ โˆˆ [0.9, 0.999] โ†’ (1-ฮฒ) โˆˆ [0.001, 0.1] โ†’ r โˆˆ [-3, -1]
r = np.random.uniform(-3, -1)
beta = 1 - 10 ** r               # ฮฒ โˆˆ [0.9, 0.999]
Log scale for ฮฒ (momentum): Don't sample ฮฒ uniformly from [0.9, 0.999]. The difference between ฮฒ=0.9 and ฮฒ=0.9005 is negligible, but between ฮฒ=0.999 and ฮฒ=0.9995 is huge (averaging over ~2000 vs ~1000 values). Sample (1 โˆ’ ฮฒ) on log scale instead.

Panda ๐Ÿผ vs. Caviar ๐ŸŸ Strategy

Two Approaches to Hyperparameter Tuning

๐Ÿผ Panda Strategy (Babysitting One Model)

When compute is limited (e.g., a single GPU at your university lab), you train one model at a time, carefully watching the loss curve and adjusting hyperparameters day by day. Like a panda caring for a single baby.

Best for: Students, startups with limited GPU budget (e.g., a single NVIDIA RTX 4090 costing โ‚น1,60,000).

๐ŸŸ Caviar Strategy (Spawn Many in Parallel)

When compute is abundant (e.g., a cloud cluster), you launch dozens of experiments simultaneously with different hyperparameters and pick the winner. Like a fish spawning thousands of eggs.

Best for: Companies like TCS, Infosys, Jio with cloud budgets or on-premise GPU clusters.

At IIT Bombay's CFILT lab, student researchers often use the Panda strategy โ€” babysitting a single model on a shared DGX station. In contrast, Jio's AI team uses the Caviar strategy, running 100+ experiments in parallel on their private cloud. Same ML principles, different resource constraints.
Section 3 ยท 11.4

Train/Dev/Test Split Strategy

The Classical Split

Traditionally (pre-deep learning era, small datasets of 100โ€“10,000 examples):

SplitClassical RatioPurpose
Train60%Fit model parameters
Dev (Validation)20%Tune hyperparameters, model selection
Test20%Final unbiased evaluation

The Big Data Split

In the deep learning era with 1M+ examples, you don't need 200K examples just for dev set. Modern splits:

Dataset SizeTrainDevTest
10,00060%20%20%
100,00090%5%5%
1,000,00098%1%1%
10,000,00099.5%0.25%0.25%

Even 0.25% of 10M = 25,000 examples โ€” more than enough to get a statistically significant performance estimate.

Guidelines for Dev/Test Set Design

Rule 1: Dev and Test Must Come from the Same Distribution

If your dev set is from distribution A and your test set is from distribution B, you're optimizing hyperparameters for the wrong target. It's like practicing archery aiming at one target, then being graded on a different one.

Rule 2: Dev Set Must Be Large Enough to Detect Differences

If algorithm A has 90.0% accuracy and B has 90.1%, you need enough dev examples to tell them apart reliably. Rule of thumb: at least 1,000โ€“10,000 examples in dev set.

Rule 3: Test Set is Optional (But Recommended)

If you only need to pick the best model (no unbiased final estimate needed), you can skip the test set. But for papers, competitions, and production, always keep a held-out test set.

Mistake: "I'll use my test set to tune hyperparameters since I have a large dataset."
Why it's wrong: This makes your test set a second dev set. You lose the ability to get an unbiased estimate of real-world performance. Many Kaggle beginners make this mistake and are shocked when their leaderboard score drops on private test data.
Section 3 ยท 11.5

Train/Dev Distribution Mismatch

The Real-World Problem

In many production ML systems, your training data comes from a different distribution than what you'll see at inference time:

Example โ€” Paytm Fraud Detection: Training data might include 5 years of historical transactions (2019โ€“2024) collected under old UPI protocols. But the dev/test set is recent 2025 data with new payment flows, UPI Lite, and credit-on-UPI. The distributions are genuinely different โ€” not just a random split issue.

Why Does Mismatch Happen?

  • Data availability: You have millions of web-scraped images but only 10,000 images from your actual camera app
  • Temporal shift: Training on historical data, deploying on current data
  • Geographic shift: Training on US data, deploying for Indian users
  • Platform shift: Training on desktop clicks, deploying on mobile

The Solution: Prioritize Target Distribution for Dev/Test

Always make your dev and test sets reflect the target distribution (what you'll see in production). Use the mismatched (but larger) data for training.

The Train-Dev Set: A Diagnostic Tool

Problem

If your training error is 1% and dev error is 10%, is the 9% gap due to variance (overfitting) or distribution mismatch? You can't tell with just train and dev errors.

Solution: Create a "Train-Dev" Set

Carve out a small subset from training data (same distribution as train, but not used for training). Now you can decompose the error:

SetDistributionUsed ForError
Training setSourceTraining1%
Train-Dev setSource (held out)Diagnosis9%
Dev setTargetHP tuning10%
Interpretation

Train โ†’ Train-Dev gap (1% โ†’ 9%): This is variance (the model overfits training data).

Train-Dev โ†’ Dev gap (9% โ†’ 10%): This is data mismatch (only 1% โ€” negligible).

Conclusion: The main problem here is variance, not data mismatch. Fix with more regularization or more data.

Error Decomposition:
Human-level โ‰ˆ Bayes error
โ†“ (Avoidable Bias)
Training error
โ†“ (Variance)
Train-Dev error
โ†“ (Data Mismatch)
Dev error
โ†“ (Overfitting to Dev Set)
Test error
Section 3 ยท 11.6

Error Analysis

The Power of Manual Inspection

Before spending weeks collecting more data or redesigning your architecture, spend 30โ€“60 minutes manually inspecting 100 misclassified dev-set examples. This simple practice is one of the highest-ROI activities in ML.

Structured Error Analysis Process

  1. Pull 100 misclassified examples from the dev set
  2. Create a spreadsheet with columns for each potential error category
  3. For each example, mark which categories apply
  4. Count percentages to identify the biggest error sources
  5. Prioritize the category with the highest ceiling for improvement

Example: Food Classification for Zomato

Imagine you're building a food image classifier for Zomato's photo-based search. You inspect 100 misclassified images:

Error CategoryCount (out of 100)Ceiling for Improvement
Blurry/low-quality images3838%
Multiple food items in one image2525%
Unusual plating/presentation1818%
Mislabeled training data1212%
Rare regional dishes77%

Conclusion: Working on handling blurry images (data augmentation, super-resolution preprocessing) could fix up to 38% of errors โ€” that's the highest-impact area. Don't waste time on rare regional dishes (only 7% ceiling).

The "Ceiling" Concept: If blurry images cause 38% of errors and your dev error is 10%, then perfectly solving the blurry image problem would reduce dev error from 10% to at most 6.2%. This is the "ceiling" for that improvement. Always calculate ceilings to decide where to invest effort.

Should You Fix Incorrect Labels?

Deep learning algorithms are robust to random label noise in the training set โ€” if errors are random (not systematic), a small percentage (1โ€“2%) of wrong labels usually doesn't hurt much. However:

  • Dev/test set labels must be correct. If 6% of your dev set is mislabeled, you can't trust your model selection process.
  • Systematic errors are dangerous. If all "dosa" images are labeled "uttapam", the model will learn this wrong mapping.
  • If you fix dev labels, fix test labels too โ€” they must stay from the same distribution.
Section 3 ยท 11.7

Orthogonalization

One Knob, One Function

In a well-designed system, each control affects exactly one thing. In an old TV: brightness knob changes brightness, volume knob changes volume. If one knob changed both, debugging would be impossible.

Apply the same principle to ML. The four sequential goals, each with its own "knob":

The Four Knobs of ML Orthogonalization

Knob 1: Fit Training Set Well (Fix High Bias)

Tools: Bigger network, train longer, better optimizer (Adam), different architecture.

Goal: Training error โ‰ˆ Human-level performance

Knob 2: Fit Dev Set Well (Fix High Variance)

Tools: Regularization (L2, dropout, data augmentation), more training data, early stopping (use cautiously โ€” it affects Knob 1 too).

Goal: Dev error โ‰ˆ Training error

Knob 3: Fit Test Set Well (Fix Dev Overfitting)

Tools: Bigger dev set, don't over-tune on dev set.

Goal: Test error โ‰ˆ Dev error

Knob 4: Perform Well in Real World

Tools: Change dev/test set distribution to match real world, change cost function to better reflect reality.

Goal: Real-world performance โ‰ˆ Test error

Early Stopping Violates Orthogonalization: Early stopping simultaneously affects both training set fitting (Knob 1) and dev set fitting (Knob 2). Andrew Ng prefers L2 regularization over early stopping because L2 only affects Knob 2 without compromising Knob 1. That said, early stopping is still widely used in practice because it's simple and often works well โ€” just be aware of the trade-off.
Section 3 ยท 11.8

Human-Level Performance & Bayes Optimal Error

Why Compare to Humans?

For tasks that humans are good at (vision, speech, NLP), human-level error is a useful proxy for the Bayes optimal error โ€” the theoretical best any function can achieve given the noise in the data.

Bayes Optimal Error โ‰ค Human-Level Error โ‰ค Current Model Error

Avoidable Bias = Training Error โˆ’ Human-Level Error
Variance = Dev Error โˆ’ Training Error

Which "Human Level" to Use?

Consider a medical imaging task:

HumanError Rate
Typical medical student5%
Experienced radiologist2%
Team of expert radiologists0.7%

For the purpose of computing avoidable bias, use the best human performance (0.7%) as the proxy for Bayes error, because Bayes error โ‰ค 0.7%.

Worked Diagnostic Examples

Scenario A: High Avoidable Bias

MetricError
Human-level (Bayes proxy)1%
Training error8%
Dev error10%

Avoidable bias = 8% โˆ’ 1% = 7%
Variance = 10% โˆ’ 8% = 2%
Diagnosis: Focus on reducing bias โ†’ bigger model, train longer.

Scenario B: High Variance

MetricError
Human-level (Bayes proxy)1%
Training error2%
Dev error10%

Avoidable bias = 2% โˆ’ 1% = 1%
Variance = 10% โˆ’ 2% = 8%
Diagnosis: Focus on reducing variance โ†’ regularization, more data, dropout.

Surpassing Human-Level Performance

Once a model surpasses human-level error, progress typically slows down because:

  • You can no longer use human-level as a reliable Bayes proxy โ€” the gap becomes unclear
  • You can't do manual error analysis (if the model is better than you, how do you know which errors to fix?)
  • You're approaching the theoretical ceiling (Bayes error), where further gains require exponentially more effort
Tasks where ML has already surpassed human performance: online advertising click prediction, product recommendation (Flipkart/Amazon), loan default prediction (where models process 500+ features that no human can simultaneously evaluate), and route optimization (Google Maps, Ola).
Niramai Health Analytics (Bengaluru) developed a breast cancer screening AI that rivals expert radiologists in detecting early-stage tumors using thermal imaging โ€” achieving comparable or better performance than experienced doctors at a fraction of the cost (โ‚น500 per screening vs. โ‚น3,000+ for mammography). Their ML strategy relied heavily on human-level benchmarking during development.
Section 4

From-Scratch Code

4.1 Hyperparameter Random Search Loop

A complete, reusable hyperparameter search engine from scratch:

Pythonimport numpy as np
import json
from datetime import datetime

class HyperparameterSearcher:
    """Random search over hyperparameters with logging."""

    def __init__(self, search_space, n_trials=25, seed=42):
        """
        search_space: dict mapping param_name -> dict with:
            'type': 'log_uniform' | 'uniform' | 'choice' | 'int_uniform'
            'low', 'high': for uniform/log_uniform/int_uniform
            'options': list for choice
        """
        self.search_space = search_space
        self.n_trials = n_trials
        self.rng = np.random.RandomState(seed)
        self.results = []

    def _sample_params(self):
        """Sample one set of hyperparameters."""
        params = {}
        for name, spec in self.search_space.items():
            if spec['type'] == 'log_uniform':
                # Sample on log scale: 10^Uniform(low, high)
                r = self.rng.uniform(spec['low'], spec['high'])
                params[name] = 10 ** r
            elif spec['type'] == 'uniform':
                params[name] = self.rng.uniform(spec['low'], spec['high'])
            elif spec['type'] == 'choice':
                params[name] = self.rng.choice(spec['options'])
            elif spec['type'] == 'int_uniform':
                params[name] = int(self.rng.uniform(spec['low'], spec['high']))
            elif spec['type'] == 'log_complement':
                # For ฮฒ: sample (1-ฮฒ) on log scale
                r = self.rng.uniform(spec['low'], spec['high'])
                params[name] = 1 - 10 ** r
        return params

    def search(self, train_fn):
        """
        Run the search.
        train_fn: callable(params_dict) -> dict with 'train_loss',
                  'dev_loss', 'dev_acc', etc.
        """
        print(f"Starting random search: {self.n_trials} trials")
        print("=" * 60)

        for trial in range(self.n_trials):
            params = self._sample_params()
            print(f"\nTrial {trial+1}/{self.n_trials}")
            print(f"  Params: {params}")

            # Train and evaluate
            metrics = train_fn(params)

            # Log results
            result = {
                'trial': trial + 1,
                'params': params,
                'metrics': metrics,
                'timestamp': datetime.now().isoformat()
            }
            self.results.append(result)

            print(f"  Dev Acc: {metrics.get('dev_acc', 'N/A')}")

        # Find best trial
        best = max(self.results,
                   key=lambda x: x['metrics'].get('dev_acc', 0))
        print(f"\n{'='*60}")
        print(f"Best Trial: #{best['trial']}")
        print(f"Best Params: {best['params']}")
        print(f"Best Dev Acc: {best['metrics']['dev_acc']:.4f}")
        return best

    def top_k(self, k=5):
        """Return top-k trials by dev accuracy."""
        sorted_results = sorted(
            self.results,
            key=lambda x: x['metrics'].get('dev_acc', 0),
            reverse=True
        )
        return sorted_results[:k]


# โ”€โ”€โ”€ Usage Example โ”€โ”€โ”€
search_space = {
    'learning_rate': {'type': 'log_uniform', 'low': -4, 'high': -1},
    'hidden_units':  {'type': 'choice', 'options': [64,128,256,512]},
    'dropout_rate':  {'type': 'uniform', 'low': 0.1, 'high': 0.5},
    'momentum':      {'type': 'log_complement', 'low': -3, 'high': -1},
    'batch_size':    {'type': 'choice', 'options': [32,64,128,256]},
}

searcher = HyperparameterSearcher(search_space, n_trials=25)
# best = searcher.search(my_train_function)

4.2 Learning Rate Finder (fastai-style)

The LR Finder is one of the most practical tools in deep learning. It trains for one epoch while exponentially increasing the learning rate, and records the loss at each step. The optimal LR is where the loss decreases fastest (steepest slope).

Pythonimport numpy as np
import copy

class LRFinder:
    """
    Learning Rate Finder (Smith 2017 / fastai-style).
    Exponentially increases LR from lr_min to lr_max over
    one pass through the data, recording loss at each step.
    """

    def __init__(self, model, optimizer_fn, loss_fn):
        """
        model:        object with .forward(X) and .parameters()
        optimizer_fn: callable(params, lr) -> optimizer
        loss_fn:      callable(y_pred, y_true) -> scalar loss
        """
        self.model = model
        self.optimizer_fn = optimizer_fn
        self.loss_fn = loss_fn
        self.lrs = []
        self.losses = []

    def find(self, X_train, y_train, lr_min=1e-7, lr_max=10,
             num_steps=100, smooth_factor=0.05):
        """Run the LR range test."""

        # Save initial model state
        initial_state = copy.deepcopy(self.model)

        # Compute multiplication factor per step
        mult = (lr_max / lr_min) ** (1 / num_steps)
        lr = lr_min
        best_loss = float('inf')
        avg_loss = 0
        n = len(X_train)
        batch_size = min(64, n)

        for step in range(num_steps):
            # Sample a mini-batch
            idx = np.random.choice(n, batch_size, replace=False)
            X_batch = X_train[idx]
            y_batch = y_train[idx]

            # Forward pass
            y_pred = self.model.forward(X_batch)
            loss = self.loss_fn(y_pred, y_batch)

            # Smooth the loss (exponential moving average)
            avg_loss = smooth_factor * loss + (1 - smooth_factor) * avg_loss
            smoothed_loss = avg_loss / (1 - (1 - smooth_factor) ** (step + 1))

            # Stop if loss explodes (> 4ร— best)
            if step > 10 and smoothed_loss > 4 * best_loss:
                print(f"Stopping: loss exploded at lr={lr:.2e}")
                break

            if smoothed_loss < best_loss:
                best_loss = smoothed_loss

            # Record
            self.lrs.append(lr)
            self.losses.append(smoothed_loss)

            # Backward pass & update (simplified)
            optimizer = self.optimizer_fn(
                self.model.parameters(), lr=lr
            )
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # Increase learning rate exponentially
            lr *= mult

        # Restore initial model state
        self.model = initial_state
        return self.lrs, self.losses

    def suggest_lr(self):
        """Suggest the LR where loss decreases fastest."""
        if len(self.losses) < 3:
            return None

        # Compute gradient of loss w.r.t. log(lr)
        log_lrs = np.log10(self.lrs)
        gradients = np.gradient(self.losses, log_lrs)

        # Find the LR with steepest negative gradient
        # (skip first 10% and last 10% for stability)
        start = len(gradients) // 10
        end = len(gradients) - len(gradients) // 10
        min_idx = start + np.argmin(gradients[start:end])

        suggested_lr = self.lrs[min_idx]
        print(f"Suggested LR: {suggested_lr:.2e}")
        print(f"  (where loss decreased fastest)")
        print(f"  Rule of thumb: use ~{suggested_lr/10:.2e} to "
              f"{suggested_lr:.2e}")
        return suggested_lr
Rule of thumb for LR Finder: The suggested LR is where the loss decreases most steeply. In practice, use a value slightly lower (about 1/3 to 1/10 of the suggested value) as your maximum LR for training. This ensures you're in the "fast descent" zone without risking instability.

4.3 Error Analysis Helper

Pythonimport numpy as np
from collections import Counter

class ErrorAnalyzer:
    """Structured error analysis on misclassified examples."""

    def __init__(self, X_dev, y_dev, y_pred, class_names=None):
        self.X_dev = X_dev
        self.y_dev = y_dev
        self.y_pred = y_pred
        self.class_names = class_names

        # Find misclassified indices
        self.misclassified = np.where(y_dev != y_pred)[0]
        self.total_errors = len(self.misclassified)
        self.dev_error = self.total_errors / len(y_dev)

        print(f"Dev set size: {len(y_dev)}")
        print(f"Misclassified: {self.total_errors}")
        print(f"Dev error rate: {self.dev_error:.2%}")

    def confusion_breakdown(self):
        """Show which classes are most confused."""
        confusion_pairs = []
        for idx in self.misclassified:
            true_label = self.y_dev[idx]
            pred_label = self.y_pred[idx]
            if self.class_names:
                pair = (self.class_names[true_label],
                        self.class_names[pred_label])
            else:
                pair = (true_label, pred_label)
            confusion_pairs.append(pair)

        counts = Counter(confusion_pairs)
        print("\nTop Confusion Pairs (True โ†’ Predicted):")
        print("-" * 50)
        for (true, pred), count in counts.most_common(10):
            pct = count / self.total_errors * 100
            print(f"  {true:>15s} โ†’ {pred:<15s}  "
                  f"{count:4d}  ({pct:.1f}%)")
        return counts

    def ceiling_analysis(self, categories):
        """
        Given error categories with counts, compute ceilings.
        categories: dict mapping category_name -> count_of_errors
        """
        print(f"\nError Ceiling Analysis (Dev Error: {self.dev_error:.2%})")
        print("-" * 55)
        print(f"{'Category':<25s} {'Count':>6s} {'% Errors':>9s} {'Ceiling':>9s}")
        print("-" * 55)
        for cat, count in sorted(categories.items(),
                                  key=lambda x: -x[1]):
            pct = count / self.total_errors * 100
            ceiling = self.dev_error * (1 - count / self.total_errors)
            print(f"  {cat:<23s} {count:6d} {pct:8.1f}% "
                  f"{ceiling:8.2%}")
        print("-" * 55)
Dev set size: 5000 Misclassified: 500 Dev error rate: 10.00% Error Ceiling Analysis (Dev Error: 10.00%) ------------------------------------------------------- Category Count % Errors Ceiling ------------------------------------------------------- Blurry images 190 38.0% 6.20% Multiple objects 125 25.0% 7.50% Unusual angles 90 18.0% 8.20% Label noise 60 12.0% 8.80% Rare classes 35 7.0% 9.30% -------------------------------------------------------
Section 5

Industry Code โ€” PyTorch & Optuna

5.1 PyTorch LR Finder (torch-lr-finder)

Pythonimport torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torch_lr_finder import LRFinder  # pip install torch-lr-finder

# Define a simple model
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.BatchNorm1d(256),
    nn.Dropout(0.3),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

optimizer = optim.Adam(model.parameters(), lr=1e-7, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()

# Create data loader
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Run LR Finder
lr_finder = LRFinder(model, optimizer, criterion, device="cuda")
lr_finder.range_test(train_loader, end_lr=10, num_iter=200)
lr_finder.plot()  # Shows loss vs. LR curve
lr_finder.reset()  # Restore model to initial state

# Read the plot: pick LR where loss is steepest
# Typically: suggested_lr โ‰ˆ 3e-3 for this architecture

5.2 Optuna โ€” Automated Hyperparameter Search

Pythonimport optuna
import torch
import torch.nn as nn
import torch.optim as optim

def objective(trial):
    """Optuna objective function for HP search."""

    # โ”€โ”€ Sample hyperparameters โ”€โ”€
    lr = trial.suggest_float("lr", 1e-5, 1e-1, log=True)
    n_layers = trial.suggest_int("n_layers", 2, 5)
    dropout = trial.suggest_float("dropout", 0.1, 0.5)
    hidden_size = trial.suggest_categorical(
        "hidden_size", [64, 128, 256, 512]
    )
    batch_size = trial.suggest_categorical(
        "batch_size", [32, 64, 128]
    )

    # โ”€โ”€ Build model dynamically โ”€โ”€
    layers = []
    in_dim = 784
    for i in range(n_layers):
        layers.append(nn.Linear(in_dim, hidden_size))
        layers.append(nn.ReLU())
        layers.append(nn.Dropout(dropout))
        in_dim = hidden_size
    layers.append(nn.Linear(in_dim, 10))
    model = nn.Sequential(*layers)

    optimizer = optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    # โ”€โ”€ Training loop (simplified) โ”€โ”€
    for epoch in range(20):
        model.train()
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            loss = criterion(model(X_batch), y_batch)
            loss.backward()
            optimizer.step()

        # โ”€โ”€ Evaluate on dev set โ”€โ”€
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for X_val, y_val in dev_loader:
                preds = model(X_val).argmax(dim=1)
                correct += (preds == y_val).sum().item()
                total += len(y_val)
        dev_acc = correct / total

        # Pruning: stop bad trials early
        trial.report(dev_acc, epoch)
        if trial.should_prune():
            raise optuna.exceptions.TrialPruned()

    return dev_acc

# โ”€โ”€ Run the study โ”€โ”€
study = optuna.create_study(
    direction="maximize",
    pruner=optuna.pruners.MedianPruner(n_warmup_steps=5)
)
study.optimize(objective, n_trials=50, timeout=3600)  # 1-hour budget

# โ”€โ”€ Results โ”€โ”€
print(f"Best trial: {study.best_trial.number}")
print(f"Best accuracy: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

# Visualize (optional)
optuna.visualization.plot_optimization_history(study)
optuna.visualization.plot_param_importances(study)

๐Ÿญ Industry Note โ€” Optuna at Scale

Optuna (developed by Preferred Networks, Japan) is the most popular HP tuning framework in production ML. It uses Tree-structured Parzen Estimators (TPE) โ€” a Bayesian approach that's smarter than pure random search. Key features: pruning (kills bad trials early), distributed search across GPUs, and dashboard visualization. Many Indian ML teams at Flipkart, Swiggy, and PhonePe use Optuna or similar frameworks (Ray Tune, Weights & Biases Sweeps).

Section 6

Visual Diagrams

6.1 Grid Search vs. Random Search โ€” Visual

Grid Search (5ร—5 = 25 trials) Random Search (25 trials) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ— โ— โ— โ— โ— โ”‚ โ”‚ โ—‹ โ—‹ โ—‹ โ”‚ โ”‚ โ”‚ โ”‚ โ—‹ โ—‹ โ—‹ โ—‹ โ”‚ โ”‚ โ— โ— โ— โ— โ— โ”‚ โ”‚ โ—‹ โ—‹ โ—‹ โ”‚ โ”‚ โ”‚ โ”‚ โ—‹ โ—‹ โ—‹ โ—‹ โ”‚ โ”‚ โ— โ— โ— โ— โ— โ”‚ โ”‚ โ—‹ โ—‹ โ—‹ โ—‹ โ”‚ โ”‚ โ”‚ โ”‚ โ—‹ โ—‹ โ—‹ โ”‚ โ”‚ โ— โ— โ— โ— โ— โ”‚ โ”‚ โ—‹ โ—‹ โ—‹ โ”‚ โ”‚ โ”‚ โ”‚ โ—‹ โ—‹ โ—‹ โ—‹ โ”‚ โ”‚ โ— โ— โ— โ— โ— โ”‚ โ”‚ โ—‹ โ—‹ โ—‹ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†’ x-axis (LR): only 5 unique โ†’ x-axis (LR): 25 unique โ†’ y-axis (HU): only 5 unique โ†’ y-axis (HU): 25 unique Project onto LR axis: Project onto LR axis: | โ— โ— โ— โ— โ— | | โ—‹โ—‹ โ—‹ โ—‹ โ—‹โ—‹ โ—‹ โ—‹โ—‹ โ—‹ โ—‹โ—‹โ—‹ โ—‹โ—‹โ—‹โ—‹ โ—‹| 5 values โ†’ poor coverage 25 values โ†’ rich coverage โœ“

6.2 Error Decomposition Waterfall

Error Decomposition for ML Strategy โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• Human Error โ–“โ–“ 1% โ† Proxy for Bayes Optimal Error โ†• Avoidable Bias (7%) Train Error โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“ 8% โ†• Variance (1%) Train-Dev Err โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“ 9% โ†• Data Mismatch (3%) Dev Error โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“ 12% โ†• Dev Overfitting (1%) Test Error โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“ 13% โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ACTION PLAN: โ”‚ โ”‚ 1. Avoidable Bias (7%) โ† BIGGEST GAP โ”‚ โ”‚ โ†’ Bigger model, train longer, new arch โ”‚ โ”‚ 2. Data Mismatch (3%) โ”‚ โ”‚ โ†’ Data augmentation, synthesize target โ”‚ โ”‚ 3. Variance (1%) โ† Already low โ”‚ โ”‚ 4. Dev Overfitting (1%) โ† Negligible โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

6.3 The LR Finder Plot

Learning Rate Finder โ€” Loss vs. LR (log scale) โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• Loss โ”‚ 4โ”‚ ร— โ”‚ ร— ร— 3โ”‚ ร— ร— โ”‚ ร— ร— 2โ”‚ ร— ร— ร— โ”‚ ร— ร— ร— 1โ”‚ ร— ร— ร— ร— โ”‚ ร— ร— ร— ร— 0.5โ”‚ ร— ร— ร— โ”‚ โ†‘ โ˜… โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ log(LR) 10โปโท 10โปโถ 10โปโต 10โปโด 10โปยณ 10โปยฒ 10โปยน 10โฐ โ˜… = Suggested LR (steepest descent โ‰ˆ 3ร—10โปยณ) Rule: Use LR โ‰ˆ โ˜…/3 to โ˜… โ†’ 1ร—10โปยณ to 3ร—10โปยณ

6.4 Orthogonalization โ€” The Four Knobs

Orthogonalization: Fix Problems in Order โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• Step 1 Step 2 Step 3 Step 4 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ FIT โ”‚ โ”‚ FIT โ”‚ โ”‚ FIT โ”‚ โ”‚ WORKS โ”‚ โ”‚ TRAININGโ”‚โ”€โ”€โ”€โ†’โ”‚ DEV โ”‚โ”€โ”€โ”€โ†’โ”‚ TEST โ”‚โ”€โ”€โ”€โ†’โ”‚ IN REAL โ”‚ โ”‚ SET โ”‚ โ”‚ SET โ”‚ โ”‚ SET โ”‚ โ”‚ WORLD โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚โ€ข Bigger โ”‚ โ”‚โ€ข Regular-โ”‚ โ”‚โ€ข Bigger โ”‚ โ”‚โ€ข Change โ”‚ โ”‚ network โ”‚ โ”‚ ization โ”‚ โ”‚ dev set โ”‚ โ”‚ dev/testโ”‚ โ”‚โ€ข Train โ”‚ โ”‚โ€ข More โ”‚ โ”‚โ€ข Don't โ”‚ โ”‚ distri- โ”‚ โ”‚ longer โ”‚ โ”‚ data โ”‚ โ”‚ over- โ”‚ โ”‚ bution โ”‚ โ”‚โ€ข Better โ”‚ โ”‚โ€ข Dropout โ”‚ โ”‚ tune โ”‚ โ”‚โ€ข Change โ”‚ โ”‚ optimizerโ”‚ โ”‚โ€ข Data augโ”‚ โ”‚ โ”‚ โ”‚ metric โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Section 7

Worked Example โ€” Complete HP Tuning Workflow

Problem: MNIST Digit Classifier at โ‚น500 Cloud Budget

You're a student at VIT Vellore. Your professor gives you a โ‚น500 Google Cloud credit and asks you to build the best MNIST classifier you can. You have a single T4 GPU (training one model takes ~3 minutes). Budget allows ~100 experiments.

Step 1: Define the Search Space (Using Tier Priorities)

HyperparameterTierRangeScale
Learning Rate (ฮฑ)1[10โปโด, 10โปยน]Log-uniform
Hidden Units2{64, 128, 256, 512}Categorical
Batch Size2{32, 64, 128}Categorical
Dropout2[0.1, 0.5]Uniform
Number of Layers3{2, 3, 4}Categorical

Step 2: Round 1 โ€” Coarse Search (30 trials, 5 epochs each)

30 trials ร— 5 epochs ร— 3 min = 15 GPU-minutes. Cost: ~โ‚น10.

Python# Round 1 results (top 5 of 30 trials)
# Trial | LR       | Units | Batch | Drop | Layers | Dev Acc
# โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
#   7   | 3.2e-3   |  256  |   64  | 0.30 |   3    | 97.8%
#  12   | 1.8e-3   |  512  |   64  | 0.25 |   3    | 97.6%
#  19   | 5.1e-3   |  256  |  128  | 0.35 |   2    | 97.5%
#  23   | 2.4e-3   |  128  |   64  | 0.20 |   3    | 97.3%
#   3   | 8.7e-3   |  256  |   32  | 0.40 |   2    | 97.1%

Observation: Best LRs cluster around [1e-3, 1e-2]. Hidden units 256โ€“512. 3 layers seems best. Batch size 64. Dropout 0.20โ€“0.35.

Step 3: Round 2 โ€” Fine Search (20 trials, 20 epochs each)

Narrow the ranges based on Round 1 findings:

Python# Narrowed search space
fine_search_space = {
    'lr':      {'type': 'log_uniform', 'low': np.log10(1e-3),
                'high': np.log10(1e-2)},   # narrowed!
    'units':   {'type': 'choice', 'options': [192,256,320,384,512]},
    'dropout': {'type': 'uniform', 'low': 0.2, 'high': 0.4},
    'layers':  {'type': 'choice', 'options': [3, 4]},
}

# Top result from Round 2:
# LR = 2.8e-3, units = 320, dropout = 0.28, layers = 3
# Dev Accuracy: 98.4%

Step 4: Final Training (Top 3 candidates, full 50 epochs)

Python# Final evaluation on held-out test set
# Model          | Dev Acc  | Test Acc
# โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
# Candidate A    | 98.4%    | 98.3%    โ† Selected (dev โ‰ˆ test, no overfitting)
# Candidate B    | 98.3%    | 98.2%
# Candidate C    | 98.1%    | 98.0%

Step 5: Error Analysis on the Best Model

Inspect 100 misclassified test examples:

Error CategoryCountCeiling
Ambiguous digits (e.g., 4 vs 9)4242%
Rotated/skewed digits2828%
Very thin strokes1818%
Noisy backgrounds1212%

Conclusion: To go beyond 98.3%, focus on data augmentation for rotation/skew (28% ceiling) and potentially ensemble models for ambiguous digits. Total cloud spend: ~โ‚น80 out of โ‚น500 budget. ๐ŸŽฏ

Section 8

Case Study โ€” Ola Surge Pricing

๐Ÿš— Ola's Surge Pricing Model: When Historical โ‰  Real-Time

The Problem

Ola's surge pricing model predicts demand-supply imbalance in each zone to set dynamic pricing. The challenge: training data is historical (past ride requests, driver locations, weather, events), but inference happens in real-time (current conditions that may be very different).

Distribution Mismatch Sources

FactorTraining Data (Historical)Inference (Real-Time)
EventsDiwali 2024 patternsIPL 2025 final (new stadium, new traffic patterns)
GeographyData from 5 tier-1 citiesExpanding to 50 tier-2/3 cities
WeatherHistorical averagesUnexpected cyclone Biparjoy
CompetitionPre-Rapido bike-taxi eraPost-Rapido with 2-wheeler competition
RegulationPre-fare-cap rulesNew state-level fare caps in Karnataka

Ola's ML Strategy (Reconstructed)

Step 1: Dev/Test Set Design
  • Training set: 500M historical ride records (2020โ€“2024) from all cities. Source distribution.
  • Dev set: 100K most recent records (last 30 days) from target cities. Target distribution.
  • Test set: 50K records from the last 7 days (same target distribution, held out).
  • Train-Dev set: 50K records randomly sampled from training data (source distribution, but held out).
Step 2: Error Decomposition
MetricMAE (โ‚น)
Human expert (Bayes proxy)โ‚น8
Training errorโ‚น12
Train-Dev errorโ‚น15
Dev errorโ‚น28
Test errorโ‚น30

Decomposition:

  • Avoidable Bias: โ‚น12 โˆ’ โ‚น8 = โ‚น4
  • Variance: โ‚น15 โˆ’ โ‚น12 = โ‚น3
  • Data Mismatch: โ‚น28 โˆ’ โ‚น15 = โ‚น13 โ† Biggest problem!
  • Dev Overfitting: โ‚น30 โˆ’ โ‚น28 = โ‚น2
Step 3: Fixing Data Mismatch (โ‚น13 gap)
  • Real-time feature injection: Add current weather, event calendars, live traffic as features at inference time โ€” not just historical averages.
  • Synthetic data: Simulate "what if IPL final + heavy rain" scenarios by combining historical ride patterns with synthetic weather overlays.
  • Online learning: Fine-tune the model daily on the most recent 24 hours of data to reduce temporal mismatch.
  • Domain adaptation: Use a small amount of target city data to adapt the model trained on tier-1 cities.
Step 4: Hyperparameter Tuning

After fixing data mismatch, re-run HP search using the Caviar strategy (Ola has a GPU cluster):

  • Tier 1: Learning rate โ†’ Log-uniform [10โปโต, 10โปยฒ]
  • Tier 2: Hidden units (128โ€“1024), batch size (256โ€“2048 for large data)
  • 50 parallel Optuna trials with pruning
  • Result: Dev MAE dropped from โ‚น28 to โ‚น16
Business Impact

Reducing surge pricing prediction error from โ‚น28 MAE to โ‚น16 MAE meant:

  • Fewer overcharged rides โ†’ 12% reduction in ride cancellations
  • Better driver allocation โ†’ 8% improvement in driver utilization
  • Estimated revenue impact: โ‚น45 crore annually across 250+ cities
Section 9

Common Mistakes & Misconceptions

Mistake #1: Tuning all hyperparameters simultaneously with equal priority.
Don't spend time searching over Adam's ฮต (Tier 4) when you haven't even found a good learning rate (Tier 1). Follow the tier ordering โ€” it saves 80%+ of your compute budget.
Mistake #2: Using grid search "because it's systematic."
Grid search wastes experiments on redundant combinations. For the same budget, random search covers the important dimensions much more thoroughly. Bergstra & Bengio (2012) proved this mathematically.
Mistake #3: Sampling learning rate uniformly (e.g., 0.001 to 1.0).
This concentrates 90% of samples above 0.1, where most reasonable LRs don't live. Always use log-uniform sampling: ฮฑ = 10 ** uniform(-4, -1).
Mistake #4: Using test set for hyperparameter tuning.
The moment you use test set performance to make model decisions, it becomes a second dev set and your final evaluation is biased. Keep the test set locked away until the very end.
Mistake #5: Ignoring data mismatch between train and dev sets.
If dev error is high, many practitioners assume "overfitting" and add regularization. But the real problem might be distribution mismatch. Use a train-dev set to distinguish variance from data mismatch.
Mistake #6: Skipping manual error analysis.
Spending 30 minutes inspecting 100 misclassified examples often reveals insights that save weeks of engineering effort. Don't jump straight to "collect more data" without understanding why the model fails.
Mistake #7: Using early stopping as the primary regularization.
Early stopping simultaneously affects bias (underfits training set) and variance (prevents overfitting dev set). This violates orthogonalization. Prefer L2 regularization + dropout, which affect only variance without compromising training fit.
Section 10

Comparison Tables

10.1 Search Strategies Compared

AspectGrid SearchRandom SearchBayesian (Optuna)
Unique values per axis (N trials)โˆ›N (cube root)NN (informed)
Handles HP importanceโŒ Equal spacingโœ… Auto-concentratesโœ… Learns importance
Parallelizableโœ… Fullyโœ… Fullyโš ๏ธ Partially
Compute efficiencyLowMediumHigh
Implementation effortEasyEasyMedium (library)
Best forโ‰ค2 HPs3โ€“7 HPs3โ€“20 HPs
Uses previous resultsโŒ NoโŒ Noโœ… Yes (surrogate model)

10.2 Bias vs. Variance vs. Mismatch Diagnostics

ProblemSymptomSolution
High Avoidable BiasTrain error >> Human errorBigger model, train longer, better architecture
High VarianceTrain-Dev error >> Train errorRegularization, more data, dropout
Data MismatchDev error >> Train-Dev errorMore target data, data synthesis, domain adaptation
Dev OverfittingTest error >> Dev errorBigger dev set, less HP tuning on dev

10.3 Panda vs. Caviar Strategy

Aspect๐Ÿผ Panda (Babysitting)๐ŸŸ Caviar (Parallel)
Compute resources1 GPU10โ€“100+ GPUs
Experiments at a time1Dozens
Human attentionHigh (daily monitoring)Low (set and forget)
Time to resultDaysโ€“weeksHoursโ€“days
Cost per experimentLowHigh (but faster)
Typical userStudent, startupLarge company, research lab
Indian exampleIIT M.Tech thesis on โ‚น1L budgetJio AI team with private GPU cluster
Section 11

Exercises

Section 11A

Multiple Choice Questions (10)

Q1 Beginner

According to Andrew Ng's hyperparameter priority ranking, which hyperparameter belongs to Tier 1 (highest priority)?

  1. Number of hidden layers
  2. Learning rate ฮฑ
  3. Adam's ฮฒโ‚‚ parameter
  4. Mini-batch size
โœ… B. Learning rate ฮฑ โ€” Tier 1 contains only the learning rate. It has the single highest impact on model performance. Number of layers is Tier 3, mini-batch size is Tier 2, and Adam's ฮฒโ‚‚ is Tier 4.
RememberHyperparameter Priority
Q2 Beginner

In a 5ร—5 grid search over learning rate and hidden units (25 total experiments), how many unique learning rate values are tested?

  1. 25
  2. 10
  3. 5
  4. 1
โœ… C. 5 โ€” In a 5ร—5 grid, each axis has only 5 unique values. The 25 experiments test all 25 combinations, but along any single axis, only 5 unique values are explored. This is the fundamental limitation of grid search.
UnderstandGrid Search
Q3 Intermediate

You want to search over learning rates in [10โปโด, 10โปยน]. What is the correct log-uniform sampling in Python?

  1. np.random.uniform(0.0001, 0.1)
  2. 10 ** np.random.uniform(-4, -1)
  3. np.random.choice([0.0001, 0.001, 0.01, 0.1])
  4. np.exp(np.random.uniform(-4, -1))
โœ… B. 10 ** np.random.uniform(-4, -1) โ€” This samples the exponent uniformly between โˆ’4 and โˆ’1, then raises 10 to that power, giving equal probability to each order of magnitude. Option A is linear uniform (biased toward large values). Option D uses base e, not base 10.
ApplyLogarithmic Scale
Q4 Intermediate

You have 10 million training examples. What is the recommended train/dev/test split?

  1. 60% / 20% / 20%
  2. 80% / 10% / 10%
  3. 98% / 1% / 1%
  4. 99.5% / 0.25% / 0.25%
โœ… D. 99.5% / 0.25% / 0.25% โ€” With 10M examples, even 0.25% gives you 25,000 examples in dev and test sets โ€” more than sufficient for reliable evaluation. Using 20% for dev would waste 2 million examples that could improve training.
ApplyData Splits
Q5 Intermediate

Your model has: Human error = 1%, Training error = 2%, Dev error = 10%. What is the dominant problem?

  1. High avoidable bias
  2. High variance
  3. Data mismatch
  4. Bayes error is too high
โœ… B. High variance โ€” Avoidable bias = 2% โˆ’ 1% = 1% (small). Variance = 10% โˆ’ 2% = 8% (large). The dominant gap is between training and dev error, indicating the model overfits the training set. Solution: more regularization, more data, or dropout.
AnalyzeError Decomposition
Q6 Intermediate

What is the purpose of a "train-dev" set?

  1. To augment the training data with more examples
  2. To distinguish between variance and data mismatch as causes of dev error
  3. To replace the test set for final evaluation
  4. To validate that the train set has no label errors
โœ… B. To distinguish between variance and data mismatch. The train-dev set has the same distribution as training data but is held out. If train-dev error is much higher than train error, the problem is variance. If dev error is much higher than train-dev error, the problem is data mismatch.
UnderstandTrain-Dev Set
Q7 Advanced

Why does early stopping violate the principle of orthogonalization?

  1. It requires manual monitoring of training
  2. It simultaneously affects training set fit (bias) and dev set fit (variance)
  3. It only works with SGD, not Adam
  4. It prevents the model from reaching zero training loss
โœ… B. It simultaneously affects training set fit and dev set fit. Orthogonalization requires each "knob" to affect exactly one goal. Early stopping stops training before the model fully fits the training set (affects bias) AND prevents overfitting the dev set (affects variance). L2 regularization is preferred because it only affects the variance knob.
AnalyzeOrthogonalization
Q8 Intermediate

In a food classification task, error analysis reveals that 38% of misclassified images are blurry. If the dev error is 10%, what is the "ceiling" โ€” the best possible dev error if you perfectly solve the blurry image problem?

  1. 3.8%
  2. 6.2%
  3. 10%
  4. 0%
โœ… B. 6.2% โ€” If 38% of the 10% errors are due to blurry images, fixing all of them removes 3.8% from the error. Remaining error = 10% โˆ’ 3.8% = 6.2%. This is the "ceiling" โ€” the best you could achieve by fixing only this one error category.
ApplyCeiling Analysis
Q9 Advanced

A medical imaging model achieves 0.5% error, while the best team of radiologists achieves 0.7% error. Which statement is TRUE?

  1. The model has zero avoidable bias
  2. Human-level error can no longer serve as a useful proxy for Bayes error
  3. The model must have overfitted since it beat humans
  4. You should use the average doctor's error rate as the Bayes proxy
โœ… B. Human-level error can no longer serve as a useful proxy for Bayes error. When the model surpasses the best human performance, we don't know how much room is left (Bayes error could be 0.3% or 0.5%). The gap between model error and Bayes error becomes uncertain, making it hard to know whether to focus on bias or variance.
EvaluateBayes Error
Q10 Intermediate

When should you use the "Panda" (babysitting) strategy over the "Caviar" (parallel) strategy?

  1. When you have a large cloud budget and many GPUs
  2. When you have limited compute (e.g., a single GPU)
  3. When the dataset is very large (10M+ examples)
  4. When you're tuning only Tier 4 hyperparameters
โœ… B. When you have limited compute (e.g., a single GPU). The Panda strategy involves carefully monitoring and adjusting a single model's hyperparameters over time โ€” like a panda carefully raising one baby. It's ideal for resource-constrained settings (students, small startups). The Caviar strategy requires many GPUs to run experiments in parallel.
UnderstandTuning Strategy
Section 11B

Short Answer Questions (5)

B1. Beginner

List the four tiers of hyperparameter importance according to Andrew Ng. Give one example hyperparameter for each tier.

B2. Intermediate

Explain with a numerical example why sampling the momentum parameter ฮฒ uniformly from [0.9, 0.999] is a bad idea. What should you do instead?

B3. Intermediate

You're building a crop disease classifier for an agri-tech startup in Pune. You have 2 million images from web-scraped agricultural databases but only 8,000 images from Indian farmers' phones (the target use case). How would you design the train/dev/test split?

B4. Intermediate

A model has: Human error = 3%, Train error = 4%, Train-Dev error = 5%, Dev error = 12%, Test error = 13%. Compute all four error gaps (avoidable bias, variance, data mismatch, dev overfitting) and identify the top priority to fix.

B5. Advanced

Explain why ML progress typically slows down after surpassing human-level performance. Give two concrete reasons related to the ML workflow.

Section 11C

Long Answer Questions (3)

C1. Intermediate โ€” The Complete Tuning Workflow (10 marks)

You are building a text classification model to categorize customer complaints for a telecom company (Jio). You have 5 million labeled complaints from email (historical) and need the model to work on WhatsApp messages (target). Describe a complete ML strategy covering: (a) data split design, (b) hyperparameter search approach, (c) error analysis methodology, (d) how to handle distribution mismatch.

C2. Advanced โ€” Grid vs. Random: Mathematical Proof (8 marks)

Prove mathematically that for a fixed budget of N experiments, random search tests more unique values along the most important hyperparameter axis than grid search when there are k hyperparameters (k โ‰ฅ 2). State your assumptions clearly. Compute the exact number of unique values per axis for both methods when N = 64 and k = 3.

C3. Advanced โ€” Orthogonalization in Practice (8 marks)

A junior ML engineer at Infosys reports the following results for a sentiment analysis model: Human error = 2%, Training error = 15%, Dev error = 16%. They propose: "Let's add dropout and L2 regularization since the dev error is high." Critique this proposal using orthogonalization principles. What would you recommend instead? Explain your reasoning step by step.

Section 11D

Programming Exercises (2)

D1. Intermediate โ€” Build a Complete HP Search Pipeline

Using the HyperparameterSearcher class from Section 4, build a complete pipeline that:

  1. Defines a search space for a 3-layer neural network (LR, hidden units, dropout, batch size)
  2. Implements a dummy train_fn that simulates training (you can use synthetic data or sklearn's make_classification)
  3. Runs a 2-round coarse-to-fine search (Round 1: 20 trials, broad range; Round 2: 10 trials, narrowed range)
  4. Prints the top-5 results from each round
  5. Reports the final best hyperparameters

Deliverable: A Python script that runs end-to-end and prints results to console.

D2. Advanced โ€” LR Finder with PyTorch

Implement a LR Finder for a PyTorch model on the Fashion-MNIST dataset:

  1. Load Fashion-MNIST using torchvision.datasets
  2. Define a 3-layer network with BatchNorm and Dropout
  3. Implement the LR range test: exponentially increase LR from 10โปโท to 10 over one epoch
  4. Plot the loss vs. log(LR) curve using matplotlib
  5. Automatically find and print the suggested LR (steepest descent point)
  6. Train the model using the suggested LR and report test accuracy

Deliverable: A Python script with a matplotlib plot and final test accuracy โ‰ฅ 88%.

Section 11E

Mini-Project

๐Ÿ—๏ธ End-to-End ML Strategy for Indian Language Sentiment Analysis

Scenario: You're building a sentiment classifier for Hinglish (Hindi-English code-mixed) product reviews for a Flipkart internship project. You have:

  • 500K English Amazon reviews (labeled positive/negative) โ€” source distribution
  • 5K Hinglish Flipkart reviews (labeled) โ€” target distribution
  • 50K unlabeled Hinglish reviews

Tasks:

  1. Data Split Design: Design your train/dev/test/train-dev splits. Justify your choices with exact numbers.
  2. Baseline Model: Train a simple logistic regression or 2-layer NN on English data. Report train, train-dev, dev, and test errors.
  3. Error Decomposition: Compute avoidable bias, variance, and data mismatch. Identify the biggest problem.
  4. Hyperparameter Search: Run a 2-round coarse-to-fine random search using Optuna. Document all trials.
  5. Error Analysis: Manually inspect 50 misclassified dev examples. Create an error category spreadsheet and perform ceiling analysis.
  6. Strategy Proposal: Based on your error analysis, propose 3 concrete next steps with expected impact. Prioritize using ceilings.

Deliverable: A Jupyter notebook with all code, analysis tables, and a 500-word strategy writeup. Grading emphasizes strategic reasoning over raw accuracy.

Section 12

Chapter Summary

Key Takeaways from Chapter 11

  1. Hyperparameter Priority: Learning rate (Tier 1) โ†’ momentum, hidden units, batch size (Tier 2) โ†’ layers, LR decay (Tier 3) โ†’ Adam params (Tier 4). Spend 80% of your budget on Tiers 1โ€“2.
  2. Random Search > Grid Search: For N experiments, random search tests N unique values per axis vs. N^(1/k) for grid search with k hyperparameters. The probability of finding the optimal region is dramatically higher with random search.
  3. Coarse-to-Fine: Don't run one massive search. Start broad with few epochs, zoom into the promising region, then run longer training on the finalists.
  4. Logarithmic Scale: Always sample learning rate and regularization strength on log scale. For momentum ฮฒ, sample (1โˆ’ฮฒ) on log scale.
  5. Big Data Splits: With 1M+ examples, use 98/1/1 or 99.5/0.25/0.25 splits. Dev and test sets must come from the same (target) distribution.
  6. Train-Dev Set: A diagnostic tool to distinguish variance from data mismatch. Same distribution as training, but held out from training.
  7. Error Analysis: Manually inspect 100 misclassified examples, categorize errors, compute ceilings, and prioritize the highest-ceiling category.
  8. Orthogonalization: Fix problems in order: bias first (bigger model) โ†’ variance (regularization) โ†’ data mismatch (more target data) โ†’ real-world performance (change metric/distribution).
  9. Human-Level Performance: Use the best human performance as a proxy for Bayes error. Avoidable bias = train error โˆ’ human error. Variance = dev error โˆ’ train error.
  10. Panda vs. Caviar: Babysit one model (limited compute) vs. spawn many in parallel (abundant compute). Choose based on your resource constraints.
The ML Strategy Decision Tree:

High avoidable bias? โ†’ Bigger model, train longer
High variance? โ†’ Regularize, more data
Data mismatch? โ†’ More target data, domain adaptation
Dev overfitting? โ†’ Bigger dev set
All good? โ†’ Ship it! ๐Ÿš€
Section 13

References & Further Reading

Foundational Papers

  1. Bergstra, J., & Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research, 13, 281โ€“305. โ€” The seminal paper proving random search superiority over grid search.
  2. Smith, L.N. (2017). Cyclical Learning Rates for Training Neural Networks. IEEE WACV. โ€” Introduced the LR range test (LR Finder) and cyclical learning rates.
  3. Snoek, J., Larochelle, H., & Adams, R.P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. NeurIPS. โ€” Foundations of Bayesian hyperparameter optimization.
  4. Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. KDD. โ€” The Optuna paper.

Textbooks & Courses

  1. Ng, A. (2017). Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization. Coursera Deep Learning Specialization, Course 2. โ€” Primary source for the tier ranking, orthogonalization, and error analysis frameworks.
  2. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press, Chapter 11: Practical Methodology. โ€” Covers hyperparameter selection strategy.
  3. Howard, J., & Gugger, S. (2020). Deep Learning for Coders with fastai and PyTorch. O'Reilly. โ€” Practical LR Finder and 1cycle policy implementation.

Indian Industry Applications

  1. Ola Engineering Blog (2023). Dynamic Pricing at Scale: ML Architecture Behind Surge Pricing. engineering.olacabs.com
  2. Flipkart Tech Blog (2023). Scaling Recommendations for 400M Users. tech.flipkart.com
  3. Niramai Health Analytix. AI for Breast Cancer Screening. niramai.com โ€” Affordable AI-powered diagnostics using thermal imaging.

Tools & Libraries

  1. Optuna Documentation: optuna.readthedocs.io
  2. torch-lr-finder: github.com/davidtvs/pytorch-lr-finder
  3. Weights & Biases Sweeps: wandb.ai/site/sweeps
  4. Ray Tune: docs.ray.io/en/latest/tune/