PyTorch • EduArtha

PyTorch: Fundamentals

Master the most popular deep learning framework. From tensors to production deployment — build, train, and ship real AI models with PyTorch.

⏱ 3–5 months | 14 Chapters | 50+ Exercises | 14 Projects | Industry Problems

Part I

Tensor Foundations

The building blocks of every PyTorch program

Chapter 1

Tensors & Operations

Learning Objectives

Create tensors from Python lists, NumPy arrays, and built-in factories
Understand dtypes, shapes, strides, and memory layout
Master reshaping, indexing, slicing, and advanced indexing
Apply broadcasting rules for element-wise operations
Convert between PyTorch tensors and NumPy arrays

What Is a Tensor?

A tensor is a multi-dimensional array — the fundamental data structure in PyTorch. Scalars are 0-D tensors, vectors are 1-D, matrices are 2-D, and anything higher is an n-D tensor. Every neural network input, output, weight, and gradient is a tensor.

Dimension	Name	Shape Example	Use Case
0-D	Scalar	`torch.tensor(3.14)`	Loss value
1-D	Vector	`(512,)`	Bias, embedding
2-D	Matrix	`(64, 784)`	Batch of flat images
3-D	3-Tensor	`(32, 100, 512)`	Batch of sequences
4-D	4-Tensor	`(16, 3, 224, 224)`	Batch of RGB images

Creating Tensors

Python
import torch

# From Python data
a = torch.tensor([1, 2, 3])                      # int64 by default
b = torch.tensor([[1.0, 2.0], [3.0, 4.0]])      # float32

# Factory functions
zeros = torch.zeros(3, 4)                        # 3×4 of zeros
ones  = torch.ones(2, 3, dtype=torch.float16)    # specify dtype
rand  = torch.randn(5, 5)                        # standard normal
eye   = torch.eye(4)                              # 4×4 identity
arange = torch.arange(0, 10, 2)                  # [0, 2, 4, 6, 8]
linspace = torch.linspace(0, 1, 5)              # 5 evenly spaced

# From NumPy (shares memory!)
import numpy as np
np_arr = np.array([1, 2, 3])
t = torch.from_numpy(np_arr)                      # zero-copy
back = t.numpy()                                   # back to numpy

Shared Memory Warning

torch.from_numpy() shares memory with the NumPy array. Modifying one changes the other. Use .clone() if you need an independent copy.

Reshaping & Views

Python
x = torch.arange(12)

# Reshape (may copy)
a = x.reshape(3, 4)
b = x.reshape(2, -1)    # -1 = infer → (2, 6)

# View (never copies, requires contiguous)
c = x.view(4, 3)

# Squeeze / Unsqueeze
t = torch.randn(1, 3, 1)
print(t.squeeze().shape)      # (3,)   — remove all dim=1
print(t.unsqueeze(0).shape)   # (1,1,3,1) — add dim at 0

# Permute / Transpose
img = torch.randn(3, 224, 224)   # C,H,W
hwc = img.permute(1, 2, 0)         # H,W,C for matplotlib

Indexing & Broadcasting

Python
m = torch.randn(4, 5)

# Basic indexing
row0 = m[0]           # first row
elem  = m[2, 3]       # element at row 2, col 3
cols  = m[:, 1:3]     # all rows, cols 1-2

# Boolean indexing
mask = m > 0
positives = m[mask]    # flat tensor of positive values

# Broadcasting: (4,5) + (5,) → element-wise add
bias = torch.randn(5)
result = m + bias      # bias broadcast across rows

# Broadcasting rules:
# 1. Align shapes from the right
# 2. Dimensions must be equal OR one of them is 1
# (4,1,3) + (1,5,3) → (4,5,3) ✓
# (4,3)   + (5,3)   → ERROR     ✗

Exercises

Ex 1.1: Create a 5×5 tensor with values 1-25, then extract the 3×3 center sub-matrix.

Solution

t = torch.arange(1, 26).reshape(5, 5)
center = t[1:4, 1:4]
print(center)

Ex 1.2: Given a batch of images (B, C, H, W), compute the mean pixel value per channel.

Solution

imgs = torch.randn(16, 3, 32, 32)
channel_means = imgs.mean(dim=(0, 2, 3))  # shape: (3,)
print(channel_means)

Ex 1.3: Use broadcasting to add a row-vector (1, 5) and a column-vector (3, 1) to get a (3, 5) matrix.

Solution

row = torch.tensor([[1, 2, 3, 4, 5]])
col = torch.tensor([[10], [20], [30]])
result = row + col  # (3, 5)
print(result)

Project: Tensor Statistics Explorer

Build a function that takes any tensor and prints comprehensive statistics:

Python
def tensor_report(t, name="tensor"):
    print(f"═══ {name} ═══")
    print(f"Shape:    {t.shape}")
    print(f"Dtype:    {t.dtype}")
    print(f"Device:   {t.device}")
    print(f"Numel:    {t.numel():,}")
    print(f"Memory:   {t.element_size() * t.numel() / 1024:.1f} KB")
    if t.is_floating_point():
        print(f"Mean:     {t.mean():.4f}")
        print(f"Std:      {t.std():.4f}")
        print(f"Min/Max:  {t.min():.4f} / {t.max():.4f}")
        print(f"Has NaN:  {t.isnan().any()}")
        print(f"Has Inf:  {t.isinf().any()}")

# Test it
x = torch.randn(16, 3, 224, 224)
tensor_report(x, "ImageNet Batch")

Industry: Tensor Operations at Scale

Netflix Recommendation System: Netflix's recommendation engine processes user-item interaction matrices as sparse tensors. A typical matrix has ~200M users × ~15K titles. PyTorch's sparse tensor operations (torch.sparse_coo_tensor) enable efficient matrix factorization on GPUs, computing millions of recommendations in seconds rather than hours.

Why This Matters for AI

Every AI model processes tensors. GPT-4 transforms text into tensors of shape (batch, seq_len, 12288). Stable Diffusion processes image tensors of (batch, 4, 64, 64) in latent space. Understanding shapes, broadcasting, and memory layout is essential — most bugs in deep learning are shape mismatches.

Key Takeaways

Tensors are multi-dimensional arrays — the core data structure in PyTorch
Use factory functions (zeros, randn, arange) for common patterns
view never copies memory; reshape may copy if not contiguous
Broadcasting follows right-alignment rules — dimensions must match or be 1
NumPy interop shares memory — use .clone() for independence

Chapter 2

GPU Computing with CUDA

Learning Objectives

Check CUDA availability and manage GPU devices
Move tensors and models between CPU and GPU with .to()
Understand GPU memory management and avoid OOM errors
Benchmark CPU vs GPU performance for matrix operations

Device Management

Python
import torch

# Check availability
print(torch.cuda.is_available())      # True if GPU available
print(torch.cuda.device_count())       # Number of GPUs
print(torch.cuda.get_device_name(0))   # e.g., 'NVIDIA A100'

# Best practice: device-agnostic code
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")

# Move tensors to GPU
x = torch.randn(1000, 1000)
x_gpu = x.to(device)           # copy to GPU
x_gpu = x.cuda()               # shorthand (GPU 0)
x_cpu = x_gpu.cpu()            # back to CPU

# Create directly on GPU
y = torch.randn(1000, 1000, device=device)

Golden Rule

All tensors in an operation must be on the same device. x_cpu + y_gpu will raise RuntimeError. Always move data to the same device before computing.

Memory Management

Python
# Monitor GPU memory
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached:    {torch.cuda.memory_reserved() / 1e9:.2f} GB")

# Free cached memory
torch.cuda.empty_cache()

# Context manager for memory tracking
with torch.cuda.memory_stats() as stats:
    big = torch.randn(10000, 10000, device="cuda")
    del big
    torch.cuda.empty_cache()

CPU vs GPU Benchmarking

Python
import time

sizes = [100, 500, 1000, 5000]
for n in sizes:
    # CPU
    a_cpu = torch.randn(n, n)
    b_cpu = torch.randn(n, n)
    start = time.time()
    _ = a_cpu @ b_cpu
    cpu_time = time.time() - start

    # GPU
    a_gpu = a_cpu.cuda()
    b_gpu = b_cpu.cuda()
    torch.cuda.synchronize()
    start = time.time()
    _ = a_gpu @ b_gpu
    torch.cuda.synchronize()
    gpu_time = time.time() - start

    print(f"n={n:5d}  CPU: {cpu_time:.4f}s  GPU: {gpu_time:.4f}s  Speedup: {cpu_time/gpu_time:.1f}x")

Exercises

Ex 2.1: Write device-agnostic code that creates a 2048×2048 identity matrix on the best available device.

Solution

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
I = torch.eye(2048, device=device)
print(I.device, I.shape)

Ex 2.2: Write a memory report function that prints allocated/reserved GPU memory in MB.

Solution

def gpu_memory_report():
    alloc = torch.cuda.memory_allocated() / 1e6
    reserved = torch.cuda.memory_reserved() / 1e6
    print(f"Allocated: {alloc:.1f} MB | Reserved: {reserved:.1f} MB")

Project: GPU Benchmark Suite

Python
import torch, time

def benchmark_ops(device, n=2048, warmup=3, runs=10):
    a = torch.randn(n, n, device=device)
    b = torch.randn(n, n, device=device)

    ops = {
        "MatMul": lambda: a @ b,
        "Add":    lambda: a + b,
        "SVD":    lambda: torch.linalg.svd(a[:512, :512]),
    }
    for name, op in ops.items():
        for _ in range(warmup): op()
        if device.type == "cuda": torch.cuda.synchronize()
        start = time.time()
        for _ in range(runs): op()
        if device.type == "cuda": torch.cuda.synchronize()
        avg = (time.time() - start) / runs
        print(f"[{device}] {name}: {avg*1000:.2f} ms")

benchmark_ops(torch.device("cpu"))
if torch.cuda.is_available():
    benchmark_ops(torch.device("cuda"))

Industry: GPU Fleet Management at Google

Google's TPU/GPU clusters use device-agnostic code patterns identical to PyTorch's .to(device). Teams at DeepMind manage thousands of A100 GPUs, tracking memory utilization per-job. Models like Gemini require multi-node GPU training where memory management directly impacts training cost ($10K+/hour for large clusters).

Why This Matters for AI

Training a single GPT-3 model costs ~$4.6M in GPU compute. Understanding GPU memory, CUDA synchronization, and efficient data transfer is the difference between a 2-day and 2-week training run. Every production ML engineer must master device management.

Key Takeaways

Always write device-agnostic code with torch.device
Use .to(device) to move tensors and models — all operands must share a device
GPU excels at large parallel operations; small tensors can be slower on GPU due to transfer overhead
Monitor memory with torch.cuda.memory_allocated() and free caches when needed

Chapter 3

Autograd & Computational Graphs

Learning Objectives

Understand automatic differentiation and computational graphs
Use requires_grad, backward(), and .grad
Control gradient computation with no_grad() and detach()
Implement gradient accumulation and zeroing

How Autograd Works

PyTorch builds a dynamic computational graph (DAG) as operations execute. Each tensor records which operations created it. Calling .backward() traverses this graph in reverse to compute gradients via the chain rule.

∂L/∂w = (∂L/∂y) · (∂y/∂w) — Chain Rule

Python
import torch

# Simple gradient computation
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2 + 2 * x + 1   # y = x² + 2x + 1
y.backward()                  # dy/dx = 2x + 2
print(x.grad)                 # tensor(8.)  — at x=3: 2(3)+2 = 8

# Multi-variable
w = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)
x = torch.tensor(3.0)
y = w * x + b               # y = wx + b
loss = (y - 10.0) ** 2     # MSE loss
loss.backward()
print(f"dL/dw = {w.grad}")  # -18.0
print(f"dL/db = {b.grad}")  # -6.0

Gradient Control

Python
# Disable gradients for inference (saves memory + faster)
with torch.no_grad():
    pred = model(x)   # no graph built

# Detach a tensor from the graph
hidden = encoder(x).detach()  # stops gradient flow here
output = decoder(hidden)

# CRITICAL: Zero gradients before each backward pass
optimizer.zero_grad()    # ← without this, gradients accumulate!
loss.backward()
optimizer.step()

# Gradient accumulation (for large effective batch sizes)
accumulation_steps = 4
for i, (x, y) in enumerate(dataloader):
    loss = model(x, y) / accumulation_steps
    loss.backward()                # gradients accumulate
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Common Pitfall

Forgetting optimizer.zero_grad() is the #1 PyTorch bug. Gradients accumulate by default. If your loss doesn't decrease, check this first!

Exercises

Ex 3.1: Compute the gradient of f(x) = sin(x) · e^x at x = π/4 using autograd, and verify against the analytical answer.

Solution

x = torch.tensor(torch.pi / 4, requires_grad=True)
f = torch.sin(x) * torch.exp(x)
f.backward()
print(f"Autograd: {x.grad:.6f}")
# Analytical: d/dx[sin(x)·eˣ] = (sin(x) + cos(x))·eˣ
analytical = (torch.sin(x) + torch.cos(x)) * torch.exp(x)
print(f"Analytical: {analytical:.6f}")

Ex 3.2: Implement a manual linear regression training loop for 5 epochs using autograd (no nn.Module).

Solution

X = torch.randn(100, 1)
y = 3 * X + 2 + torch.randn(100, 1) * 0.1
w = torch.randn(1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
lr = 0.1
for epoch in range(5):
    pred = X * w + b
    loss = ((pred - y) ** 2).mean()
    loss.backward()
    with torch.no_grad():
        w -= lr * w.grad
        b -= lr * b.grad
    w.grad.zero_()
    b.grad.zero_()
    print(f"Epoch {epoch}: loss={loss:.4f}, w={w.item():.3f}, b={b.item():.3f}")

Project: Gradient Descent Visualizer

Python
import torch

def gradient_descent_trace(f, x0, lr=0.1, steps=20):
    """Trace gradient descent on a 1D function."""
    x = torch.tensor(x0, requires_grad=True, dtype=torch.float32)
    history = []
    for i in range(steps):
        y = f(x)
        y.backward()
        history.append((x.item(), y.item(), x.grad.item()))
        with torch.no_grad():
            x -= lr * x.grad
        x.grad.zero_()
    return history

# f(x) = (x-3)² + 1  → minimum at x=3
trace = gradient_descent_trace(lambda x: (x - 3)**2 + 1, x0=-2.0)
for x, y, g in trace[:5]:
    print(f"x={x:7.3f}  f(x)={y:7.3f}  grad={g:7.3f}")

Industry: Gradient Checkpointing at Meta

Training LLaMA-70B requires more GPU memory than available. Meta uses gradient checkpointing — discarding intermediate activations during forward pass and recomputing them during backward. This trades 30% more compute for 60% less memory, enabling training of models that otherwise wouldn't fit.

Key Takeaways

PyTorch builds a dynamic computation graph — rebuilt each forward pass
.backward() computes gradients via reverse-mode auto-differentiation
Always zero_grad() before .backward() — gradients accumulate by default
Use torch.no_grad() for inference and .detach() to stop gradient flow

Chapter 4

Data Loading & Transforms

Learning Objectives

Use Dataset and DataLoader for efficient data pipelines
Apply transforms for data preprocessing and augmentation
Build custom datasets for any data format
Optimize loading with multiprocessing and pinned memory

Dataset & DataLoader

Python
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, transforms

# Built-in dataset with transforms
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])
train_data = datasets.CIFAR10('./data', train=True, download=True, transform=transform)
loader = DataLoader(train_data, batch_size=64, shuffle=True, num_workers=4, pin_memory=True)

for images, labels in loader:
    print(images.shape)   # (64, 3, 224, 224)
    break

Custom Dataset

Python
import os, torch
from PIL import Image

class CustomImageDataset(Dataset):
    def __init__(self, img_dir, labels_file, transform=None):
        self.img_dir = img_dir
        self.transform = transform
        # Load labels: {filename: label}
        self.labels = {}
        with open(labels_file) as f:
            for line in f:
                name, label = line.strip().split(",")
                self.labels[name] = int(label)
        self.images = list(self.labels.keys())

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.images[idx])
        image = Image.open(img_path).convert("RGB")
        label = self.labels[self.images[idx]]
        if self.transform:
            image = self.transform(image)
        return image, label

Data Augmentation

Python
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.RandomRotation(15),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    transforms.RandomErasing(p=0.1),
])

Exercises

Ex 4.1: Create a custom Dataset that generates synthetic regression data (y = 2x + noise) with 1000 samples.

Solution

class SyntheticRegression(Dataset):
    def __init__(self, n=1000):
        self.x = torch.randn(n, 1)
        self.y = 2 * self.x + torch.randn(n, 1) * 0.1
    def __len__(self): return len(self.x)
    def __getitem__(self, i): return self.x[i], self.y[i]

ds = SyntheticRegression()
dl = DataLoader(ds, batch_size=32, shuffle=True)

Project: DataLoader Speed Benchmarker

Python
import time
from torch.utils.data import DataLoader, TensorDataset

X = torch.randn(10000, 3, 32, 32)
y = torch.randint(0, 10, (10000,))
ds = TensorDataset(X, y)

for nw in [0, 2, 4, 8]:
    loader = DataLoader(ds, batch_size=64, num_workers=nw, pin_memory=True)
    start = time.time()
    for batch in loader: pass
    print(f"num_workers={nw}: {time.time()-start:.3f}s")

Industry: Data Pipelines at Tesla Autopilot

Tesla processes 1.5 billion miles of driving video data. Their PyTorch data pipeline uses custom IterableDataset subclasses to stream video frames from distributed storage, apply real-time augmentation (rain/fog/night simulation), and feed 8-camera multi-view batches to HydraNet — processing 36 FPS per camera across thousands of GPUs.

Key Takeaways

Implement __len__ and __getitem__ for custom datasets
Use num_workers > 0 and pin_memory=True for faster GPU training
Apply data augmentation only to training data, not validation/test
Use transforms.Compose to chain preprocessing steps

Part II

Building Models

From layers to complete training pipelines

Chapter 5

nn.Module & Model Architecture

Learning Objectives

Build models by subclassing nn.Module
Use Sequential, ModuleList, and ModuleDict
Inspect and manage model parameters
Implement forward pass patterns and skip connections

Your First nn.Module

Python
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self, in_features, hidden, out_features):
        super().__init__()
        self.fc1 = nn.Linear(in_features, hidden)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden, out_features)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

model = SimpleNet(784, 256, 10)
print(model)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

Sequential & ModuleList

Python
# Sequential — simple stack
model = nn.Sequential(
    nn.Linear(784, 512),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(512, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

# ModuleList — dynamic layers
class DynamicNet(nn.Module):
    def __init__(self, layer_sizes):
        super().__init__()
        self.layers = nn.ModuleList([
            nn.Linear(layer_sizes[i], layer_sizes[i+1])
            for i in range(len(layer_sizes) - 1)
        ])

    def forward(self, x):
        for layer in self.layers[:-1]:
            x = torch.relu(layer(x))
        return self.layers[-1](x)

net = DynamicNet([784, 512, 256, 128, 10])

Always Use ModuleList, Not Python List

A plain Python list won't register submodules. model.parameters() won't find weights in a plain list. Always use nn.ModuleList or nn.ModuleDict for dynamic layers.

Exercises

Ex 5.1: Build a ResidualBlock that adds the input to the output (skip connection). Input and output dimensions must match.

Solution

class ResidualBlock(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.block = nn.Sequential(
            nn.Linear(dim, dim), nn.ReLU(),
            nn.Linear(dim, dim)
        )
    def forward(self, x):
        return torch.relu(x + self.block(x))

Project: Model Parameter Inspector

Python
def model_summary(model):
    total = 0
    print(f"{'Layer':<40} {'Shape':<20} {'Params':>10}")
    print("─" * 72)
    for name, param in model.named_parameters():
        n = param.numel()
        total += n
        print(f"{name:<40} {str(list(param.shape)):<20} {n:>10,}")
    print("─" * 72)
    print(f"{'TOTAL':<40} {'':<20} {total:>10,}")
    print(f"Memory: {total * 4 / 1e6:.2f} MB (float32)")

model_summary(SimpleNet(784, 256, 10))

Industry: Modular Architecture at Hugging Face

Hugging Face's Transformers library builds every model (BERT, GPT-2, LLaMA) as composable nn.Module subclasses. Each attention head, feed-forward layer, and normalization is a separate module. This modularity lets researchers swap components — replace attention with Flash Attention, change normalization from LayerNorm to RMSNorm — with single-line changes.

Key Takeaways

Always subclass nn.Module and implement forward()
Use nn.Sequential for simple stacks, ModuleList for dynamic architectures
model.parameters() iterates all learnable weights for optimizers
Skip connections (residual) improve gradient flow in deep networks

Chapter 6

Loss Functions & Optimizers

Learning Objectives

Choose the right loss function for classification vs regression
Understand CrossEntropyLoss, MSELoss, BCELoss, and their variants
Configure optimizers: SGD, Adam, AdamW
Implement learning rate scheduling strategies

Common Loss Functions

Loss	Task	Input	Formula
`nn.MSELoss`	Regression	Any shape	mean((ŷ - y)²)
`nn.CrossEntropyLoss`	Multi-class	Raw logits	-log(softmax(ŷ)[y])
`nn.BCEWithLogitsLoss`	Binary/Multi-label	Raw logits	-[y·log(σ(ŷ)) + (1-y)·log(1-σ(ŷ))]
`nn.L1Loss`	Regression	Any shape	mean(\|ŷ - y\|)

Python
import torch, torch.nn as nn

# Classification — CrossEntropyLoss expects RAW LOGITS (no softmax!)
criterion = nn.CrossEntropyLoss()
logits = torch.randn(8, 10)       # batch=8, 10 classes
targets = torch.randint(0, 10, (8,))  # integer labels
loss = criterion(logits, targets)

# Regression — MSELoss
criterion = nn.MSELoss()
pred = torch.randn(32, 1)
target = torch.randn(32, 1)
loss = criterion(pred, target)

Optimizers & LR Scheduling

Python
import torch.optim as optim

model = nn.Linear(784, 10)

# Adam — most popular default
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)

# AdamW — decoupled weight decay (better for transformers)
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# Learning rate scheduling
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

# One cycle — warmup + cosine decay
scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=1e-3, total_steps=1000
)

# Usage in training loop
for epoch in range(100):
    for x, y in loader:
        optimizer.zero_grad()
        loss = criterion(model(x), y)
        loss.backward()
        optimizer.step()
    scheduler.step()  # update LR after each epoch

Exercises

Ex 6.1: Compare Adam vs SGD on a simple classification task. Train for 20 epochs and plot the loss curves.

Solution

model_adam = nn.Sequential(nn.Linear(20, 50), nn.ReLU(), nn.Linear(50, 3))
model_sgd = nn.Sequential(nn.Linear(20, 50), nn.ReLU(), nn.Linear(50, 3))
model_sgd.load_state_dict(model_adam.state_dict())  # same init

X = torch.randn(200, 20); y = torch.randint(0, 3, (200,))
opt_adam = optim.Adam(model_adam.parameters(), lr=1e-3)
opt_sgd = optim.SGD(model_sgd.parameters(), lr=0.01, momentum=0.9)
crit = nn.CrossEntropyLoss()
for e in range(20):
    for opt, mdl, name in [(opt_adam, model_adam, "Adam"), (opt_sgd, model_sgd, "SGD")]:
        opt.zero_grad()
        loss = crit(mdl(X), y)
        loss.backward()
        opt.step()
    if e % 5 == 0:
        print(f"Epoch {e}: Adam={crit(model_adam(X),y):.3f} SGD={crit(model_sgd(X),y):.3f}")

Project: LR Finder

Python
def lr_finder(model, loader, criterion, min_lr=1e-7, max_lr=1, steps=100):
    optimizer = optim.SGD(model.parameters(), lr=min_lr)
    lr_mult = (max_lr / min_lr) ** (1 / steps)
    lrs, losses = [], []
    for i, (x, y) in enumerate(loader):
        if i >= steps: break
        optimizer.zero_grad()
        loss = criterion(model(x), y)
        loss.backward()
        optimizer.step()
        lrs.append(optimizer.param_groups[0]['lr'])
        losses.append(loss.item())
        optimizer.param_groups[0]['lr'] *= lr_mult
    print(f"Best LR ≈ {lrs[losses.index(min(losses))]:.2e}")
    return lrs, losses

Industry: Optimizer Choice at OpenAI

GPT-4 training uses AdamW with cosine learning rate decay and linear warmup. The warmup phase (first ~2000 steps) gradually increases LR from 0 to peak, preventing early instability. The weight decay of 0.1 acts as regularization. These choices are now industry standard for large language models.

Key Takeaways

CrossEntropyLoss expects raw logits — it applies softmax internally
AdamW with cosine scheduling is the go-to for transformers
Learning rate is the most important hyperparameter — use LR finders
Weight decay ≠ L2 regularization in Adam (use AdamW for proper decoupling)

Chapter 7

Training Loops & Validation

Learning Objectives

Write a complete training loop with validation
Implement early stopping and model checkpointing
Track and log metrics (loss, accuracy, F1)
Handle model.train() vs model.eval() modes

The Complete Training Loop

Python
import torch, torch.nn as nn
from torch.utils.data import DataLoader

def train_one_epoch(model, loader, criterion, optimizer, device):
    model.train()   # enable dropout, batchnorm training mode
    total_loss, correct, total = 0, 0, 0
    for X, y in loader:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()
        logits = model(X)
        loss = criterion(logits, y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item() * X.size(0)
        correct += (logits.argmax(1) == y).sum().item()
        total += X.size(0)
    return total_loss / total, correct / total

@torch.no_grad()
def evaluate(model, loader, criterion, device):
    model.eval()    # disable dropout, batchnorm eval mode
    total_loss, correct, total = 0, 0, 0
    for X, y in loader:
        X, y = X.to(device), y.to(device)
        logits = model(X)
        loss = criterion(logits, y)
        total_loss += loss.item() * X.size(0)
        correct += (logits.argmax(1) == y).sum().item()
        total += X.size(0)
    return total_loss / total, correct / total

Early Stopping & Checkpointing

Python
best_val_loss = float('inf')
patience, patience_counter = 5, 0

for epoch in range(100):
    train_loss, train_acc = train_one_epoch(model, train_loader, criterion, optimizer, device)
    val_loss, val_acc = evaluate(model, val_loader, criterion, device)

    print(f"Epoch {epoch+1:3d} │ Train Loss: {train_loss:.4f} Acc: {train_acc:.3f} │ Val Loss: {val_loss:.4f} Acc: {val_acc:.3f}")

    # Checkpoint best model
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'val_loss': val_loss,
        }, 'best_model.pth')
        print(f"  ✓ Saved best model (val_loss={val_loss:.4f})")
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print(f"  ⚠ Early stopping at epoch {epoch+1}")
            break

# Load best model for inference
checkpoint = torch.load('best_model.pth')
model.load_state_dict(checkpoint['model_state_dict'])

Exercises

Ex 7.1: Add a progress bar using tqdm to the training loop. Display loss and accuracy per batch.

Solution

from tqdm import tqdm
for X, y in tqdm(loader, desc=f"Epoch {epoch}"):
    # ... training code ...
    pbar.set_postfix(loss=loss.item(), acc=correct/total)

Project: MNIST Classifier — End to End

Python
import torch, torch.nn as nn, torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,),(0.3081,))])
train_ds = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_ds = datasets.MNIST('./data', train=False, transform=transform)
train_loader = DataLoader(train_ds, 128, shuffle=True)
test_loader = DataLoader(test_ds, 256)

model = nn.Sequential(nn.Flatten(), nn.Linear(784,256), nn.ReLU(), nn.Dropout(0.2),
                       nn.Linear(256,10)).to(device)
optimizer = optim.Adam(model.parameters(), 1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(5):
    model.train()
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()
        criterion(model(X), y).backward()
        optimizer.step()
    _, acc = evaluate(model, test_loader, criterion, device)
    print(f"Epoch {epoch+1}: Test Acc = {acc:.3f}")

Industry: Training Infrastructure at Anthropic

Anthropic's Claude training loop includes gradient norm monitoring, loss spike detection, and automatic checkpoint recovery. When a loss spike occurs (NaN or sudden increase), the system automatically rolls back to the last stable checkpoint and reduces the learning rate — preventing days of wasted GPU time on 2048+ GPU clusters.

Key Takeaways

Always switch between model.train() and model.eval()
Use @torch.no_grad() decorator for evaluation functions
Save both model and optimizer state for resumable training
Early stopping prevents overfitting — monitor validation loss

Chapter 8

CNNs in PyTorch

Learning Objectives

Build CNNs with Conv2d, MaxPool2d, and BatchNorm2d
Calculate output dimensions for conv/pool layers
Implement LeNet and ResNet architectures
Apply transfer learning with pretrained models

Convolution Basics

Output Size = ⌊(Input + 2·Padding - Kernel) / Stride⌋ + 1

Python
import torch.nn as nn

# Basic CNN
class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),   # (3,32,32) → (32,32,32)
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),                    # → (32,16,16)
            nn.Conv2d(32, 64, 3, padding=1),  # → (64,16,16)
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),                    # → (64,8,8)
            nn.Conv2d(64, 128, 3, padding=1), # → (128,8,8)
            nn.ReLU(),
            nn.AdaptiveAvgPool2d(1),             # → (128,1,1)
        )
        self.classifier = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = x.flatten(1)   # (B, 128)
        return self.classifier(x)

Transfer Learning

Python
from torchvision import models

# Load pretrained ResNet18
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Replace classification head
model.fc = nn.Linear(model.fc.in_features, 5)  # 5 classes

# Only train the new head
optimizer = optim.Adam(model.fc.parameters(), lr=1e-3)

Exercises

Ex 8.1: Calculate the output shape of Conv2d(3, 64, kernel_size=7, stride=2, padding=3) given input (1, 3, 224, 224).

Solution

# Output = ⌊(224 + 2·3 - 7) / 2⌋ + 1 = ⌊223/2⌋ + 1 = 112
# Shape: (1, 64, 112, 112)
conv = nn.Conv2d(3, 64, 7, stride=2, padding=3)
x = torch.randn(1, 3, 224, 224)
print(conv(x).shape)  # torch.Size([1, 64, 112, 112])

Project: CIFAR-10 CNN with 90%+ Accuracy

Python
# Full training pipeline — transfer learning on CIFAR-10
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
model.fc = nn.Linear(512, 10)
model = model.to(device)

transform = transforms.Compose([
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225])
])
train_ds = datasets.CIFAR10('./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_ds, 64, shuffle=True, num_workers=2)

optimizer = optim.Adam(model.parameters(), 1e-4)
criterion = nn.CrossEntropyLoss()
# Train for 5 epochs → expect ~93% accuracy

Industry: CNNs in Autonomous Driving

Waymo's perception stack uses multi-scale CNNs (Feature Pyramid Networks) processing 5 LiDAR and 29 camera feeds simultaneously. Each camera frame passes through a backbone CNN (EfficientNet variant), producing feature maps at 1/4, 1/8, 1/16, 1/32 resolution — enabling detection of both pedestrians (near, large) and traffic signs (far, small) in the same forward pass.

Key Takeaways

Conv2d extracts spatial features; MaxPool2d reduces spatial dimensions
BatchNorm2d stabilizes training and allows higher learning rates
AdaptiveAvgPool2d(1) handles any input size before the classifier
Transfer learning: freeze backbone, replace head, fine-tune — fastest path to good results

Part III

Advanced PyTorch

Production-grade techniques for serious models

Chapter 9

Recurrent Networks

Learning Objectives

Understand RNN, LSTM, and GRU architectures in PyTorch
Process variable-length sequences with packing/padding
Build a text generation model with character-level LSTM
Handle hidden state initialization and management

LSTM in PyTorch

Python
import torch, torch.nn as nn

class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=2,
                           batch_first=True, dropout=0.3, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, num_classes)  # *2 for bidirectional

    def forward(self, x):
        emb = self.embedding(x)            # (B, T) → (B, T, E)
        output, (h_n, c_n) = self.lstm(emb) # output: (B, T, 2H)
        # Use last hidden states from both directions
        hidden = torch.cat([h_n[-2], h_n[-1]], dim=1)  # (B, 2H)
        return self.fc(hidden)

model = TextClassifier(vocab_size=10000, embed_dim=128, hidden_dim=256, num_classes=5)
x = torch.randint(0, 10000, (8, 50))  # batch=8, seq_len=50
out = model(x)  # (8, 5)

Exercises

Ex 9.1: Build a GRU-based model for sequence-to-sequence prediction (predict next value in a time series).

Solution

class TimeSeriesGRU(nn.Module):
    def __init__(self, input_dim=1, hidden=64):
        super().__init__()
        self.gru = nn.GRU(input_dim, hidden, batch_first=True)
        self.fc = nn.Linear(hidden, 1)
    def forward(self, x):
        out, _ = self.gru(x)
        return self.fc(out[:, -1, :])  # last timestep

Project: Character-Level Text Generator

Python
class CharRNN(nn.Module):
    def __init__(self, vocab_size, embed=64, hidden=256):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed)
        self.lstm = nn.LSTM(embed, hidden, num_layers=2, batch_first=True)
        self.fc = nn.Linear(hidden, vocab_size)
    def forward(self, x, hidden=None):
        emb = self.embed(x)
        out, hidden = self.lstm(emb, hidden)
        return self.fc(out), hidden

def generate(model, start_char, char2idx, idx2char, length=200):
    model.eval()
    x = torch.tensor([[char2idx[start_char]]])
    hidden = None
    result = [start_char]
    for _ in range(length):
        logits, hidden = model(x, hidden)
        probs = torch.softmax(logits[0, -1] / 0.8, dim=-1)
        idx = torch.multinomial(probs, 1).item()
        result.append(idx2char[idx])
        x = torch.tensor([[idx]])
    return ''.join(result)

Industry: LSTMs in Finance

Goldman Sachs and Two Sigma use LSTM networks for time-series forecasting of market volatility. Their models process sequences of 60+ features (price, volume, order flow, sentiment) across 500+ timesteps, predicting intraday price movements with sub-millisecond inference latency using ONNX-optimized PyTorch models.

Key Takeaways

LSTM handles long-range dependencies better than vanilla RNN (forget gate prevents vanishing gradients)
Use batch_first=True for intuitive (B, T, F) tensor layouts
Bidirectional doubles hidden size — concatenate forward and backward states
For most NLP tasks, Transformers have replaced RNNs — but RNNs remain strong for streaming/real-time data

Chapter 10

Custom Layers & Hooks

Learning Objectives

Write custom nn.Module layers with learnable parameters
Implement custom autograd functions with torch.autograd.Function
Use forward and backward hooks for debugging and visualization
Inspect gradients and activations at any layer

Custom Layer with Parameters

Python
class ScaleShift(nn.Module):
    """Learnable scale and shift: y = gamma * x + beta"""
    def __init__(self, dim):
        super().__init__()
        self.gamma = nn.Parameter(torch.ones(dim))
        self.beta = nn.Parameter(torch.zeros(dim))

    def forward(self, x):
        return self.gamma * x + self.beta

Forward Hooks for Activation Inspection

Python
activations = {}

def save_activation(name):
    def hook(module, input, output):
        activations[name] = output.detach()
    return hook

model = SimpleCNN()
model.features[0].register_forward_hook(save_activation('conv1'))
model.features[4].register_forward_hook(save_activation('conv2'))

# Forward pass — hooks capture activations
_ = model(torch.randn(1, 3, 32, 32))
print(activations['conv1'].shape)  # (1, 32, 32, 32)
print(activations['conv2'].shape)  # (1, 64, 16, 16)

Gradient Hooks

Python
gradients = {}

def save_gradient(name):
    def hook(module, grad_input, grad_output):
        gradients[name] = grad_output[0].detach()
    return hook

model.features[0].register_backward_hook(save_gradient('conv1_grad'))
loss = criterion(model(x), y)
loss.backward()
print(f"Conv1 gradient norm: {gradients['conv1_grad'].norm():.4f}")

Exercises

Ex 10.1: Implement a custom Swish activation function as an nn.Module: swish(x) = x · σ(x).

Solution

class Swish(nn.Module):
    def forward(self, x):
        return x * torch.sigmoid(x)

Project: Gradient Flow Debugger

Python
def plot_gradient_flow(model):
    """Print gradient statistics for each layer."""
    print(f"{'Layer':<40} {'Mean Grad':>12} {'Max Grad':>12}")
    print("─" * 66)
    for name, param in model.named_parameters():
        if param.grad is not None:
            print(f"{name:<40} {param.grad.abs().mean():>12.6f} {param.grad.abs().max():>12.6f}")

# Use after loss.backward()
loss.backward()
plot_gradient_flow(model)

Industry: Model Interpretability at Google Health

Google's medical AI team uses gradient hooks extensively for Grad-CAM visualizations — highlighting which regions of X-ray images the model focuses on when diagnosing pneumonia or cancer. This interpretability is required by FDA regulations for clinical deployment. Hooks capture gradients at the last convolutional layer to generate attention heatmaps overlaid on the original image.

Key Takeaways

Use nn.Parameter to create learnable parameters in custom modules
Forward hooks capture activations; backward hooks capture gradients
Hooks are essential for debugging vanishing/exploding gradients
Always .detach() tensors saved in hooks to prevent memory leaks

Chapter 11

Distributed Training & Mixed Precision

Learning Objectives

Scale training with DataParallel and DistributedDataParallel
Use Automatic Mixed Precision (AMP) for faster training
Implement gradient accumulation for large effective batch sizes
Understand multi-GPU communication patterns

DataParallel (Simple Multi-GPU)

Python
# Easiest way — wrap model in DataParallel
model = SimpleCNN().cuda()
if torch.cuda.device_count() > 1:
    model = nn.DataParallel(model)  # splits batch across GPUs

Mixed Precision Training (AMP)

Python
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for X, y in train_loader:
    X, y = X.cuda(), y.cuda()
    optimizer.zero_grad()

    # Forward pass in FP16
    with autocast():
        logits = model(X)
        loss = criterion(logits, y)

    # Backward pass with gradient scaling
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

AMP Speedup

Mixed precision typically gives 1.5–3× speedup on modern GPUs (A100, H100) with no accuracy loss. It uses FP16 for computation and FP32 for gradient accumulation. Always use it unless you have a specific reason not to.

Gradient Accumulation

Python
# Simulate batch_size=256 with actual batch_size=64
accumulation_steps = 4
for i, (X, y) in enumerate(train_loader):
    X, y = X.to(device), y.to(device)
    with autocast():
        loss = criterion(model(X), y) / accumulation_steps
    scaler.scale(loss).backward()

    if (i + 1) % accumulation_steps == 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

Exercises

Ex 11.1: Measure training speed (samples/sec) with and without AMP on your GPU.

Solution

import time
for use_amp in [False, True]:
    scaler = GradScaler(enabled=use_amp)
    start = time.time()
    for X, y in train_loader:
        X, y = X.cuda(), y.cuda()
        optimizer.zero_grad()
        with autocast(enabled=use_amp):
            loss = criterion(model(X), y)
        scaler.scale(loss).backward()
        scaler.step(optimizer); scaler.update()
    torch.cuda.synchronize()
    elapsed = time.time() - start
    print(f"AMP={use_amp}: {len(train_loader.dataset)/elapsed:.0f} samples/sec")

Project: Multi-GPU Training Script

Python
# Complete DDP training script (launch with: torchrun --nproc_per_node=2 train.py)
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def train(rank, world_size):
    setup(rank, world_size)
    model = SimpleCNN().to(rank)
    model = DDP(model, device_ids=[rank])
    # ... training loop ...
    dist.destroy_process_group()

Industry: Training LLaMA at Meta

Meta trained LLaMA-70B on 2048 A100 GPUs using FSDP (Fully Sharded Data Parallel) — PyTorch's evolution of DDP. FSDP shards model parameters, gradients, and optimizer states across GPUs, enabling models that don't fit on a single GPU. Combined with AMP and gradient checkpointing, they achieved 95% GPU utilization across the cluster.

Key Takeaways

DataParallel is simple but slow; DistributedDataParallel is production-grade
AMP gives free 1.5-3× speedup — use it always
Gradient accumulation simulates larger batch sizes without more memory
Launch DDP with torchrun — it handles rank/world_size automatically

Chapter 12

TorchScript & ONNX Export

Learning Objectives

Convert models to TorchScript via tracing and scripting
Export models to ONNX format for cross-platform deployment
Optimize models for inference (quantization, pruning)
Deploy to mobile with PyTorch Mobile / ExecuTorch

TorchScript

Python
# Method 1: Tracing (for models without control flow)
model = SimpleCNN().eval()
example = torch.randn(1, 3, 32, 32)
traced = torch.jit.trace(model, example)
traced.save("model_traced.pt")

# Method 2: Scripting (handles if/for/while)
@torch.jit.script
def relu_threshold(x: torch.Tensor, threshold: float = 0.5) -> torch.Tensor:
    if x.mean() > threshold:
        return torch.relu(x)
    return x

# Load and run without Python
loaded = torch.jit.load("model_traced.pt")
output = loaded(example)

ONNX Export

Python
import torch.onnx

model = SimpleCNN().eval()
dummy = torch.randn(1, 3, 32, 32)

torch.onnx.export(
    model, dummy, "model.onnx",
    input_names=["image"],
    output_names=["logits"],
    dynamic_axes={"image": {0: "batch"}},  # variable batch size
    opset_version=17
)

# Run with ONNX Runtime (10-40% faster than PyTorch)
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
result = session.run(None, {"image": dummy.numpy()})

Quantization

Python
# Dynamic quantization (easiest, good for LSTMs/Transformers)
quantized = torch.quantization.quantize_dynamic(
    model, {nn.Linear, nn.LSTM}, dtype=torch.qint8
)
# Model size: ~4× smaller, inference: ~2× faster

Exercises

Ex 12.1: Export a model to ONNX and compare inference speed between PyTorch and ONNX Runtime.

Solution

import time, onnxruntime as ort
dummy = torch.randn(1, 3, 32, 32)
# PyTorch
start = time.time()
for _ in range(1000): model(dummy)
pt_time = time.time() - start
# ONNX Runtime
sess = ort.InferenceSession("model.onnx")
start = time.time()
for _ in range(1000): sess.run(None, {"image": dummy.numpy()})
ort_time = time.time() - start
print(f"PyTorch: {pt_time:.2f}s | ONNX: {ort_time:.2f}s | Speedup: {pt_time/ort_time:.1f}x")

Project: Full Model Export Pipeline

Python
def export_pipeline(model, example_input, name="model"):
    model.eval()
    # 1. TorchScript
    traced = torch.jit.trace(model, example_input)
    traced.save(f"{name}.pt")
    # 2. ONNX
    torch.onnx.export(model, example_input, f"{name}.onnx", opset_version=17)
    # 3. Quantized
    q_model = torch.quantization.quantize_dynamic(model, {nn.Linear}, torch.qint8)
    traced_q = torch.jit.trace(q_model, example_input)
    traced_q.save(f"{name}_quantized.pt")
    print(f"Exported: {name}.pt, {name}.onnx, {name}_quantized.pt")

Industry: Edge Deployment at Apple

Apple deploys PyTorch models to billions of iPhones via CoreML conversion. Models go through: PyTorch → ONNX → CoreML → on-device Neural Engine. Face ID's depth estimation model runs at 30 FPS entirely on the iPhone's Neural Engine. The PyTorch→ONNX export step is the critical bridge enabling this pipeline.

Key Takeaways

TorchScript removes Python dependency — use trace for simple models, script for control flow
ONNX is the universal exchange format — enables deployment anywhere
Dynamic quantization gives 4× size reduction with minimal accuracy loss
Always benchmark exported models against PyTorch baseline

Part IV

Industry Applications

Real-world problems solved with PyTorch

Chapter 13

Industry Problem: Medical Image Classification

Learning Objectives

Build an end-to-end medical image classification pipeline
Handle class imbalance with weighted sampling and focal loss
Implement proper medical AI evaluation (sensitivity, specificity, AUC)
Deploy with confidence calibration and uncertainty estimation

Problem Statement

Build a chest X-ray classifier that detects pneumonia from X-ray images. The dataset has ~5,000 images with severe class imbalance (3:1 pneumonia to normal). This mirrors a real clinical deployment scenario.

Data Pipeline

Python
from torchvision import transforms, datasets
from torch.utils.data import DataLoader, WeightedRandomSampler
import numpy as np

# Medical-specific augmentation (no vertical flip — anatomy matters!)
train_transform = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.RandomCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(10),
    transforms.ColorJitter(brightness=0.1, contrast=0.1),
    transforms.ToTensor(),
    transforms.Normalize([0.485], [0.229]),
])

# Weighted sampler for class imbalance
train_data = datasets.ImageFolder('data/train', transform=train_transform)
class_counts = np.bincount([y for _, y in train_data])
weights = 1.0 / class_counts
sample_weights = [weights[y] for _, y in train_data]
sampler = WeightedRandomSampler(sample_weights, len(train_data))
train_loader = DataLoader(train_data, batch_size=32, sampler=sampler)

Model Architecture

Python
from torchvision import models

class ChestXRayModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
        # Modify first conv for grayscale (1 channel instead of 3)
        self.backbone.conv1 = nn.Conv2d(1, 64, 7, stride=2, padding=3, bias=False)
        self.backbone.fc = nn.Sequential(
            nn.Dropout(0.3),
            nn.Linear(2048, 512),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(512, 2)  # Normal vs Pneumonia
        )

    def forward(self, x):
        return self.backbone(x)

Medical-Grade Evaluation

Python
from sklearn.metrics import classification_report, roc_auc_score

@torch.no_grad()
def medical_evaluate(model, loader, device):
    model.eval()
    all_preds, all_labels, all_probs = [], [], []
    for X, y in loader:
        X = X.to(device)
        logits = model(X)
        probs = torch.softmax(logits, dim=1)[:, 1]  # P(pneumonia)
        preds = (probs > 0.5).long()
        all_preds.extend(preds.cpu().tolist())
        all_labels.extend(y.tolist())
        all_probs.extend(probs.cpu().tolist())

    print(classification_report(all_labels, all_preds,
          target_names=['Normal', 'Pneumonia']))
    print(f"AUC-ROC: {roc_auc_score(all_labels, all_probs):.4f}")

Clinical Deployment Considerations

In medical AI, sensitivity (recall for the positive class) is more important than accuracy. Missing a pneumonia case (false negative) is far worse than a false alarm. Adjust the classification threshold to achieve ≥95% sensitivity, even at the cost of lower specificity.

Exercises

Ex 13.1: Implement Focal Loss to handle class imbalance: FL(pₜ) = -αₜ(1-pₜ)^γ log(pₜ).

Solution

class FocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2.0):
        super().__init__()
        self.alpha, self.gamma = alpha, gamma
    def forward(self, logits, targets):
        ce = nn.functional.cross_entropy(logits, targets, reduction='none')
        pt = torch.exp(-ce)
        return (self.alpha * (1 - pt) ** self.gamma * ce).mean()

Project: Full Medical AI Pipeline

Combine everything: data loading with class balancing → ResNet50 transfer learning → training with focal loss → evaluation with AUC-ROC and sensitivity/specificity → export to ONNX for hospital deployment. Target: ≥95% sensitivity, ≥85% specificity, AUC > 0.95.

Industry: Google Health's Chest X-Ray AI

Google Health deployed a chest X-ray AI system in hospitals across India and Thailand. Key engineering decisions: (1) Model ensembling — 3 ResNet variants averaged for robustness; (2) Calibration — temperature scaling ensures predicted probabilities match actual disease rates; (3) Human-in-the-loop — AI flags suspicious cases for radiologist review, reducing report turnaround from 11 days to 3 hours. The PyTorch models run on NVIDIA T4 GPUs in Google Cloud with ONNX Runtime serving.

Key Takeaways

Medical AI requires domain-specific augmentation — no arbitrary flips/rotations
Use weighted sampling or focal loss for class imbalance
Evaluate with sensitivity/specificity/AUC — not just accuracy
Clinical deployment requires calibration, uncertainty, and human oversight

Chapter 14

Industry Problem: Real-Time Object Detection

Learning Objectives

Understand YOLO-style single-shot detection architecture
Implement inference pipeline for video streams
Optimize for edge deployment (quantization, TensorRT)
Build a production monitoring dashboard for model drift

Object Detection with Pretrained YOLO

Python
import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2
from torchvision.models.detection import FasterRCNN_ResNet50_FPN_V2_Weights

# Load pretrained Faster R-CNN
weights = FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT
model = fasterrcnn_resnet50_fpn_v2(weights=weights)
model.eval()

# Inference
img = torch.randn(3, 480, 640)  # C, H, W
with torch.no_grad():
    predictions = model([img])

# predictions[0] contains: boxes, labels, scores
boxes = predictions[0]['boxes']    # (N, 4) — x1,y1,x2,y2
labels = predictions[0]['labels']  # (N,) — class IDs
scores = predictions[0]['scores']  # (N,) — confidence

# Filter low-confidence detections
keep = scores > 0.5
print(f"Detected {keep.sum()} objects")

Real-Time Video Inference

Python
import cv2, time, torch
from torchvision import transforms

def realtime_detection(model, video_source=0, conf_threshold=0.5):
    cap = cv2.VideoCapture(video_source)
    transform = transforms.Compose([transforms.ToTensor()])
    fps_history = []

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret: break

        start = time.time()
        img_tensor = transform(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))

        with torch.no_grad():
            preds = model([img_tensor.cuda()])[0]

        # Draw bounding boxes
        for box, score, label in zip(preds['boxes'], preds['scores'], preds['labels']):
            if score > conf_threshold:
                x1, y1, x2, y2 = box.int().tolist()
                cv2.rectangle(frame, (x1,y1), (x2,y2), (0,255,0), 2)
                cv2.putText(frame, f"{label.item()}: {score:.2f}",
                           (x1,y1-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0,255,0), 2)

        fps = 1.0 / (time.time() - start)
        cv2.putText(frame, f"FPS: {fps:.1f}", (10,30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0,0,255), 2)
        cv2.imshow('Detection', frame)
        if cv2.waitKey(1) == 27: break

    cap.release()
    cv2.destroyAllWindows()

Edge Optimization

Python
# Quantize for mobile/edge deployment
model_int8 = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Export to ONNX for TensorRT optimization
dummy = [torch.randn(3, 480, 640)]
torch.onnx.export(model, dummy, "detector.onnx", opset_version=17)

# TensorRT gives 5-10× speedup on NVIDIA edge devices (Jetson)

Production Monitoring

Python
import json, time
from collections import deque

class DetectionMonitor:
    def __init__(self, window_size=100):
        self.latencies = deque(maxlen=window_size)
        self.confidences = deque(maxlen=window_size)
        self.detection_counts = deque(maxlen=window_size)

    def log(self, latency, preds):
        self.latencies.append(latency)
        scores = preds['scores'][preds['scores'] > 0.5]
        self.confidences.extend(scores.tolist())
        self.detection_counts.append(len(scores))

    def report(self):
        import numpy as np
        return {
            "avg_latency_ms": np.mean(self.latencies) * 1000,
            "p99_latency_ms": np.percentile(self.latencies, 99) * 1000,
            "avg_confidence": np.mean(self.confidences) if self.confidences else 0,
            "avg_detections": np.mean(self.detection_counts),
        }

Exercises

Ex 14.1: Add Non-Maximum Suppression (NMS) with IoU threshold 0.5 to filter overlapping detections.

Solution

from torchvision.ops import nms
keep_idx = nms(preds['boxes'], preds['scores'], iou_threshold=0.5)
filtered_boxes = preds['boxes'][keep_idx]
filtered_scores = preds['scores'][keep_idx]

Ex 14.2: Implement a model drift detector that alerts when average confidence drops below a threshold.

Solution

def check_drift(monitor, threshold=0.7):
    report = monitor.report()
    if report["avg_confidence"] < threshold:
        print(f"⚠ DRIFT ALERT: Avg confidence {report['avg_confidence']:.3f} < {threshold}")
        return True
    return False

Project: Complete Detection API Server

Python
# FastAPI endpoint for object detection
from fastapi import FastAPI, UploadFile
from PIL import Image
import io

app = FastAPI()
model = fasterrcnn_resnet50_fpn_v2(weights="DEFAULT").eval().cuda()

@app.post("/detect")
async def detect(file: UploadFile):
    image = Image.open(io.BytesIO(await file.read())).convert("RGB")
    tensor = transforms.ToTensor()(image).cuda()
    with torch.no_grad():
        preds = model([tensor])[0]
    keep = preds['scores'] > 0.5
    return {
        "detections": [{
            "box": preds['boxes'][i].tolist(),
            "label": preds['labels'][i].item(),
            "confidence": preds['scores'][i].item()
        } for i in keep.nonzero().flatten().tolist()]
    }

Industry: Amazon Go — Cashierless Stores

Amazon Go stores use hundreds of ceiling cameras running real-time object detection to track customers and products. Their PyTorch-based detection pipeline processes 30+ camera feeds simultaneously, detecting product pickups/putbacks with sub-200ms latency. The system uses TensorRT-optimized models on NVIDIA Jetson edge devices, with a central GPU cluster for model updates and A/B testing. Detection accuracy directly impacts revenue — a 1% improvement in product tracking accuracy translates to millions in reduced shrinkage.

Why This Matters for AI

Object detection is the bridge between AI and the physical world. Self-driving cars, warehouse robotics, medical imaging, surveillance, and AR all depend on fast, accurate detection. Mastering the full pipeline — from model training to edge deployment to production monitoring — is what separates an ML researcher from an ML engineer.

Key Takeaways

Use pretrained detection models (Faster R-CNN, YOLO) as starting points
Real-time inference requires GPU + quantization + TensorRT optimization
NMS filters overlapping detections — critical for clean output
Production systems need monitoring: latency, confidence drift, accuracy degradation
Edge deployment (Jetson, mobile) requires ONNX → TensorRT/CoreML conversion