PyTorch β’ EduArtha
PyTorch: Fundamentals
Master the most popular deep learning framework. From tensors to production deployment β build, train, and ship real AI models with PyTorch.
β± 3β5 months | 14 Chapters | 50+ Exercises | 14 Projects | Industry Problems
Tensor Foundations
The building blocks of every PyTorch program
Tensors & Operations
Learning Objectives
- Create tensors from Python lists, NumPy arrays, and built-in factories
- Understand dtypes, shapes, strides, and memory layout
- Master reshaping, indexing, slicing, and advanced indexing
- Apply broadcasting rules for element-wise operations
- Convert between PyTorch tensors and NumPy arrays
What Is a Tensor?
A tensor is a multi-dimensional array β the fundamental data structure in PyTorch. Scalars are 0-D tensors, vectors are 1-D, matrices are 2-D, and anything higher is an n-D tensor. Every neural network input, output, weight, and gradient is a tensor.
| Dimension | Name | Shape Example | Use Case |
|---|---|---|---|
| 0-D | Scalar | torch.tensor(3.14) | Loss value |
| 1-D | Vector | (512,) | Bias, embedding |
| 2-D | Matrix | (64, 784) | Batch of flat images |
| 3-D | 3-Tensor | (32, 100, 512) | Batch of sequences |
| 4-D | 4-Tensor | (16, 3, 224, 224) | Batch of RGB images |
Creating Tensors
Python
import torch
# From Python data
a = torch.tensor([1, 2, 3]) # int64 by default
b = torch.tensor([[1.0, 2.0], [3.0, 4.0]]) # float32
# Factory functions
zeros = torch.zeros(3, 4) # 3Γ4 of zeros
ones = torch.ones(2, 3, dtype=torch.float16) # specify dtype
rand = torch.randn(5, 5) # standard normal
eye = torch.eye(4) # 4Γ4 identity
arange = torch.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linspace = torch.linspace(0, 1, 5) # 5 evenly spaced
# From NumPy (shares memory!)
import numpy as np
np_arr = np.array([1, 2, 3])
t = torch.from_numpy(np_arr) # zero-copy
back = t.numpy() # back to numpy
Shared Memory Warning
torch.from_numpy() shares memory with the NumPy array. Modifying one changes the other. Use .clone() if you need an independent copy.
Reshaping & Views
Python
x = torch.arange(12)
# Reshape (may copy)
a = x.reshape(3, 4)
b = x.reshape(2, -1) # -1 = infer β (2, 6)
# View (never copies, requires contiguous)
c = x.view(4, 3)
# Squeeze / Unsqueeze
t = torch.randn(1, 3, 1)
print(t.squeeze().shape) # (3,) β remove all dim=1
print(t.unsqueeze(0).shape) # (1,1,3,1) β add dim at 0
# Permute / Transpose
img = torch.randn(3, 224, 224) # C,H,W
hwc = img.permute(1, 2, 0) # H,W,C for matplotlib
Indexing & Broadcasting
Python
m = torch.randn(4, 5)
# Basic indexing
row0 = m[0] # first row
elem = m[2, 3] # element at row 2, col 3
cols = m[:, 1:3] # all rows, cols 1-2
# Boolean indexing
mask = m > 0
positives = m[mask] # flat tensor of positive values
# Broadcasting: (4,5) + (5,) β element-wise add
bias = torch.randn(5)
result = m + bias # bias broadcast across rows
# Broadcasting rules:
# 1. Align shapes from the right
# 2. Dimensions must be equal OR one of them is 1
# (4,1,3) + (1,5,3) β (4,5,3) β
# (4,3) + (5,3) β ERROR β
Exercises
Ex 1.1: Create a 5Γ5 tensor with values 1-25, then extract the 3Γ3 center sub-matrix.
Solution
t = torch.arange(1, 26).reshape(5, 5)
center = t[1:4, 1:4]
print(center)
Ex 1.2: Given a batch of images (B, C, H, W), compute the mean pixel value per channel.
Solution
imgs = torch.randn(16, 3, 32, 32)
channel_means = imgs.mean(dim=(0, 2, 3)) # shape: (3,)
print(channel_means)
Ex 1.3: Use broadcasting to add a row-vector (1, 5) and a column-vector (3, 1) to get a (3, 5) matrix.
Solution
row = torch.tensor([[1, 2, 3, 4, 5]])
col = torch.tensor([[10], [20], [30]])
result = row + col # (3, 5)
print(result)
Project: Tensor Statistics Explorer
Build a function that takes any tensor and prints comprehensive statistics:
Python
def tensor_report(t, name="tensor"):
print(f"βββ {name} βββ")
print(f"Shape: {t.shape}")
print(f"Dtype: {t.dtype}")
print(f"Device: {t.device}")
print(f"Numel: {t.numel():,}")
print(f"Memory: {t.element_size() * t.numel() / 1024:.1f} KB")
if t.is_floating_point():
print(f"Mean: {t.mean():.4f}")
print(f"Std: {t.std():.4f}")
print(f"Min/Max: {t.min():.4f} / {t.max():.4f}")
print(f"Has NaN: {t.isnan().any()}")
print(f"Has Inf: {t.isinf().any()}")
# Test it
x = torch.randn(16, 3, 224, 224)
tensor_report(x, "ImageNet Batch")
Industry: Tensor Operations at Scale
Netflix Recommendation System: Netflix's recommendation engine processes user-item interaction matrices as sparse tensors. A typical matrix has ~200M users Γ ~15K titles. PyTorch's sparse tensor operations (torch.sparse_coo_tensor) enable efficient matrix factorization on GPUs, computing millions of recommendations in seconds rather than hours.
Why This Matters for AI
Every AI model processes tensors. GPT-4 transforms text into tensors of shape (batch, seq_len, 12288). Stable Diffusion processes image tensors of (batch, 4, 64, 64) in latent space. Understanding shapes, broadcasting, and memory layout is essential β most bugs in deep learning are shape mismatches.
Key Takeaways
- Tensors are multi-dimensional arrays β the core data structure in PyTorch
- Use factory functions (
zeros,randn,arange) for common patterns viewnever copies memory;reshapemay copy if not contiguous- Broadcasting follows right-alignment rules β dimensions must match or be 1
- NumPy interop shares memory β use
.clone()for independence
GPU Computing with CUDA
Learning Objectives
- Check CUDA availability and manage GPU devices
- Move tensors and models between CPU and GPU with
.to() - Understand GPU memory management and avoid OOM errors
- Benchmark CPU vs GPU performance for matrix operations
Device Management
Python
import torch
# Check availability
print(torch.cuda.is_available()) # True if GPU available
print(torch.cuda.device_count()) # Number of GPUs
print(torch.cuda.get_device_name(0)) # e.g., 'NVIDIA A100'
# Best practice: device-agnostic code
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")
# Move tensors to GPU
x = torch.randn(1000, 1000)
x_gpu = x.to(device) # copy to GPU
x_gpu = x.cuda() # shorthand (GPU 0)
x_cpu = x_gpu.cpu() # back to CPU
# Create directly on GPU
y = torch.randn(1000, 1000, device=device)
Golden Rule
All tensors in an operation must be on the same device. x_cpu + y_gpu will raise RuntimeError. Always move data to the same device before computing.
Memory Management
Python
# Monitor GPU memory
print(f"Allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
# Free cached memory
torch.cuda.empty_cache()
# Context manager for memory tracking
with torch.cuda.memory_stats() as stats:
big = torch.randn(10000, 10000, device="cuda")
del big
torch.cuda.empty_cache()
CPU vs GPU Benchmarking
Python
import time
sizes = [100, 500, 1000, 5000]
for n in sizes:
# CPU
a_cpu = torch.randn(n, n)
b_cpu = torch.randn(n, n)
start = time.time()
_ = a_cpu @ b_cpu
cpu_time = time.time() - start
# GPU
a_gpu = a_cpu.cuda()
b_gpu = b_cpu.cuda()
torch.cuda.synchronize()
start = time.time()
_ = a_gpu @ b_gpu
torch.cuda.synchronize()
gpu_time = time.time() - start
print(f"n={n:5d} CPU: {cpu_time:.4f}s GPU: {gpu_time:.4f}s Speedup: {cpu_time/gpu_time:.1f}x")
Exercises
Ex 2.1: Write device-agnostic code that creates a 2048Γ2048 identity matrix on the best available device.
Solution
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
I = torch.eye(2048, device=device)
print(I.device, I.shape)
Ex 2.2: Write a memory report function that prints allocated/reserved GPU memory in MB.
Solution
def gpu_memory_report():
alloc = torch.cuda.memory_allocated() / 1e6
reserved = torch.cuda.memory_reserved() / 1e6
print(f"Allocated: {alloc:.1f} MB | Reserved: {reserved:.1f} MB")
Project: GPU Benchmark Suite
Python
import torch, time
def benchmark_ops(device, n=2048, warmup=3, runs=10):
a = torch.randn(n, n, device=device)
b = torch.randn(n, n, device=device)
ops = {
"MatMul": lambda: a @ b,
"Add": lambda: a + b,
"SVD": lambda: torch.linalg.svd(a[:512, :512]),
}
for name, op in ops.items():
for _ in range(warmup): op()
if device.type == "cuda": torch.cuda.synchronize()
start = time.time()
for _ in range(runs): op()
if device.type == "cuda": torch.cuda.synchronize()
avg = (time.time() - start) / runs
print(f"[{device}] {name}: {avg*1000:.2f} ms")
benchmark_ops(torch.device("cpu"))
if torch.cuda.is_available():
benchmark_ops(torch.device("cuda"))
Industry: GPU Fleet Management at Google
Google's TPU/GPU clusters use device-agnostic code patterns identical to PyTorch's .to(device). Teams at DeepMind manage thousands of A100 GPUs, tracking memory utilization per-job. Models like Gemini require multi-node GPU training where memory management directly impacts training cost ($10K+/hour for large clusters).
Why This Matters for AI
Training a single GPT-3 model costs ~$4.6M in GPU compute. Understanding GPU memory, CUDA synchronization, and efficient data transfer is the difference between a 2-day and 2-week training run. Every production ML engineer must master device management.
Key Takeaways
- Always write device-agnostic code with
torch.device - Use
.to(device)to move tensors and models β all operands must share a device - GPU excels at large parallel operations; small tensors can be slower on GPU due to transfer overhead
- Monitor memory with
torch.cuda.memory_allocated()and free caches when needed
Autograd & Computational Graphs
Learning Objectives
- Understand automatic differentiation and computational graphs
- Use
requires_grad,backward(), and.grad - Control gradient computation with
no_grad()anddetach() - Implement gradient accumulation and zeroing
How Autograd Works
PyTorch builds a dynamic computational graph (DAG) as operations execute. Each tensor records which operations created it. Calling .backward() traverses this graph in reverse to compute gradients via the chain rule.
Python
import torch
# Simple gradient computation
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2 + 2 * x + 1 # y = xΒ² + 2x + 1
y.backward() # dy/dx = 2x + 2
print(x.grad) # tensor(8.) β at x=3: 2(3)+2 = 8
# Multi-variable
w = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)
x = torch.tensor(3.0)
y = w * x + b # y = wx + b
loss = (y - 10.0) ** 2 # MSE loss
loss.backward()
print(f"dL/dw = {w.grad}") # -18.0
print(f"dL/db = {b.grad}") # -6.0
Gradient Control
Python
# Disable gradients for inference (saves memory + faster)
with torch.no_grad():
pred = model(x) # no graph built
# Detach a tensor from the graph
hidden = encoder(x).detach() # stops gradient flow here
output = decoder(hidden)
# CRITICAL: Zero gradients before each backward pass
optimizer.zero_grad() # β without this, gradients accumulate!
loss.backward()
optimizer.step()
# Gradient accumulation (for large effective batch sizes)
accumulation_steps = 4
for i, (x, y) in enumerate(dataloader):
loss = model(x, y) / accumulation_steps
loss.backward() # gradients accumulate
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
Common Pitfall
Forgetting optimizer.zero_grad() is the #1 PyTorch bug. Gradients accumulate by default. If your loss doesn't decrease, check this first!
Exercises
Ex 3.1: Compute the gradient of f(x) = sin(x) Β· e^x at x = Ο/4 using autograd, and verify against the analytical answer.
Solution
x = torch.tensor(torch.pi / 4, requires_grad=True)
f = torch.sin(x) * torch.exp(x)
f.backward()
print(f"Autograd: {x.grad:.6f}")
# Analytical: d/dx[sin(x)Β·eΛ£] = (sin(x) + cos(x))Β·eΛ£
analytical = (torch.sin(x) + torch.cos(x)) * torch.exp(x)
print(f"Analytical: {analytical:.6f}")
Ex 3.2: Implement a manual linear regression training loop for 5 epochs using autograd (no nn.Module).
Solution
X = torch.randn(100, 1)
y = 3 * X + 2 + torch.randn(100, 1) * 0.1
w = torch.randn(1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
lr = 0.1
for epoch in range(5):
pred = X * w + b
loss = ((pred - y) ** 2).mean()
loss.backward()
with torch.no_grad():
w -= lr * w.grad
b -= lr * b.grad
w.grad.zero_()
b.grad.zero_()
print(f"Epoch {epoch}: loss={loss:.4f}, w={w.item():.3f}, b={b.item():.3f}")
Project: Gradient Descent Visualizer
Python
import torch
def gradient_descent_trace(f, x0, lr=0.1, steps=20):
"""Trace gradient descent on a 1D function."""
x = torch.tensor(x0, requires_grad=True, dtype=torch.float32)
history = []
for i in range(steps):
y = f(x)
y.backward()
history.append((x.item(), y.item(), x.grad.item()))
with torch.no_grad():
x -= lr * x.grad
x.grad.zero_()
return history
# f(x) = (x-3)Β² + 1 β minimum at x=3
trace = gradient_descent_trace(lambda x: (x - 3)**2 + 1, x0=-2.0)
for x, y, g in trace[:5]:
print(f"x={x:7.3f} f(x)={y:7.3f} grad={g:7.3f}")
Industry: Gradient Checkpointing at Meta
Training LLaMA-70B requires more GPU memory than available. Meta uses gradient checkpointing β discarding intermediate activations during forward pass and recomputing them during backward. This trades 30% more compute for 60% less memory, enabling training of models that otherwise wouldn't fit.
Key Takeaways
- PyTorch builds a dynamic computation graph β rebuilt each forward pass
.backward()computes gradients via reverse-mode auto-differentiation- Always
zero_grad()before.backward()β gradients accumulate by default - Use
torch.no_grad()for inference and.detach()to stop gradient flow
Data Loading & Transforms
Learning Objectives
- Use
DatasetandDataLoaderfor efficient data pipelines - Apply transforms for data preprocessing and augmentation
- Build custom datasets for any data format
- Optimize loading with multiprocessing and pinned memory
Dataset & DataLoader
Python
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, transforms
# Built-in dataset with transforms
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
train_data = datasets.CIFAR10('./data', train=True, download=True, transform=transform)
loader = DataLoader(train_data, batch_size=64, shuffle=True, num_workers=4, pin_memory=True)
for images, labels in loader:
print(images.shape) # (64, 3, 224, 224)
break
Custom Dataset
Python
import os, torch
from PIL import Image
class CustomImageDataset(Dataset):
def __init__(self, img_dir, labels_file, transform=None):
self.img_dir = img_dir
self.transform = transform
# Load labels: {filename: label}
self.labels = {}
with open(labels_file) as f:
for line in f:
name, label = line.strip().split(",")
self.labels[name] = int(label)
self.images = list(self.labels.keys())
def __len__(self):
return len(self.images)
def __getitem__(self, idx):
img_path = os.path.join(self.img_dir, self.images[idx])
image = Image.open(img_path).convert("RGB")
label = self.labels[self.images[idx]]
if self.transform:
image = self.transform(image)
return image, label
Data Augmentation
Python
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(p=0.5),
transforms.ColorJitter(brightness=0.2, contrast=0.2),
transforms.RandomRotation(15),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
transforms.RandomErasing(p=0.1),
])
Exercises
Ex 4.1: Create a custom Dataset that generates synthetic regression data (y = 2x + noise) with 1000 samples.
Solution
class SyntheticRegression(Dataset):
def __init__(self, n=1000):
self.x = torch.randn(n, 1)
self.y = 2 * self.x + torch.randn(n, 1) * 0.1
def __len__(self): return len(self.x)
def __getitem__(self, i): return self.x[i], self.y[i]
ds = SyntheticRegression()
dl = DataLoader(ds, batch_size=32, shuffle=True)
Project: DataLoader Speed Benchmarker
Python
import time
from torch.utils.data import DataLoader, TensorDataset
X = torch.randn(10000, 3, 32, 32)
y = torch.randint(0, 10, (10000,))
ds = TensorDataset(X, y)
for nw in [0, 2, 4, 8]:
loader = DataLoader(ds, batch_size=64, num_workers=nw, pin_memory=True)
start = time.time()
for batch in loader: pass
print(f"num_workers={nw}: {time.time()-start:.3f}s")
Industry: Data Pipelines at Tesla Autopilot
Tesla processes 1.5 billion miles of driving video data. Their PyTorch data pipeline uses custom IterableDataset subclasses to stream video frames from distributed storage, apply real-time augmentation (rain/fog/night simulation), and feed 8-camera multi-view batches to HydraNet β processing 36 FPS per camera across thousands of GPUs.
Key Takeaways
- Implement
__len__and__getitem__for custom datasets - Use
num_workers > 0andpin_memory=Truefor faster GPU training - Apply data augmentation only to training data, not validation/test
- Use
transforms.Composeto chain preprocessing steps
Building Models
From layers to complete training pipelines
nn.Module & Model Architecture
Learning Objectives
- Build models by subclassing
nn.Module - Use
Sequential,ModuleList, andModuleDict - Inspect and manage model parameters
- Implement forward pass patterns and skip connections
Your First nn.Module
Python
import torch.nn as nn
class SimpleNet(nn.Module):
def __init__(self, in_features, hidden, out_features):
super().__init__()
self.fc1 = nn.Linear(in_features, hidden)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden, out_features)
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
model = SimpleNet(784, 256, 10)
print(model)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
Sequential & ModuleList
Python
# Sequential β simple stack
model = nn.Sequential(
nn.Linear(784, 512),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
# ModuleList β dynamic layers
class DynamicNet(nn.Module):
def __init__(self, layer_sizes):
super().__init__()
self.layers = nn.ModuleList([
nn.Linear(layer_sizes[i], layer_sizes[i+1])
for i in range(len(layer_sizes) - 1)
])
def forward(self, x):
for layer in self.layers[:-1]:
x = torch.relu(layer(x))
return self.layers[-1](x)
net = DynamicNet([784, 512, 256, 128, 10])
Always Use ModuleList, Not Python List
A plain Python list won't register submodules. model.parameters() won't find weights in a plain list. Always use nn.ModuleList or nn.ModuleDict for dynamic layers.
Exercises
Ex 5.1: Build a ResidualBlock that adds the input to the output (skip connection). Input and output dimensions must match.
Solution
class ResidualBlock(nn.Module):
def __init__(self, dim):
super().__init__()
self.block = nn.Sequential(
nn.Linear(dim, dim), nn.ReLU(),
nn.Linear(dim, dim)
)
def forward(self, x):
return torch.relu(x + self.block(x))
Project: Model Parameter Inspector
Python
def model_summary(model):
total = 0
print(f"{'Layer':<40} {'Shape':<20} {'Params':>10}")
print("β" * 72)
for name, param in model.named_parameters():
n = param.numel()
total += n
print(f"{name:<40} {str(list(param.shape)):<20} {n:>10,}")
print("β" * 72)
print(f"{'TOTAL':<40} {'':<20} {total:>10,}")
print(f"Memory: {total * 4 / 1e6:.2f} MB (float32)")
model_summary(SimpleNet(784, 256, 10))
Industry: Modular Architecture at Hugging Face
Hugging Face's Transformers library builds every model (BERT, GPT-2, LLaMA) as composable nn.Module subclasses. Each attention head, feed-forward layer, and normalization is a separate module. This modularity lets researchers swap components β replace attention with Flash Attention, change normalization from LayerNorm to RMSNorm β with single-line changes.
Key Takeaways
- Always subclass
nn.Moduleand implementforward() - Use
nn.Sequentialfor simple stacks,ModuleListfor dynamic architectures model.parameters()iterates all learnable weights for optimizers- Skip connections (residual) improve gradient flow in deep networks
Loss Functions & Optimizers
Learning Objectives
- Choose the right loss function for classification vs regression
- Understand CrossEntropyLoss, MSELoss, BCELoss, and their variants
- Configure optimizers: SGD, Adam, AdamW
- Implement learning rate scheduling strategies
Common Loss Functions
| Loss | Task | Input | Formula |
|---|---|---|---|
nn.MSELoss | Regression | Any shape | mean((Ε· - y)Β²) |
nn.CrossEntropyLoss | Multi-class | Raw logits | -log(softmax(Ε·)[y]) |
nn.BCEWithLogitsLoss | Binary/Multi-label | Raw logits | -[yΒ·log(Ο(Ε·)) + (1-y)Β·log(1-Ο(Ε·))] |
nn.L1Loss | Regression | Any shape | mean(|Ε· - y|) |
Python
import torch, torch.nn as nn
# Classification β CrossEntropyLoss expects RAW LOGITS (no softmax!)
criterion = nn.CrossEntropyLoss()
logits = torch.randn(8, 10) # batch=8, 10 classes
targets = torch.randint(0, 10, (8,)) # integer labels
loss = criterion(logits, targets)
# Regression β MSELoss
criterion = nn.MSELoss()
pred = torch.randn(32, 1)
target = torch.randn(32, 1)
loss = criterion(pred, target)
Optimizers & LR Scheduling
Python
import torch.optim as optim
model = nn.Linear(784, 10)
# Adam β most popular default
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
# AdamW β decoupled weight decay (better for transformers)
optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
# Learning rate scheduling
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
# One cycle β warmup + cosine decay
scheduler = optim.lr_scheduler.OneCycleLR(
optimizer, max_lr=1e-3, total_steps=1000
)
# Usage in training loop
for epoch in range(100):
for x, y in loader:
optimizer.zero_grad()
loss = criterion(model(x), y)
loss.backward()
optimizer.step()
scheduler.step() # update LR after each epoch
Exercises
Ex 6.1: Compare Adam vs SGD on a simple classification task. Train for 20 epochs and plot the loss curves.
Solution
model_adam = nn.Sequential(nn.Linear(20, 50), nn.ReLU(), nn.Linear(50, 3))
model_sgd = nn.Sequential(nn.Linear(20, 50), nn.ReLU(), nn.Linear(50, 3))
model_sgd.load_state_dict(model_adam.state_dict()) # same init
X = torch.randn(200, 20); y = torch.randint(0, 3, (200,))
opt_adam = optim.Adam(model_adam.parameters(), lr=1e-3)
opt_sgd = optim.SGD(model_sgd.parameters(), lr=0.01, momentum=0.9)
crit = nn.CrossEntropyLoss()
for e in range(20):
for opt, mdl, name in [(opt_adam, model_adam, "Adam"), (opt_sgd, model_sgd, "SGD")]:
opt.zero_grad()
loss = crit(mdl(X), y)
loss.backward()
opt.step()
if e % 5 == 0:
print(f"Epoch {e}: Adam={crit(model_adam(X),y):.3f} SGD={crit(model_sgd(X),y):.3f}")
Project: LR Finder
Python
def lr_finder(model, loader, criterion, min_lr=1e-7, max_lr=1, steps=100):
optimizer = optim.SGD(model.parameters(), lr=min_lr)
lr_mult = (max_lr / min_lr) ** (1 / steps)
lrs, losses = [], []
for i, (x, y) in enumerate(loader):
if i >= steps: break
optimizer.zero_grad()
loss = criterion(model(x), y)
loss.backward()
optimizer.step()
lrs.append(optimizer.param_groups[0]['lr'])
losses.append(loss.item())
optimizer.param_groups[0]['lr'] *= lr_mult
print(f"Best LR β {lrs[losses.index(min(losses))]:.2e}")
return lrs, losses
Industry: Optimizer Choice at OpenAI
GPT-4 training uses AdamW with cosine learning rate decay and linear warmup. The warmup phase (first ~2000 steps) gradually increases LR from 0 to peak, preventing early instability. The weight decay of 0.1 acts as regularization. These choices are now industry standard for large language models.
Key Takeaways
CrossEntropyLossexpects raw logits β it applies softmax internally- AdamW with cosine scheduling is the go-to for transformers
- Learning rate is the most important hyperparameter β use LR finders
- Weight decay β L2 regularization in Adam (use AdamW for proper decoupling)
Training Loops & Validation
Learning Objectives
- Write a complete training loop with validation
- Implement early stopping and model checkpointing
- Track and log metrics (loss, accuracy, F1)
- Handle
model.train()vsmodel.eval()modes
The Complete Training Loop
Python
import torch, torch.nn as nn
from torch.utils.data import DataLoader
def train_one_epoch(model, loader, criterion, optimizer, device):
model.train() # enable dropout, batchnorm training mode
total_loss, correct, total = 0, 0, 0
for X, y in loader:
X, y = X.to(device), y.to(device)
optimizer.zero_grad()
logits = model(X)
loss = criterion(logits, y)
loss.backward()
optimizer.step()
total_loss += loss.item() * X.size(0)
correct += (logits.argmax(1) == y).sum().item()
total += X.size(0)
return total_loss / total, correct / total
@torch.no_grad()
def evaluate(model, loader, criterion, device):
model.eval() # disable dropout, batchnorm eval mode
total_loss, correct, total = 0, 0, 0
for X, y in loader:
X, y = X.to(device), y.to(device)
logits = model(X)
loss = criterion(logits, y)
total_loss += loss.item() * X.size(0)
correct += (logits.argmax(1) == y).sum().item()
total += X.size(0)
return total_loss / total, correct / total
Early Stopping & Checkpointing
Python
best_val_loss = float('inf')
patience, patience_counter = 5, 0
for epoch in range(100):
train_loss, train_acc = train_one_epoch(model, train_loader, criterion, optimizer, device)
val_loss, val_acc = evaluate(model, val_loader, criterion, device)
print(f"Epoch {epoch+1:3d} β Train Loss: {train_loss:.4f} Acc: {train_acc:.3f} β Val Loss: {val_loss:.4f} Acc: {val_acc:.3f}")
# Checkpoint best model
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'val_loss': val_loss,
}, 'best_model.pth')
print(f" β Saved best model (val_loss={val_loss:.4f})")
else:
patience_counter += 1
if patience_counter >= patience:
print(f" β Early stopping at epoch {epoch+1}")
break
# Load best model for inference
checkpoint = torch.load('best_model.pth')
model.load_state_dict(checkpoint['model_state_dict'])
Exercises
Ex 7.1: Add a progress bar using tqdm to the training loop. Display loss and accuracy per batch.
Solution
from tqdm import tqdm
for X, y in tqdm(loader, desc=f"Epoch {epoch}"):
# ... training code ...
pbar.set_postfix(loss=loss.item(), acc=correct/total)
Project: MNIST Classifier β End to End
Python
import torch, torch.nn as nn, torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,),(0.3081,))])
train_ds = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_ds = datasets.MNIST('./data', train=False, transform=transform)
train_loader = DataLoader(train_ds, 128, shuffle=True)
test_loader = DataLoader(test_ds, 256)
model = nn.Sequential(nn.Flatten(), nn.Linear(784,256), nn.ReLU(), nn.Dropout(0.2),
nn.Linear(256,10)).to(device)
optimizer = optim.Adam(model.parameters(), 1e-3)
criterion = nn.CrossEntropyLoss()
for epoch in range(5):
model.train()
for X, y in train_loader:
X, y = X.to(device), y.to(device)
optimizer.zero_grad()
criterion(model(X), y).backward()
optimizer.step()
_, acc = evaluate(model, test_loader, criterion, device)
print(f"Epoch {epoch+1}: Test Acc = {acc:.3f}")
Industry: Training Infrastructure at Anthropic
Anthropic's Claude training loop includes gradient norm monitoring, loss spike detection, and automatic checkpoint recovery. When a loss spike occurs (NaN or sudden increase), the system automatically rolls back to the last stable checkpoint and reduces the learning rate β preventing days of wasted GPU time on 2048+ GPU clusters.
Key Takeaways
- Always switch between
model.train()andmodel.eval() - Use
@torch.no_grad()decorator for evaluation functions - Save both model and optimizer state for resumable training
- Early stopping prevents overfitting β monitor validation loss
CNNs in PyTorch
Learning Objectives
- Build CNNs with
Conv2d,MaxPool2d, andBatchNorm2d - Calculate output dimensions for conv/pool layers
- Implement LeNet and ResNet architectures
- Apply transfer learning with pretrained models
Convolution Basics
Python
import torch.nn as nn
# Basic CNN
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1), # (3,32,32) β (32,32,32)
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2), # β (32,16,16)
nn.Conv2d(32, 64, 3, padding=1), # β (64,16,16)
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2), # β (64,8,8)
nn.Conv2d(64, 128, 3, padding=1), # β (128,8,8)
nn.ReLU(),
nn.AdaptiveAvgPool2d(1), # β (128,1,1)
)
self.classifier = nn.Linear(128, num_classes)
def forward(self, x):
x = self.features(x)
x = x.flatten(1) # (B, 128)
return self.classifier(x)
Transfer Learning
Python
from torchvision import models
# Load pretrained ResNet18
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
# Freeze all layers
for param in model.parameters():
param.requires_grad = False
# Replace classification head
model.fc = nn.Linear(model.fc.in_features, 5) # 5 classes
# Only train the new head
optimizer = optim.Adam(model.fc.parameters(), lr=1e-3)
Exercises
Ex 8.1: Calculate the output shape of Conv2d(3, 64, kernel_size=7, stride=2, padding=3) given input (1, 3, 224, 224).
Solution
# Output = β(224 + 2Β·3 - 7) / 2β + 1 = β223/2β + 1 = 112
# Shape: (1, 64, 112, 112)
conv = nn.Conv2d(3, 64, 7, stride=2, padding=3)
x = torch.randn(1, 3, 224, 224)
print(conv(x).shape) # torch.Size([1, 64, 112, 112])
Project: CIFAR-10 CNN with 90%+ Accuracy
Python
# Full training pipeline β transfer learning on CIFAR-10
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
model.fc = nn.Linear(512, 10)
model = model.to(device)
transform = transforms.Compose([
transforms.Resize(224),
transforms.ToTensor(),
transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225])
])
train_ds = datasets.CIFAR10('./data', train=True, download=True, transform=transform)
train_loader = DataLoader(train_ds, 64, shuffle=True, num_workers=2)
optimizer = optim.Adam(model.parameters(), 1e-4)
criterion = nn.CrossEntropyLoss()
# Train for 5 epochs β expect ~93% accuracy
Industry: CNNs in Autonomous Driving
Waymo's perception stack uses multi-scale CNNs (Feature Pyramid Networks) processing 5 LiDAR and 29 camera feeds simultaneously. Each camera frame passes through a backbone CNN (EfficientNet variant), producing feature maps at 1/4, 1/8, 1/16, 1/32 resolution β enabling detection of both pedestrians (near, large) and traffic signs (far, small) in the same forward pass.
Key Takeaways
Conv2dextracts spatial features;MaxPool2dreduces spatial dimensionsBatchNorm2dstabilizes training and allows higher learning ratesAdaptiveAvgPool2d(1)handles any input size before the classifier- Transfer learning: freeze backbone, replace head, fine-tune β fastest path to good results
Advanced PyTorch
Production-grade techniques for serious models
Recurrent Networks
Learning Objectives
- Understand RNN, LSTM, and GRU architectures in PyTorch
- Process variable-length sequences with packing/padding
- Build a text generation model with character-level LSTM
- Handle hidden state initialization and management
LSTM in PyTorch
Python
import torch, torch.nn as nn
class TextClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=2,
batch_first=True, dropout=0.3, bidirectional=True)
self.fc = nn.Linear(hidden_dim * 2, num_classes) # *2 for bidirectional
def forward(self, x):
emb = self.embedding(x) # (B, T) β (B, T, E)
output, (h_n, c_n) = self.lstm(emb) # output: (B, T, 2H)
# Use last hidden states from both directions
hidden = torch.cat([h_n[-2], h_n[-1]], dim=1) # (B, 2H)
return self.fc(hidden)
model = TextClassifier(vocab_size=10000, embed_dim=128, hidden_dim=256, num_classes=5)
x = torch.randint(0, 10000, (8, 50)) # batch=8, seq_len=50
out = model(x) # (8, 5)
Exercises
Ex 9.1: Build a GRU-based model for sequence-to-sequence prediction (predict next value in a time series).
Solution
class TimeSeriesGRU(nn.Module):
def __init__(self, input_dim=1, hidden=64):
super().__init__()
self.gru = nn.GRU(input_dim, hidden, batch_first=True)
self.fc = nn.Linear(hidden, 1)
def forward(self, x):
out, _ = self.gru(x)
return self.fc(out[:, -1, :]) # last timestep
Project: Character-Level Text Generator
Python
class CharRNN(nn.Module):
def __init__(self, vocab_size, embed=64, hidden=256):
super().__init__()
self.embed = nn.Embedding(vocab_size, embed)
self.lstm = nn.LSTM(embed, hidden, num_layers=2, batch_first=True)
self.fc = nn.Linear(hidden, vocab_size)
def forward(self, x, hidden=None):
emb = self.embed(x)
out, hidden = self.lstm(emb, hidden)
return self.fc(out), hidden
def generate(model, start_char, char2idx, idx2char, length=200):
model.eval()
x = torch.tensor([[char2idx[start_char]]])
hidden = None
result = [start_char]
for _ in range(length):
logits, hidden = model(x, hidden)
probs = torch.softmax(logits[0, -1] / 0.8, dim=-1)
idx = torch.multinomial(probs, 1).item()
result.append(idx2char[idx])
x = torch.tensor([[idx]])
return ''.join(result)
Industry: LSTMs in Finance
Goldman Sachs and Two Sigma use LSTM networks for time-series forecasting of market volatility. Their models process sequences of 60+ features (price, volume, order flow, sentiment) across 500+ timesteps, predicting intraday price movements with sub-millisecond inference latency using ONNX-optimized PyTorch models.
Key Takeaways
- LSTM handles long-range dependencies better than vanilla RNN (forget gate prevents vanishing gradients)
- Use
batch_first=Truefor intuitive(B, T, F)tensor layouts - Bidirectional doubles hidden size β concatenate forward and backward states
- For most NLP tasks, Transformers have replaced RNNs β but RNNs remain strong for streaming/real-time data
Custom Layers & Hooks
Learning Objectives
- Write custom
nn.Modulelayers with learnable parameters - Implement custom autograd functions with
torch.autograd.Function - Use forward and backward hooks for debugging and visualization
- Inspect gradients and activations at any layer
Custom Layer with Parameters
Python
class ScaleShift(nn.Module):
"""Learnable scale and shift: y = gamma * x + beta"""
def __init__(self, dim):
super().__init__()
self.gamma = nn.Parameter(torch.ones(dim))
self.beta = nn.Parameter(torch.zeros(dim))
def forward(self, x):
return self.gamma * x + self.beta
Forward Hooks for Activation Inspection
Python
activations = {}
def save_activation(name):
def hook(module, input, output):
activations[name] = output.detach()
return hook
model = SimpleCNN()
model.features[0].register_forward_hook(save_activation('conv1'))
model.features[4].register_forward_hook(save_activation('conv2'))
# Forward pass β hooks capture activations
_ = model(torch.randn(1, 3, 32, 32))
print(activations['conv1'].shape) # (1, 32, 32, 32)
print(activations['conv2'].shape) # (1, 64, 16, 16)
Gradient Hooks
Python
gradients = {}
def save_gradient(name):
def hook(module, grad_input, grad_output):
gradients[name] = grad_output[0].detach()
return hook
model.features[0].register_backward_hook(save_gradient('conv1_grad'))
loss = criterion(model(x), y)
loss.backward()
print(f"Conv1 gradient norm: {gradients['conv1_grad'].norm():.4f}")
Exercises
Ex 10.1: Implement a custom Swish activation function as an nn.Module: swish(x) = x Β· Ο(x).
Solution
class Swish(nn.Module):
def forward(self, x):
return x * torch.sigmoid(x)
Project: Gradient Flow Debugger
Python
def plot_gradient_flow(model):
"""Print gradient statistics for each layer."""
print(f"{'Layer':<40} {'Mean Grad':>12} {'Max Grad':>12}")
print("β" * 66)
for name, param in model.named_parameters():
if param.grad is not None:
print(f"{name:<40} {param.grad.abs().mean():>12.6f} {param.grad.abs().max():>12.6f}")
# Use after loss.backward()
loss.backward()
plot_gradient_flow(model)
Industry: Model Interpretability at Google Health
Google's medical AI team uses gradient hooks extensively for Grad-CAM visualizations β highlighting which regions of X-ray images the model focuses on when diagnosing pneumonia or cancer. This interpretability is required by FDA regulations for clinical deployment. Hooks capture gradients at the last convolutional layer to generate attention heatmaps overlaid on the original image.
Key Takeaways
- Use
nn.Parameterto create learnable parameters in custom modules - Forward hooks capture activations; backward hooks capture gradients
- Hooks are essential for debugging vanishing/exploding gradients
- Always
.detach()tensors saved in hooks to prevent memory leaks
Distributed Training & Mixed Precision
Learning Objectives
- Scale training with
DataParallelandDistributedDataParallel - Use Automatic Mixed Precision (AMP) for faster training
- Implement gradient accumulation for large effective batch sizes
- Understand multi-GPU communication patterns
DataParallel (Simple Multi-GPU)
Python
# Easiest way β wrap model in DataParallel
model = SimpleCNN().cuda()
if torch.cuda.device_count() > 1:
model = nn.DataParallel(model) # splits batch across GPUs
Mixed Precision Training (AMP)
Python
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for X, y in train_loader:
X, y = X.cuda(), y.cuda()
optimizer.zero_grad()
# Forward pass in FP16
with autocast():
logits = model(X)
loss = criterion(logits, y)
# Backward pass with gradient scaling
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
AMP Speedup
Mixed precision typically gives 1.5β3Γ speedup on modern GPUs (A100, H100) with no accuracy loss. It uses FP16 for computation and FP32 for gradient accumulation. Always use it unless you have a specific reason not to.
Gradient Accumulation
Python
# Simulate batch_size=256 with actual batch_size=64
accumulation_steps = 4
for i, (X, y) in enumerate(train_loader):
X, y = X.to(device), y.to(device)
with autocast():
loss = criterion(model(X), y) / accumulation_steps
scaler.scale(loss).backward()
if (i + 1) % accumulation_steps == 0:
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
Exercises
Ex 11.1: Measure training speed (samples/sec) with and without AMP on your GPU.
Solution
import time
for use_amp in [False, True]:
scaler = GradScaler(enabled=use_amp)
start = time.time()
for X, y in train_loader:
X, y = X.cuda(), y.cuda()
optimizer.zero_grad()
with autocast(enabled=use_amp):
loss = criterion(model(X), y)
scaler.scale(loss).backward()
scaler.step(optimizer); scaler.update()
torch.cuda.synchronize()
elapsed = time.time() - start
print(f"AMP={use_amp}: {len(train_loader.dataset)/elapsed:.0f} samples/sec")
Project: Multi-GPU Training Script
Python
# Complete DDP training script (launch with: torchrun --nproc_per_node=2 train.py)
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup(rank, world_size):
dist.init_process_group("nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
def train(rank, world_size):
setup(rank, world_size)
model = SimpleCNN().to(rank)
model = DDP(model, device_ids=[rank])
# ... training loop ...
dist.destroy_process_group()
Industry: Training LLaMA at Meta
Meta trained LLaMA-70B on 2048 A100 GPUs using FSDP (Fully Sharded Data Parallel) β PyTorch's evolution of DDP. FSDP shards model parameters, gradients, and optimizer states across GPUs, enabling models that don't fit on a single GPU. Combined with AMP and gradient checkpointing, they achieved 95% GPU utilization across the cluster.
Key Takeaways
DataParallelis simple but slow;DistributedDataParallelis production-grade- AMP gives free 1.5-3Γ speedup β use it always
- Gradient accumulation simulates larger batch sizes without more memory
- Launch DDP with
torchrunβ it handles rank/world_size automatically
TorchScript & ONNX Export
Learning Objectives
- Convert models to TorchScript via tracing and scripting
- Export models to ONNX format for cross-platform deployment
- Optimize models for inference (quantization, pruning)
- Deploy to mobile with PyTorch Mobile / ExecuTorch
TorchScript
Python
# Method 1: Tracing (for models without control flow)
model = SimpleCNN().eval()
example = torch.randn(1, 3, 32, 32)
traced = torch.jit.trace(model, example)
traced.save("model_traced.pt")
# Method 2: Scripting (handles if/for/while)
@torch.jit.script
def relu_threshold(x: torch.Tensor, threshold: float = 0.5) -> torch.Tensor:
if x.mean() > threshold:
return torch.relu(x)
return x
# Load and run without Python
loaded = torch.jit.load("model_traced.pt")
output = loaded(example)
ONNX Export
Python
import torch.onnx
model = SimpleCNN().eval()
dummy = torch.randn(1, 3, 32, 32)
torch.onnx.export(
model, dummy, "model.onnx",
input_names=["image"],
output_names=["logits"],
dynamic_axes={"image": {0: "batch"}}, # variable batch size
opset_version=17
)
# Run with ONNX Runtime (10-40% faster than PyTorch)
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
result = session.run(None, {"image": dummy.numpy()})
Quantization
Python
# Dynamic quantization (easiest, good for LSTMs/Transformers)
quantized = torch.quantization.quantize_dynamic(
model, {nn.Linear, nn.LSTM}, dtype=torch.qint8
)
# Model size: ~4Γ smaller, inference: ~2Γ faster
Exercises
Ex 12.1: Export a model to ONNX and compare inference speed between PyTorch and ONNX Runtime.
Solution
import time, onnxruntime as ort
dummy = torch.randn(1, 3, 32, 32)
# PyTorch
start = time.time()
for _ in range(1000): model(dummy)
pt_time = time.time() - start
# ONNX Runtime
sess = ort.InferenceSession("model.onnx")
start = time.time()
for _ in range(1000): sess.run(None, {"image": dummy.numpy()})
ort_time = time.time() - start
print(f"PyTorch: {pt_time:.2f}s | ONNX: {ort_time:.2f}s | Speedup: {pt_time/ort_time:.1f}x")
Project: Full Model Export Pipeline
Python
def export_pipeline(model, example_input, name="model"):
model.eval()
# 1. TorchScript
traced = torch.jit.trace(model, example_input)
traced.save(f"{name}.pt")
# 2. ONNX
torch.onnx.export(model, example_input, f"{name}.onnx", opset_version=17)
# 3. Quantized
q_model = torch.quantization.quantize_dynamic(model, {nn.Linear}, torch.qint8)
traced_q = torch.jit.trace(q_model, example_input)
traced_q.save(f"{name}_quantized.pt")
print(f"Exported: {name}.pt, {name}.onnx, {name}_quantized.pt")
Industry: Edge Deployment at Apple
Apple deploys PyTorch models to billions of iPhones via CoreML conversion. Models go through: PyTorch β ONNX β CoreML β on-device Neural Engine. Face ID's depth estimation model runs at 30 FPS entirely on the iPhone's Neural Engine. The PyTorchβONNX export step is the critical bridge enabling this pipeline.
Key Takeaways
- TorchScript removes Python dependency β use
tracefor simple models,scriptfor control flow - ONNX is the universal exchange format β enables deployment anywhere
- Dynamic quantization gives 4Γ size reduction with minimal accuracy loss
- Always benchmark exported models against PyTorch baseline
Industry Applications
Real-world problems solved with PyTorch
Industry Problem: Medical Image Classification
Learning Objectives
- Build an end-to-end medical image classification pipeline
- Handle class imbalance with weighted sampling and focal loss
- Implement proper medical AI evaluation (sensitivity, specificity, AUC)
- Deploy with confidence calibration and uncertainty estimation
Problem Statement
Build a chest X-ray classifier that detects pneumonia from X-ray images. The dataset has ~5,000 images with severe class imbalance (3:1 pneumonia to normal). This mirrors a real clinical deployment scenario.
Data Pipeline
Python
from torchvision import transforms, datasets
from torch.utils.data import DataLoader, WeightedRandomSampler
import numpy as np
# Medical-specific augmentation (no vertical flip β anatomy matters!)
train_transform = transforms.Compose([
transforms.Resize((256, 256)),
transforms.RandomCrop(224),
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(10),
transforms.ColorJitter(brightness=0.1, contrast=0.1),
transforms.ToTensor(),
transforms.Normalize([0.485], [0.229]),
])
# Weighted sampler for class imbalance
train_data = datasets.ImageFolder('data/train', transform=train_transform)
class_counts = np.bincount([y for _, y in train_data])
weights = 1.0 / class_counts
sample_weights = [weights[y] for _, y in train_data]
sampler = WeightedRandomSampler(sample_weights, len(train_data))
train_loader = DataLoader(train_data, batch_size=32, sampler=sampler)
Model Architecture
Python
from torchvision import models
class ChestXRayModel(nn.Module):
def __init__(self):
super().__init__()
self.backbone = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
# Modify first conv for grayscale (1 channel instead of 3)
self.backbone.conv1 = nn.Conv2d(1, 64, 7, stride=2, padding=3, bias=False)
self.backbone.fc = nn.Sequential(
nn.Dropout(0.3),
nn.Linear(2048, 512),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(512, 2) # Normal vs Pneumonia
)
def forward(self, x):
return self.backbone(x)
Medical-Grade Evaluation
Python
from sklearn.metrics import classification_report, roc_auc_score
@torch.no_grad()
def medical_evaluate(model, loader, device):
model.eval()
all_preds, all_labels, all_probs = [], [], []
for X, y in loader:
X = X.to(device)
logits = model(X)
probs = torch.softmax(logits, dim=1)[:, 1] # P(pneumonia)
preds = (probs > 0.5).long()
all_preds.extend(preds.cpu().tolist())
all_labels.extend(y.tolist())
all_probs.extend(probs.cpu().tolist())
print(classification_report(all_labels, all_preds,
target_names=['Normal', 'Pneumonia']))
print(f"AUC-ROC: {roc_auc_score(all_labels, all_probs):.4f}")
Clinical Deployment Considerations
In medical AI, sensitivity (recall for the positive class) is more important than accuracy. Missing a pneumonia case (false negative) is far worse than a false alarm. Adjust the classification threshold to achieve β₯95% sensitivity, even at the cost of lower specificity.
Exercises
Ex 13.1: Implement Focal Loss to handle class imbalance: FL(pβ) = -Ξ±β(1-pβ)^Ξ³ log(pβ).
Solution
class FocalLoss(nn.Module):
def __init__(self, alpha=0.25, gamma=2.0):
super().__init__()
self.alpha, self.gamma = alpha, gamma
def forward(self, logits, targets):
ce = nn.functional.cross_entropy(logits, targets, reduction='none')
pt = torch.exp(-ce)
return (self.alpha * (1 - pt) ** self.gamma * ce).mean()
Project: Full Medical AI Pipeline
Combine everything: data loading with class balancing β ResNet50 transfer learning β training with focal loss β evaluation with AUC-ROC and sensitivity/specificity β export to ONNX for hospital deployment. Target: β₯95% sensitivity, β₯85% specificity, AUC > 0.95.
Industry: Google Health's Chest X-Ray AI
Google Health deployed a chest X-ray AI system in hospitals across India and Thailand. Key engineering decisions: (1) Model ensembling β 3 ResNet variants averaged for robustness; (2) Calibration β temperature scaling ensures predicted probabilities match actual disease rates; (3) Human-in-the-loop β AI flags suspicious cases for radiologist review, reducing report turnaround from 11 days to 3 hours. The PyTorch models run on NVIDIA T4 GPUs in Google Cloud with ONNX Runtime serving.
Key Takeaways
- Medical AI requires domain-specific augmentation β no arbitrary flips/rotations
- Use weighted sampling or focal loss for class imbalance
- Evaluate with sensitivity/specificity/AUC β not just accuracy
- Clinical deployment requires calibration, uncertainty, and human oversight
Industry Problem: Real-Time Object Detection
Learning Objectives
- Understand YOLO-style single-shot detection architecture
- Implement inference pipeline for video streams
- Optimize for edge deployment (quantization, TensorRT)
- Build a production monitoring dashboard for model drift
Object Detection with Pretrained YOLO
Python
import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn_v2
from torchvision.models.detection import FasterRCNN_ResNet50_FPN_V2_Weights
# Load pretrained Faster R-CNN
weights = FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT
model = fasterrcnn_resnet50_fpn_v2(weights=weights)
model.eval()
# Inference
img = torch.randn(3, 480, 640) # C, H, W
with torch.no_grad():
predictions = model([img])
# predictions[0] contains: boxes, labels, scores
boxes = predictions[0]['boxes'] # (N, 4) β x1,y1,x2,y2
labels = predictions[0]['labels'] # (N,) β class IDs
scores = predictions[0]['scores'] # (N,) β confidence
# Filter low-confidence detections
keep = scores > 0.5
print(f"Detected {keep.sum()} objects")
Real-Time Video Inference
Python
import cv2, time, torch
from torchvision import transforms
def realtime_detection(model, video_source=0, conf_threshold=0.5):
cap = cv2.VideoCapture(video_source)
transform = transforms.Compose([transforms.ToTensor()])
fps_history = []
while cap.isOpened():
ret, frame = cap.read()
if not ret: break
start = time.time()
img_tensor = transform(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
with torch.no_grad():
preds = model([img_tensor.cuda()])[0]
# Draw bounding boxes
for box, score, label in zip(preds['boxes'], preds['scores'], preds['labels']):
if score > conf_threshold:
x1, y1, x2, y2 = box.int().tolist()
cv2.rectangle(frame, (x1,y1), (x2,y2), (0,255,0), 2)
cv2.putText(frame, f"{label.item()}: {score:.2f}",
(x1,y1-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0,255,0), 2)
fps = 1.0 / (time.time() - start)
cv2.putText(frame, f"FPS: {fps:.1f}", (10,30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0,0,255), 2)
cv2.imshow('Detection', frame)
if cv2.waitKey(1) == 27: break
cap.release()
cv2.destroyAllWindows()
Edge Optimization
Python
# Quantize for mobile/edge deployment
model_int8 = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
# Export to ONNX for TensorRT optimization
dummy = [torch.randn(3, 480, 640)]
torch.onnx.export(model, dummy, "detector.onnx", opset_version=17)
# TensorRT gives 5-10Γ speedup on NVIDIA edge devices (Jetson)
Production Monitoring
Python
import json, time
from collections import deque
class DetectionMonitor:
def __init__(self, window_size=100):
self.latencies = deque(maxlen=window_size)
self.confidences = deque(maxlen=window_size)
self.detection_counts = deque(maxlen=window_size)
def log(self, latency, preds):
self.latencies.append(latency)
scores = preds['scores'][preds['scores'] > 0.5]
self.confidences.extend(scores.tolist())
self.detection_counts.append(len(scores))
def report(self):
import numpy as np
return {
"avg_latency_ms": np.mean(self.latencies) * 1000,
"p99_latency_ms": np.percentile(self.latencies, 99) * 1000,
"avg_confidence": np.mean(self.confidences) if self.confidences else 0,
"avg_detections": np.mean(self.detection_counts),
}
Exercises
Ex 14.1: Add Non-Maximum Suppression (NMS) with IoU threshold 0.5 to filter overlapping detections.
Solution
from torchvision.ops import nms
keep_idx = nms(preds['boxes'], preds['scores'], iou_threshold=0.5)
filtered_boxes = preds['boxes'][keep_idx]
filtered_scores = preds['scores'][keep_idx]
Ex 14.2: Implement a model drift detector that alerts when average confidence drops below a threshold.
Solution
def check_drift(monitor, threshold=0.7):
report = monitor.report()
if report["avg_confidence"] < threshold:
print(f"β DRIFT ALERT: Avg confidence {report['avg_confidence']:.3f} < {threshold}")
return True
return False
Project: Complete Detection API Server
Python
# FastAPI endpoint for object detection
from fastapi import FastAPI, UploadFile
from PIL import Image
import io
app = FastAPI()
model = fasterrcnn_resnet50_fpn_v2(weights="DEFAULT").eval().cuda()
@app.post("/detect")
async def detect(file: UploadFile):
image = Image.open(io.BytesIO(await file.read())).convert("RGB")
tensor = transforms.ToTensor()(image).cuda()
with torch.no_grad():
preds = model([tensor])[0]
keep = preds['scores'] > 0.5
return {
"detections": [{
"box": preds['boxes'][i].tolist(),
"label": preds['labels'][i].item(),
"confidence": preds['scores'][i].item()
} for i in keep.nonzero().flatten().tolist()]
}
Industry: Amazon Go β Cashierless Stores
Amazon Go stores use hundreds of ceiling cameras running real-time object detection to track customers and products. Their PyTorch-based detection pipeline processes 30+ camera feeds simultaneously, detecting product pickups/putbacks with sub-200ms latency. The system uses TensorRT-optimized models on NVIDIA Jetson edge devices, with a central GPU cluster for model updates and A/B testing. Detection accuracy directly impacts revenue β a 1% improvement in product tracking accuracy translates to millions in reduced shrinkage.
Why This Matters for AI
Object detection is the bridge between AI and the physical world. Self-driving cars, warehouse robotics, medical imaging, surveillance, and AR all depend on fast, accurate detection. Mastering the full pipeline β from model training to edge deployment to production monitoring β is what separates an ML researcher from an ML engineer.
Key Takeaways
- Use pretrained detection models (Faster R-CNN, YOLO) as starting points
- Real-time inference requires GPU + quantization + TensorRT optimization
- NMS filters overlapping detections β critical for clean output
- Production systems need monitoring: latency, confidence drift, accuracy degradation
- Edge deployment (Jetson, mobile) requires ONNX β TensorRT/CoreML conversion