Neural Networks & Deep Learning

Chapter 16: Autoencoders and Representation Learning

Compressing the World Into Vectors โ€” From Identity Mapping to Generative Models

โฑ๏ธ Reading Time: ~3 hours  |  ๐Ÿ“– Unit 5: Specialized Architectures  |  ๐Ÿง  Theory + Code Chapter

๐Ÿ“‹ Prerequisites: Chapter 10 (Batch Normalization & Deep Networks), Chapter 12 (CNNs), Basic Probability (KL divergence)

Bloom's Taxonomy Map for This Chapter

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberRecall the encoder-bottleneck-decoder architecture, list five autoencoder variants (vanilla, sparse, denoising, contractive, variational), and state the ELBO formula
๐Ÿ”ต UnderstandExplain why undercomplete bottlenecks force compression, how KL penalty induces sparsity, why the reparameterization trick enables backpropagation through sampling, and how AEs relate to PCA
๐ŸŸข ApplyImplement vanilla AE and VAE from scratch in NumPy; build PyTorch autoencoders for MNIST; compute reconstruction error for anomaly detection
๐ŸŸก AnalyzeTrace information flow through encoder/decoder, analyze latent space structure, compare reconstruction loss landscapes across AE variants, decompose ELBO into reconstruction + KL terms
๐ŸŸ  EvaluateCritically compare AE vs PCA for dimensionality reduction, evaluate when to choose denoising vs contractive vs variational, justify architecture choices for anomaly detection vs generation
๐Ÿ”ด CreateDesign a VAE for MNIST digit generation with latent space interpolation; propose an anomaly detection pipeline for financial transaction data
Section 1

Learning Objectives

After completing this chapter, you will be able to:

  1. Define the autoencoder framework โ€” encoder function f(x), latent code z, decoder function g(z) โ€” and explain why training it as an identity function is not trivial
  2. Distinguish between undercomplete (bottleneck < input dim) and overcomplete (bottleneck โ‰ฅ input dim) autoencoders, and explain when each is useful
  3. Derive the sparse autoencoder loss with KL-divergence penalty and implement it in both NumPy and PyTorch
  4. Explain why denoising autoencoders learn more robust features than vanilla AEs, and how contractive autoencoders achieve the same via Jacobian penalty
  5. Derive the ELBO (Evidence Lower Bound) for Variational Autoencoders from first principles โ€” including the reparameterization trick โ€” and implement a complete VAE
  6. Apply autoencoders to real-world tasks: anomaly detection (Paytm UPI fraud), audio feature learning (Spotify), dimensionality reduction, and denoising
  7. Compare autoencoders with PCA โ€” proving that a linear AE with MSE loss recovers PCA projections
  8. Visualize and interpret 2D latent spaces, perform latent space interpolation, and generate new samples from a trained VAE
Section 2

Opening Hook

๐ŸŽฌ 200 Million Taste Profiles in 64 Dimensions

Netflix has over 200 million subscribers. Each user has watched, paused, rewound, and binged their way through thousands of hours of content. Their viewing history โ€” across 17,000+ titles โ€” could be represented as a sparse 17,000-dimensional vector. But storing, comparing, and reasoning over 200 million such vectors? Computationally nightmarish.

Instead, Netflix compresses each user into a dense 64-dimensional embedding โ€” a compact "taste fingerprint" that captures whether you love dark Scandinavian thrillers, Bollywood rom-coms, or Studio Ghibli anime. Two users who are "close" in this 64-dimensional space get similar recommendations, even if they've never watched the same movie.

How do you learn this compression? You can't hand-engineer it โ€” the features are too abstract. Instead, you train a neural network to compress and reconstruct: squeeze 17,000 dimensions through a 64-neuron bottleneck, then try to recover the original vector. Whatever survives the bottleneck is the essential information. Everything else was noise.

This compress-then-reconstruct architecture is called an autoencoder โ€” and it's one of the most elegant ideas in deep learning. In this chapter, you'll learn to build them from scratch, derive the math behind their probabilistic cousin (the VAE), and see how companies from Paytm to Spotify use them in production.

Netflix Paytm Spotify Google
Section 3

The Intuition First

The Suitcase Analogy

Imagine you're packing for a two-week trip, but your airline only allows a tiny carry-on bag. You can't bring everything โ€” you must decide what's essential. You fold clothes tightly, choose versatile items, and leave behind the non-essentials.

Now imagine your friend at the destination has to reconstruct your entire wardrobe from what arrives in that tiny bag. If they can do it well โ€” if the essentials you packed are enough to approximate everything you needed โ€” then you've learned an excellent compression.

That's exactly what an autoencoder does:

  • Encoder = You packing the suitcase (compress 784 MNIST pixels โ†’ 32 numbers)
  • Bottleneck = The tiny carry-on bag (the 32-dimensional latent code z)
  • Decoder = Your friend unpacking and reconstructing (32 numbers โ†’ 784 pixels)
  • Training signal = How different is the reconstruction from the original?
THE AUTOENCODER: COMPRESS โ†’ REMEMBER โ†’ RECONSTRUCT Input x Latent z Output xฬ‚ (784-dim) (32-dim) (784-dim) โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ Encoder โ”‚ โ–ˆโ–ˆ โ”‚ Decoder โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ โ”‚ โ–ˆโ–ˆ โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ compress โ”‚ โ–ˆโ–ˆ โ”‚ reconstruct โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ”‚ โ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ”‚ โ”‚ โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Loss = โ€–x - xฬ‚โ€–ยฒ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ (minimize this!)

The "Aha" Question

Why would you train a network to output its own input? That sounds trivially useless โ€” the identity function does this perfectly. The magic is in the constraint: by forcing the data through a narrow bottleneck, you prevent the network from learning the identity and instead force it to learn which features matter most. The bottleneck IS the learned representation.

The word "autoencoder" literally means "self-encoder" โ€” a network that learns to encode itself. The concept dates back to 1986, when Rumelhart, Hinton, and Williams showed that networks with bottleneck layers learn useful internal representations. Hinton revisited them in 2006 to help launch the deep learning revolution.

Section 4 ยท 16.1

The Autoencoder Architecture

Formal Definition

An autoencoder consists of two functions:

Encoder: z = fฮธ(x)      Decoder: xฬ‚ = gฯ†(z)
Loss: L(ฮธ, ฯ†) = โ€–x - gฯ†(fฮธ(x))โ€–ยฒ

where:

  • x โˆˆ โ„n is the input (e.g., a 784-dimensional flattened MNIST image)
  • z โˆˆ โ„d is the latent code (bottleneck representation), with d < n for undercomplete AEs
  • xฬ‚ โˆˆ โ„n is the reconstruction
  • ฮธ, ฯ† are the encoder and decoder parameters (weights & biases)

You train by minimizing the reconstruction loss over your dataset:

L = (1/N) ฮฃแตข โ€–xแตข - gฯ†(fฮธ(xแตข))โ€–ยฒ

A Concrete Architecture

For MNIST (28ร—28 = 784 pixels), a simple autoencoder might look like:

ENCODER DECODER โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Input โ”‚ 784 โ”‚ Latent โ”‚ 32 โ”‚ Dense โ”‚ โ†’ 256 (ReLU) โ”‚ Dense โ”‚ โ†’ 256 (ReLU) โ”‚ Dense โ”‚ โ†’ 128 (ReLU) โ”‚ Dense โ”‚ โ†’ 128 (ReLU) โ”‚ Dense โ”‚ โ†’ 32 (linear) โ”‚ Dense โ”‚ โ†’ 784 (sigmoid) โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ 784 โ†’ 256 โ†’ 128 โ†’ 32 โ†’ 128 โ†’ 256 โ†’ 784 โ”‚ BOTTLENECK (latent code z)

Why sigmoid on the output? MNIST pixels are normalized to [0,1]. Using sigmoid ensures the output is in the same range, and you can use binary cross-entropy (BCE) instead of MSE. BCE often works better for binary/normalized data because it treats each pixel as a Bernoulli probability: L = -ฮฃ[xแตข log xฬ‚แตข + (1-xแตข) log(1 - xฬ‚แตข)]

Step-by-step forward pass for a single MNIST image:

1. Flatten 28ร—28 image โ†’ vector x โˆˆ โ„784

2. Encoder layer 1: hโ‚ = ReLU(Wโ‚x + bโ‚), where Wโ‚ โˆˆ โ„256ร—784

3. Encoder layer 2: hโ‚‚ = ReLU(Wโ‚‚hโ‚ + bโ‚‚), where Wโ‚‚ โˆˆ โ„128ร—256

4. Bottleneck: z = Wโ‚ƒhโ‚‚ + bโ‚ƒ, where Wโ‚ƒ โˆˆ โ„32ร—128 โ†’ z โˆˆ โ„32

5. Decoder layer 1: hโ‚ƒ = ReLU(Wโ‚„z + bโ‚„), where Wโ‚„ โˆˆ โ„128ร—32

6. Decoder layer 2: hโ‚„ = ReLU(Wโ‚…hโ‚ƒ + bโ‚…), where Wโ‚… โˆˆ โ„256ร—128

7. Output: xฬ‚ = ฯƒ(Wโ‚†hโ‚„ + bโ‚†), where Wโ‚† โˆˆ โ„784ร—256

8. Loss: L = โ€–x - xฬ‚โ€–ยฒ or BCE(x, xฬ‚)

Total parameters: 784ยท256 + 256ยท128 + 128ยท32 + 32ยท128 + 128ยท256 + 256ยท784 + biases โ‰ˆ 469K

Section 5 ยท 16.2

Undercomplete vs Overcomplete Autoencoders

Undercomplete AE (d < n)

When the latent dimension d is smaller than the input dimension n, the autoencoder is undercomplete. It must learn to compress โ€” there simply isn't enough capacity in the bottleneck to memorize the input.

This is the classic, intuitive case. Think of it like summarizing a novel in a tweet: you're forced to extract the essential meaning.

Overcomplete AE (d โ‰ฅ n)

What if the latent dimension is larger than or equal to the input? Now the network has more capacity than needed โ€” it can trivially learn the identity function by copying each input dimension to a latent dimension.

This sounds useless, but with the right regularization, overcomplete autoencoders learn powerful features:

  • Sparse AE: Penalize activations so most latent units are inactive โ†’ learns selective features
  • Denoising AE: Corrupt input, reconstruct clean version โ†’ learns robust features
  • Contractive AE: Penalize the Jacobian of the encoder โ†’ learns locally invariant features

Key Insight: The Bottleneck Isn't the Only Way to Learn

The Core Idea:

An autoencoder learns useful representations NOT just from having a narrow bottleneck, but from any constraint that prevents it from learning the trivial identity. The bottleneck is one such constraint โ€” but sparsity, noise injection, and Jacobian penalties achieve the same goal in different ways.

Why It Matters:

Overcomplete autoencoders with regularization often learn better features than undercomplete ones, because they have more capacity to capture subtle patterns โ€” they're just prevented from using that capacity to cheat.

Typed vs nRegularizationWhat Forces Learning
Undercompleted < nNone neededBottleneck compresses
Sparsed โ‰ฅ nKL sparsityMost neurons stay off
Denoisingd โ‰ฅ nInput corruptionMust denoise to reconstruct
Contractived โ‰ฅ nJacobian penaltyEncoder must be stable
Section 6 ยท 16.3

Sparse Autoencoders

The Intuition

Imagine a classroom of 500 students (neurons), but only 20 should raise their hands for any given question. Each student specializes in recognizing something specific. When you show a cat picture, only the "whiskers," "pointy ears," and "fur texture" students activate โ€” the rest stay silent. That's sparsity.

Mathematical Foundation

For each hidden neuron j, define the average activation over the training set:

ฯฬ‚โฑผ = (1/N) ฮฃแตข aโฑผ(xแตข)

where aโฑผ(xแตข) is the activation of neuron j for input xแตข. We want ฯฬ‚โฑผ โ‰ˆ ฯ, where ฯ is a small target (e.g., 0.05).

We penalize deviation from the target using KL divergence:

KL(ฯ โ€– ฯฬ‚โฑผ) = ฯ log(ฯ / ฯฬ‚โฑผ) + (1 - ฯ) log((1 - ฯ) / (1 - ฯฬ‚โฑผ))

The total sparse autoencoder loss becomes:

Lsparse = Lreconstruction + ฮฒ ฮฃโฑผ KL(ฯ โ€– ฯฬ‚โฑผ)

where ฮฒ controls the strength of the sparsity penalty (typically ฮฒ = 3, ฯ = 0.05).

Why KL divergence and not just L2 on activations?

You could use Lโ‚‚ penalty: ฮฃโฑผ(ฯฬ‚โฑผ - ฯ)ยฒ. But KL divergence is asymmetric โ€” it penalizes ฯฬ‚โฑผ > ฯ more harshly than ฯฬ‚โฑผ < ฯ, which is what we want. If a neuron fires too often (ฯฬ‚โฑผ = 0.8 when ฯ = 0.05), the KL penalty is enormous. If it fires too rarely (ฯฬ‚โฑผ = 0.01), the penalty is small. This asymmetry encourages neurons to stay inactive by default and only fire when they detect something truly important.

Numerically: KL(0.05 โ€– 0.80) = 0.05ยทln(0.05/0.80) + 0.95ยทln(0.95/0.20) โ‰ˆ 1.49

While: KL(0.05 โ€– 0.01) = 0.05ยทln(0.05/0.01) + 0.95ยทln(0.95/0.99) โ‰ˆ 0.043

The "too active" case is penalized 35ร— more than "too inactive."

Sparse AE Loss: L = โ€–x - xฬ‚โ€–ยฒ + ฮฒ ฮฃโฑผ KL(ฯ โ€– ฯฬ‚โฑผ)

ฯ = target sparsity (e.g., 0.05), ฮฒ = sparsity weight, ฯฬ‚โฑผ = avg activation of neuron j

Key: KL is 0 when ฯฬ‚โฑผ = ฯ, grows โˆž as ฯฬ‚โฑผ โ†’ 0 or 1 (away from ฯ)

Section 7 ยท 16.4

Denoising Autoencoders (DAE)

The Key Idea

Instead of training the autoencoder on clean data, you corrupt the input first โ€” add Gaussian noise, randomly zero-out pixels (masking noise), or apply salt-and-pepper noise โ€” then train the decoder to reconstruct the clean original.

DENOISING AUTOENCODER Clean x Corrupt Noisy xฬƒ Encode Decode xฬ‚ โ‰ˆ x โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”‚ 3 โ”‚ โ”‚noiseโ”‚ โ”‚3.2 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ 3.1 โ”‚ โ”‚ 7 โ”‚ โ†’ โ”‚mask โ”‚ โ†’ โ”‚ 0 โ”‚ โ†’ โ”‚ z โ”‚ โ†’ โ”‚ โ”‚ โ†’ โ”‚ 6.9 โ”‚ โ”‚ 1 โ”‚ โ”‚ โ”‚ โ”‚1.5 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ 1.0 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Loss = โ€–x - xฬ‚โ€–ยฒ (compare with CLEAN x) โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
xฬƒ ~ q(xฬƒ | x)      (corruption process)
LDAE = ๐”ผq(xฬƒ|x) [โ€–x - gฯ†(fฮธ(xฬƒ))โ€–ยฒ]

Common Corruption Methods

MethodDescriptionTypical Setting
Gaussianxฬƒ = x + ฮต, ฮต ~ N(0, ฯƒยฒ)ฯƒ = 0.3โ€“0.5
MaskingRandomly set fraction of pixels to 030โ€“50% dropout
Salt & PepperRandom pixels โ†’ 0 or 110โ€“20% pixels

Why Does This Work?

By corrupting the input, you force the autoencoder to learn the statistical structure of the data distribution โ€” not just memorize inputs. The network must figure out: "Given that pixel (3,7) is corrupted, what should it be, based on the surrounding pixels?" This requires learning the data manifold.

Paper: "Extracting and Composing Robust Features with Denoising Autoencoders" โ€” Vincent et al. (2008). This landmark paper showed that DAEs learn features that capture the data-generating distribution. Mathematically, the DAE objective is equivalent to learning a score function โ€” the gradient of the log-density โˆ‡โ‚“ log p(x) โ€” connecting denoising to score-based diffusion models (2020โ€“2025). Modern diffusion models like DALL-E 2 and Stable Diffusion are, at their core, very deep denoising autoencoders!

Section 8 ยท 16.5

Contractive Autoencoders (CAE)

The Intuition

Think of the encoder as a mapping from input space to latent space. A contractive autoencoder says: "Small changes in the input should produce even smaller changes in the latent code." In other words, the encoding should be locally robust โ€” it shouldn't jump around wildly when you slightly perturb the input.

The Jacobian Penalty

Formally, we penalize the Frobenius norm of the Jacobian of the encoder:

LCAE = โ€–x - xฬ‚โ€–ยฒ + ฮป โ€–Jf(x)โ€–ยฒF

where Jf(x) = โˆ‚f(x)/โˆ‚x โˆˆ โ„dร—n,   โ€–Jโ€–ยฒF = ฮฃแตขโฑผ Jยฒแตขโฑผ

The Jacobian matrix J โˆˆ โ„dร—n has entry Jแตขโฑผ = โˆ‚zแตข/โˆ‚xโฑผ โ€” how much does latent dimension i change when input dimension j changes. Penalizing this forces the encoder to be insensitive to small input variations, learning only the directions that truly matter.

โŒ MYTH: "Contractive AEs and Denoising AEs do completely different things."

โœ… TRUTH: They're deeply related! Rifai et al. (2011) showed that the DAE implicitly minimizes a term proportional to the Frobenius norm of the Jacobian. Denoising achieves contraction through stochastic corruption; CAE achieves it through an explicit penalty.

๐Ÿ” WHY IT MATTERS: Understanding this connection lets you choose: DAE if you want simplicity, CAE if you want precise control over robustness.

Section 9 ยท 16.6

Variational Autoencoders (VAE)

This is the crown jewel of this chapter. Buckle up โ€” we'll derive everything from first principles.

The Problem with Regular Autoencoders

A vanilla AE learns a deterministic mapping: each input x maps to exactly one point z in latent space. The latent space has no structure โ€” points can be scattered arbitrarily. If you sample a random point z from this latent space and decode it, you'll likely get garbage. Regular AEs compress but cannot generate.

The VAE Idea: Make the Latent Space Probabilistic

Instead of encoding x to a single point z, we encode it to a probability distribution โ€” specifically, a Gaussian N(ฮผ, ฯƒยฒ). Then we sample z from this distribution. The decoder must be able to reconstruct x from any sample drawn from the distribution, not just one specific point.

VARIATIONAL AUTOENCODER Input x Encoder Sample Decoder Output xฬ‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”€โ”€โ”€โ”€โ”€โ–บ โ”‚ ฮผ โ”€โ”€โ”€โ”ผโ”€โ”€โ” โ”‚z=ฮผ+ฯƒฮต โ”‚ โ”€โ”€โ”€โ”€โ”€โ–บ โ”‚ โ”‚ โ”€โ”€โ”€โ–บ โ”‚ โ”‚ โ”‚ x โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ–บ โ”‚ฮต~N(0,1)โ”‚ โ”‚ g(z) โ”‚ โ”‚ xฬ‚ โ”‚ โ”‚ โ”‚ โ”€โ”€โ”€โ”€โ”€โ–บ โ”‚ log ฯƒยฒ โ”€โ”ผโ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ REPARAMETERIZATION TRICK

The ELBO Derivation (From Scratch)

Goal: We want to maximize the log-likelihood of our data: log p(x).

Problem: Computing p(x) = โˆซ p(x|z) p(z) dz is intractable โ€” we'd need to integrate over all possible z.

Solution: Introduce an approximate posterior qฯ†(z|x) and derive a tractable lower bound.

 

Step 1: Start with log p(x) and introduce qฯ†(z|x):

log p(x) = log โˆซ p(x,z) dz = log โˆซ [p(x,z) / q(z|x)] ยท q(z|x) dz

 

Step 2: Apply Jensen's inequality (log of expectation โ‰ฅ expectation of log):

log p(x) โ‰ฅ โˆซ q(z|x) log [p(x,z) / q(z|x)] dz

= ๐”ผq[log p(x,z) - log q(z|x)]

 

Step 3: Expand p(x,z) = p(x|z) ยท p(z):

= ๐”ผq[log p(x|z) + log p(z) - log q(z|x)]

= ๐”ผq[log p(x|z)] - ๐”ผq[log q(z|x) - log p(z)]

= ๐”ผq[log p(x|z)] - KL(q(z|x) โ€– p(z))

 

This is the ELBO!

ELBO = ๐”ผq(z|x)[log p(x|z)] โˆ’ KL(qฯ†(z|x) โ€– p(z))

โ†‘ Reconstruction Loss                  โ†‘ KL Regularizer
"How well can I reconstruct x from z?"      "How close is q(z|x) to prior p(z)?"

The Two Terms Explained

Term 1: Reconstruction Loss โ€” ๐”ผq[log p(x|z)]

This is the expected log-likelihood of the data given the latent code. In practice:

  • For binary/normalized data: Binary Cross-Entropy between x and xฬ‚
  • For continuous data: MSE (equivalent to Gaussian p(x|z) with fixed variance)

Term 2: KL Divergence โ€” KL(q(z|x) โ€– p(z))

This regularizes the latent space, pushing q(z|x) = N(ฮผ, ฯƒยฒ) toward the prior p(z) = N(0, I). It has a closed-form solution for two Gaussians:

KL(N(ฮผ, ฯƒยฒ) โ€– N(0, 1)) = โˆ’ยฝ ฮฃโฑผ (1 + log ฯƒยฒโฑผ โˆ’ ฮผยฒโฑผ โˆ’ ฯƒยฒโฑผ)

Deriving the KL term for univariate Gaussians:

KL(N(ฮผ,ฯƒยฒ) โ€– N(0,1)) = โˆซ q(z) log [q(z)/p(z)] dz

= โˆซ q(z) [log q(z) - log p(z)] dz

= -ยฝ log(2ฯ€ฯƒยฒ) - ยฝ + ยฝ log(2ฯ€) + ยฝ(ฮผยฒ + ฯƒยฒ)

(using the fact that ๐”ผ[zยฒ] = ฮผยฒ + ฯƒยฒ for z ~ N(ฮผ,ฯƒยฒ) and entropy of N(ฮผ,ฯƒยฒ) = ยฝ log(2ฯ€eฯƒยฒ))

= -ยฝ log ฯƒยฒ - ยฝ + ยฝฮผยฒ + ยฝฯƒยฒ

= -ยฝ(1 + log ฯƒยฒ - ฮผยฒ - ฯƒยฒ)

For d-dimensional diagonal Gaussian, sum over all dimensions j.

The Reparameterization Trick

Here's a subtle problem: how do you backpropagate through a sampling operation? If z ~ N(ฮผ, ฯƒยฒ), the sampling is stochastic โ€” gradients can't flow through random number generators.

Solution: Instead of sampling z directly, reparameterize:

z = ฮผ + ฯƒ โŠ™ ฮต,    where ฮต ~ N(0, I)

Now the randomness is in ฮต (which has no learnable parameters), and z is a deterministic function of ฮผ and ฯƒ. Gradients flow through ฮผ and ฯƒ just fine!

THE REPARAMETERIZATION TRICK โŒ CAN'T BACKPROP: โœ… CAN BACKPROP: ฮผ โ”€โ”€โ” ฮผ โ”€โ”€โ”€โ”€ + โ”€โ”€โ”€โ”€โ–บ z โ”œโ”€โ”€ sample โ”€โ”€โ–บ z โ”‚ ฯƒ โ”€โ”€โ”˜ (stochastic) ฯƒ โ”€โ”€ ร— โ”€โ”€ ฮต โ”‚ ฮต ~ N(0,I) (no gradient needed here)

โŒ MYTH: "The VAE encoder outputs z directly."

โœ… TRUTH: The encoder outputs TWO vectors: ฮผ and log ฯƒยฒ (not ฯƒยฒ directly โ€” we use log ฯƒยฒ for numerical stability, since ฯƒยฒ must be positive and log ฯƒยฒ can be any real number). The actual z is then sampled using the reparameterization trick.

๐Ÿ” WHY IT MATTERS: If you code the encoder to output ฯƒยฒ directly, you'll need to constrain it to be positive (e.g., softplus). Using log ฯƒยฒ is simpler and more stable.

VAE Loss = โˆ’ELBO = Reconstruction + KL

Reconstruction: BCE(x, xฬ‚) or MSE(x, xฬ‚)

KL: โˆ’ยฝ ฮฃโฑผ(1 + log ฯƒยฒโฑผ โˆ’ ฮผยฒโฑผ โˆ’ ฯƒยฒโฑผ)

Reparameterization: z = ฮผ + ฯƒ โŠ™ ฮต, ฮต ~ N(0,I)

Encoder outputs: ฮผ and log ฯƒยฒ (not z directly!)

Section 10 ยท 16.7

Autoencoders vs PCA

The Theorem: Linear AE = PCA

Here's a beautiful result. Consider a single-layer autoencoder with no activation function (linear encoder and decoder) trained with MSE loss:

z = Wx + bโ‚      xฬ‚ = W'z + bโ‚‚      L = โ€–x - xฬ‚โ€–ยฒ

The optimal weights W span the same subspace as the top-d principal components of the data! The latent code z is (a rotation of) the PCA projection.

Proof sketch:

1. The linear AE minimizes L = ๐”ผ[โ€–x - W'Wxโ€–ยฒ] (absorbing biases into zero-mean data).

2. This is exactly the projection error โ€” the same objective PCA minimizes.

3. The columns of WT (the decoder matrix) must span the top-d eigenvectors of the data covariance matrix ฮฃ = ๐”ผ[xxT].

4. The key insight: the encoder and decoder learn complementary linear maps. W projects onto a d-dimensional subspace, and W' reconstructs from it. The optimal subspace is the one that preserves the most variance โ€” which is exactly PCA.

5. Caveat: W won't necessarily equal the PCA eigenvectors โ€” it could be any rotation of them. But the subspace is the same.

When AEs Beat PCA

FeaturePCANonlinear AE
TransformationLinear onlyArbitrary nonlinear
Data manifoldFlat hyperplaneCurved manifold
ComputationEigendecomposition O(nยณ)SGD training O(epochs)
InterpretabilityEigenvectors are interpretableLatent dims are opaque
ScalabilityStruggles with n > 10KHandles millions of features
New dataSimple projectionForward pass through encoder

Interview Gold: If asked "AE vs PCA?", start with the equivalence (linear AE = PCA), then explain how nonlinearity gives AEs strictly more expressive power. Mention that PCA is still preferred when you want interpretability or when the data relationships are approximately linear.

Section 11 ยท 16.8

Anomaly Detection with Autoencoders

The Core Idea

Train an autoencoder on normal data only. The AE learns to compress and reconstruct normal patterns well. When an anomalous input arrives, the AE fails to reconstruct it โ€” the reconstruction error spikes. You flag any input with reconstruction error above a threshold as anomalous.

anomaly_score(x) = โ€–x โˆ’ g(f(x))โ€–ยฒ
Flag as anomaly if anomaly_score(x) > ฯ„

๐Ÿ‡ฎ๐Ÿ‡ณ Indian Industry: Paytm UPI Anomaly Detection

Case Study: Detecting Fraud in 8 Billion Monthly Transactions

Paytm processes over 8 billion UPI transactions per month โ€” roughly 3,000 per second on average, with spikes during festivals like Diwali reaching 10,000+ TPS. Fraud accounts for a tiny fraction (<0.01%), but at this scale, even 0.01% means 800,000 potentially fraudulent transactions per month.

The Architecture

Each transaction is featurized into a ~120-dimensional vector:

  • Transaction features: amount (log-scaled), time-of-day (cyclical encoding), day-of-week, merchant category
  • User behavior: rolling 7-day average spend, typical transaction time, usual merchant types, device fingerprint hash
  • Graph features: sender-receiver trust score, merchant reputation score, network centrality

An autoencoder (120 โ†’ 64 โ†’ 32 โ†’ 64 โ†’ 120) is trained on millions of legitimate transactions. At inference:

  1. Compute reconstruction error for each incoming transaction
  2. If error > adaptive threshold (based on rolling 99.5th percentile), flag for review
  3. Secondary classifier (gradient-boosted trees) on flagged transactions to reduce false positives
Results
  • 95% recall on known fraud patterns
  • 70% recall on novel/zero-day fraud (vs 20% for rule-based systems)
  • False positive rate: ~0.3% (manageable for human review)
  • Inference latency: <5ms per transaction on GPU

Key insight: The autoencoder excels at catching novel fraud patterns โ€” types never seen before. Rule-based systems catch known patterns; the AE catches the unknown unknowns.

๐Ÿ‡บ๐Ÿ‡ธ US/Global Industry: Spotify Audio Feature Learning

Case Study: Compressing 100M Songs into Embeddings

Spotify has over 100 million tracks in its catalog. For personalized recommendations, the platform needs a compact representation of each track's audio characteristics. Human-labeled features (genre, mood) are expensive and inconsistent. Instead, Spotify uses autoencoders to learn audio features directly from mel-spectrograms.

The Pipeline
  1. Input: 3-second mel-spectrogram clips (128 mel bins ร— 130 time frames = 16,640 dimensions)
  2. Encoder: Convolutional AE with 4 conv layers โ†’ bottleneck of 128 dimensions
  3. Training: Trained on 10M tracks with MSE reconstruction loss + VQ-VAE quantization
  4. Output: 128-dimensional audio embedding per track
Applications
  • Cold-start recommendations: New tracks with no play history get embeddings from audio alone
  • Music similarity: cosine similarity in latent space correlates with human "sounds-like" judgments
  • Anomaly detection: Identifying mislabeled tracks, corrupted audio, or AI-generated content
  • Playlist generation: Smooth interpolation in latent space for seamless transitions between tracks

Key insight: The AE-learned features capture musical properties (tempo, energy, instrumentalness) without explicit labels โ€” they emerge naturally from reconstruction pressure.

๐Ÿ‡ฎ๐Ÿ‡ณ Paytm โ€” Anomaly Detection

  • โœฆ Task: Fraud detection in 8B monthly UPI txns
  • โœฆ AE role: Reconstruction error as anomaly score
  • โœฆ Input: 120-dim transaction feature vector
  • โœฆ Bottleneck: 32 dimensions
  • โœฆ Challenge: Ultra-low latency (<5ms), extreme class imbalance (<0.01% fraud)
  • โœฆ Bonus: Catches zero-day fraud patterns

๐Ÿ‡บ๐Ÿ‡ธ Spotify โ€” Audio Features

  • โœฆ Task: Learn audio embeddings for 100M tracks
  • โœฆ AE role: Compress mel-spectrogram โ†’ embedding
  • โœฆ Input: 16,640-dim spectrogram
  • โœฆ Bottleneck: 128 dimensions
  • โœฆ Challenge: Scale, cold-start problem, diverse genres
  • โœฆ Bonus: Emergent musical features without labels

Job Roles Using Autoencoders:

  • ML Engineer (Fraud/Risk): Build anomaly detection pipelines at Paytm, Razorpay, Stripe, PayPal. Salary: โ‚น25-55 LPA (India) / $150-250K (US)
  • Research Scientist (Generative AI): Design VAE architectures for drug discovery, image generation, music synthesis. Salary: โ‚น30-60 LPA / $180-300K
  • Data Scientist (RecSys): Build embedding systems for Netflix, Spotify, Amazon. Salary: โ‚น20-45 LPA / $140-220K
  • Applied Scientist (NLP): Sentence embeddings with autoencoder pre-training. Salary: โ‚น22-50 LPA / $160-280K
Section 12

Worked Examples

Example 1: By-Hand Forward Pass (Vanilla AE)

Hand Calculation: 3โ†’2โ†’3 Autoencoder

Setup:

Input: x = [0.8, 0.4, 0.1]

Encoder weights: Wโ‚ = [[0.5, 0.3, -0.2], [-0.1, 0.7, 0.4]], bโ‚ = [0.1, -0.1]

Decoder weights: Wโ‚‚ = [[0.6, -0.3], [0.2, 0.8], [-0.4, 0.5]], bโ‚‚ = [0.05, 0.02, -0.03]

Activation: ReLU (encoder), Sigmoid (decoder)

Step 1: Encode

h = Wโ‚x + bโ‚

hโ‚ = 0.5(0.8) + 0.3(0.4) + (-0.2)(0.1) + 0.1 = 0.40 + 0.12 - 0.02 + 0.1 = 0.60

hโ‚‚ = (-0.1)(0.8) + 0.7(0.4) + 0.4(0.1) + (-0.1) = -0.08 + 0.28 + 0.04 - 0.1 = 0.14

z = ReLU(h) = [ReLU(0.60), ReLU(0.14)] = [0.60, 0.14]

Step 2: Decode

o = Wโ‚‚z + bโ‚‚

oโ‚ = 0.6(0.60) + (-0.3)(0.14) + 0.05 = 0.36 - 0.042 + 0.05 = 0.368

oโ‚‚ = 0.2(0.60) + 0.8(0.14) + 0.02 = 0.12 + 0.112 + 0.02 = 0.252

oโ‚ƒ = (-0.4)(0.60) + 0.5(0.14) + (-0.03) = -0.24 + 0.07 - 0.03 = -0.200

xฬ‚ = ฯƒ(o) = [ฯƒ(0.368), ฯƒ(0.252), ฯƒ(-0.200)] = [0.591, 0.563, 0.450]

Step 3: Compute Loss (MSE)

L = (1/3)[(0.8-0.591)ยฒ + (0.4-0.563)ยฒ + (0.1-0.450)ยฒ]

= (1/3)[0.0437 + 0.0266 + 0.1225] = 0.0643

Example 2: Computing VAE KL Divergence

VAE KL Term: Numerical Calculation

Setup:

Encoder outputs for a single input x: ฮผ = [0.5, -0.3], log ฯƒยฒ = [-0.8, 0.2]

So: ฯƒยฒ = [exp(-0.8), exp(0.2)] = [0.449, 1.221]

Prior: p(z) = N(0, I)

KL Computation:

KL = -ยฝ ฮฃโฑผ (1 + log ฯƒยฒโฑผ - ฮผยฒโฑผ - ฯƒยฒโฑผ)

For j=1: -ยฝ(1 + (-0.8) - 0.25 - 0.449) = -ยฝ(-0.499) = 0.250

For j=2: -ยฝ(1 + 0.2 - 0.09 - 1.221) = -ยฝ(-0.111) = 0.056

Total KL = 0.250 + 0.056 = 0.306

Interpretation:

Dimension 1 contributes more to KL because ฮผโ‚ = 0.5 (further from 0) and ฯƒโ‚ยฒ = 0.449 (further from 1). The KL pushes both (ฮผ,ฯƒยฒ) toward (0,1).

Example 3: Anomaly Detection Threshold (Indian Industry)

Setting Anomaly Threshold at Paytm

Scenario:

You've trained an AE on 1 million legitimate UPI transactions. The reconstruction errors on a validation set of 100,000 normal transactions follow approximately a log-normal distribution with mean = 0.023 and std = 0.012.

Step 1: Compute percentiles

99th percentile of reconstruction error on normal data: 0.023 + 2.33 ร— 0.012 โ‰ˆ 0.051

99.5th percentile: 0.023 + 2.58 ร— 0.012 โ‰ˆ 0.054

Step 2: Set threshold

Choose ฯ„ = 0.054 (99.5th percentile). This means ~0.5% of normal transactions will be false positives.

Step 3: Evaluate on known fraud

On a test set of 500 known fraudulent transactions, reconstruction errors: mean = 0.187, std = 0.095.

Fraction above ฯ„ = 0.054: approximately 92% โ†’ recall = 0.92

Step 4: Business decision

0.5% FPR on 8B transactions/month = 40M false alerts โ†’ too many for human review!

Solution: Use AE as first filter, then secondary classifier (XGBoost) on flagged transactions reduces FPR to 0.03%.

Section 13

Python Implementation: From Scratch (NumPy)

Vanilla Autoencoder โ€” NumPy

Python ยท NumPy
import numpy as np

class VanillaAutoencoder:
    """Vanilla AE: 784 โ†’ 128 โ†’ 32 โ†’ 128 โ†’ 784"""
    def __init__(self, input_dim=784, hidden_dim=128, latent_dim=32, lr=0.001):
        self.lr = lr
        # Xavier initialization
        self.W1 = np.random.randn(input_dim, hidden_dim) * np.sqrt(2.0 / input_dim)
        self.b1 = np.zeros(hidden_dim)
        self.W2 = np.random.randn(hidden_dim, latent_dim) * np.sqrt(2.0 / hidden_dim)
        self.b2 = np.zeros(latent_dim)
        self.W3 = np.random.randn(latent_dim, hidden_dim) * np.sqrt(2.0 / latent_dim)
        self.b3 = np.zeros(hidden_dim)
        self.W4 = np.random.randn(hidden_dim, input_dim) * np.sqrt(2.0 / hidden_dim)
        self.b4 = np.zeros(input_dim)

    def relu(self, x):
        return np.maximum(0, x)

    def relu_grad(self, x):
        return (x > 0).astype(float)

    def sigmoid(self, x):
        return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))

    def forward(self, x):
        # Encoder
        self.z1 = x @ self.W1 + self.b1
        self.a1 = self.relu(self.z1)                # (batch, 128)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.latent = self.z2                        # (batch, 32) โ€” linear bottleneck
        # Decoder
        self.z3 = self.latent @ self.W3 + self.b3
        self.a3 = self.relu(self.z3)                # (batch, 128)
        self.z4 = self.a3 @ self.W4 + self.b4
        self.output = self.sigmoid(self.z4)          # (batch, 784)
        return self.output

    def compute_loss(self, x, x_hat):
        # Binary cross-entropy
        eps = 1e-8
        bce = -np.mean(x * np.log(x_hat + eps) + (1 - x) * np.log(1 - x_hat + eps))
        return bce

    def backward(self, x):
        batch_size = x.shape[0]
        # Output gradient (BCE derivative with sigmoid)
        dz4 = (self.output - x) / batch_size          # (batch, 784)
        dW4 = self.a3.T @ dz4
        db4 = dz4.sum(axis=0)
        da3 = dz4 @ self.W4.T
        dz3 = da3 * self.relu_grad(self.z3)
        dW3 = self.latent.T @ dz3
        db3 = dz3.sum(axis=0)
        dlatent = dz3 @ self.W3.T
        # Latent layer is linear, so dz2 = dlatent
        dz2 = dlatent
        dW2 = self.a1.T @ dz2
        db2 = dz2.sum(axis=0)
        da1 = dz2 @ self.W2.T
        dz1 = da1 * self.relu_grad(self.z1)
        dW1 = x.T @ dz1
        db1 = dz1.sum(axis=0)
        # Update weights
        for W, dW in [(self.W1, dW1), (self.W2, dW2),
                        (self.W3, dW3), (self.W4, dW4)]:
            W -= self.lr * dW
        for b, db in [(self.b1, db1), (self.b2, db2),
                        (self.b3, db3), (self.b4, db4)]:
            b -= self.lr * db

    def train_step(self, x_batch):
        x_hat = self.forward(x_batch)
        loss = self.compute_loss(x_batch, x_hat)
        self.backward(x_batch)
        return loss

# โ”€โ”€ Usage โ”€โ”€
# from sklearn.datasets import fetch_openml
# mnist = fetch_openml('mnist_784', version=1)
# X = mnist.data.values / 255.0  # normalize to [0,1]
# ae = VanillaAutoencoder()
# for epoch in range(50):
#     for i in range(0, len(X), 128):
#         loss = ae.train_step(X[i:i+128])
#     print(f"Epoch {epoch}: loss={loss:.4f}")

Variational Autoencoder โ€” NumPy

Python ยท NumPy
class VAE_NumPy:
    """Variational AE: 784 โ†’ 256 โ†’ (ฮผ,logvar) โ†’ z(2-dim) โ†’ 256 โ†’ 784"""
    def __init__(self, input_dim=784, hidden=256, latent=2, lr=0.001):
        self.lr = lr
        self.latent = latent
        # Encoder: input โ†’ hidden โ†’ (mu, logvar)
        self.W1 = np.random.randn(input_dim, hidden) * np.sqrt(2.0/input_dim)
        self.b1 = np.zeros(hidden)
        self.W_mu = np.random.randn(hidden, latent) * np.sqrt(2.0/hidden)
        self.b_mu = np.zeros(latent)
        self.W_lv = np.random.randn(hidden, latent) * np.sqrt(2.0/hidden)
        self.b_lv = np.zeros(latent)
        # Decoder: z โ†’ hidden โ†’ output
        self.W3 = np.random.randn(latent, hidden) * np.sqrt(2.0/latent)
        self.b3 = np.zeros(hidden)
        self.W4 = np.random.randn(hidden, input_dim) * np.sqrt(2.0/hidden)
        self.b4 = np.zeros(input_dim)

    def relu(self, x): return np.maximum(0, x)
    def relu_d(self, x): return (x > 0).astype(float)
    def sigmoid(self, x): return 1/(1+np.exp(-np.clip(x,-500,500)))

    def forward(self, x):
        B = x.shape[0]
        # Encode
        self.h1_pre = x @ self.W1 + self.b1
        self.h1 = self.relu(self.h1_pre)
        self.mu = self.h1 @ self.W_mu + self.b_mu       # (B, latent)
        self.logvar = self.h1 @ self.W_lv + self.b_lv    # (B, latent)
        # Reparameterize: z = mu + exp(0.5*logvar) * eps
        self.eps = np.random.randn(B, self.latent)
        self.std = np.exp(0.5 * self.logvar)
        self.z = self.mu + self.std * self.eps           # (B, latent)
        # Decode
        self.h3_pre = self.z @ self.W3 + self.b3
        self.h3 = self.relu(self.h3_pre)
        self.out_pre = self.h3 @ self.W4 + self.b4
        self.x_hat = self.sigmoid(self.out_pre)          # (B, 784)
        return self.x_hat

    def loss(self, x):
        eps = 1e-8
        # Reconstruction: BCE
        recon = -np.sum(x*np.log(self.x_hat+eps) + (1-x)*np.log(1-self.x_hat+eps)) / x.shape[0]
        # KL: -0.5 * sum(1 + logvar - mu^2 - exp(logvar))
        kl = -0.5 * np.sum(1 + self.logvar - self.mu**2 - np.exp(self.logvar)) / x.shape[0]
        return recon + kl, recon, kl

    def backward(self, x):
        B = x.shape[0]
        # โ”€โ”€ Decoder gradients โ”€โ”€
        d_out = (self.x_hat - x) / B                    # BCE+sigmoid shortcut
        dW4 = self.h3.T @ d_out
        db4 = d_out.sum(0)
        dh3 = d_out @ self.W4.T * self.relu_d(self.h3_pre)
        dW3 = self.z.T @ dh3
        db3 = dh3.sum(0)
        dz = dh3 @ self.W3.T                            # (B, latent)
        # โ”€โ”€ Reparameterization gradients โ”€โ”€
        d_mu = dz + (self.mu) / B                        # recon + KL grad w.r.t ฮผ
        d_logvar = dz * 0.5 * self.std * self.eps \
                   + 0.5 * (np.exp(self.logvar) - 1) / B  # KL grad w.r.t logvar
        # โ”€โ”€ Encoder gradients โ”€โ”€
        dW_mu = self.h1.T @ d_mu
        db_mu = d_mu.sum(0)
        dW_lv = self.h1.T @ d_logvar
        db_lv = d_logvar.sum(0)
        dh1 = (d_mu @ self.W_mu.T + d_logvar @ self.W_lv.T) * self.relu_d(self.h1_pre)
        dW1 = x.T @ dh1
        db1 = dh1.sum(0)
        # โ”€โ”€ Update all weights โ”€โ”€
        for p, g in [(self.W1,dW1),(self.b1,db1),(self.W_mu,dW_mu),(self.b_mu,db_mu),
                       (self.W_lv,dW_lv),(self.b_lv,db_lv),(self.W3,dW3),(self.b3,db3),
                       (self.W4,dW4),(self.b4,db4)]:
            p -= self.lr * g

    def train_step(self, x):
        self.forward(x)
        total, recon, kl = self.loss(x)
        self.backward(x)
        return total, recon, kl

# Usage: vae = VAE_NumPy(latent=2); vae.train_step(batch)

A student wrote the following VAE KL loss. What's wrong?

# BUGGY CODE โ€” find the error!
kl_loss = -0.5 * np.sum(1 + self.logvar - self.mu**2 - self.logvar**2)

Bug: The last term should be np.exp(self.logvar), NOT self.logvar**2! The KL formula is: -ยฝฮฃ(1 + log ฯƒยฒ - ฮผยฒ - ฯƒยฒ). Since the network outputs logvar = log ฯƒยฒ, we need exp(logvar) to get ฯƒยฒ. Squaring logvar gives (log ฯƒยฒ)ยฒ โ€” completely wrong! This would lead to the KL term failing to properly regularize the latent space, resulting in a disorganized latent space where sampling produces garbage.

Section 14

PyTorch Implementations

Vanilla Autoencoder โ€” PyTorch

Python ยท PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

class Autoencoder(nn.Module):
    def __init__(self, latent_dim=32):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(784, 256), nn.ReLU(),
            nn.Linear(256, 128), nn.ReLU(),
            nn.Linear(128, latent_dim)      # no activation โ†’ linear bottleneck
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 128), nn.ReLU(),
            nn.Linear(128, 256), nn.ReLU(),
            nn.Linear(256, 784), nn.Sigmoid()
        )

    def forward(self, x):
        z = self.encoder(x)
        x_hat = self.decoder(z)
        return x_hat, z

# โ”€โ”€ Training Loop โ”€โ”€
transform = transforms.Compose([transforms.ToTensor(), transforms.Lambda(lambda x: x.view(-1))])
train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
loader = DataLoader(train_data, batch_size=128, shuffle=True)

model = Autoencoder(latent_dim=32)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCELoss(reduction='mean')

for epoch in range(20):
    total_loss = 0
    for x, _ in loader:
        x_hat, z = model(x)
        loss = criterion(x_hat, x)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}: loss={total_loss/len(loader):.4f}")

Variational Autoencoder โ€” PyTorch

Python ยท PyTorch
class VAE(nn.Module):
    def __init__(self, latent_dim=2):
        super().__init__()
        # Encoder: shared layers + two heads (mu, logvar)
        self.encoder = nn.Sequential(
            nn.Linear(784, 512), nn.ReLU(),
            nn.Linear(512, 256), nn.ReLU(),
        )
        self.fc_mu     = nn.Linear(256, latent_dim)
        self.fc_logvar = nn.Linear(256, latent_dim)
        # Decoder
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256), nn.ReLU(),
            nn.Linear(256, 512), nn.ReLU(),
            nn.Linear(512, 784), nn.Sigmoid(),
        )

    def encode(self, x):
        h = self.encoder(x)
        return self.fc_mu(h), self.fc_logvar(h)

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)           # ฮต ~ N(0, I)
        return mu + std * eps                  # z = ฮผ + ฯƒโŠ™ฮต

    def decode(self, z):
        return self.decoder(z)

    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        x_hat = self.decode(z)
        return x_hat, mu, logvar, z

def vae_loss(x, x_hat, mu, logvar):
    # Reconstruction: BCE
    recon = nn.functional.binary_cross_entropy(x_hat, x, reduction='sum')
    # KL divergence: -0.5 * ฮฃ(1 + log(ฯƒยฒ) - ฮผยฒ - ฯƒยฒ)
    kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return (recon + kl) / x.size(0), recon / x.size(0), kl / x.size(0)

# โ”€โ”€ Training Loop โ”€โ”€
vae = VAE(latent_dim=2)
opt = optim.Adam(vae.parameters(), lr=1e-3)

for epoch in range(30):
    for x, _ in loader:
        x_hat, mu, logvar, z = vae(x)
        loss, recon, kl = vae_loss(x, x_hat, mu, logvar)
        opt.zero_grad()
        loss.backward()
        opt.step()
    print(f"Epoch {epoch+1}: loss={loss:.1f}  recon={recon:.1f}  kl={kl:.1f}")

Latent Space Visualization & Generation

Python ยท Visualization
import matplotlib.pyplot as plt

# โ”€โ”€ 1. Visualize latent space (color by digit) โ”€โ”€
vae.eval()
all_z, all_y = [], []
with torch.no_grad():
    for x, y in loader:
        _, mu, _, _ = vae(x)
        all_z.append(mu.numpy())
        all_y.append(y.numpy())

Z = np.concatenate(all_z)
Y = np.concatenate(all_y)

plt.figure(figsize=(10, 8))
for digit in range(10):
    mask = Y == digit
    plt.scatter(Z[mask, 0], Z[mask, 1], s=2, alpha=0.5, label=str(digit))
plt.legend(); plt.title("VAE Latent Space (2D)")
plt.xlabel("zโ‚"); plt.ylabel("zโ‚‚")
plt.savefig("vae_latent_space.png", dpi=150)
plt.show()

# โ”€โ”€ 2. Generate new digits by sampling from prior โ”€โ”€
n = 15
grid = np.linspace(-3, 3, n)
fig, axes = plt.subplots(n, n, figsize=(12, 12))
for i, yi in enumerate(grid):
    for j, xi in enumerate(grid):
        z = torch.tensor([[xi, yi]], dtype=torch.float32)
        with torch.no_grad():
            img = vae.decode(z).view(28, 28).numpy()
        axes[i][j].imshow(img, cmap='gray')
        axes[i][j].axis('off')
plt.suptitle("Generated Digits: Traversing 2D Latent Space")
plt.tight_layout()
plt.savefig("vae_generated_grid.png", dpi=150)
plt.show()
Expected output after training: Epoch 1: loss=214.3 recon=209.1 kl=5.2 Epoch 10: loss=158.7 recon=151.3 kl=7.4 Epoch 30: loss=147.2 recon=138.6 kl=8.6 The latent space plot will show 10 clusters (one per digit) with smooth transitions between them. The generation grid will show digits morphing smoothly from one to another.
Section 15

Visual Diagrams

Diagram 1: The Autoencoder Zoo

THE AUTOENCODER FAMILY TREE โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ AUTOENCODER (base concept) โ”‚ โ”‚ Input โ†’ Encode โ†’ Bottleneck โ†’ Decode โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ UNDERCOMPLETE โ”‚ โ”‚ OVERCOMPLETE โ”‚ โ”‚ (d < n) โ”‚ โ”‚ (d โ‰ฅ n) โ”‚ โ”‚ Bottleneck โ”‚ โ”‚ Regularized โ”‚ โ”‚ forces learning โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ” โ”Œโ”€โ”ดโ”€โ”€โ”€โ” โ”Œโ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Vanilla โ”‚ โ”‚Sparse โ”‚ โ”‚ DAE โ”‚ โ”‚ CAE โ”‚ โ”‚ AE โ”‚ โ”‚ AE โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ (Ch 16.1) โ”‚ โ”‚(16.3) โ”‚ โ”‚(16.4โ”‚ โ”‚(16.5) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ VAE โ”‚ โ”‚ (probabilistic)โ”‚ โ”‚ (Ch 16.6) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Diagram 2: VAE Loss Landscape

ELBO = RECONSTRUCTION โˆ’ KL DIVERGENCE High Recon Loss โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Low Recon Loss (blurry output) (sharp output) โ”‚ โ”‚ โ”‚ โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•— โ”‚ โ”‚ โ•‘ Too much โ•‘ โ”‚ โ”‚ โ•‘ KL weight โ•‘ โ”‚ โ”‚ โ•‘ (ฮฒ >> 1) โ•‘ โ”‚ โ”‚ โ•‘ = blurry โ•‘ โ”‚ High โ•‘ โ•‘ but smoothโ•‘ โ”‚ KL โ•‘ โ•‘ latent โ•‘ โ”‚ โ•‘ โ•šโ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ• โ”‚ โ•‘ โ”‚ โ˜… Sweet spot โ”‚ โ•‘ โ”‚ (ฮฒ = 1) โ”‚ โ•‘ โ”‚ Good recon + โ”‚ โ•‘ โ”‚ organized latent โ”‚ Low โ•‘ โ”‚ โ”‚ KL โ•‘ โ•”โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•— โ”‚ โ•‘ โ•‘ Too little โ•‘ โ”‚ โ•‘ โ•‘ KL weight โ•‘ โ”‚ โ•‘ โ•‘ (ฮฒ << 1) โ•‘ โ”‚ โ•‘ โ•‘ = sharp โ•‘ โ”‚ โ•‘ โ•‘ but messy โ•‘ โ”‚ โ•‘ โ•‘ latent โ•‘ โ”‚ โ•‘ โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• โ”‚

Diagram 3: AE vs PCA โ€” Data Manifolds

PCA PROJECTION (linear) AE PROJECTION (nonlinear) โ€ข โ€ข โ€ข โ€ข โ€ข โ€ข โ€ข โ€ข โ€ข โ€ข โ€ข โ•”โ•โ•โ•โ•— โ€ข โ€ข โ•ญโ”€โ”€โ”€โ•ฎ โ€ข โ€ข โ•‘ โ•‘ โ€ข โ€ข โ”‚ โ”‚ โ€ข โ€ข โ•‘ PCAโ•‘ โ€ข โ€ข โ”‚ AE โ”‚ โ€ข โ€ข โ•‘ โ•‘ โ€ข PCA finds โ€ข โ”‚ โ”‚ โ€ข AE traces โ€ข โ•šโ•โ•โ•โ• โ€ข a flat line โ€ข โ•ฐโ”€โ”€โ”€โ•ฏ โ€ข the curve! โ€ข โ€ข โ€ข โ€ข โ€ข โ€ข โ€ข โ€ข โ€ข โ€ข PCA misses the curve! AE captures the manifold! High reconstruction error. Low reconstruction error.
Section 16

Common Misconceptions

โŒ MYTH: "Autoencoders are generative models โ€” you can sample from them like GANs."

โœ… TRUTH: Vanilla autoencoders are NOT generative. Their latent space is unstructured โ€” random samples from it produce garbage. Only VAEs (and VQ-VAEs) are generative, because their latent space is regularized to match a known prior distribution (N(0,I)) that you can actually sample from.

๐Ÿ” WHY IT MATTERS: Don't use a vanilla AE when your goal is generation. Use a VAE or GAN instead. If your goal is just compression, feature learning, or anomaly detection, a vanilla AE is fine.

โŒ MYTH: "Deeper autoencoders always learn better representations."

โœ… TRUTH: An overly deep or overly wide encoder can memorize training data, learning an identity-like mapping through sheer capacity even with a bottleneck. The right architecture depends on data complexity. For MNIST, 2-3 encoder layers suffice. For ImageNet, you might use a ResNet encoder.

๐Ÿ” WHY IT MATTERS: Monitor both training AND validation reconstruction error. If the gap is large, your AE is memorizing โ€” add dropout, reduce capacity, or switch to a DAE.

โŒ MYTH: "The KL term in VAEs is a nuisance โ€” I should minimize ฮฒ to get better reconstructions."

โœ… TRUTH: Without the KL term, a VAE collapses to a deterministic AE with unstructured latent space. The KL term is what gives the VAE its generative power โ€” it ensures the latent space is smooth, continuous, and samplable. ฮฒ-VAE (Higgins et al., 2017) showed that increasing ฮฒ beyond 1 can encourage disentangled representations where each latent dimension controls a single factor of variation.

๐Ÿ” WHY IT MATTERS: The reconstruction-KL tradeoff is fundamental. Don't blindly minimize KL โ€” tune ฮฒ for your use case. Want sharp images? Lower ฮฒ. Want disentangled features? Raise ฮฒ.

โŒ MYTH: "Autoencoders are just for images."

โœ… TRUTH: AEs work on ANY data type: tabular data (fraud detection), text (sentence embeddings), audio (speech denoising), time series (anomaly detection), molecular graphs (drug discovery), and point clouds (3D shape compression).

๐Ÿ” WHY IT MATTERS: Don't limit your thinking. If you have data and want to learn a compact representation, an autoencoder is worth trying.

Section 17

GATE / Exam Corner

Previous Year Questions & Patterns

GATE CS 2023 (Adapted)

An autoencoder with input dimension 100, a single hidden layer of 20 neurons (with sigmoid activation), and an output layer of 100 neurons is trained with MSE loss. This autoencoder performs dimensionality reduction similar to which technique?

  1. K-Means Clustering
  2. Principal Component Analysis (PCA)
  3. Independent Component Analysis (ICA)
  4. t-SNE
Answer: B. An autoencoder with a bottleneck layer learns a compressed representation. With a single linear hidden layer and MSE loss, it recovers the same subspace as the top-20 principal components. Even with sigmoid activation, it approximates a nonlinear generalization of PCA.
UnderstandGATE CS
GATE DA 2024 (Adapted)

In a Variational Autoencoder, the reparameterization trick is used primarily to:

  1. Speed up training by reducing the number of parameters
  2. Enable backpropagation through the stochastic sampling operation
  3. Ensure the decoder output is always in the range [0,1]
  4. Regularize the latent space to prevent overfitting
Answer: B. The reparameterization trick (z = ฮผ + ฯƒโŠ™ฮต, ฮต~N(0,I)) moves the stochasticity to ฮต (which has no learnable parameters), making z a deterministic function of ฮผ and ฯƒ. This allows gradients to flow through ฮผ and ฯƒ via standard backpropagation. Option D describes the KL term, not the reparameterization trick.
UnderstandGATE DA
GATE CS 2022 (Adapted)

Consider a denoising autoencoder trained by corrupting input x to xฬƒ using masking noise (randomly zeroing 30% of inputs), then reconstructing x from xฬƒ. The loss function compares:

  1. xฬƒ with the encoder output z
  2. xฬƒ with the decoder output xฬ‚
  3. x (original clean input) with the decoder output xฬ‚
  4. z with a fixed target vector
Answer: C. The key insight of denoising autoencoders: the network receives the CORRUPTED input xฬƒ but the loss is computed against the CLEAN input x. This forces the network to learn the data distribution, not just memorize the identity function.
RememberGATE CS

Exam-Ready Formula Sheet

ConceptFormula
Vanilla AE LossL = โ€–x - g(f(x))โ€–ยฒ or BCE(x, xฬ‚)
Sparse AE LossL = Lrecon + ฮฒ ฮฃโฑผ KL(ฯ โ€– ฯฬ‚โฑผ)
DAE LossL = โ€–x - g(f(xฬƒ))โ€–ยฒ, xฬƒ = corrupt(x)
CAE LossL = โ€–x - xฬ‚โ€–ยฒ + ฮป โ€–โˆ‚f/โˆ‚xโ€–ยฒF
VAE ELBOELBO = ๐”ผ[log p(x|z)] โˆ’ KL(q(z|x) โ€– p(z))
VAE KL (Gaussian)KL = โˆ’ยฝ ฮฃ(1 + log ฯƒยฒ โˆ’ ฮผยฒ โˆ’ ฯƒยฒ)
Reparameterizationz = ฮผ + ฯƒ โŠ™ ฮต, ฮต ~ N(0, I)
Linear AE = PCAOptimal W spans top-d eigenvectors of Cov(x)

GATE Prediction Table (2025โ€“2027)

TopicProbabilityTypeKey Focus
AE architecture basicsHigh1-mark MCQEncoder-decoder-bottleneck
VAE ELBO formulaMedium-High2-mark MCQTwo-term decomposition
AE vs PCAMedium1-mark MCQLinear AE = PCA
Reparameterization trickMedium2-mark MCQWhy it's needed
Denoising AEMedium1-mark MCQTrain on noisy, compare with clean
Section 18

Interview Prep

Conceptual Questions

Q1: "What's the difference between an autoencoder and PCA?"

๐Ÿ† Strong Answer:

"A linear autoencoder with MSE loss is mathematically equivalent to PCA โ€” the optimal encoder weights span the same subspace as the top principal components. But a nonlinear AE (with ReLU, etc.) can learn curved manifolds that PCA can't capture. Think of a Swiss roll dataset: PCA sees a flat projection, while a nonlinear AE unrolls it. PCA is still preferred for interpretability and when data is approximately linear. AEs scale better to high dimensions (millions of features) and can be combined with CNN/RNN encoders for structured data."

Q2: "Explain the reparameterization trick in VAEs."

๐Ÿ† Strong Answer:

"In a VAE, we want to backpropagate through z ~ N(ฮผ, ฯƒยฒ), but sampling is a stochastic operation with no defined gradient. The reparameterization trick rewrites z = ฮผ + ฯƒยทฮต where ฮต ~ N(0,1). Now z is a deterministic function of ฮผ and ฯƒ โ€” just multiply and add โ€” so gradients flow through normally. The randomness is 'externalized' into ฮต, which has no learnable parameters and doesn't need gradients. This is analogous to how dropout externalizes randomness with a fixed mask per forward pass."

Q3: "How would you use an autoencoder for anomaly detection?"

๐Ÿ† Strong Answer (Indian Context):

"Train the AE exclusively on normal data. The AE learns the manifold of normality โ€” what normal transactions/images/signals look like. At inference, compute reconstruction error. Normal inputs reconstruct well (low error), anomalies reconstruct poorly (high error) because the AE never learned those patterns. Set a threshold at the 99th percentile of validation reconstruction errors. At Paytm, for example, this approach catches novel fraud patterns that rule-based systems miss, because the AE flags anything that deviates from 'normal' โ€” regardless of the specific fraud mechanism."

๐Ÿ† Strong Follow-up โ€” Limitations:

"Limitations: (1) Choosing the threshold is tricky and domain-specific. (2) If anomalies are in the training data, the AE learns them too. (3) Reconstruction error alone may not distinguish types of anomalies. Solution: ensemble with supervised models, use adaptive thresholds, or use the latent code for secondary classification."

Coding Questions

Coding Q1: "Implement the VAE loss function in PyTorch."

def vae_loss_fn(x, x_hat, mu, logvar):
    # Reconstruction: sum over features, mean over batch
    recon = F.binary_cross_entropy(x_hat, x, reduction='sum')
    # KL: closed-form for diagonal Gaussian vs N(0,I)
    kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return (recon + kl) / x.size(0)

Key insight: Use reduction='sum' for BCE (not 'mean'), then divide by batch size manually. This ensures the reconstruction term and KL term are on the same scale. Using 'mean' for BCE would make the reconstruction term ~784ร— smaller than KL for MNIST.

Coding Q2: "Implement the reparameterization trick."

def reparameterize(mu, logvar):
    std = torch.exp(0.5 * logvar)  # ฯƒ = exp(ยฝ log ฯƒยฒ)
    eps = torch.randn_like(std)     # ฮต ~ N(0, I)
    return mu + std * eps            # z = ฮผ + ฯƒโŠ™ฮต

Common mistake: Some candidates use logvar directly as std, forgetting the exp(0.5 * ...). Remember: the encoder outputs log ฯƒยฒ, and ฯƒ = exp(ยฝ ยท log ฯƒยฒ) = exp(log ฯƒ) = ฯƒ.

Case Study Questions

Case: "Design an anomaly detection system for credit card fraud."

Approach:
  1. Data: Feature-engineer transactions: amount, time, merchant category, user history (rolling stats)
  2. Architecture: Tabular AE (120 โ†’ 64 โ†’ 32 โ†’ 64 โ†’ 120) with BatchNorm and Dropout
  3. Training: Only on verified legitimate transactions. Use MSE loss. Train for 50 epochs with early stopping on validation recon error
  4. Threshold: 99.5th percentile of validation reconstruction errors. Make adaptive: recalculate weekly
  5. Deployment: Stream transactions through encoder, compute recon error in <5ms, flag if above threshold
  6. Monitoring: Track distribution shift in reconstruction errors over time. Alert if baseline shifts
Why AE over supervised?

Fraud is rare (<0.01%), so labeled data is scarce. AEs learn "what's normal" from abundant normal data. They catch novel fraud patterns that supervised models (trained on known fraud types) miss.

Section 19

Hands-On Lab: MNIST VAE with Latent Space Exploration

Lab Objective

Build a complete VAE pipeline: train on MNIST, visualize the 2D latent space, generate new digits by sampling, and interpolate between digits in latent space.

Lab Steps

Checklist

  • Step 1: Load and preprocess MNIST (normalize to [0,1], flatten)
  • Step 2: Implement VAE class (encoder with ฮผ and log ฯƒยฒ, reparameterization, decoder)
  • Step 3: Implement VAE loss (BCE reconstruction + KL divergence)
  • Step 4: Train for 30 epochs, log both recon and KL components
  • Step 5: Plot training curves (total loss, recon, KL) over epochs
  • Step 6: Encode test set and plot 2D latent space (scatter, color by digit)
  • Step 7: Generate a 15ร—15 grid of digits by sampling z from [-3,3]ยฒ
  • Step 8: Interpolate between two digits: find ฮผ for "3" and "7", lerp z, decode
  • Step 9: Bonus โ€” try latent_dim=10 and compare reconstruction quality
  • Step 10: Bonus โ€” implement ฮฒ-VAE (ฮฒ=4) and compare latent space structure

Grading Rubric

ComponentPointsCriteria
VAE Implementation25Correct encoder (two heads), reparameterization, decoder
Loss Function20Correct BCE reconstruction + KL divergence + proper scaling
Training Loop15Converges, loss decreases, no NaN/Inf
Latent Space Plot15Clear clusters, smooth transitions, correct labels
Generation Grid15Recognizable digits, smooth morphing across grid
Interpolation10Smooth transition between two specific digits
Total100
Section 20

Exercises

Section A: Conceptual Questions (5)

A1
Beginner

Why can't you generate meaningful new images by sampling random points from the latent space of a vanilla (non-variational) autoencoder?

A vanilla AE's latent space has no structure โ€” points are scattered arbitrarily with gaps between them. Random samples likely fall in these gaps where the decoder was never trained, producing garbage. VAEs fix this by regularizing the latent space to match N(0,I), ensuring the entire space is "filled" meaningfully.
A2
Beginner

List three real-world applications of autoencoders, and for each, specify which AE variant you would use and why.

(1) Anomaly detection (fraud): Vanilla/undercomplete AE โ€” train on normal, flag high recon error. (2) Image denoising: Denoising AE โ€” trained to reconstruct clean from noisy. (3) Drug molecule generation: VAE โ€” smooth latent space allows interpolation and sampling of novel molecules.
A3
Intermediate

Explain the difference between the encoder output of a vanilla AE and a VAE. What does each output, and why?

Vanilla AE encoder outputs a single deterministic vector z. VAE encoder outputs TWO vectors: ฮผ (mean) and log ฯƒยฒ (log-variance), defining a Gaussian distribution. z is then sampled from this distribution. This makes the latent space probabilistic, enabling generation.
A4
Intermediate

Why does a denoising autoencoder compare its output with the CLEAN input (not the noisy input)? What would happen if you compared with the noisy input?

Comparing with clean forces the network to learn the data manifold and "undo" the corruption. If you compared with the noisy input, the network would learn to preserve the noise โ€” it would become a regular AE on noisy data, learning nothing useful about the true data distribution.
A5
Beginner

In the context of autoencoders, what is "representation learning"? How does it differ from manual feature engineering?

Representation learning lets the model discover useful features (representations) automatically from data through the reconstruction objective. Manual feature engineering requires domain experts to design features by hand. AE representations are often more powerful because they capture complex nonlinear relationships that humans might miss.

Section B: Mathematical Questions (8)

B1
Intermediate

Derive the KL divergence between N(ฮผ, ฯƒยฒ) and N(0, 1) from the definition KL = โˆซ q(z) log[q(z)/p(z)] dz. Show all steps.

See Section 16.6 Professor's Whiteboard for the complete derivation. Key steps: expand log ratio, use properties of Gaussian (๐”ผ[z]=ฮผ, ๐”ผ[zยฒ]=ฮผยฒ+ฯƒยฒ, entropy = ยฝlog(2ฯ€eฯƒยฒ)). Result: KL = -ยฝ(1 + log ฯƒยฒ - ฮผยฒ - ฯƒยฒ).
B2
Intermediate

A VAE encoder outputs ฮผ = [1.2, -0.5, 0.0] and log ฯƒยฒ = [-1.0, 0.5, 0.3]. Compute the total KL divergence.

ฯƒยฒ = [eโปยน, eโฐยทโต, eโฐยทยณ] = [0.368, 1.649, 1.350]
KLโ‚ = -ยฝ(1 + (-1.0) - 1.44 - 0.368) = -ยฝ(-1.808) = 0.904
KLโ‚‚ = -ยฝ(1 + 0.5 - 0.25 - 1.649) = -ยฝ(-0.399) = 0.200
KLโ‚ƒ = -ยฝ(1 + 0.3 - 0.0 - 1.350) = -ยฝ(-0.050) = 0.025
Total = 0.904 + 0.200 + 0.025 = 1.129
B3
Advanced

Prove that a single-layer linear autoencoder (no activation, MSE loss) recovers the PCA subspace. [Hint: take derivative of L = ๐”ผ[โ€–x - Wโ‚‚Wโ‚xโ€–ยฒ] w.r.t. Wโ‚‚ and show optimal Wโ‚‚ = Wโ‚แต€, then characterize Wโ‚.]

Setting โˆ‚L/โˆ‚Wโ‚‚ = 0 gives Wโ‚‚ = Cov(x,z)ยทCov(z,z)โปยน. When Wโ‚‚=Wโ‚แต€ (tied weights), the problem becomes minimizing โ€–x - Wโ‚แต€Wโ‚xโ€–ยฒ, which is the projection error. By Eckart-Young theorem, the optimal rank-d projection uses the top-d eigenvectors of ฮฃ = ๐”ผ[xxแต€]. These are exactly the principal components.
B4
Intermediate

Compute the sparse autoencoder KL penalty for a neuron with average activation ฯฬ‚ = 0.7 when the target sparsity is ฯ = 0.05.

KL(0.05 โ€– 0.70) = 0.05ยทln(0.05/0.70) + 0.95ยทln(0.95/0.30)
= 0.05ยทln(0.0714) + 0.95ยทln(3.167)
= 0.05ยท(-2.639) + 0.95ยท(1.153)
= -0.132 + 1.095 = 0.963
This is a large penalty, pushing the neuron to fire less often.
B5
Intermediate

For a contractive autoencoder with encoder f(x) = sigmoid(Wx + b), compute the Jacobian J = โˆ‚f/โˆ‚x and its Frobenius norm.

Let h = Wx + b, f(x) = ฯƒ(h). Then โˆ‚fแตข/โˆ‚xโฑผ = ฯƒ'(hแตข)ยทWแตขโฑผ = fแตข(1-fแตข)ยทWแตขโฑผ.
So J = diag(fโŠ™(1-f)) ยท W, where โŠ™ is element-wise product.
โ€–Jโ€–ยฒF = ฮฃแตข ฮฃโฑผ [fแตข(1-fแตข)]ยฒ ยท Wยฒแตขโฑผ = ฮฃแตข [fแตข(1-fแตข)]ยฒ ยท โ€–Wแตขโ€–ยฒ
B6
Advanced

Show that the ELBO is tight (equals log p(x)) if and only if q(z|x) = p(z|x). What does this imply for VAE training?

log p(x) = ELBO + KL(q(z|x) โ€– p(z|x)). Since KL โ‰ฅ 0, ELBO โ‰ค log p(x). Equality holds iff KL(qโ€–p) = 0, i.e., q(z|x) = p(z|x). This means the VAE's approximation is perfect when the encoder perfectly matches the true posterior. In practice, we use q from a restricted family (diagonal Gaussians), so the ELBO is always a loose lower bound. More expressive q (normalizing flows, IAF) tighten it.
B7
Intermediate

An autoencoder is trained on 10,000 normal transactions. The reconstruction errors follow N(0.02, 0.008ยฒ). What threshold captures 99% of normal data? If fraud has recon error N(0.15, 0.05ยฒ), what is the recall at this threshold?

99th percentile: ฯ„ = 0.02 + 2.326ร—0.008 = 0.02 + 0.0186 = 0.0386
P(fraud_error > 0.0386) = P(Z > (0.0386-0.15)/0.05) = P(Z > -2.228) = ฮฆ(2.228) โ‰ˆ 0.987
So recall โ‰ˆ 98.7% โ€” excellent!
B8
Intermediate

How many trainable parameters does a 784โ†’256โ†’128โ†’32โ†’128โ†’256โ†’784 autoencoder have? (Include biases.)

Encoder: 784ร—256+256 + 256ร—128+128 + 128ร—32+32 = 200,960 + 32,896 + 4,128 = 237,984
Decoder: 32ร—128+128 + 128ร—256+256 + 256ร—784+784 = 4,224 + 33,024 + 201,488 = 238,736
Total = 237,984 + 238,736 = 476,720

Section C: Coding Questions (4)

C1
Intermediate

Implement a denoising autoencoder in PyTorch that adds Gaussian noise (ฯƒ=0.3) to MNIST inputs during training. Include the noise-addition step in the training loop.

Key snippet: noisy_x = x + 0.3 * torch.randn_like(x); noisy_x = torch.clamp(noisy_x, 0, 1); x_hat = model(noisy_x); loss = F.mse_loss(x_hat, x) โ€” note loss compares with CLEAN x, not noisy_x.
C2
Intermediate

Write a function that takes a trained VAE, two digit images (e.g., "3" and "7"), and generates 10 intermediate images by linearly interpolating in latent space.

Encode both images to get ฮผโ‚ƒ and ฮผโ‚‡. For t in np.linspace(0, 1, 10): z = (1-t)*ฮผโ‚ƒ + t*ฮผโ‚‡, decode z. This produces a smooth morphing from "3" to "7".
C3
Advanced

Implement an anomaly detection pipeline: train an AE on MNIST digits 0-4 only, then evaluate reconstruction error on digits 5-9 (anomalies). Plot the ROC curve.

Train AE on digits 0-4. At test time, compute recon error for all 10 digits. Digits 5-9 should have higher error. Use sklearn.metrics.roc_curve with labels (0 for normal, 1 for anomaly) and recon errors as scores. Plot FPR vs TPR.
C4
Intermediate

Modify the VAE from Section 14 to implement ฮฒ-VAE with ฮฒ=4. Train it and compare the latent space visualization with the standard VAE (ฮฒ=1). What changes?

Change loss: (recon + beta * kl) / batch_size with beta=4. The latent space becomes more organized and disentangled โ€” each dimension tends to capture one factor of variation. But reconstructions become blurrier because the KL is weighted more heavily.

Section D: Critical Thinking (3)

D1
Advanced

A company claims their autoencoder "achieves 0 reconstruction error on all training data." Should you be impressed? What questions would you ask?

You should NOT be impressed โ€” this likely means the AE has memorized the training data (overfitting). Questions: (1) What's the validation reconstruction error? (2) What's the bottleneck size relative to input? (3) How many parameters vs training examples? (4) Does the latent space show meaningful clusters? Zero training loss with a large enough network is trivial and useless for any downstream task.
D2
Advanced

VAEs often produce blurrier images than GANs. Why? What modifications have been proposed to address this?

VAEs optimize a lower bound (ELBO) and use pixel-wise reconstruction loss, which averages over all modes of the data distribution, producing blurry means. GANs use adversarial loss that captures perceptual sharpness. Fixes: (1) VAE-GAN hybrids (add discriminator), (2) VQ-VAE (discrete latent codes), (3) Hierarchical VAEs (NVAE), (4) Better decoders with perceptual loss, (5) Diffusion-based decoders.
D3
Advanced

Compare the use of autoencoders for anomaly detection in Paytm's UPI system (India) vs credit card fraud detection at Stripe (US). What domain-specific differences would affect your architecture choices?

India (UPI): Higher volume (8B/month), lower avg transaction value, mobile-first (device fingerprinting is crucial), regulatory requirements (RBI mandates), P2P transfers dominant. Need: ultra-low latency, batch feature updates.
US (Stripe): Lower volume but higher values, card-not-present transactions, more sophisticated fraud (account takeover), PCI compliance, international transactions. Need: more features per transaction, multi-currency handling.
Architecture differences: Indian model needs more velocity features (txn frequency), US needs more cross-border features and cardholder verification signals.

โ˜… Starred Research Questions (2)

โ˜…1
Advanced ยท Research

Read about VQ-VAE (van den Oord et al., 2017). How does it differ from a standard VAE? Why does discretizing the latent space help for generation tasks? Implement a simplified VQ-VAE for MNIST.

VQ-VAE uses a discrete latent space with a codebook of learned embeddings. The encoder output is "snapped" to the nearest codebook vector. This avoids posterior collapse (a common VAE issue) and enables autoregressive generation over discrete codes. The codebook is trained with straight-through estimator for gradients. Key advantage: sharper reconstructions than standard VAE.
โ˜…2
Advanced ยท Research

Investigate the "posterior collapse" problem in VAEs: the decoder ignores z and the KL term drops to 0. When does this happen? What solutions have been proposed? Implement KL annealing (gradually increase ฮฒ from 0 to 1 during training) and show it mitigates the problem.

Posterior collapse occurs when the decoder is too powerful (e.g., autoregressive LSTM) and can model p(x) without using z. The encoder then sets q(z|x) = p(z) to minimize KL to 0. Solutions: (1) KL annealing (warm-up ฮฒ from 0โ†’1), (2) Free bits (minimum KL per dimension), (3) ฮด-VAE (lower bound on KL), (4) Weaker decoders. Implementation: multiply KL by min(1, epoch/warmup_epochs).
Section 21

Connections

How This Chapter Connects

โ† Builds On:
  • Chapter 10 (Batch Norm & Deep Networks): Training deep encoder/decoder stacks requires BatchNorm for stability
  • Chapter 12 (CNNs): Convolutional encoders/decoders for image autoencoders; transpose convolutions in decoder
  • Chapter 6 (Backpropagation): Understanding backprop through encoder-decoder is essential for implementing AEs from scratch
  • Chapter 9 (Regularization): Sparse, denoising, and contractive AEs are all regularization strategies
โ†’ Enables:
  • Chapter 16-GAN (GANs): VAE-GAN hybrids combine the best of both; understanding latent spaces is prerequisite for GANs
  • Chapter 19 (Recommendation Systems): AE embeddings power collaborative filtering at Netflix, Spotify
  • Diffusion Models: Score-based diffusion models are deeply connected to denoising autoencoders
  • Self-Supervised Learning: Masked autoencoders (MAE) by He et al. (2022) are the backbone of modern visual pre-training
๐Ÿ”ฌ Research Frontier:
  • Masked Autoencoders (MAE): He et al. (2022) โ€” mask 75% of image patches, reconstruct โ†’ powerful visual features
  • VQ-VAE-2: Razavi et al. (2019) โ€” hierarchical discrete VAE rivaling GANs in image quality
  • Diffusion + VAE: Latent diffusion models (Stable Diffusion) run diffusion in the VAE latent space for efficiency
๐Ÿญ Industry Implementation:
  • Netflix: Collaborative filtering with AE embeddings (200M users)
  • Spotify: Audio feature learning with convolutional AEs
  • Paytm: UPI fraud detection with reconstruction error
  • Google: AE-based compression in YouTube video encoding
  • Tesla: Sensor anomaly detection in autopilot systems

Masked Autoencoders Are Scalable Vision Learners โ€” He, Chen, Xie, Li, Dollรกr, Girshick (Meta AI, 2022). This paper showed that masking 75% of image patches and training a ViT to reconstruct them produces embeddings rivaling supervised pre-training on ImageNet. The key insight: aggressive masking forces the model to learn semantic understanding, not just texture. This connects denoising AEs (2008) to modern self-supervised learning (2022) โ€” the same "corrupt and reconstruct" idea, scaled up with Transformers.

Section 22

Chapter Summary

7 Key Takeaways

  1. Autoencoders learn by compression: Squeeze data through a bottleneck, then reconstruct. Whatever survives the bottleneck is the learned representation โ€” the essential features of the data.
  2. Five variants serve different purposes: Undercomplete (bottleneck forces compression), Sparse (KL penalty โ†’ selective features), Denoising (corruption โ†’ robust features), Contractive (Jacobian penalty โ†’ stable encoding), Variational (probabilistic โ†’ generation + structure).
  3. Linear AE = PCA: A single-layer linear AE with MSE loss recovers the same subspace as PCA. Nonlinear AEs are strictly more powerful, capturing curved manifolds.
  4. VAEs are generative: By making the latent space probabilistic and regularizing with KL divergence, VAEs enable sampling and generation. The ELBO = Reconstruction โˆ’ KL.
  5. The reparameterization trick is essential: z = ฮผ + ฯƒโŠ™ฮต externalizes randomness to ฮต, allowing gradients to flow through ฮผ and ฯƒ during backprop.
  6. Anomaly detection via reconstruction error: Train on normal data, flag high reconstruction error. Used at Paytm for UPI fraud detection (8B txns/month) โ€” catches novel fraud patterns that rule-based systems miss.
  7. Modern connection to diffusion models: Denoising autoencoders (2008) anticipated score-based diffusion models (2020+). The "corrupt and reconstruct" idea is more powerful than ever, powering Stable Diffusion, DALL-E, and masked autoencoder pre-training.

Key Equation

VAE ELBO: log p(x) โ‰ฅ ๐”ผq(z|x)[log p(x|z)] โˆ’ KL(q(z|x) โ€– p(z))

Key Intuition

"An autoencoder learns what's important by being forced to forget what's not. The bottleneck is a question: 'If you could only remember 32 numbers about this image, what would they be?' The answer IS the representation."

Section 23

Further Reading

๐Ÿ‡ฎ๐Ÿ‡ณ Indian Resources

  • NPTEL: "Deep Learning" by Prof. Mitesh Khapra (IIT Madras) โ€” Lectures 35-38 cover autoencoders and VAEs with excellent mathematical rigor
  • NPTEL: "Introduction to Machine Learning" by Prof. Sudeshna Sarkar (IIT Kharagpur) โ€” PCA and dimensionality reduction context
  • GATE Preparation: "Deep Learning" by Ian Goodfellow โ€” Chapter 14 (Autoencoders) is the gold standard reference
  • Padhai (One Fourth Labs): Free VAE tutorial series with Python implementation walkthroughs

๐ŸŒ Global Resources

  • Original VAE Paper: "Auto-Encoding Variational Bayes" โ€” Kingma & Welling (2014). The foundational paper. Read the derivation alongside our Section 16.6
  • Tutorial: "An Introduction to Variational Autoencoders" โ€” Kingma & Welling (2019). A comprehensive 50-page tutorial
  • Distill.pub: "Understanding the Variational Lower Bound" โ€” Interactive visualization of the ELBO
  • 3Blue1Brown: "But what is a neural network?" โ€” Foundation for understanding AE architectures
  • Lilian Weng's Blog: "From Autoencoder to Beta-VAE" โ€” Outstanding survey of all AE variants
  • Paper: "Masked Autoencoders Are Scalable Vision Learners" โ€” He et al. (2022). Modern connection to self-supervised learning
  • Paper: "Neural Discrete Representation Learning" (VQ-VAE) โ€” van den Oord et al. (2017)
  • Goodfellow et al.: "Deep Learning" textbook, Chapter 14 (free online at deeplearningbook.org)