Dimensionality Reduction
PCA, t-SNE, UMAP โ Taming the Curse of High-Dimensional Data
Learning Objectives
Upon completing this chapter, you will be able to:
Introduction
Imagine you're trying to describe a person. You could use thousands of measurements โ height, weight, shoe size, arm length, hair color (encoded numerically), income, education years, and so on. But do you really need all of them? Height and arm length are highly correlated; so are income and education level. Much of the information in those thousands of features is redundant.
Dimensionality reduction is the art and science of finding a smaller set of variables that captures the essential structure of your data. It is one of the most powerful tools in the machine learning arsenal โ used for visualization, noise removal, computational speedup, and combating the infamous curse of dimensionality.
This chapter explores four major families of techniques:
- PCA (Principal Component Analysis) โ The classical linear method based on maximum variance / eigenvalue decomposition
- t-SNE (t-distributed Stochastic Neighbor Embedding) โ A non-linear method optimized for 2D/3D visualization
- UMAP (Uniform Manifold Approximation and Projection) โ A newer non-linear method that preserves both local and global structure
- LDA (Linear Discriminant Analysis) โ A supervised method that maximizes class separability
We'll also study the connection between PCA and SVD (Singular Value Decomposition), explore Kernel PCA for non-linear mappings, and contrast feature selection with feature extraction. Every concept will be accompanied by mathematical derivations from first principles, hand-computed examples, and Python code.
Historical Background
3.1 The Birth of PCA (1901โ1936)
Karl Pearson introduced the concept in 1901 as fitting planes to point clouds by minimizing orthogonal distances. Harold Hotelling formalized it in 1933 using eigenvalue decomposition of the covariance matrix, giving us the PCA we know today. The method became computationally practical only in the 1960s with the advent of digital computers and efficient algorithms for eigenvalue computation (QR algorithm by John Francis, 1961).
3.2 SVD and the Numerical Revolution (1960sโ1970s)
Gene Golub and William Kahan's 1965 algorithm for computing the Singular Value Decomposition made SVD numerically stable and efficient. The realization that PCA could be computed via SVD (without explicitly forming the covariance matrix) was transformative for large datasets.
3.3 Kernel Methods and Non-Linear Extensions (1998)
Bernhard Schรถlkopf, Alexander Smola, and Klaus-Robert Mรผller introduced Kernel PCA in 1998, extending PCA to non-linear manifolds using the kernel trick โ the same insight that powered SVMs.
3.4 t-SNE: A Visualization Revolution (2008)
Laurens van der Maaten and Geoffrey Hinton published t-SNE in 2008, building on Hinton and Roweis's earlier SNE (2002). The key innovation was using a Student's t-distribution in the low-dimensional space to solve the crowding problem. The method quickly became the gold standard for visualizing high-dimensional data like MNIST digits and gene expression data.
3.5 UMAP: Speed and Global Structure (2018)
Leland McInnes, John Healy, and James Melville published UMAP in 2018, grounding dimensionality reduction in the mathematics of algebraic topology (Riemannian geometry and fuzzy simplicial sets). UMAP offered dramatic speedups over t-SNE while better preserving global structure.
Conceptual Explanation
4.1 The Curse of Dimensionality
Richard Bellman coined the term "curse of dimensionality" in 1961. The key problems are:
Problem 1: Distance Concentration
In high dimensions, all points become approximately equidistant. If you sample points uniformly in a d-dimensional unit hypercube, the ratio of the maximum to minimum distance approaches 1 as d โ โ. This makes distance-based algorithms (KNN, K-Means, RBF kernels) unreliable.
Problem 2: Volume Explosion
A d-dimensional unit hypercube has 2d corners. To sample the space with the same density as 10 samples per dimension, you need 10d points. For d = 100 features, that's 10100 โ vastly more than atoms in the universe (~1080).
Problem 3: Empty Space
Most of the volume of a high-dimensional unit sphere concentrates near its surface. A d-dimensional sphere of radius r = 0.99 has volume (0.99)d relative to the full sphere. For d = 500, this is essentially 0. Almost all points are "on the boundary."
Problem 4: Statistical Sparsity
More features mean more parameters to estimate, requiring exponentially more data to achieve the same statistical reliability. Overfitting becomes inevitable.
4.2 Feature Selection vs Feature Extraction
| Aspect | Feature Selection | Feature Extraction |
|---|---|---|
| Approach | Choose a subset of original features | Create new features as combinations of originals |
| Interpretability | High โ original features are preserved | Low โ new features may be abstract |
| Methods | Filter (correlation), Wrapper (RFE), Embedded (Lasso) | PCA, t-SNE, UMAP, Autoencoders |
| Information Loss | Higher โ discarded features are gone | Lower โ all features contribute to new ones |
| When to Use | When interpretability is critical | When maximum information preservation is needed |
4.3 Overview of Methods
PCA โ Linear, Unsupervised
PCA finds orthogonal directions (principal components) that maximize variance. It's a linear transformation โ the new features are linear combinations of the old ones. Best for: preprocessing, noise removal, compression, and exploratory analysis when data lies on a linear subspace.
t-SNE โ Non-Linear, Unsupervised
t-SNE models pairwise similarities as probabilities in both high-D and low-D spaces, then minimizes the KL divergence between them. It excels at revealing cluster structure in 2D/3D visualizations. Limitations: non-parametric (can't transform new points), slow on large datasets, random initialization can give different results.
UMAP โ Non-Linear, Unsupervised
UMAP builds a weighted graph in high-D space (fuzzy simplicial set), then optimizes a low-D graph to match it. It preserves both local and global structure, runs much faster than t-SNE, and produces a parametric mapping that can transform new data.
LDA โ Linear, Supervised
LDA finds directions that maximize the ratio of between-class variance to within-class variance. Unlike PCA, it uses class labels. Maximum components = C-1 (where C is the number of classes).
| Method | Linear? | Supervised? | Preserves | Speed | Best For |
|---|---|---|---|---|---|
| PCA | โ Yes | โ No | Global variance | โก Fast | Preprocessing, compression |
| Kernel PCA | โ No | โ No | Non-linear structure | ๐ข Slow | Non-linear manifolds |
| t-SNE | โ No | โ No | Local neighborhoods | ๐ข Slow | 2D/3D visualization |
| UMAP | โ No | โ No | Local + global | โก Fast | Visualization + downstream |
| LDA | โ Yes | โ Yes | Class separability | โก Fast | Classification preprocessing |
Mathematical Foundation
5.1 Covariance Matrix
Given a data matrix X โ โnรd (n samples, d features), center the data by subtracting the mean: Xฬ = X - ฮผ, where ฮผ = (1/n)ฮฃixi.
Cjk = (1/(n-1)) ฮฃi=1n (xij - ฮผj)(xik - ฮผk)
The covariance matrix is symmetric positive semi-definite (all eigenvalues โฅ 0). Diagonal entries are variances; off-diagonal entries are covariances between feature pairs.
5.2 Eigenvalue Decomposition
Since C is symmetric, it admits an eigendecomposition:
C = V ฮ VT
where V = [v1 | v2 | ... | vd] (orthonormal eigenvectors)
and ฮ = diag(ฮป1, ฮป2, ..., ฮปd) with ฮป1 โฅ ฮป2 โฅ ... โฅ ฮปd โฅ 0
5.3 PCA Projection
To reduce from d to k dimensions, select the top k eigenvectors W = [v1 | ... | vk] โ โdรk:
Each row zi = WT xฬi is the projection of sample i onto k principal components
5.4 Explained Variance Ratio
Cumulative EVR for k components = ฮฃi=1k ฮปi / ฮฃj=1d ฮปj
The explained variance ratio tells us what fraction of the total variance in the data is captured by each principal component. A common heuristic is to choose k such that the cumulative EVR exceeds 95%.
5.5 SVD Connection to PCA
The centered data matrix Xฬ has a Singular Value Decomposition:
where U โ โnรn (left singular vectors), ฮฃ โ โnรd (singular values ฯi)
V โ โdรd (right singular vectors = eigenvectors of XฬTXฬ)
The connection: C = (1/(n-1))XฬTXฬ = (1/(n-1))VฮฃTฮฃVT. So ฮปi = ฯiยฒ / (n-1), and the right singular vectors V are the principal component directions. This means PCA can be computed via SVD without ever forming the dรd covariance matrix โ crucial when d is large.
5.6 Kernel PCA
For non-linear data, we map to a higher-dimensional space ฯ: โd โ โD (D >> d) and perform PCA there. Using the kernel trick K(xi, xj) = ฯ(xi)ยทฯ(xj), we work with the nรn kernel matrix instead of the DรD covariance matrix:
where Kฬ is the centered kernel matrix:
Kฬ = K - 1nK - K1n + 1nK1n
(1n = (1/n) matrix of all ones)
5.7 t-SNE Mathematics
Step 1: High-D Pairwise Similarities
For each pair of points, compute a conditional probability representing "how likely is xj to be a neighbor of xi under a Gaussian centered at xi":
Symmetrize: pij = (pj|i + pi|j) / 2n
The bandwidth ฯi is set so that the perplexity of the conditional distribution equals a user-specified value (typically 5โ50). Perplexity โ effective number of neighbors:
Step 2: Low-D Similarities with Student-t Distribution
In the low-dimensional space, similarities use a heavy-tailed Student's t-distribution (1 degree of freedom = Cauchy):
The heavy tail allows moderately distant points in high-D to be placed far apart in low-D without penalty, solving the crowding problem.
Step 3: Minimize KL Divergence
Gradient: โC/โyi = 4 ฮฃj (pij - qij)(yi - yj)(1 + ||yi - yj||ยฒ)-1
5.8 UMAP Mathematics (Simplified)
UMAP constructs a weighted graph in high-D using k-nearest neighbors with a fuzzy membership function:
where ฯi = distance to nearest neighbor (ensures local connectivity)
ฯi is calibrated to achieve log2(k) = ฮฃj w(xi, xj)
The fuzzy union of local membership functions is symmetrized: wij = wiโj + wjโi - wiโjยทwjโi. UMAP then minimizes a cross-entropy objective (not KL divergence), which allows repulsive forces between non-neighbors:
5.9 LDA Mathematics
SB = ฮฃc nc (ฮผc - ฮผ)(ฮผc - ฮผ)T (Between-class scatter)
SW = ฮฃc ฮฃiโc (xi - ฮผc)(xi - ฮผc)T (Within-class scatter)
Optimal: SW-1 SB w = ฮป w โ generalized eigenproblem
Formula Derivations
6.1 Deriving PCA from Maximum Variance
We want to find a unit vector w such that the variance of the projected data is maximized.
Step 1: Set up the projection
Given centered data xฬi, the projection of point i onto direction w is: zi = wTxฬi
Step 2: Write the variance of the projection
Var(z) = (1/(n-1)) ฮฃi ziยฒ = (1/(n-1)) ฮฃi (wTxฬi)ยฒ = (1/(n-1)) wT (ฮฃi xฬixฬiT) w = wT C w
Step 3: Constrained optimization
Maximize wTCw subject to wTw = 1.
Using Lagrange multipliers: L = wTCw - ฮป(wTw - 1)
Step 4: Take the derivative and set to zero
โL/โw = 2Cw - 2ฮปw = 0
โน Cw = ฮปw
Step 5: Interpret the result
This is the eigenvalue equation! w must be an eigenvector of C. The variance along this direction is: wTCw = wTฮปw = ฮป. So the variance equals the eigenvalue, and to maximize variance, we pick the eigenvector with the largest eigenvalue. โ
Step 6: Second component
The second principal component maximizes variance subject to being orthogonal to the first: w2Tw1 = 0. By an analogous derivation with two Lagrange constraints, w2 is the eigenvector with the second largest eigenvalue. This pattern continues for all k components.
6.2 Deriving the t-SNE Gradient
The cost function is C = ฮฃiโ j pij log(pij/qij).
Step 1: Rewrite using qij
Let dij = ||yi - yj||ยฒ and Z = ฮฃkโ l(1 + dkl)-1. Then qij = (1 + dij)-1 / Z.
Step 2: Differentiate
โC/โyi = 2 ฮฃjโ i (pij - qij) ยท 2(yi - yj) ยท (1 + dij)-1
= 4 ฮฃj (pij - qij)(yi - yj)(1 + ||yi - yj||ยฒ)-1
Interpretation: If pij > qij (points should be closer than they are), the gradient pulls yi toward yj. If pij < qij, it pushes them apart. The (1+dij)-1 term gives more weight to nearby points.
6.3 PCA via SVD Derivation
Start with the SVD: Xฬ = UฮฃVT. The covariance matrix:
C = (1/(n-1)) XฬTXฬ = (1/(n-1)) VฮฃTUTUฮฃVT = (1/(n-1)) Vฮฃ2VT
Comparing with C = VฮVT, we get ฮ = ฮฃ2/(n-1), i.e., eigenvalue ฮปi = ฯiยฒ/(n-1).
The projection: Z = XฬVk = UฮฃVTVk = Ukฮฃk (using orthogonality of V).
Key advantage: For n << d, computing the SVD of Xฬ (nรd) is cheaper than the eigendecomposition of C (dรd). Scikit-learn uses this approach.
Worked Numerical Examples
7.1 PCA by Hand: 3 Features โ 2 Components
Given Data (5 samples, 3 features)
| Sample | xโ | xโ | xโ |
|---|---|---|---|
| A | 2 | 4 | 1 |
| B | 4 | 6 | 3 |
| C | 6 | 8 | 5 |
| D | 8 | 10 | 7 |
| E | 10 | 12 | 9 |
Step 1: Compute means
ฮผโ = (2+4+6+8+10)/5 = 6, ฮผโ = (4+6+8+10+12)/5 = 8, ฮผโ = (1+3+5+7+9)/5 = 5
Step 2: Center the data (subtract means)
| Sample | xฬโ | xฬโ | xฬโ |
|---|---|---|---|
| A | -4 | -4 | -4 |
| B | -2 | -2 | -2 |
| C | 0 | 0 | 0 |
| D | 2 | 2 | 2 |
| E | 4 | 4 | 4 |
Step 3: Compute covariance matrix C = (1/4) Xฬแต Xฬ
XฬแตXฬ = [(-4)ยฒ+(-2)ยฒ+0+4+16, ...] Let's compute each entry:
ฮฃxฬโยฒ = 16+4+0+4+16 = 40 โ Cโโ = 40/4 = 10
ฮฃxฬโxฬโ = 16+4+0+4+16 = 40 โ Cโโ = 40/4 = 10
ฮฃxฬโxฬโ = 16+4+0+4+16 = 40 โ Cโโ = 40/4 = 10
Similarly: Cโโ = 10, Cโโ = 10, Cโโ = 10
Step 4: Find eigenvalues
det(C - ฮปI) = 0. Since all rows are identical (rank 1 matrix), we know there are two zero eigenvalues.
trace(C) = 30 = sum of eigenvalues. Since rank = 1, we have ฮปโ = 30, ฮปโ = 0, ฮปโ = 0.
Verification: Cยท[1,1,1]แต = [30,30,30]แต = 30ยท[1,1,1]แต โ
Step 5: Find eigenvectors
For ฮปโ = 30: vโ = [1,1,1]แต/โ3 = [0.577, 0.577, 0.577]แต
For ฮปโ = 0: any vector orthogonal to vโ, e.g., vโ = [1,-1,0]แต/โ2 = [0.707, -0.707, 0]แต
For ฮปโ = 0: vโ = [1,1,-2]แต/โ6 = [0.408, 0.408, -0.816]แต
Step 6: Project onto top 2 components
W = [vโ | vโ] (3ร2 matrix). Z = Xฬ ยท W:
z_A = [-4, -4, -4] ยท W = [-4(0.577+0.577+0.577), -4(0.707-0.707+0)] = [-6.928, 0]
z_B = [-3.464, 0], z_C = [0, 0], z_D = [3.464, 0], z_E = [6.928, 0]
Step 7: Explained variance
EVRโ = 30/30 = 100%. All variance is captured by PC1 (because the data lies on a perfect line in 3D).
7.2 Explained Variance โ A More Realistic Example
Suppose after computing PCA on a 10-feature dataset, the eigenvalues are:
ฮป = [5.8, 3.2, 1.5, 0.9, 0.6, 0.4, 0.3, 0.2, 0.07, 0.03]
Total = 13.0
EVR = [44.6%, 24.6%, 11.5%, 6.9%, 4.6%, 3.1%, 2.3%, 1.5%, 0.5%, 0.2%]
Cumulative = [44.6%, 69.2%, 80.8%, 87.7%, 92.3%, 95.4%, ...]
Decision: For 95% threshold, choose k = 6 components (reducing 10 โ 6). For 90%, choose k = 5.
Visual Diagrams
Flowcharts
Python Implementation from Scratch
10.1 PCA Class โ Full Eigendecomposition
import numpy as np
import matplotlib.pyplot as plt
class PCA:
"""
Principal Component Analysis from scratch.
Implements the eigendecomposition approach:
1. Center data (optionally standardize)
2. Compute covariance matrix
3. Eigendecompose
4. Project onto top-k eigenvectors
"""
def __init__(self, n_components=2, standardize=True):
"""
Parameters:
n_components: int or float
If int: number of components to keep
If float (0-1): keep enough for this explained variance ratio
standardize: bool
If True, scale features to unit variance (recommended)
"""
self.n_components = n_components
self.standardize = standardize
self.components_ = None # Principal component directions (k x d)
self.eigenvalues_ = None # All eigenvalues
self.explained_variance_ = None
self.explained_variance_ratio_ = None
self.mean_ = None
self.std_ = None
def fit(self, X):
"""Fit PCA on training data X (n_samples x n_features)."""
n, d = X.shape
# Step 1: Center (and optionally scale)
self.mean_ = np.mean(X, axis=0)
X_centered = X - self.mean_
if self.standardize:
self.std_ = np.std(X, axis=0, ddof=0)
# Avoid division by zero for constant features
self.std_[self.std_ == 0] = 1.0
X_centered = X_centered / self.std_
# Step 2: Covariance matrix (d x d)
# Using (n-1) for unbiased estimator
cov_matrix = (X_centered.T @ X_centered) / (n - 1)
# Step 3: Eigenvalue decomposition
# np.linalg.eigh is for symmetric matrices (faster, numerically stable)
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# Sort in descending order (eigh returns ascending)
sorted_idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[sorted_idx]
eigenvectors = eigenvectors[:, sorted_idx]
# Clip tiny negative eigenvalues (numerical noise)
eigenvalues = np.maximum(eigenvalues, 0)
self.eigenvalues_ = eigenvalues
self.explained_variance_ = eigenvalues
total_var = np.sum(eigenvalues)
self.explained_variance_ratio_ = eigenvalues / total_var if total_var > 0 else eigenvalues
# Determine number of components
if isinstance(self.n_components, float) and 0 < self.n_components < 1:
cumulative = np.cumsum(self.explained_variance_ratio_)
k = np.searchsorted(cumulative, self.n_components) + 1
self.n_components = min(k, d)
self.n_components = min(self.n_components, d)
# Step 4: Select top-k eigenvectors
# components_ shape: (k, d) โ each row is a principal direction
self.components_ = eigenvectors[:, :self.n_components].T
return self
def transform(self, X):
"""Project data onto principal components."""
X_centered = X - self.mean_
if self.standardize:
X_centered = X_centered / self.std_
# Z = X_centered @ W where W = components_.T (d x k)
return X_centered @ self.components_.T
def fit_transform(self, X):
"""Fit and transform in one step."""
self.fit(X)
return self.transform(X)
def inverse_transform(self, Z):
"""Reconstruct data from reduced representation."""
X_approx = Z @ self.components_ # (n x k) @ (k x d) = (n x d)
if self.standardize:
X_approx = X_approx * self.std_
X_approx = X_approx + self.mean_
return X_approx
def plot_scree(self, figsize=(10, 4)):
"""Plot scree diagram and cumulative explained variance."""
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=figsize)
k = min(len(self.eigenvalues_), 20) # Show up to 20
x = range(1, k + 1)
# Scree plot
ax1.bar(x, self.eigenvalues_[:k], color='#059669', alpha=0.7)
ax1.plot(x, self.eigenvalues_[:k], 'o-', color='#0891b2', linewidth=2)
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Eigenvalue')
ax1.set_title('Scree Plot')
ax1.set_xticks(x)
ax1.grid(alpha=0.3)
# Cumulative variance
cumulative = np.cumsum(self.explained_variance_ratio_[:k]) * 100
ax2.plot(x, cumulative, 'o-', color='#059669', linewidth=2, markersize=6)
ax2.axhline(y=95, color='#f43f5e', linestyle='--', label='95% threshold')
ax2.fill_between(x, cumulative, alpha=0.1, color='#059669')
ax2.set_xlabel('Number of Components')
ax2.set_ylabel('Cumulative Explained Variance (%)')
ax2.set_title('Cumulative Explained Variance')
ax2.set_xticks(x)
ax2.legend()
ax2.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# ============== DEMO ==============
if __name__ == "__main__":
# Generate sample data: 3 clusters in 5D
np.random.seed(42)
n = 200
# 3 clusters with different centers
c1 = np.random.randn(n, 5) + np.array([3, 3, 3, 0, 0])
c2 = np.random.randn(n, 5) + np.array([-3, -3, -3, 0, 0])
c3 = np.random.randn(n, 5) + np.array([0, 0, 0, 5, 5])
X = np.vstack([c1, c2, c3])
labels = np.array([0]*n + [1]*n + [2]*n)
# Apply PCA
pca = PCA(n_components=2, standardize=True)
Z = pca.fit_transform(X)
print(f"Original shape: {X.shape}")
print(f"Reduced shape: {Z.shape}")
print(f"Explained variance ratios: {pca.explained_variance_ratio_[:5].round(4)}")
print(f"Cumulative (2 components): {sum(pca.explained_variance_ratio_[:2]):.4f}")
# Plot
pca.plot_scree()
plt.figure(figsize=(8, 6))
colors = ['#059669', '#0891b2', '#f59e0b']
for i, c in enumerate(colors):
mask = labels == i
plt.scatter(Z[mask, 0], Z[mask, 1], c=c, label=f'Cluster {i}', alpha=0.6, s=30)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA: 5D โ 2D Projection')
plt.legend()
plt.grid(alpha=0.3)
plt.show()
10.2 PCA via SVD (Numerically Superior)
import numpy as np
class PCA_SVD:
"""PCA via Singular Value Decomposition โ avoids forming covariance matrix."""
def __init__(self, n_components=2):
self.n_components = n_components
def fit(self, X):
n, d = X.shape
self.mean_ = X.mean(axis=0)
X_centered = X - self.mean_
# Truncated SVD: only compute top-k singular values/vectors
# Full SVD: Xฬ = U ฮฃ Vแต
U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)
# Principal directions are rows of Vt
self.components_ = Vt[:self.n_components] # (k x d)
# Eigenvalues of covariance = ฯยฒ/(n-1)
self.explained_variance_ = (S ** 2) / (n - 1)
total = self.explained_variance_.sum()
self.explained_variance_ratio_ = self.explained_variance_ / total
return self
def transform(self, X):
return (X - self.mean_) @ self.components_.T
def fit_transform(self, X):
self.fit(X)
return self.transform(X)
# Demo
np.random.seed(42)
X = np.random.randn(500, 50) # 500 samples, 50 features
pca = PCA_SVD(n_components=5)
Z = pca.fit_transform(X)
print(f"Reduced: {X.shape} โ {Z.shape}")
print(f"Top 5 EVR: {pca.explained_variance_ratio_[:5].round(4)}")
10.3 t-SNE Simplified Implementation
import numpy as np
def compute_pairwise_distances(X):
"""Compute squared Euclidean distances between all pairs."""
sum_sq = np.sum(X ** 2, axis=1)
# ||xi - xj||ยฒ = ||xi||ยฒ + ||xj||ยฒ - 2 xiยทxj
D = sum_sq[:, np.newaxis] + sum_sq[np.newaxis, :] - 2 * X @ X.T
return np.maximum(D, 0) # Clip numerical negatives
def compute_perplexity_probabilities(D, target_perplexity=30.0, tol=1e-5, max_iter=50):
"""
For each point, find ฯแตข via binary search so that
Perplexity(Pแตข) = target_perplexity.
Returns symmetrized joint probabilities P.
"""
n = D.shape[0]
P = np.zeros((n, n))
target_entropy = np.log2(target_perplexity)
for i in range(n):
# Binary search for ฯแตข (actually searching over ฮฒ = 1/(2ฯยฒ))
beta_min, beta_max = -np.inf, np.inf
beta = 1.0
Di = D[i].copy()
Di[i] = np.inf # Exclude self
for _ in range(max_iter):
# Compute conditional probabilities p(j|i)
exp_D = np.exp(-Di * beta)
exp_D[i] = 0
sum_exp = np.sum(exp_D)
if sum_exp == 0:
P_i = np.zeros(n)
else:
P_i = exp_D / sum_exp
# Compute entropy
H = -np.sum(P_i[P_i > 0] * np.log2(P_i[P_i > 0]))
# Check if we hit target
diff = H - target_entropy
if abs(diff) < tol:
break
# Update ฮฒ via binary search
if diff > 0: # Entropy too high โ increase ฮฒ (decrease ฯ)
beta_min = beta
beta = beta * 2 if beta_max == np.inf else (beta + beta_max) / 2
else:
beta_max = beta
beta = beta / 2 if beta_min == -np.inf else (beta + beta_min) / 2
P[i] = P_i
# Symmetrize: p_ij = (p_j|i + p_i|j) / (2n)
P = (P + P.T) / (2 * n)
P = np.maximum(P, 1e-12) # Avoid log(0)
return P
def tsne(X, n_components=2, perplexity=30.0, n_iter=500, lr=200.0, momentum=0.8):
"""
Simplified t-SNE implementation.
Parameters:
X: ndarray (n_samples, n_features)
n_components: output dimensions (2 or 3)
perplexity: effective number of neighbors (5-50)
n_iter: gradient descent iterations
lr: learning rate
momentum: momentum coefficient
"""
n = X.shape[0]
# Step 1: Pairwise distances in high-D
D_high = compute_pairwise_distances(X)
# Step 2: Compute high-D probabilities with target perplexity
P = compute_perplexity_probabilities(D_high, target_perplexity=perplexity)
# Early exaggeration (multiply P by 4 for first 100 iterations)
P_exag = P * 4.0
# Step 3: Initialize low-D embedding
Y = np.random.randn(n, n_components) * 1e-4
velocity = np.zeros_like(Y)
for t in range(n_iter):
# Use exaggerated P for early iterations
P_use = P_exag if t < 100 else P
# Pairwise distances in low-D
D_low = compute_pairwise_distances(Y)
# Compute q_ij using Student-t distribution
Q_num = 1.0 / (1.0 + D_low) # (1 + ||yi - yj||ยฒ)โปยน
np.fill_diagonal(Q_num, 0)
Q = Q_num / np.sum(Q_num)
Q = np.maximum(Q, 1e-12)
# Compute gradient
# โC/โyแตข = 4 ฮฃโฑผ (pแตขโฑผ - qแตขโฑผ)(yแตข - yโฑผ)(1 + ||yแตข - yโฑผ||ยฒ)โปยน
PQ_diff = P_use - Q
# Efficient gradient computation
grad = np.zeros_like(Y)
for i in range(n):
diff = Y[i] - Y # (n, n_components)
grad[i] = 4 * np.sum((PQ_diff[i, :, np.newaxis]) * diff * Q_num[i, :, np.newaxis], axis=0)
# Update with momentum
velocity = momentum * velocity - lr * grad
Y += velocity
# Re-center
Y -= Y.mean(axis=0)
if (t + 1) % 100 == 0:
cost = np.sum(P_use * np.log(P_use / Q))
print(f" Iteration {t+1}/{n_iter}, KL divergence: {cost:.4f}")
return Y
# Demo
if __name__ == "__main__":
np.random.seed(42)
# Create 3 clusters in 20D
X = np.vstack([
np.random.randn(50, 20) + np.array([5]*10 + [0]*10),
np.random.randn(50, 20) + np.array([0]*10 + [5]*10),
np.random.randn(50, 20) + np.array([-5]*10 + [-5]*10),
])
print("Running t-SNE (this may take a moment)...")
Y = tsne(X, n_components=2, perplexity=15, n_iter=500, lr=100)
print(f"Embedding shape: {Y.shape}")
TensorFlow Implementation
11.1 PCA with TensorFlow (GPU-Accelerated)
import tensorflow as tf
import numpy as np
class TF_PCA:
"""GPU-accelerated PCA using TensorFlow's SVD."""
def __init__(self, n_components=2):
self.n_components = n_components
def fit_transform(self, X_np):
"""
Fit PCA and return transformed data.
Uses TF's SVD for GPU acceleration on large matrices.
"""
X = tf.constant(X_np, dtype=tf.float64)
# Center data
mean = tf.reduce_mean(X, axis=0)
X_centered = X - mean
# SVD: Xฬ = U ฮฃ Vแต
S, U, V = tf.linalg.svd(X_centered, full_matrices=False)
# Project onto top k right singular vectors
V_k = V[:, :self.n_components] # (d x k)
Z = tf.matmul(X_centered, V_k) # (n x k)
# Explained variance
n = tf.cast(tf.shape(X)[0], tf.float64)
eigenvalues = S ** 2 / (n - 1)
total = tf.reduce_sum(eigenvalues)
evr = eigenvalues / total
self.explained_variance_ratio_ = evr.numpy()
self.components_ = tf.transpose(V_k).numpy()
return Z.numpy()
# Demo
np.random.seed(42)
X = np.random.randn(1000, 100)
pca = TF_PCA(n_components=10)
Z = pca.fit_transform(X)
print(f"Shape: {X.shape} โ {Z.shape}")
print(f"Top 5 EVR: {pca.explained_variance_ratio_[:5].round(4)}")
11.2 Autoencoder for Non-Linear Dimensionality Reduction
import tensorflow as tf
from tensorflow import keras
import numpy as np
def build_autoencoder(input_dim, encoding_dim=2, hidden_dims=[128, 64]):
"""
Build a symmetric autoencoder.
Encoder: input_dim โ hidden1 โ hidden2 โ encoding_dim
Decoder: encoding_dim โ hidden2 โ hidden1 โ input_dim
"""
# Encoder
encoder_input = keras.Input(shape=(input_dim,))
x = encoder_input
for dim in hidden_dims:
x = keras.layers.Dense(dim, activation='relu')(x)
x = keras.layers.BatchNormalization()(x)
encoded = keras.layers.Dense(encoding_dim, activation='linear', name='bottleneck')(x)
# Decoder (mirror of encoder)
x = encoded
for dim in reversed(hidden_dims):
x = keras.layers.Dense(dim, activation='relu')(x)
x = keras.layers.BatchNormalization()(x)
decoded = keras.layers.Dense(input_dim, activation='linear')(x)
# Models
autoencoder = keras.Model(encoder_input, decoded, name='autoencoder')
encoder = keras.Model(encoder_input, encoded, name='encoder')
autoencoder.compile(
optimizer=keras.optimizers.Adam(learning_rate=1e-3),
loss='mse'
)
return autoencoder, encoder
# Example: MNIST dimensionality reduction
# Load data
(X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data()
X_train = X_train.reshape(-1, 784).astype('float32') / 255.0
X_test = X_test.reshape(-1, 784).astype('float32') / 255.0
# Build and train
autoencoder, encoder = build_autoencoder(784, encoding_dim=2)
autoencoder.summary()
history = autoencoder.fit(
X_train, X_train,
epochs=30,
batch_size=256,
validation_data=(X_test, X_test),
verbose=1
)
# Get 2D embeddings
Z_train = encoder.predict(X_train)
print(f"Embedding shape: {Z_train.shape}") # (60000, 2)
# Visualize (compare with PCA and t-SNE!)
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 8))
scatter = plt.scatter(Z_train[:5000, 0], Z_train[:5000, 1],
c=y_train[:5000], cmap='tab10', s=3, alpha=0.5)
plt.colorbar(scatter, label='Digit')
plt.title('Autoencoder: MNIST 784D โ 2D')
plt.xlabel('Latent Dim 1')
plt.ylabel('Latent Dim 2')
plt.show()
Scikit-Learn Implementation
12.1 Complete PCA Pipeline with Visualization
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA, KernelPCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_wine, load_digits
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
# ============================================
# PART 1: PCA on Wine Dataset
# ============================================
wine = load_wine()
X, y = wine.data, wine.target
print(f"Wine dataset: {X.shape} โ {len(wine.feature_names)} features")
# Pipeline: Scale โ PCA
pipe_pca = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=2))
])
X_pca = pipe_pca.fit_transform(X)
pca = pipe_pca.named_steps['pca']
print(f"\nExplained Variance Ratio: {pca.explained_variance_ratio_.round(4)}")
print(f"Cumulative: {np.cumsum(pca.explained_variance_ratio_).round(4)}")
# Scree Plot
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Full PCA for scree plot
pca_full = PCA().fit(StandardScaler().fit_transform(X))
axes[0].bar(range(1, 14), pca_full.explained_variance_ratio_, color='#059669', alpha=0.7)
axes[0].plot(range(1, 14), pca_full.explained_variance_ratio_, 'o-', color='#0891b2')
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Scree Plot')
# Cumulative variance
axes[1].plot(range(1, 14), np.cumsum(pca_full.explained_variance_ratio_),
'o-', color='#059669', linewidth=2)
axes[1].axhline(y=0.95, color='red', linestyle='--', label='95%')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Explained Variance')
axes[1].set_title('Cumulative Variance')
axes[1].legend()
# 2D Projection
colors = ['#059669', '#0891b2', '#f59e0b']
for i in range(3):
mask = y == i
axes[2].scatter(X_pca[mask, 0], X_pca[mask, 1],
c=colors[i], label=wine.target_names[i], s=50, alpha=0.7)
axes[2].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
axes[2].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
axes[2].set_title('Wine: PCA 2D Projection')
axes[2].legend()
plt.tight_layout()
plt.show()
# Biplot: show feature contributions
def biplot(pca_model, X_pca, feature_names, y, ax):
"""Create a biplot showing both data points and feature loadings."""
colors = ['#059669', '#0891b2', '#f59e0b']
for i, c in enumerate(colors):
mask = y == i
ax.scatter(X_pca[mask, 0], X_pca[mask, 1], c=c, alpha=0.4, s=30)
# Feature loadings (arrows)
loadings = pca_model.components_.T # (d x k)
scale = 3 # Arrow scale factor
for i, name in enumerate(feature_names):
ax.arrow(0, 0, loadings[i, 0]*scale, loadings[i, 1]*scale,
color='red', alpha=0.7, head_width=0.05)
ax.text(loadings[i, 0]*scale*1.1, loadings[i, 1]*scale*1.1,
name, fontsize=7, color='red')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_title('Biplot')
ax.grid(alpha=0.3)
fig, ax = plt.subplots(figsize=(10, 8))
biplot(pca, X_pca, wine.feature_names, y, ax)
plt.tight_layout()
plt.show()
# ============================================
# PART 2: PCA for Classification Improvement
# ============================================
print("\n--- PCA + SVM Classification ---")
for k in [2, 4, 6, 8, 10, 13]:
pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=k)),
('svm', SVC(kernel='rbf', C=10))
])
scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f" k={k:2d}: accuracy = {scores.mean():.4f} ยฑ {scores.std():.4f}")
# ============================================
# PART 3: Kernel PCA
# ============================================
from sklearn.datasets import make_moons
X_moons, y_moons = make_moons(n_samples=300, noise=0.1, random_state=42)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Linear PCA
pca_linear = PCA(n_components=1)
X_lin = pca_linear.fit_transform(X_moons)
axes[0].scatter(X_lin, np.zeros_like(X_lin), c=y_moons, cmap='RdYlGn', s=30)
axes[0].set_title('Linear PCA (1D)')
# Kernel PCA with RBF
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=10)
X_kpca = kpca.fit_transform(X_moons)
axes[1].scatter(X_kpca[:, 0], X_kpca[:, 1], c=y_moons, cmap='RdYlGn', s=30)
axes[1].set_title('Kernel PCA (RBF, ฮณ=10)')
# Original data
axes[2].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap='RdYlGn', s=30)
axes[2].set_title('Original Moon Data')
plt.tight_layout()
plt.show()
12.2 t-SNE Pipeline with MNIST
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
import time
# Load digits dataset (8x8 images, 64 features, 10 classes)
digits = load_digits()
X, y = digits.data, digits.target
print(f"Digits dataset: {X.shape}, {len(np.unique(y))} classes")
# Scale data
X_scaled = StandardScaler().fit_transform(X)
# ============================================
# Strategy: PCA first (64โ30), then t-SNE (30โ2)
# This is the recommended approach for large datasets!
# ============================================
pca_pre = PCA(n_components=30)
X_pca30 = pca_pre.fit_transform(X_scaled)
print(f"PCA 64โ30: {sum(pca_pre.explained_variance_ratio_):.2%} variance retained")
# t-SNE with different perplexities
perplexities = [5, 15, 30, 50]
fig, axes = plt.subplots(1, 4, figsize=(24, 5))
for ax, perp in zip(axes, perplexities):
start = time.time()
tsne = TSNE(n_components=2, perplexity=perp, random_state=42,
n_iter=1000, learning_rate='auto', init='pca')
X_tsne = tsne.fit_transform(X_pca30)
elapsed = time.time() - start
scatter = ax.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=8, alpha=0.7)
ax.set_title(f'Perplexity = {perp}\n(KL = {tsne.kl_divergence_:.3f}, {elapsed:.1f}s)')
ax.set_xticks([])
ax.set_yticks([])
plt.suptitle('t-SNE: Effect of Perplexity on MNIST Digits', fontsize=14, fontweight='bold')
plt.colorbar(scatter, ax=axes, label='Digit', shrink=0.8)
plt.tight_layout()
plt.show()
# ============================================
# Comparison: PCA vs t-SNE
# ============================================
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# PCA
X_pca2 = PCA(n_components=2).fit_transform(X_scaled)
axes[0].scatter(X_pca2[:, 0], X_pca2[:, 1], c=y, cmap='tab10', s=8, alpha=0.5)
axes[0].set_title('PCA (2D)')
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')
# t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42,
n_iter=1000, learning_rate='auto', init='pca')
X_tsne2 = tsne.fit_transform(X_pca30)
scatter = axes[1].scatter(X_tsne2[:, 0], X_tsne2[:, 1], c=y, cmap='tab10', s=8, alpha=0.5)
axes[1].set_title('t-SNE (2D)')
axes[1].set_xlabel('t-SNE 1')
axes[1].set_ylabel('t-SNE 2')
plt.colorbar(scatter, ax=axes, label='Digit')
plt.suptitle('PCA vs t-SNE: MNIST Digits', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
12.3 UMAP Pipeline
# pip install umap-learn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
import time
try:
import umap
digits = load_digits()
X, y = digits.data, digits.target
X_scaled = StandardScaler().fit_transform(X)
# UMAP with different n_neighbors
n_neighbors_list = [5, 15, 50, 200]
fig, axes = plt.subplots(1, 4, figsize=(24, 5))
for ax, nn in zip(axes, n_neighbors_list):
start = time.time()
reducer = umap.UMAP(
n_components=2,
n_neighbors=nn,
min_dist=0.1,
metric='euclidean',
random_state=42
)
X_umap = reducer.fit_transform(X_scaled)
elapsed = time.time() - start
scatter = ax.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='tab10', s=8, alpha=0.7)
ax.set_title(f'n_neighbors = {nn}\n({elapsed:.1f}s)')
ax.set_xticks([])
ax.set_yticks([])
plt.suptitle('UMAP: Effect of n_neighbors on MNIST', fontsize=14, fontweight='bold')
plt.colorbar(scatter, ax=axes, label='Digit', shrink=0.8)
plt.tight_layout()
plt.show()
# ============================================
# Grand Comparison: PCA vs t-SNE vs UMAP
# ============================================
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# PCA
t0 = time.time()
X_pca = PCA(n_components=2).fit_transform(X_scaled)
axes[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10', s=5, alpha=0.5)
axes[0].set_title(f'PCA ({time.time()-t0:.3f}s)')
# t-SNE
t0 = time.time()
X_tsne = TSNE(n_components=2, perplexity=30, random_state=42, init='pca').fit_transform(X_scaled)
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', s=5, alpha=0.5)
axes[1].set_title(f't-SNE ({time.time()-t0:.1f}s)')
# UMAP
t0 = time.time()
X_umap = umap.UMAP(n_components=2, random_state=42).fit_transform(X_scaled)
scatter = axes[2].scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='tab10', s=5, alpha=0.5)
axes[2].set_title(f'UMAP ({time.time()-t0:.1f}s)')
plt.colorbar(scatter, ax=axes, label='Digit')
plt.suptitle('Grand Comparison: PCA vs t-SNE vs UMAP', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
except ImportError:
print("Install umap-learn: pip install umap-learn")
12.4 LDA Implementation
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
import numpy as np
wine = load_wine()
X, y = wine.data, wine.target
X_scaled = StandardScaler().fit_transform(X)
# LDA: max components = C-1 = 2
lda = LDA(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)
# Compare PCA vs LDA
from sklearn.decomposition import PCA
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
colors = ['#059669', '#0891b2', '#f59e0b']
names = wine.target_names
for ax, X_proj, title in [(axes[0], PCA(2).fit_transform(X_scaled), 'PCA'),
(axes[1], X_lda, 'LDA')]:
for i, (c, name) in enumerate(zip(colors, names)):
mask = y == i
ax.scatter(X_proj[mask, 0], X_proj[mask, 1], c=c, label=name, s=50, alpha=0.7)
ax.set_title(f'{title} Projection (Wine Dataset)')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
# Classification accuracy comparison
print("\n--- PCA vs LDA for Classification ---")
for method_name, reducer in [('PCA(2)', PCA(2)), ('LDA(2)', LDA(n_components=2))]:
pipe = Pipeline([
('scaler', StandardScaler()),
('reduce', reducer),
('svm', SVC(kernel='rbf'))
])
scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f" {method_name}: {scores.mean():.4f} ยฑ {scores.std():.4f}")
Indian Case Studies
๐ฎ๐ณ Case Study 1: Aadhaar Biometric Dimensionality Reduction
Challenge
India's Aadhaar system stores biometrics for 1.4 billion people. Each fingerprint scan produces a high-dimensional minutiae template (~200+ features per finger ร 10 fingers). Iris scans generate IrisCodes of 2,048 bits per eye. Storing and matching these at scale is computationally prohibitive.
Solution: Dimensionality Reduction in Biometric Matching
UIDAI employs several dimensionality reduction strategies:
- Fingerprint templates: ISO/IEC 19794-2 minutiae extraction reduces raw fingerprint images (typically 300ร400 pixels = 120K dimensions) to ~60-100 minutiae points (each with x, y, angle = 3 features), achieving roughly 400:1 reduction
- IrisCode: Daugman's algorithm projects iris texture patterns onto a 2,048-bit binary code using Gabor wavelets โ a form of feature extraction that reduces ~300K pixel values to 256 bytes
- Multi-modal fusion: PCA is used to combine fingerprint and iris features into a compact representation for 1:N identification (searching the entire database)
Scale
Over 100 billion biometric authentications processed. PCA-based indexing reduces search from 1.4B full template comparisons to ~100K candidates per query. Average authentication time: under 200ms.
Python Simulation
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
# Simulate biometric data: 10,000 people, 200 features per person
np.random.seed(42)
n_people = 10000
n_features = 200 # Minutiae + IrisCode features
# Generate clustered data (each person = cluster of templates)
n_classes = 100 # Simulate 100 unique identities
samples_per_class = n_people // n_classes
X_biometric = []
y_identity = []
for person_id in range(n_classes):
center = np.random.randn(n_features) * 3
templates = center + np.random.randn(samples_per_class, n_features) * 0.5
X_biometric.append(templates)
y_identity.extend([person_id] * samples_per_class)
X = np.vstack(X_biometric)
y = np.array(y_identity)
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Without PCA: KNN on full 200D
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
knn_full = KNeighborsClassifier(n_neighbors=3)
knn_full.fit(X_train_s, y_train)
acc_full = accuracy_score(y_test, knn_full.predict(X_test_s))
# With PCA: reduce to 20D
pca = PCA(n_components=20)
X_train_pca = pca.fit_transform(X_train_s)
X_test_pca = pca.transform(X_test_s)
knn_pca = KNeighborsClassifier(n_neighbors=3)
knn_pca.fit(X_train_pca, y_train)
acc_pca = accuracy_score(y_test, knn_pca.predict(X_test_pca))
print("=== Aadhaar Biometric Matching Simulation ===")
print(f"Full features (200D): accuracy = {acc_full:.4f}")
print(f"PCA reduced ( 20D): accuracy = {acc_pca:.4f}")
print(f"Variance retained: {sum(pca.explained_variance_ratio_):.2%}")
print(f"Dimensionality reduction: {200/20:.0f}x")
print(f"Search speedup (KNN): ~{(200/20):.0f}x faster distance computations")
๐ฎ๐ณ Case Study 2: ISRO Satellite Image Compression
Challenge
ISRO's Cartosat and Resourcesat satellites generate hyperspectral images with 200+ spectral bands per pixel. A single scene can be 10,000ร10,000 pixels ร 200 bands = 20 billion values. Downlinking this data to ground stations (bandwidth-limited) requires massive compression.
Solution: PCA-Based Spectral Compression
ISRO uses PCA to reduce spectral dimensionality:
- 200+ spectral bands โ 10-15 principal components (retaining 99%+ variance)
- Each pixel: 200 values โ 15 values = 13:1 compression
- Combined with spatial compression (JPEG2000), achieves 100:1 total compression
- Enables near-real-time crop monitoring under the Fasal programme
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Simulate hyperspectral image: 100x100 pixels, 200 bands
np.random.seed(42)
height, width, bands = 100, 100, 200
image = np.random.randn(height * width, bands)
# Add correlated structure (simulating real spectral correlations)
basis = np.random.randn(10, bands) # 10 underlying spectral patterns
coefficients = np.random.randn(height * width, 10)
image = coefficients @ basis + np.random.randn(height * width, bands) * 0.1
# Apply PCA for spectral compression
pca = PCA(n_components=15)
compressed = pca.fit_transform(image)
reconstructed = pca.inverse_transform(compressed)
# Compute reconstruction quality
mse = np.mean((image - reconstructed) ** 2)
signal_power = np.mean(image ** 2)
psnr = 10 * np.log10(signal_power / mse) if mse > 0 else float('inf')
print("=== ISRO Spectral Compression Simulation ===")
print(f"Original: {height}x{width}x{bands} = {height*width*bands:,} values")
print(f"Compressed: {height}x{width}x{pca.n_components_} = {height*width*pca.n_components_:,} values")
print(f"Compression ratio: {bands/pca.n_components_:.1f}:1")
print(f"Variance retained: {sum(pca.explained_variance_ratio_):.4%}")
print(f"Reconstruction PSNR: {psnr:.1f} dB")
๐ฎ๐ณ Case Study 3: Flipkart Product Embedding
Application
Flipkart manages a catalog of 150+ million products. Product descriptions are embedded into 768-dimensional vectors using BERT. For real-time similar-product recommendations, they use PCA to reduce embeddings from 768D to 128D, followed by approximate nearest neighbor search (FAISS). This reduces memory from 4.5 GB to 760 MB for 10M products while maintaining 95%+ recommendation quality.
Global Case Studies
๐ Case Study 1: MNIST Visualization with t-SNE
The Iconic Visualization
The t-SNE visualization of MNIST handwritten digits (784 pixels โ 2D) is perhaps the most famous use of dimensionality reduction in ML. Published by van der Maaten in 2008, it showed for the first time that t-SNE could separate all 10 digit classes into distinct, tight clusters โ something PCA completely fails to do.
Key insights from the MNIST t-SNE plot:
- Digits 0, 1, 6 form well-separated clusters (visually distinct shapes)
- Digits 3, 5, 8 often overlap (similar curved strokes)
- Digits 4, 7, 9 partially overlap (shared angular features)
- Within-cluster structure often reveals writing styles (e.g., 7 with/without crossbar)
๐ Case Study 2: Genomics Data Reduction
The Challenge
A typical gene expression study measures ~20,000 genes across a few hundred patient samples. This is the extreme n << d regime where the curse of dimensionality is most severe.
Application: Cancer Subtype Discovery
The Cancer Genome Atlas (TCGA) project used PCA and t-SNE to discover cancer subtypes:
- Breast cancer: PCA of 20,000 gene expressions revealed 4 molecular subtypes (Luminal A, Luminal B, HER2+, Basal-like) that had different prognoses and treatment responses
- Single-cell RNA-seq: UMAP is now the standard for visualizing single-cell data (10,000-50,000 genes ร 100,000+ cells). Tools like Scanpy and Seurat use UMAP by default
๐ Case Study 3: Google โ Embedding Spaces at Scale
Application
Google's recommendation systems embed users and items into high-dimensional spaces (typically 64-512 dimensions). For serving recommendations at 8.5 billion search queries/day, they use:
- Product Quantization (PQ): A form of dimensionality reduction that splits 128D vectors into 8 sub-vectors of 16D each, quantizing each sub-vector to 256 centroids. This compresses each embedding from 512 bytes to 8 bytes
- ScaNN (Scalable Nearest Neighbors): Uses PCA as a preprocessing step for approximate nearest neighbor search across billions of embeddings
๐ Case Study 4: Tesla Autopilot โ Sensor Fusion
Application
Tesla's 8 cameras produce ~1.2 billion pixels per second. The vision backbone reduces each camera's 1280ร960ร3 image (~3.7M dimensions) to a 256-dimensional feature vector using a neural network (a learned form of dimensionality reduction). These compact representations are fused across cameras and time for 3D scene understanding.
Startup Applications
๐งฌ Elucidata (India)
Uses UMAP on multi-omics data (genomics + proteomics + metabolomics) to help pharmaceutical companies identify drug targets. Their platform reduces 50,000+ molecular features to interpretable 2D visualizations that biologists can explore interactively.
๐จ Runway ML (US)
Uses PCA and t-SNE in their "Concept Canvas" to let artists explore the latent space of generative models. Users can navigate the reduced space to find novel image generations. The 512D StyleGAN latent space is projected to 2D for interactive exploration.
๐ก๏ธ Darktrace (UK)
Their Enterprise Immune System uses PCA on network traffic features (100+ dimensions including packet sizes, timing patterns, protocols) to detect anomalies. Reducing to 10 principal components makes the real-time computation feasible across 10,000+ devices per enterprise.
๐ SigTuple (India)
Uses dimensionality reduction on medical image features (blood cell morphology โ 200+ shape/color features) to assist pathologists. PCA reduces feature vectors before classification, enabling real-time analysis on low-power edge devices in rural health centers.
Government Applications
๐ฎ๐ณ DRDO โ Radar Signal Processing
DRDO's radar systems use PCA for clutter suppression. Radar returns contain signal (target) + clutter (ground, weather) + noise. PCA decomposes returns into principal components: the first few correspond to strong clutter, which can be removed. This is equivalent to applying a subspace projection filter, improving target detection by 10-15 dB.
๐บ๐ธ NSA โ Signal Intelligence
Communication signals contain thousands of features (frequency components, timing patterns, modulation characteristics). PCA and LDA are used to classify signal types and identify emitters. Dimensionality reduction enables real-time processing of intercepted communications.
๐ฎ๐ณ Census of India โ Demographic Analysis
India's census collects 100+ socioeconomic variables per household across 640+ districts. PCA identifies the key factors driving demographic variation: typically 5-7 principal components capture 80% of variance, corresponding to interpretable factors like urbanization, literacy, agricultural dependency, and industrial development.
๐ช๐บ ESA/Copernicus โ Earth Observation
The European Space Agency's Sentinel-2 satellites capture 13 spectral bands. PCA reduces these to 3 components for false-color visualization, enabling rapid assessment of vegetation health, water quality, and urban sprawl across the European Union.
Industry Applications
๐ Pharma โ Drug Discovery
Molecular descriptors for drug candidates often number in the thousands (topological indices, electronic properties, binding affinities). PCA reduces these to 10-20 "chemical space" coordinates. Companies like AstraZeneca use this to navigate vast libraries of candidate molecules efficiently.
๐ญ Manufacturing โ Process Control
A semiconductor fab monitors 500+ process variables per wafer. PCA-based multivariate statistical process control (MSPC) reduces these to ~20 principal components, detecting out-of-control conditions as deviations in the reduced space. This catches defects 3-5x faster than monitoring individual variables.
๐ต Spotify โ Audio Features
Spotify extracts 128 audio features from each song (tempo, key, energy, spectral features, MFCCs). PCA and autoencoders compress these into 32-dimensional "taste profiles" for users and songs. The Discover Weekly algorithm matches users to songs by proximity in this reduced space.
๐ฐ Finance โ Risk Factor Models
Portfolio managers track thousands of assets. PCA on the correlation matrix of returns identifies the "risk factors" driving market movements. The first principal component typically captures 30-40% of variance (the "market factor"), followed by sector rotations, value/growth, and size factors โ the foundation of factor investing.
Mini Projects
๐ฏ Mini Project 1: Eigenfaces โ Face Recognition with PCA
Build a face recognition system using PCA (the "Eigenfaces" method, Turk & Pentland, 1991).
"""
Mini Project 1: Eigenfaces for Face Recognition
=================================================
Uses PCA to reduce face image dimensionality, then KNN for classification.
The principal components (eigenfaces) capture the key modes of variation
in human faces: lighting, expression, identity.
"""
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_lfw_people
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Load Labeled Faces in the Wild dataset
# Only keep people with โฅ 50 images for reliable training
print("Loading LFW dataset (this may download ~200MB on first run)...")
lfw = fetch_lfw_people(min_faces_per_person=50, resize=0.5)
X, y = lfw.data, lfw.target
names = lfw.target_names
n_samples, n_features = X.shape
h, w = lfw.images.shape[1], lfw.images.shape[2]
print(f"Dataset: {n_samples} images, {n_features} pixels each ({h}x{w})")
print(f"Classes: {len(names)} people: {names}")
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
# Step 1: PCA (finding Eigenfaces)
n_components = 100 # Keep 100 principal components
pca = PCA(n_components=n_components, svd_solver='randomized', whiten=True, random_state=42)
pca.fit(X_train)
# Transform data
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
print(f"\nDimensionality: {n_features} โ {n_components}")
print(f"Variance explained: {sum(pca.explained_variance_ratio_):.2%}")
# Visualize top eigenfaces
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
for i, ax in enumerate(axes.flat):
eigenface = pca.components_[i].reshape(h, w)
ax.imshow(eigenface, cmap='RdBu_r')
ax.set_title(f'PC {i+1}\n({pca.explained_variance_ratio_[i]:.1%})')
ax.axis('off')
plt.suptitle('Top 10 Eigenfaces', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Step 2: SVM Classification on PCA-reduced features
svm = SVC(kernel='rbf', C=10, gamma='scale', class_weight='balanced')
svm.fit(X_train_pca, y_train)
y_pred = svm.predict(X_test_pca)
print("\n" + classification_report(y_test, y_pred, target_names=names))
# Step 3: Reconstruction quality at different k
fig, axes = plt.subplots(2, 6, figsize=(18, 6))
sample = X_test[0] # Original face
axes[0, 0].imshow(sample.reshape(h, w), cmap='gray')
axes[0, 0].set_title('Original')
axes[0, 0].axis('off')
for i, k in enumerate([5, 10, 25, 50, 100]):
pca_k = PCA(n_components=k).fit(X_train)
reconstructed = pca_k.inverse_transform(pca_k.transform([sample]))[0]
axes[0, i+1].imshow(reconstructed.reshape(h, w), cmap='gray')
axes[0, i+1].set_title(f'k={k} ({sum(pca_k.explained_variance_ratio_):.0%})')
axes[0, i+1].axis('off')
# Show reconstruction error
for i, k in enumerate([5, 10, 25, 50, 100]):
pca_k = PCA(n_components=k).fit(X_train)
recon = pca_k.inverse_transform(pca_k.transform([sample]))[0]
error = np.abs(sample - recon)
axes[1, i+1].imshow(error.reshape(h, w), cmap='hot')
axes[1, i+1].set_title(f'Error (k={k})')
axes[1, i+1].axis('off')
axes[1, 0].axis('off')
plt.suptitle('Face Reconstruction at Different PCA Dimensions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
๐ฏ Mini Project 2: High-Dimensional Data Explorer
Build an interactive tool that applies PCA, t-SNE, and UMAP to any dataset and compares results side-by-side.
"""
Mini Project 2: High-Dimensional Data Explorer
=================================================
A comprehensive tool for comparing dimensionality reduction methods.
Generates publication-quality comparison plots.
"""
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
from sklearn.decomposition import PCA, KernelPCA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits, load_wine, load_breast_cancer
import time
class DimReductionExplorer:
"""Compares PCA, Kernel PCA, t-SNE, and (optionally) UMAP."""
def __init__(self, X, y, feature_names=None, class_names=None, dataset_name="Dataset"):
self.X_raw = X
self.y = y
self.feature_names = feature_names
self.class_names = class_names
self.dataset_name = dataset_name
self.scaler = StandardScaler()
self.X = self.scaler.fit_transform(X)
self.results = {}
def run_all(self, tsne_perplexity=30, kpca_kernel='rbf', kpca_gamma=None):
"""Run all dimensionality reduction methods."""
print(f"=== {self.dataset_name}: {self.X.shape[0]} samples, {self.X.shape[1]} features ===\n")
# PCA
t0 = time.time()
pca = PCA(n_components=2)
self.results['PCA'] = {
'embedding': pca.fit_transform(self.X),
'time': time.time() - t0,
'model': pca
}
print(f"PCA: {time.time()-t0:.3f}s | EVR: {sum(pca.explained_variance_ratio_):.2%}")
# Kernel PCA
t0 = time.time()
kpca = KernelPCA(n_components=2, kernel=kpca_kernel, gamma=kpca_gamma)
self.results['Kernel PCA'] = {
'embedding': kpca.fit_transform(self.X),
'time': time.time() - t0,
'model': kpca
}
print(f"Kernel PCA: {time.time()-t0:.3f}s")
# t-SNE
t0 = time.time()
tsne = TSNE(n_components=2, perplexity=tsne_perplexity, random_state=42, init='pca')
self.results['t-SNE'] = {
'embedding': tsne.fit_transform(self.X),
'time': time.time() - t0,
'model': tsne
}
print(f"t-SNE: {time.time()-t0:.1f}s | KL div: {tsne.kl_divergence_:.4f}")
# UMAP (optional)
try:
import umap
t0 = time.time()
reducer = umap.UMAP(n_components=2, random_state=42)
self.results['UMAP'] = {
'embedding': reducer.fit_transform(self.X),
'time': time.time() - t0,
'model': reducer
}
print(f"UMAP: {time.time()-t0:.1f}s")
except ImportError:
print("UMAP: skipped (pip install umap-learn)")
def plot_comparison(self, figsize=(20, 5)):
"""Plot all methods side by side."""
methods = list(self.results.keys())
n_methods = len(methods)
fig, axes = plt.subplots(1, n_methods, figsize=(5*n_methods, 5))
if n_methods == 1:
axes = [axes]
cmap = 'tab10' if len(np.unique(self.y)) <= 10 else 'tab20'
for ax, method in zip(axes, methods):
Z = self.results[method]['embedding']
t = self.results[method]['time']
scatter = ax.scatter(Z[:, 0], Z[:, 1], c=self.y, cmap=cmap,
s=10, alpha=0.6, edgecolors='none')
ax.set_title(f'{method}\n({t:.2f}s)', fontsize=12, fontweight='bold')
ax.set_xticks([])
ax.set_yticks([])
plt.colorbar(scatter, ax=axes, label='Class', shrink=0.8)
plt.suptitle(f'{self.dataset_name}: Dimensionality Reduction Comparison',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
def plot_pca_analysis(self, max_components=None):
"""Detailed PCA analysis: scree plot, cumulative variance, loadings."""
if max_components is None:
max_components = min(self.X.shape)
pca_full = PCA(n_components=max_components).fit(self.X)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
k = min(max_components, 20)
# Scree plot
axes[0].bar(range(1, k+1), pca_full.explained_variance_ratio_[:k],
color='#059669', alpha=0.7)
axes[0].set_xlabel('Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Scree Plot')
# Cumulative
cum = np.cumsum(pca_full.explained_variance_ratio_[:k]) * 100
axes[1].plot(range(1, k+1), cum, 'o-', color='#059669', linewidth=2)
axes[1].axhline(y=95, color='red', linestyle='--', label='95%')
axes[1].axhline(y=90, color='orange', linestyle='--', label='90%')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Variance (%)')
axes[1].set_title('Cumulative Explained Variance')
axes[1].legend()
# Top features for PC1 and PC2
if self.feature_names is not None:
loadings = pca_full.components_[:2]
top_features_pc1 = np.argsort(np.abs(loadings[0]))[-5:]
axes[2].barh(range(5), loadings[0][top_features_pc1], color='#059669')
axes[2].set_yticks(range(5))
axes[2].set_yticklabels([self.feature_names[i] for i in top_features_pc1])
axes[2].set_title('Top 5 Feature Loadings (PC1)')
plt.suptitle(f'{self.dataset_name}: PCA Analysis', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# ============================================
# DEMO: Run on multiple datasets
# ============================================
if __name__ == "__main__":
# Dataset 1: Wine
wine = load_wine()
explorer = DimReductionExplorer(
wine.data, wine.target,
feature_names=wine.feature_names,
class_names=wine.target_names,
dataset_name="Wine Dataset"
)
explorer.run_all()
explorer.plot_comparison()
explorer.plot_pca_analysis()
# Dataset 2: Digits
digits = load_digits()
explorer2 = DimReductionExplorer(
digits.data, digits.target,
dataset_name="Digits Dataset (64D)"
)
explorer2.run_all(tsne_perplexity=30)
explorer2.plot_comparison()
๐ฏ Mini Project 3: Image Compression with PCA
Compress images by applying PCA to image patches and reconstructing at different compression ratios.
"""
Mini Project 3: Image Compression using PCA
=============================================
Treats each row of the image as a sample with width features.
PCA extracts principal components = dominant patterns across rows.
"""
import numpy as np
import matplotlib.pyplot as plt
# Create a sample grayscale image (or load your own)
from sklearn.datasets import load_sample_image
try:
china = load_sample_image("china.jpg")
# Convert to grayscale
img = np.mean(china, axis=2) # Average RGB channels
except:
# Fallback: create a synthetic image
x = np.linspace(0, 4*np.pi, 200)
y = np.linspace(0, 4*np.pi, 300)
X, Y = np.meshgrid(x, y)
img = np.sin(X) * np.cos(Y) + np.random.randn(300, 200) * 0.1
print(f"Image shape: {img.shape}")
print(f"Original size: {img.nbytes / 1024:.1f} KB")
# Apply PCA with different numbers of components
from sklearn.decomposition import PCA
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
k_values = [1, 5, 10, 25, 50, 100, 200, img.shape[1]]
for ax, k in zip(axes.flat, k_values):
k = min(k, min(img.shape))
pca = PCA(n_components=k)
compressed = pca.fit_transform(img) # (height x k)
reconstructed = pca.inverse_transform(compressed) # (height x width)
# Compression ratio
original_size = img.shape[0] * img.shape[1]
compressed_size = compressed.shape[0] * compressed.shape[1] + pca.components_.size
ratio = original_size / compressed_size
mse = np.mean((img - reconstructed) ** 2)
ax.imshow(reconstructed, cmap='gray')
ax.set_title(f'k={k}\nRatio: {ratio:.1f}x, MSE: {mse:.2f}')
ax.axis('off')
plt.suptitle('Image Compression with PCA', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
End-of-Chapter Exercises
Conceptual Questions
Explain the curse of dimensionality using the "volume of hypersphere" argument. Show mathematically that Vsphere/Vcube โ 0 as d โ โ.
Why must we center the data before applying PCA? What happens if we don't? Prove that the mean of the projected data along each principal component is zero.
Explain why PCA is sensitive to the scale of features. Give an example where standardization changes the principal components dramatically.
Derive the reconstruction error formula for PCA: E = ฮฃi=k+1d ฮปi. Show that discarding the smallest eigenvalues minimizes the MSE.
Explain the crowding problem in SNE and how t-SNE solves it using a heavy-tailed t-distribution. Why is KL divergence asymmetric, and how does this affect the embedding?
Compare PCA, t-SNE, and UMAP along five dimensions: (1) linearity, (2) computational complexity, (3) preservation of global vs. local structure, (4) out-of-sample extension, (5) determinism.
Explain why LDA is limited to at most C-1 components (where C is the number of classes). What happens if SW is singular?
Computational Exercises
Given the data matrix X = [[1,2],[3,4],[5,6],[7,8]], compute PCA by hand: (a) center the data, (b) compute the covariance matrix, (c) find eigenvalues and eigenvectors, (d) project onto PC1.
For the covariance matrix C = [[4, 2], [2, 3]], find the eigenvalues, eigenvectors, and explain what the principal components represent geometrically.
Implement Kernel PCA from scratch using the RBF kernel. Test it on the make_moons dataset and compare with linear PCA.
Write a function that computes the optimal number of PCA components using: (a) the 95% variance threshold, (b) Kaiser's rule (ฮป > 1 for standardized data), (c) the elbow method (maximum curvature of scree plot).
Implement the biplot function from scratch. For the Wine dataset, create a biplot and interpret which features drive each principal component.
Programming Projects
Apply PCA to the MNIST dataset. Plot accuracy of a KNN classifier vs. number of PCA components (k = 1, 2, 5, 10, 20, 50, 100, 200, 784). Find the optimal k.
Run t-SNE on MNIST with perplexities [5, 10, 30, 50, 100] and learning rates [10, 50, 200, 1000]. Create a 5ร4 grid of visualizations. Which hyperparameter has more impact?
Compare PCA + Logistic Regression vs. LDA + Logistic Regression on three datasets (Wine, Iris, Breast Cancer). Report accuracy, training time, and number of components used.
Implement incremental PCA for a dataset that doesn't fit in memory. Use sklearn's IncrementalPCA with batch_size=100 on a 100,000-sample dataset. Compare results with standard PCA.
Apply UMAP with different metrics (euclidean, cosine, manhattan) to the 20 Newsgroups text dataset (TF-IDF features). Which metric produces the best cluster separation?
Research-Oriented Exercises
Implement Sparse PCA using sklearn and compare the resulting components with standard PCA. When and why would sparse components be preferred?
The Johnson-Lindenstrauss lemma states that random projections can approximately preserve distances. Implement a random projection and compare with PCA for nearest-neighbor queries. At what dimensionality do they converge?
Implement Probabilistic PCA (PPCA) which models data as X = Wz + ฮผ + ฮต where z ~ N(0, I) and ฮต ~ N(0, ฯยฒI). Show how it handles missing data gracefully, unlike standard PCA.
Compare the stability of t-SNE and UMAP embeddings across different random seeds. Run each 10 times and quantify consistency using Procrustes distance. Which is more reproducible?
Build an "embedding space interpolation" tool: given two images, project them to PCA space, interpolate between their PCA coordinates, and reconstruct intermediate images. Apply to faces and digits.
Multiple Choice Questions
Interview Questions
Q: Explain PCA to a non-technical stakeholder. When would you use it and when would you avoid it?
Model Answer: PCA finds the most important "directions" in your data. Imagine a cloud of data points in 3D โ PCA finds the best 2D plane to project them onto while losing the least information, like taking the best possible photograph of a 3D object. Use it when: features are correlated, you need to speed up ML models, or reduce storage. Avoid it when: feature interpretability is critical (e.g., regulatory compliance), data has strong non-linear structure, or when the number of features is already small.
Q: What's the computational complexity of PCA? How does SVD help with the n >> d and n << d cases?
Model Answer: Eigendecomposition of dรd covariance matrix: O(dยณ). But forming the covariance matrix costs O(ndยฒ). SVD of nรd data matrix costs O(min(nยฒd, ndยฒ)). When n >> d: SVD costs O(ndยฒ), same as eigendecomposition route. When n << d (e.g., genomics): SVD is O(nยฒd), much cheaper than the O(dยณ) eigendecomposition. Randomized SVD (used in sklearn) can further reduce to O(ndk) for k components.
Q: You're building a product recommendation system with 1 million products, each represented as a 768-dimensional BERT embedding. How would you enable real-time nearest-neighbor search?
Model Answer: (1) PCA: reduce 768D โ 128D (retaining ~95% variance, 6x memory reduction). (2) Product Quantization: split 128D into 8 sub-vectors of 16D, quantize each to 256 centroids = 8 bytes/product. Total memory: 1M ร 8 bytes = 8 MB (vs 3 GB original). (3) Use FAISS or ScaNN for ANN search. (4) For production: build an IVF (Inverted File) index with ~1000 partitions. Query time: ~1ms (vs ~1s for brute force). PCA is the critical first step that makes everything else feasible.
Q: Why is t-SNE non-parametric (can't transform new points), while UMAP can? How does parametric t-SNE solve this?
Model Answer: t-SNE optimizes positions y_i directly via gradient descent โ there's no learned function f(x) โ y. To embed a new point x_new, you'd need to re-run the entire optimization. UMAP, however, learns a graph structure that can be used to map new points by finding their nearest neighbors in the training set and interpolating. Parametric t-SNE (Sainburg et al., 2021) uses a neural network to learn f(x) โ y, training it with the same KL divergence objective. This gives the best of both worlds: t-SNE quality + out-of-sample capability.
Q: How would you apply PCA to hyperspectral satellite images for crop classification?
Model Answer: (1) Stack all pixels as rows, spectral bands as columns (NรB matrix). (2) Standardize each band. (3) Apply PCA to reduce B=200 bands to k=15 components (capturing 99%+ variance). (4) Feed reduced features to a Random Forest or SVM classifier with training labels (crop type). Key insight: spectral bands are highly correlated (neighboring wavelengths measure similar properties), making PCA extremely effective. Additional trick: use "MNF transform" (Minimum Noise Fraction), a noise-aware variant of PCA that orders components by signal-to-noise ratio rather than variance.
Q: Can you use PCA for feature selection? What's the difference?
Model Answer: PCA does feature extraction, not selection. PCs are linear combinations of ALL original features โ you can't remove any single feature. However, you can use PCA loadings to identify which features contribute most to the top PCs, then keep only those features (this is "PCA-guided feature selection"). Also, Sparse PCA (which adds an L1 penalty) produces components that use only a few original features, bridging the gap between extraction and selection.
Q: You applied t-SNE and got beautiful clusters, but your classifier accuracy didn't improve. Why?
Model Answer: t-SNE is optimized for visualization (preserving local neighborhoods), not for preserving distances or linear separability. Several issues: (1) t-SNE distorts distances โ close clusters might actually be far apart and vice versa. (2) t-SNE is non-parametric โ you can't apply the same transformation to test data. (3) The 2D embedding discards information that a classifier could use. For classification, use PCA (preserves variance) or LDA (maximizes class separation) instead. t-SNE is a diagnostic tool, not a preprocessing step.
Q: In finance, PCA applied to the correlation matrix of stock returns typically shows the first PC explaining 30-40% of variance. What does this PC represent?
Model Answer: The first PC is the "market factor" โ it has roughly equal positive loadings on all stocks and represents broad market movements (risk-on/risk-off sentiment). PC2 often represents sector rotation (e.g., growth vs. value). PC3 might capture size effects. This is the basis of factor models (APT, Fama-French). Traders use PCA to decompose portfolio risk: the residual variance after removing the top PCs represents idiosyncratic/stock-specific risk.
Q: Prove that PCA minimizes reconstruction error. What's the connection between maximum variance and minimum error?
Model Answer: Total variance = ฮฃ ฮปแตข. Keeping k components preserves ฮฃ_{i=1}^k ฮปแตข variance. Reconstruction error = ฮฃ_{i=k+1}^d ฮปแตข = Total - Preserved. Minimizing reconstruction error โบ maximizing preserved variance โบ choosing the largest k eigenvalues. The proof uses the Eckart-Young theorem: the best rank-k approximation to a matrix (in Frobenius norm) is given by its top-k SVD components.
Q: You have a dataset with 1000 features but only 100 samples. What problems do you face, and how does PCA help?
Model Answer: With p >> n: (1) Covariance matrix (1000ร1000) is rank-deficient (rank โค 99), making it non-invertible. (2) Models will overfit โ more parameters than data points. (3) Distance-based methods fail (curse of dimensionality). PCA helps by: (a) Reducing to at most 99 meaningful components (99 non-zero eigenvalues). (b) Typically 10-20 PCs capture most variance, giving a well-conditioned feature space. (c) Computing PCA via SVD of the 100ร1000 data matrix (cheaper than 1000ร1000 eigendecomposition). This is the standard approach in genomics, text mining, and neuroimaging.
Research Problems
๐ฌ Research Problem 1: Adaptive Perplexity for t-SNE
Standard t-SNE uses a single perplexity value for all points. But in datasets with varying density (e.g., a dense cluster near a sparse one), uniform perplexity is suboptimal. Design and implement an adaptive perplexity scheme where each point uses a different perplexity based on local density. Compare with standard t-SNE on synthetic datasets with multi-scale cluster structure.
Starting point: Compute local density ฯi = 1/dฬk (inverse of mean k-nearest-neighbor distance). Set perplexityi โ ฯi-ฮฑ for some parameter ฮฑ. Evaluate using: (a) trustworthiness metric, (b) silhouette score of known clusters, (c) visual quality.
Key reference: Linderman et al., "Clustering with t-SNE, provably" (2019).
๐ฌ Research Problem 2: Dimensionality Reduction for Time Series
Standard PCA treats each feature independently, ignoring temporal correlations. Design a "Temporal PCA" that finds principal components in the time-frequency domain. Apply to: (a) EEG brain signals (64 channels ร 1000 time points), (b) stock price histories (500 stocks ร 252 trading days).
Approach: Apply Short-Time Fourier Transform to each channel, flatten the time-frequency representation, then apply PCA. Compare with: standard PCA on raw time series, Dynamic Time Warping + MDS, and the method of Dynamic Mode Decomposition (DMD).
๐ฌ Research Problem 3: Fairness in Dimensionality Reduction
PCA maximizes total variance, which may disproportionately represent majority groups. For example, in face recognition, eigenfaces trained on predominantly light-skinned faces may poorly represent darker-skinned faces. Design a "Fair PCA" that ensures equal reconstruction error across demographic groups.
Formulation: Minimize maxgโGroups Eg[||x - xฬ||ยฒ] subject to dim(projection) = k. Compare with: standard PCA, group-stratified PCA (separate PCA per group), and regularized PCA with group-wise reconstruction constraints.
Key reference: Samadi et al., "The Price of Fair PCA" (NeurIPS 2018).
๐ฌ Research Problem 4: Topology-Preserving Dimensionality Reduction
UMAP is inspired by topology, but how well does it actually preserve topological features (holes, loops, connected components)? Develop quantitative metrics based on persistent homology to evaluate how well PCA, t-SNE, UMAP, and autoencoders preserve the topological structure of data manifolds.
Tools: Use Ripser or GUDHI for persistent homology computation. Test on: (a) torus, (b) Klein bottle samples, (c) real protein conformation data.
Key Takeaways
References
Foundational Papers
- Pearson, K. (1901). "On Lines and Planes of Closest Fit to Systems of Points in Space." Philosophical Magazine, 2(11), 559โ572.
- Hotelling, H. (1933). "Analysis of a Complex of Statistical Variables into Principal Components." Journal of Educational Psychology, 24(6), 417โ441.
- Fisher, R.A. (1936). "The Use of Multiple Measurements in Taxonomic Problems." Annals of Eugenics, 7(2), 179โ188.
- Golub, G.H. & Kahan, W. (1965). "Calculating the Singular Values and Pseudo-Inverse of a Matrix." SIAM Journal on Numerical Analysis, 2(2), 205โ224.
- Schรถlkopf, B., Smola, A., & Mรผller, K.R. (1998). "Nonlinear Component Analysis as a Kernel Eigenvalue Problem." Neural Computation, 10(5), 1299โ1319.
t-SNE and UMAP
- Hinton, G.E. & Roweis, S.T. (2002). "Stochastic Neighbor Embedding." NeurIPS.
- van der Maaten, L. & Hinton, G. (2008). "Visualizing Data using t-SNE." JMLR, 9, 2579โ2605.
- McInnes, L., Healy, J., & Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." arXiv:1802.03426.
- Linderman, G.C. et al. (2019). "Fast Interpolation-based t-SNE for Improved Visualization of Single-Cell RNA-Seq Data." Nature Methods, 16, 243โ245.
Applications
- Turk, M. & Pentland, A. (1991). "Eigenfaces for Recognition." Journal of Cognitive Neuroscience, 3(1), 71โ86.
- Jolliffe, I.T. & Cadima, J. (2016). "Principal Component Analysis: A Review and Recent Developments." Phil. Trans. R. Soc. A, 374, 20150202.
- Becht, E. et al. (2019). "Dimensionality Reduction for Visualizing Single-Cell Data Using UMAP." Nature Biotechnology, 37, 38โ44.
Textbooks
- Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 12 (PCA).
- Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 20.
- Gรฉron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed. O'Reilly. Chapter 8.
Indian Context
- UIDAI Technical Reports on Biometric De-duplication Architecture (2020).
- ISRO/NRSC Hyperspectral Remote Sensing Documentation, NRSC-TR-2019.
- Rao, C.R. (1948). "The Utilization of Multiple Measurements in Problems of Biological Classification." JRSS B, 10(2), 159โ203.