Unsupervised Learning — Clustering
Discovering Hidden Structure: K-Means, Hierarchical, DBSCAN, and Gaussian Mixture Models
Learning Objectives
Distinguish supervised from unsupervised learning and articulate when clustering is the right choice
Implement K-Means from scratch, derive convergence via within-cluster SSE minimization
Explain K-Means++ initialization and prove its O(log k)-competitive guarantee
Apply Elbow Method and Silhouette Score to select optimal number of clusters
Build hierarchical clustering (agglomerative & divisive) and interpret dendrograms
Master DBSCAN for density-based clustering with noise handling via ε and MinPts
Derive the Expectation-Maximization (EM) algorithm for Gaussian Mixture Models
Evaluate clusters using Silhouette, Davies-Bouldin, Calinski-Harabasz, and ARI metrics
Compare K-Means, Hierarchical, DBSCAN, and GMM — know which to use when
Build end-to-end clustering projects: customer segmentation and image color quantization
Introduction
Imagine walking into a library where none of the books have category labels. You'd naturally start grouping them — novels here, science texts there, history books in another pile. You're performing clustering: discovering natural groupings in data without being told what those groups are.
In supervised learning (Chapters 6–14), we had labeled data: "this email is spam," "this image is a cat." In unsupervised learning, we have only the data itself — no labels, no teacher. The algorithm must discover structure on its own.
Clustering is the most fundamental unsupervised learning task. Given a dataset of n points in d-dimensional space, the goal is to partition them into k groups (clusters) such that points within the same cluster are "similar" to each other and "dissimilar" to points in other clusters.
Why Clustering Matters
- Customer Segmentation — Jio segments its 200M+ users into behavioral clusters for targeted plans
- Anomaly Detection — Points that don't belong to any cluster may be fraudulent transactions
- Data Exploration — Understand the structure of a new dataset before building models
- Feature Engineering — Cluster assignments become features for downstream supervised models
- Compression — Image color quantization reduces 16M colors to k representative colors
- Biology — Gene expression clustering reveals functional groups of genes
Historical Background
Clustering has one of the richest histories in machine learning, spanning mathematics, statistics, and computer science:
| Year | Milestone | Contributor | Significance |
|---|---|---|---|
| 1932 | Anthropometric clustering | Driver & Kroeber | First formal cluster analysis in social science |
| 1957 | K-Means algorithm | Stuart Lloyd (Bell Labs) | Originally for PCM signal quantization; published 1982 |
| 1963 | K-Means rediscovery | Edward Forgy | Published the iterative K-Means independently |
| 1967 | Hierarchical clustering | Johnson (single/complete linkage) | Systematic taxonomy via dendrograms |
| 1977 | EM algorithm formalized | Dempster, Laird, Rubin | General framework for GMMs and missing data |
| 1996 | DBSCAN | Ester, Kriegel, Sander, Xu | Density-based clustering, arbitrary-shaped clusters |
| 2007 | K-Means++ | Arthur & Vassilvitskii | O(log k)-competitive initialization |
| 2011 | Mini-Batch K-Means | Sculley (Google) | Scalable K-Means for web-scale data |
| 2017 | Deep Clustering (DEC) | Xie, Girshick, Farhadi | Neural network-based unsupervised clustering |
Conceptual Explanation
4.1 Supervised vs. Unsupervised Learning
In supervised learning, we have pairs (xᵢ, yᵢ) and learn a mapping f: X → Y. In unsupervised learning, we have only {x₁, x₂, ..., xₙ} and seek to discover hidden patterns, groupings, or representations.
SUPERVISED LEARNING UNSUPERVISED LEARNING
═══════════════════ ═════════════════════
Input: (x₁,y₁), ..., (xₙ,yₙ) Input: x₁, x₂, ..., xₙ
↓ ↓
Learn: f(x) → y Discover: Structure in X
↓ ↓
Output: Predicted labels Output: Clusters, components,
associations
Examples: Examples:
• Classification • Clustering ← THIS CHAPTER
• Regression • Dimensionality Reduction
• Anomaly Detection
• Association Rules
4.2 K-Means Clustering
K-Means is the most popular clustering algorithm. The idea is elegantly simple:
- Choose k — the number of clusters
- Initialize — pick k points as initial centroids
- Assign — assign each data point to the nearest centroid
- Update — recompute each centroid as the mean of its assigned points
- Repeat — iterate steps 3–4 until convergence
The algorithm converges because it monotonically decreases the within-cluster sum of squares (WCSS) — also called inertia:
4.3 K-Means++ Initialization
Standard K-Means randomly initializes centroids, which can lead to poor local minima. K-Means++ is a smart initialization strategy:
- Pick the first centroid uniformly at random from the data
- For each remaining point x, compute D(x) = distance to nearest existing centroid
- Pick the next centroid with probability proportional to D(x)²
- Repeat steps 2–3 until k centroids are chosen
This ensures centroids are spread out, leading to O(log k)-competitive approximation to the optimal WCSS.
4.4 Choosing Optimal K
Elbow Method
Plot WCSS vs. k. The "elbow" — where the curve bends sharply — suggests the optimal k. Beyond this point, adding more clusters gives diminishing returns.
Silhouette Score
For each point i, the silhouette score measures how similar it is to its own cluster versus the nearest neighboring cluster:
s(i) ranges from -1 (wrong cluster) to +1 (perfectly clustered). The overall silhouette score is the mean over all points.
4.5 Hierarchical Clustering
Instead of specifying k upfront, hierarchical clustering builds a tree (dendrogram) of nested clusterings:
- Agglomerative (bottom-up): Start with each point as its own cluster. Repeatedly merge the two closest clusters.
- Divisive (top-down): Start with all points in one cluster. Recursively split the least coherent cluster.
Linkage Types
The key design choice is how to measure distance between clusters:
| Linkage | Distance Between A and B | Behavior |
|---|---|---|
| Single | min{d(a,b) : a∈A, b∈B} | Can find elongated clusters; sensitive to noise ("chaining") |
| Complete | max{d(a,b) : a∈A, b∈B} | Produces compact, equal-diameter clusters |
| Average (UPGMA) | mean{d(a,b) : a∈A, b∈B} | Compromise between single and complete |
| Ward's | ΔJ (increase in WCSS upon merge) | Minimizes variance; produces equal-sized clusters |
4.6 DBSCAN: Density-Based Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) defines clusters as dense regions separated by sparse regions. Two hyperparameters:
- ε (eps): The radius of the neighborhood
- MinPts: Minimum points required to form a dense region
Three point types:
- Core point: Has ≥ MinPts neighbors within ε-radius
- Border point: Within ε of a core point but has < MinPts neighbors
- Noise point: Neither core nor border — the outlier!
4.7 Gaussian Mixture Models (GMM)
While K-Means does "hard" assignment (each point belongs to exactly one cluster), GMMs do soft clustering — each point has a probability of belonging to each cluster.
A GMM models the data as a mixture of k Gaussian distributions:
The parameters {πₖ, μₖ, Σₖ} are learned via the Expectation-Maximization (EM) algorithm.
4.8 Cluster Evaluation Metrics
Silhouette Score
How well-separated are clusters? Range: [-1, 1]. Higher = better.
Davies-Bouldin Index
Ratio of within-cluster to between-cluster distances. Lower = better.
Calinski-Harabasz
Ratio of between-cluster to within-cluster variance. Higher = better.
ARI (Adjusted Rand)
Measures agreement with ground truth (when available). Range: [-1, 1].
Mathematical Foundation
5.1 K-Means as an Optimization Problem
Given n data points {x₁, ..., xₙ} in ℝᵈ, K-Means seeks a partition C = {C₁, ..., Cₖ} and centroids {μ₁, ..., μₖ} that minimize:
This is an NP-hard problem in general (even for k=2 in d-dimensions). K-Means gives a local optimum via alternating minimization.
5.2 Why K-Means Converges
The algorithm alternates two steps, each of which can only decrease (or maintain) J:
- Assignment step (fix μ, optimize C): For each xᵢ, assigning it to the nearest centroid minimizes ||xᵢ - μⱼ||². This cannot increase J.
- Update step (fix C, optimize μ): Given cluster Cⱼ, the value of μ that minimizes Σ_{i∈Cⱼ} ||xᵢ - μ||² is the mean: μⱼ = (1/|Cⱼ|) Σ_{i∈Cⱼ} xᵢ. This cannot increase J.
Since J is bounded below by 0 and each step decreases it, the algorithm must converge. There are finitely many partitions, so it terminates in finite steps (usually fast).
5.3 K-Means++ Competitive Ratio
Let OPT denote the optimal WCSS. K-Means++ initialization produces initial centroids with expected cost:
5.4 Silhouette Score Formalization
For point xᵢ assigned to cluster Cₖ:
5.5 Gaussian Mixture Model — Full Specification
The probability density under a GMM with k components:
5.6 Davies-Bouldin Index
5.7 Calinski-Harabasz Index
Formula Derivations
6.1 Deriving the Optimal Centroid (Mean Minimizes SSE)
We prove that the centroid μⱼ that minimizes within-cluster SSE is the mean of the cluster points.
Given: Cluster Cⱼ = {x₁, x₂, ..., xₘ}. Find μ* = argmin_μ Σᵢ ||xᵢ - μ||²
The Hessian ∇²f = 2mI ≻ 0 (positive definite), confirming this is a global minimum.
6.2 Deriving the EM Algorithm for GMMs
The log-likelihood for a GMM is:
Direct maximization is intractable (ln of sum). The EM algorithm introduces latent variables zᵢ ∈ {1,...,k} indicating cluster assignment.
E-Step: Compute Responsibilities
γᵢⱼ is the "soft" probability that point xᵢ belongs to cluster j.
M-Step: Update Parameters
Maximize the expected complete-data log-likelihood. Define Nⱼ = Σᵢ γᵢⱼ (effective number of points in cluster j).
Proof sketch for μⱼ: Taking ∂/∂μⱼ of the expected log-likelihood Q(θ, θ^{old}) = Σᵢ Σⱼ γᵢⱼ ln[πⱼ N(xᵢ|μⱼ,Σⱼ)] and setting to zero:
6.3 Ward's Linkage — Merge Cost Derivation
When merging clusters A and B, the increase in total WCSS is:
Proof: The WCSS of the merged cluster A∪B with mean μ_{A∪B} = (|A|μ_A + |B|μ_B)/(|A|+|B|) can be expanded. The increase equals the variance due to the difference in means, weighted by the harmonic-like factor |A|·|B|/(|A|+|B|).
Worked Numerical Examples
Example 1: K-Means — 3 Iterations by Hand
Data points (2D): A(1,1), B(1.5,2), C(3,4), D(5,7), E(3.5,5), F(4.5,5), G(3.5,4.5)
k = 2. Initial centroids: μ₁ = A(1,1), μ₂ = D(5,7)
Iteration 1 — Assignment Step
| Point | Coords | dist² to μ₁(1,1) | dist² to μ₂(5,7) | Cluster |
|---|---|---|---|---|
| A | (1, 1) | 0 | 52 | C₁ |
| B | (1.5, 2) | 1.25 | 37.25 | C₁ |
| C | (3, 4) | 13 | 13 | C₁ (tie → pick first) |
| D | (5, 7) | 52 | 0 | C₂ |
| E | (3.5, 5) | 22.25 | 6.25 | C₂ |
| F | (4.5, 5) | 28.25 | 4.25 | C₂ |
| G | (3.5, 4.5) | 18.5 | 8.5 | C₂ |
C₁ = {A, B, C}, C₂ = {D, E, F, G}
Iteration 1 — Update Step
Iteration 2 — Assignment Step
| Point | dist² to μ₁(1.83,2.33) | dist² to μ₂(4.13,5.38) | Cluster |
|---|---|---|---|
| A(1,1) | 2.47 | 28.89 | C₁ |
| B(1.5,2) | 0.22 | 18.33 | C₁ |
| C(3,4) | 4.15 | 3.16 | C₂ ← moved! |
| D(5,7) | 31.78 | 3.41 | C₂ |
| E(3.5,5) | 9.90 | 0.53 | C₂ |
| F(4.5,5) | 18.26 | 0.28 | C₂ |
| G(3.5,4.5) | 7.47 | 1.14 | C₂ |
C₁ = {A, B}, C₂ = {C, D, E, F, G} — Point C moved to C₂!
Iteration 2 — Update Step
Iteration 3 — Assignment Step
Recomputing distances with μ₁=(1.25,1.5) and μ₂=(3.9,5.1):
All points remain in the same clusters: C₁ = {A, B}, C₂ = {C, D, E, F, G}. Converged!
Example 2: Silhouette Score Computation
Clusters: C₁ = {(0,0), (1,0), (0,1)}, C₂ = {(5,5), (6,5), (5,6)}
Compute silhouette for point x₁ = (0,0) in C₁:
Visual Diagrams
K-Means Iteration Visualization
Iteration 0 (Init) Iteration 1 Iteration 2 (Converged)
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ · · │ │ ■ ■ │ │ ■ ■ │
│ · · │ │ ■ ■ │ │ ■ ★ ■ │
│ ★ │ │ ★ │ │ │
│ │ │ │ │ │
│ ★ │ │ ★ │ │ ★ │
│ · · │ │ ● ● │ │ ● ● │
│ · │ │ ● │ │ ● │
└──────────────────┘ └──────────────────┘ └──────────────────┘
★ = centroid · = unassigned ● = cluster 1 ■ = cluster 2
DBSCAN Point Classification
ε = 1.5, MinPts = 3
┌─────────────────────────────────┐
│ │
│ ●───● Core points: ● │
│ │ ╲ │ Border pts: ○ │
│ ●───●───○ Noise pts: ✕ │
│ │
│ ✕ │
│ │
│ ●───● │
│ │ ╲ │ │
│ ○────●───● │
│ │
│ ✕ │
└─────────────────────────────────┘
Cluster 1: top-left group (4 core + 1 border)
Cluster 2: bottom-center group (4 core + 1 border)
Noise: 2 isolated points (✕)
Dendrogram Reading
Height (distance)
│
8 ├─────────────────────┐
│ │
6 ├─────────┐ │
│ │ │
4 ├───┐ │ │
│ │ │ │
2 ├─┐ │ ┌─┘ ┌─┘
│ │ │ │ │
0 ├─┴─┴───┴───────────┴──
A B C D E F G
└──┘ │ │ └──┘ │
merge │ │ merge │
at 2 │ │ at 2 │
└─────┘ │ └──────┘
merge │ merge
at 4 │ at 3
└─────────┘
merge at 6
└─────────────────────┘
merge at 8
Cut at height 5 → 2 clusters: {A,B,C,D} and {E,F,G}
Cut at height 3 → 3 clusters: {A,B}, {C,D}, {E,F,G}
GMM Soft Clustering vs K-Means Hard Clustering
K-MEANS (Hard Assignment) GMM (Soft Assignment)
┌──────────────────────┐ ┌──────────────────────┐
│ ● ● ● │ ■ ■ ■ │ │ ● ● ● ■ ■ ■ │
│ ● ● │ ■ ■ │ │ ● ● ◐ ■ ■ │
│ ● ▌ ■ │ │ ● ◑ ■ │
│ │ │ │ │
│ Hard boundary │ │ ◐ = 70% cluster 1 │
│ between clusters │ │ ◑ = 40% cluster 1 │
│ │ │ Smooth probabilities │
└──────────────────────┘ └──────────────────────┘
Flowcharts
K-Means Algorithm Flowchart
┌──────────────────┐
│ START: Input │
│ X, k (clusters) │
└────────┬─────────┘
▼
┌──────────────────┐
│ Initialize k │
│ centroids │
│ (random/K-M++) │
└────────┬─────────┘
▼
┌────►┌──────────────────┐
│ │ ASSIGN each xᵢ │
│ │ to nearest μⱼ │
│ │ cᵢ = argmin │
│ │ ||xᵢ - μⱼ||² │
│ └────────┬─────────┘
│ ▼
│ ┌──────────────────┐
│ │ UPDATE centroids│
│ │ μⱼ = mean(Cⱼ) │
│ └────────┬─────────┘
│ ▼
│ ┌──────────────────┐
│ NO │ Converged? │
├─────┤ (assignments │
│ │ unchanged?) │
│ └────────┬─────────┘
│ │ YES
│ ▼
│ ┌──────────────────┐
│ │ OUTPUT: Clusters│
│ │ C₁,...,Cₖ and │
│ │ centroids μ₁..μₖ│
│ └──────────────────┘
DBSCAN Algorithm Flowchart
┌─────────────────────────┐
│ START: Input X, ε, MinPts│
└───────────┬─────────────┘
▼
┌─────────────────────────┐
│ Mark all points UNVISITED│
│ cluster_id = 0 │
└───────────┬─────────────┘
▼
┌─────────────────────────┐
│ Pick unvisited point P │◄──────────────┐
│ Mark P as VISITED │ │
└───────────┬─────────────┘ │
▼ │
┌─────────────────────────┐ │
│ Find neighbors N(P) │ │
│ within radius ε │ │
└───────────┬─────────────┘ │
▼ │
┌───────────────┐ │
│|N(P)| ≥ MinPts?│ │
└───┬───────┬───┘ │
YES│ │NO │
▼ ▼ │
┌──────────┐ ┌──────────┐ │
│ CORE PT │ │ NOISE │ │
│ cluster++│ │ (may │ │
│ Expand │ │ become │ │
│ cluster │ │ border) │ │
│ via BFS │ └──────────┘ │
└─────┬────┘ │
▼ │
┌───────────────┐ │
│ More unvisited │───YES───────────────────┘
│ points? │
└───────┬───────┘
│ NO
▼
┌──────────────────┐
│ OUTPUT: Clusters │
│ + Noise points │
└──────────────────┘
EM Algorithm Flowchart
┌──────────────────────────┐
│ START: Data X, k comps │
│ Init: πⱼ, μⱼ, Σⱼ │
└────────────┬─────────────┘
▼
┌───►┌──────────────────────┐
│ │ E-STEP │
│ │ Compute γᵢⱼ = │
│ │ P(zᵢ=j | xᵢ, θ) │
│ │ (responsibilities) │
│ └────────────┬─────────┘
│ ▼
│ ┌──────────────────────┐
│ │ M-STEP │
│ │ Update: μⱼ, Σⱼ, πⱼ │
│ │ using γᵢⱼ weights │
│ └────────────┬─────────┘
│ ▼
│ ┌──────────────────────┐
│ │ Compute ℓ(θ) │
│ NO │ Converged? │
├────┤ |ℓ_new - ℓ_old| <ε? │
│ └────────────┬─────────┘
│ │ YES
│ ▼
│ ┌──────────────────────┐
│ │ OUTPUT: │
│ │ πⱼ, μⱼ, Σⱼ for each │
│ │ component + γᵢⱼ │
│ └──────────────────────┘
Choosing the Right Clustering Algorithm
┌─────────────────────┐
│ Do you know k │
│ (number of clusters)?│
└─────┬──────┬────────┘
YES │ │ NO
▼ ▼
┌──────────┐ ┌──────────────┐
│Spherical │ │Need to handle│
│clusters? │ │noise/outliers│
└──┬────┬──┘ └──┬───────┬───┘
YES│ │NO YES│ │NO
▼ ▼ ▼ ▼
┌──────┐┌─────┐┌──────┐┌──────────┐
│K-Means││ GMM ││DBSCAN││Hierarchi-│
│ ││ ││ ││cal (cut │
│Fast, ││Soft ││Auto k││dendrogram│
│simple ││prob ││Noise ││at desired│
│ ││ ││label ││height) │
└──────┘└─────┘└──────┘└──────────┘
Python Implementation from Scratch
10.1 K-Means from Scratch
import numpy as np
import matplotlib.pyplot as plt
class KMeans:
"""K-Means clustering from scratch with K-Means++ initialization."""
def __init__(self, k=3, max_iters=100, tol=1e-4, init='kmeans++', random_state=42):
self.k = k
self.max_iters = max_iters
self.tol = tol
self.init = init
self.random_state = random_state
self.centroids = None
self.labels = None
self.inertia_ = None
self.n_iters_ = 0
def _init_centroids(self, X):
"""Initialize centroids using random or K-Means++."""
rng = np.random.RandomState(self.random_state)
n = X.shape[0]
if self.init == 'random':
indices = rng.choice(n, self.k, replace=False)
return X[indices].copy()
elif self.init == 'kmeans++':
centroids = [X[rng.randint(n)]]
for _ in range(1, self.k):
# Compute squared distances to nearest centroid
dists = np.array([
min(np.sum((x - c) ** 2) for c in centroids)
for x in X
])
# Probability proportional to D(x)^2
probs = dists / dists.sum()
# Weighted random selection
idx = rng.choice(n, p=probs)
centroids.append(X[idx])
return np.array(centroids)
def _assign_clusters(self, X):
"""Assign each point to nearest centroid."""
# Vectorized distance computation
# X: (n, d), centroids: (k, d)
# Using broadcasting: (n, 1, d) - (1, k, d) -> (n, k, d)
diffs = X[:, np.newaxis, :] - self.centroids[np.newaxis, :, :]
distances = np.sum(diffs ** 2, axis=2) # (n, k)
return np.argmin(distances, axis=1)
def _update_centroids(self, X, labels):
"""Update centroids as cluster means."""
new_centroids = np.zeros_like(self.centroids)
for j in range(self.k):
cluster_points = X[labels == j]
if len(cluster_points) > 0:
new_centroids[j] = cluster_points.mean(axis=0)
else:
# Handle empty cluster: reinitialize randomly
new_centroids[j] = X[np.random.randint(X.shape[0])]
return new_centroids
def _compute_inertia(self, X, labels):
"""Compute within-cluster sum of squares (WCSS)."""
inertia = 0
for j in range(self.k):
cluster_points = X[labels == j]
if len(cluster_points) > 0:
inertia += np.sum((cluster_points - self.centroids[j]) ** 2)
return inertia
def fit(self, X):
"""Fit K-Means to data X."""
self.centroids = self._init_centroids(X)
for i in range(self.max_iters):
# Assignment step
self.labels = self._assign_clusters(X)
# Update step
new_centroids = self._update_centroids(X, self.labels)
# Check convergence
shift = np.sum((new_centroids - self.centroids) ** 2)
self.centroids = new_centroids
self.n_iters_ = i + 1
if shift < self.tol:
break
self.inertia_ = self._compute_inertia(X, self.labels)
return self
def predict(self, X):
"""Predict cluster labels for new data."""
return self._assign_clusters(X)
def fit_predict(self, X):
"""Fit and return labels."""
self.fit(X)
return self.labels
# --- Demo ---
np.random.seed(42)
# Generate 3 blobs
X = np.vstack([
np.random.randn(50, 2) + [0, 0],
np.random.randn(50, 2) + [5, 5],
np.random.randn(50, 2) + [10, 0],
])
km = KMeans(k=3, init='kmeans++')
labels = km.fit_predict(X)
print(f"Converged in {km.n_iters_} iterations")
print(f"Inertia (WCSS): {km.inertia_:.2f}")
print(f"Centroids:\n{km.centroids}")
# Elbow method
inertias = []
K_range = range(1, 10)
for k in K_range:
km_temp = KMeans(k=k)
km_temp.fit(X)
inertias.append(km_temp.inertia_)
print("\nElbow Method - Inertias:")
for k, inertia in zip(K_range, inertias):
bar = '█' * int(inertia / 50)
print(f" k={k}: {inertia:8.1f} |{bar}")
10.2 DBSCAN from Scratch
import numpy as np
from collections import deque
class DBSCAN:
"""DBSCAN clustering from scratch."""
def __init__(self, eps=0.5, min_pts=5):
self.eps = eps
self.min_pts = min_pts
self.labels = None
self.core_points_ = None
self.n_clusters_ = 0
def _get_neighbors(self, X, point_idx):
"""Find all points within eps of X[point_idx]."""
distances = np.sqrt(np.sum((X - X[point_idx]) ** 2, axis=1))
return np.where(distances <= self.eps)[0]
def fit(self, X):
"""Fit DBSCAN to data X."""
n = X.shape[0]
self.labels = np.full(n, -1) # -1 = unvisited/noise
visited = np.zeros(n, dtype=bool)
cluster_id = 0
core_mask = np.zeros(n, dtype=bool)
# Precompute distance matrix for efficiency
# (for small datasets; for large ones, use spatial index)
dist_matrix = np.sqrt(
np.sum((X[:, np.newaxis] - X[np.newaxis, :]) ** 2, axis=2)
)
for i in range(n):
if visited[i]:
continue
visited[i] = True
# Find neighbors
neighbors = np.where(dist_matrix[i] <= self.eps)[0]
if len(neighbors) < self.min_pts:
# Mark as noise (may become border later)
self.labels[i] = -1
continue
# Core point — start new cluster
core_mask[i] = True
self.labels[i] = cluster_id
# BFS expansion
queue = deque(neighbors.tolist())
while queue:
q = queue.popleft()
if not visited[q]:
visited[q] = True
q_neighbors = np.where(dist_matrix[q] <= self.eps)[0]
if len(q_neighbors) >= self.min_pts:
core_mask[q] = True
queue.extend(q_neighbors.tolist())
# Assign to cluster if not yet assigned
if self.labels[q] == -1:
self.labels[q] = cluster_id
cluster_id += 1
self.core_points_ = core_mask
self.n_clusters_ = cluster_id
return self
def fit_predict(self, X):
"""Fit and return labels."""
self.fit(X)
return self.labels
# --- Demo ---
np.random.seed(42)
# Two crescent moons (non-spherical clusters)
from sklearn.datasets import make_moons
X, y_true = make_moons(n_samples=200, noise=0.1, random_state=42)
db = DBSCAN(eps=0.2, min_pts=5)
labels = db.fit_predict(X)
print(f"Number of clusters found: {db.n_clusters_}")
print(f"Noise points: {np.sum(labels == -1)}")
print(f"Core points: {np.sum(db.core_points_)}")
print(f"Cluster sizes: {[np.sum(labels == i) for i in range(db.n_clusters_)]}")
10.3 Silhouette Score from Scratch
import numpy as np
def silhouette_score(X, labels):
"""Compute mean silhouette score from scratch."""
n = X.shape[0]
unique_labels = np.unique(labels)
k = len(unique_labels)
if k <= 1 or k >= n:
return 0.0 # Undefined for 1 cluster or n clusters
# Pairwise distance matrix
dist_matrix = np.sqrt(
np.sum((X[:, np.newaxis] - X[np.newaxis, :]) ** 2, axis=2)
)
silhouettes = np.zeros(n)
for i in range(n):
own_cluster = labels[i]
own_mask = labels == own_cluster
# a(i): mean intra-cluster distance
if own_mask.sum() > 1:
a_i = dist_matrix[i][own_mask].sum() / (own_mask.sum() - 1)
else:
a_i = 0
# b(i): min mean distance to other clusters
b_i = np.inf
for label in unique_labels:
if label == own_cluster or label == -1:
continue
other_mask = labels == label
if other_mask.sum() > 0:
mean_dist = dist_matrix[i][other_mask].mean()
b_i = min(b_i, mean_dist)
if b_i == np.inf:
silhouettes[i] = 0
else:
silhouettes[i] = (b_i - a_i) / max(a_i, b_i)
return silhouettes.mean()
# --- Demo ---
np.random.seed(42)
X = np.vstack([
np.random.randn(30, 2) + [0, 0],
np.random.randn(30, 2) + [5, 5],
np.random.randn(30, 2) + [10, 0],
])
from kmeans_scratch import KMeans # from above
for k in range(2, 7):
km = KMeans(k=k)
labels = km.fit_predict(X)
score = silhouette_score(X, labels)
bar = '█' * int(score * 30)
print(f"k={k}: Silhouette = {score:.4f} |{bar}")
# Best k should be 3 (highest silhouette)
TensorFlow Implementation
While clustering is traditionally non-neural, TensorFlow can accelerate K-Means on GPU and enables deep clustering approaches.
import tensorflow as tf
import numpy as np
class TFKMeans:
"""GPU-accelerated K-Means using TensorFlow."""
def __init__(self, k=3, max_iters=100, tol=1e-4):
self.k = k
self.max_iters = max_iters
self.tol = tol
@tf.function
def _assign_step(self, X, centroids):
"""Vectorized assignment using TF ops."""
# X: (n, d), centroids: (k, d)
# Expand: (n, 1, d) - (1, k, d) = (n, k, d)
diffs = tf.expand_dims(X, 1) - tf.expand_dims(centroids, 0)
distances = tf.reduce_sum(diffs ** 2, axis=2) # (n, k)
return tf.argmin(distances, axis=1) # (n,)
@tf.function
def _update_step(self, X, labels, k):
"""Update centroids as cluster means."""
new_centroids = []
for j in tf.range(k):
mask = tf.equal(labels, tf.cast(j, labels.dtype))
cluster_pts = tf.boolean_mask(X, mask)
new_centroids.append(tf.reduce_mean(cluster_pts, axis=0))
return tf.stack(new_centroids)
def fit(self, X_np):
"""Fit K-Means on data."""
X = tf.constant(X_np, dtype=tf.float32)
n = X.shape[0]
# K-Means++ init
indices = [np.random.randint(n)]
for _ in range(1, self.k):
centroids_so_far = tf.gather(X, indices)
diffs = tf.expand_dims(X, 1) - tf.expand_dims(centroids_so_far, 0)
dists = tf.reduce_min(tf.reduce_sum(diffs**2, axis=2), axis=1)
probs = dists / tf.reduce_sum(dists)
idx = np.random.choice(n, p=probs.numpy())
indices.append(idx)
self.centroids = tf.Variable(tf.gather(X, indices))
for i in range(self.max_iters):
labels = self._assign_step(X, self.centroids)
new_centroids = self._update_step(X, labels, self.k)
shift = tf.reduce_sum((new_centroids - self.centroids) ** 2)
self.centroids.assign(new_centroids)
if shift.numpy() < self.tol:
print(f"Converged at iteration {i+1}")
break
self.labels_ = labels.numpy()
return self
# --- Demo ---
np.random.seed(42)
X = np.vstack([
np.random.randn(100, 2) + [0, 0],
np.random.randn(100, 2) + [6, 6],
np.random.randn(100, 2) + [12, 0],
]).astype(np.float32)
model = TFKMeans(k=3)
model.fit(X)
print(f"TF Centroids:\n{model.centroids.numpy()}")
print(f"Cluster sizes: {[np.sum(model.labels_ == i) for i in range(3)]}")
# --- Deep Clustering with Autoencoder ---
class DeepCluster(tf.keras.Model):
"""Autoencoder-based deep clustering."""
def __init__(self, latent_dim=10, n_clusters=3):
super().__init__()
self.encoder = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(latent_dim, activation='relu'),
])
self.decoder = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(784, activation='sigmoid'), # MNIST
])
self.cluster_centers = tf.Variable(
tf.random.normal([n_clusters, latent_dim])
)
def call(self, x):
z = self.encoder(x)
x_hat = self.decoder(z)
return x_hat, z
print("\nDeep Clustering architecture defined for high-dimensional data.")
Scikit-Learn Implementation
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
silhouette_score, silhouette_samples,
davies_bouldin_score, calinski_harabasz_score,
adjusted_rand_score
)
from sklearn.datasets import make_blobs, make_moons
from scipy.cluster.hierarchy import dendrogram, linkage
import warnings
warnings.filterwarnings('ignore')
# =========================================
# 1. K-MEANS with Elbow & Silhouette
# =========================================
print("=" * 60)
print("1. K-MEANS CLUSTERING")
print("=" * 60)
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8,
random_state=42)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Elbow Method
inertias = []
sil_scores = []
K_range = range(2, 10)
for k in K_range:
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
km.fit(X_scaled)
inertias.append(km.inertia_)
sil_scores.append(silhouette_score(X_scaled, km.labels_))
print("\nElbow Method Results:")
for k, inertia, sil in zip(K_range, inertias, sil_scores):
bar_i = '█' * int(inertia / 20)
bar_s = '█' * int(sil * 40)
print(f" k={k}: Inertia={inertia:7.1f} |{bar_i}")
print(f" Silhouette={sil:.4f} |{bar_s}")
# Best K-Means
best_k = list(K_range)[np.argmax(sil_scores)]
print(f"\nOptimal k by Silhouette: {best_k}")
km_final = KMeans(n_clusters=best_k, init='k-means++', n_init=10,
random_state=42)
labels_km = km_final.fit_predict(X_scaled)
print(f"Final Inertia: {km_final.inertia_:.2f}")
print(f"Silhouette: {silhouette_score(X_scaled, labels_km):.4f}")
print(f"Davies-Bouldin: {davies_bouldin_score(X_scaled, labels_km):.4f}")
print(f"Calinski-Harabasz: {calinski_harabasz_score(X_scaled, labels_km):.1f}")
print(f"ARI (vs true): {adjusted_rand_score(y_true, labels_km):.4f}")
# =========================================
# 2. HIERARCHICAL CLUSTERING
# =========================================
print("\n" + "=" * 60)
print("2. HIERARCHICAL CLUSTERING")
print("=" * 60)
# Test all linkage types
for link in ['ward', 'complete', 'average', 'single']:
hc = AgglomerativeClustering(
n_clusters=4, linkage=link
)
labels_hc = hc.fit_predict(X_scaled)
sil = silhouette_score(X_scaled, labels_hc)
ari = adjusted_rand_score(y_true, labels_hc)
print(f" {link:10s} linkage: Silhouette={sil:.4f}, ARI={ari:.4f}")
# Dendrogram with scipy
Z = linkage(X_scaled[:50], method='ward') # Subset for readability
print(f"\nLinkage matrix shape: {Z.shape}")
print(" Columns: [idx1, idx2, distance, n_points]")
print(f" First 3 merges:")
for row in Z[:3]:
print(f" Merge clusters {int(row[0])} & {int(row[1])}, "
f"dist={row[2]:.3f}, size={int(row[3])}")
# =========================================
# 3. DBSCAN
# =========================================
print("\n" + "=" * 60)
print("3. DBSCAN CLUSTERING")
print("=" * 60)
X_moons, y_moons = make_moons(n_samples=300, noise=0.08, random_state=42)
X_moons_scaled = StandardScaler().fit_transform(X_moons)
# Parameter search
print("\nParameter sensitivity:")
for eps in [0.2, 0.3, 0.5, 0.7]:
for min_pts in [3, 5, 10]:
db = DBSCAN(eps=eps, min_samples=min_pts)
labels_db = db.fit_predict(X_moons_scaled)
n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
n_noise = (labels_db == -1).sum()
if n_clusters >= 2:
sil = silhouette_score(X_moons_scaled[labels_db != -1],
labels_db[labels_db != -1])
else:
sil = -1
print(f" eps={eps}, min_pts={min_pts}: "
f"clusters={n_clusters}, noise={n_noise}, sil={sil:.3f}")
# Best DBSCAN
db_best = DBSCAN(eps=0.3, min_samples=5)
labels_db = db_best.fit_predict(X_moons_scaled)
print(f"\nBest DBSCAN: {len(set(labels_db)) - 1} clusters found")
print(f"Noise points: {(labels_db == -1).sum()}")
print(f"ARI vs true: {adjusted_rand_score(y_moons, labels_db):.4f}")
# =========================================
# 4. GAUSSIAN MIXTURE MODEL
# =========================================
print("\n" + "=" * 60)
print("4. GAUSSIAN MIXTURE MODEL (EM)")
print("=" * 60)
# BIC/AIC for model selection
print("\nModel selection via BIC/AIC:")
for k in range(2, 8):
gmm = GaussianMixture(n_components=k, covariance_type='full',
n_init=3, random_state=42)
gmm.fit(X_scaled)
labels_gmm = gmm.predict(X_scaled)
sil = silhouette_score(X_scaled, labels_gmm)
print(f" k={k}: BIC={gmm.bic(X_scaled):8.1f}, "
f"AIC={gmm.aic(X_scaled):8.1f}, Silhouette={sil:.4f}")
# Best GMM
gmm_best = GaussianMixture(n_components=4, covariance_type='full',
random_state=42)
gmm_best.fit(X_scaled)
labels_gmm = gmm_best.predict(X_scaled)
proba = gmm_best.predict_proba(X_scaled)
print(f"\nBest GMM (k=4):")
print(f" Converged: {gmm_best.converged_}")
print(f" Iterations: {gmm_best.n_iter_}")
print(f" ARI vs true: {adjusted_rand_score(y_true, labels_gmm):.4f}")
print(f"\nSoft assignments (first 5 points):")
for i in range(5):
probs_str = ', '.join(f'{p:.3f}' for p in proba[i])
print(f" Point {i}: [{probs_str}] → Cluster {labels_gmm[i]}")
print(f"\nMixing weights: {gmm_best.weights_.round(3)}")
# =========================================
# 5. COMPREHENSIVE COMPARISON
# =========================================
print("\n" + "=" * 60)
print("5. ALGORITHM COMPARISON")
print("=" * 60)
algorithms = {
'K-Means (k=4)': KMeans(n_clusters=4, random_state=42),
'Hierarchical (Ward)': AgglomerativeClustering(n_clusters=4, linkage='ward'),
'DBSCAN (0.5, 5)': DBSCAN(eps=0.5, min_samples=5),
'GMM (k=4)': GaussianMixture(n_components=4, random_state=42),
}
print(f"\n{'Algorithm':<25} {'Clusters':>8} {'Silhouette':>10} "
f"{'DB-Index':>10} {'ARI':>8}")
print("-" * 65)
for name, algo in algorithms.items():
if hasattr(algo, 'fit_predict'):
labels = algo.fit_predict(X_scaled)
else:
algo.fit(X_scaled)
labels = algo.predict(X_scaled)
mask = labels >= 0
n_clusters = len(set(labels[mask]))
if n_clusters >= 2:
sil = silhouette_score(X_scaled[mask], labels[mask])
db = davies_bouldin_score(X_scaled[mask], labels[mask])
else:
sil = db = float('nan')
ari = adjusted_rand_score(y_true, labels)
print(f"{name:<25} {n_clusters:>8d} {sil:>10.4f} {db:>10.4f} {ari:>8.4f}")
Indian Case Studies
Jio Customer Segmentation — 200M+ Users
Telecom · Clustering at ScaleProblem
Reliance Jio has 400M+ subscribers generating diverse usage patterns. Targeted plans, content recommendations, and network optimization require understanding distinct user segments — without predefined labels.
Approach
- Features: Daily data usage (GB), voice minutes, video streaming hours, recharge frequency, location (urban/rural), device tier, app usage patterns
- Preprocessing: StandardScaler + PCA to 15 dimensions (from 50+ raw features)
- Algorithm: Mini-Batch K-Means (k=8 discovered via elbow + business validation)
- Validation: Silhouette score = 0.41 (acceptable for high-dimensional real data)
Discovered Segments
| Segment | Size | Profile | Jio Action |
|---|---|---|---|
| Data Heavy | 18% | 10GB+/day, heavy streaming | ₹999 unlimited plans |
| Voice Primary | 22% | Low data, high calls, rural | ₹149 voice-first plans |
| Price Sensitive | 25% | Recharge only when offers | Flash sales, sachets |
| JioTV Fans | 12% | Heavy video, moderate data | Content bundles |
| Business Users | 8% | Weekday heavy, enterprise apps | JioBusiness plans |
| Light Users | 15% | Feature phone, minimal data | JioPhone specials |
Impact
Clustering-driven personalized plans increased ARPU (Average Revenue Per User) by 15% and reduced churn by 12% in the price-sensitive segment through targeted retention offers.
# Simulated Jio customer segmentation
import numpy as np
from sklearn.cluster import MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
np.random.seed(42)
# Simulate 10,000 users with 8 features
n_users = 10000
data_usage_gb = np.concatenate([
np.random.exponential(2, 3000), # light users
np.random.normal(8, 2, 4000), # moderate
np.random.normal(15, 3, 3000), # heavy
])
voice_mins = np.random.normal(120, 60, n_users).clip(0)
video_hours = np.random.exponential(1.5, n_users)
recharge_freq = np.random.poisson(2, n_users) + 1
location = np.random.binomial(1, 0.6, n_users) # 1=urban
device_tier = np.random.choice([1, 2, 3], n_users, p=[0.3, 0.5, 0.2])
app_diversity = np.random.poisson(5, n_users)
monthly_spend = np.random.lognormal(5.5, 0.8, n_users)
X = np.column_stack([
data_usage_gb, voice_mins, video_hours, recharge_freq,
location, device_tier, app_diversity, monthly_spend
])
# Pipeline
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# PCA for visualization
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)
print(f"PCA explained variance: {pca.explained_variance_ratio_.sum():.2%}")
# Mini-Batch K-Means for scalability
mbkm = MiniBatchKMeans(n_clusters=6, batch_size=1000, random_state=42)
labels = mbkm.fit_predict(X_scaled)
# Segment analysis
print("\nSegment Analysis:")
print(f"{'Segment':<12} {'Count':>6} {'Avg Data(GB)':>12} "
f"{'Avg Voice':>10} {'Avg Spend':>10}")
print("-" * 55)
for seg in range(6):
mask = labels == seg
print(f"Segment {seg:<4} {mask.sum():>6} "
f"{data_usage_gb[mask].mean():>12.1f} "
f"{voice_mins[mask].mean():>10.0f} "
f"{monthly_spend[mask].mean():>10.0f}")
Indian Census — District-Level Population Clustering
Government · Socioeconomic AnalysisProblem
India's Census (2011: 640 districts, 2021: 773 districts) captures hundreds of indicators per district. Clustering helps identify development patterns and target government programs.
Features Used
- Population density, urbanization rate, literacy rate, sex ratio
- Agricultural employment %, industrial employment %, service sector %
- Infant mortality rate, access to electricity, road connectivity
- Scheduled Caste/Tribe population %, HDI indicators
Methodology
Ward's hierarchical clustering with 5 clusters, validated against known development indices:
- Cluster 1 (Metro): High urbanization, high literacy — Delhi, Mumbai, Bangalore districts
- Cluster 2 (Developing Urban): Medium urbanization, growing services — Tier-2 cities
- Cluster 3 (Agricultural): High agricultural employment, moderate literacy — Punjab, Haryana heartland
- Cluster 4 (Tribal/Forest): Low density, high ST population — Chhattisgarh, Jharkhand
- Cluster 5 (Under-served): Low literacy, high IMR — parts of Bihar, UP
Policy Impact
The clustering directly informed Aspirational Districts Programme — identifying 112 districts requiring concentrated development resources. Cluster membership helps predict which interventions (education, healthcare, infrastructure) will be most effective.
Global Case Studies
Spotify — Music Clustering & Discover Weekly
Entertainment · Audio FeaturesHow Spotify Clusters Music
Spotify's audio analysis API extracts features for each of its 100M+ tracks:
- Acoustic features: Danceability, energy, valence (mood), tempo, key, loudness, speechiness, acousticness, instrumentalness
- Spectral features: MFCC (Mel-Frequency Cepstral Coefficients) — 13-dimensional representation of audio spectrum
- Embedding features: 128-dimensional vectors from deep learning models trained on listener co-occurrence
Multi-Level Clustering
- Macro clusters (~5000): K-Means on audio embeddings — broad genres (rock, jazz, hip-hop, classical, electronic)
- Micro clusters (~100,000): Further subdivision — "90s grunge," "lo-fi bedroom pop," "Bollywood dance"
- User taste clusters: GMM on user listening histories — each user is a mixture of music taste clusters
Discover Weekly Pipeline
For each user: find their taste cluster → find songs in those clusters → filter out already-heard songs → rank by popularity and novelty → deliver 30 songs every Monday. This serves 600M+ personalized playlists weekly.
# Spotify-style audio feature clustering
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
# Simulated tracks with Spotify-like features
n_tracks = 1000
features = {
'danceability': np.random.beta(5, 3, n_tracks),
'energy': np.random.beta(4, 3, n_tracks),
'valence': np.random.beta(3, 3, n_tracks),
'tempo': np.random.normal(120, 30, n_tracks).clip(60, 200),
'acousticness': np.random.beta(2, 5, n_tracks),
'instrumentalness': np.random.beta(1.5, 8, n_tracks),
'speechiness': np.random.beta(1.5, 10, n_tracks),
}
X = np.column_stack(list(features.values()))
X_scaled = StandardScaler().fit_transform(X)
# Cluster into micro-genres
km = KMeans(n_clusters=8, init='k-means++', n_init=10, random_state=42)
genre_labels = km.fit_predict(X_scaled)
# Analyze clusters
print("Discovered Micro-Genres:")
print(f"{'Cluster':<10} {'Count':>6} {'Dance':>7} {'Energy':>8} "
f"{'Valence':>9} {'Tempo':>7}")
print("-" * 52)
for c in range(8):
mask = genre_labels == c
print(f"Genre {c:<5} {mask.sum():>5} "
f"{features['danceability'][mask].mean():>6.3f} "
f"{features['energy'][mask].mean():>7.3f} "
f"{features['valence'][mask].mean():>8.3f} "
f"{features['tempo'][mask].mean():>6.1f}")
Amazon — Product Grouping & Recommendation
E-Commerce · Product TaxonomyProblem
Amazon's catalog has 350M+ products. Manual categorization doesn't scale. Clustering helps:
- Product similarity: Group products with similar descriptions, images, and purchase patterns
- "Customers also bought": Users within the same behavioral cluster tend to buy similar items
- Inventory optimization: Cluster warehouses by demand patterns to optimize stocking
Technique Stack
- Text clustering: TF-IDF embeddings of product descriptions → K-Means for initial grouping
- Image clustering: ResNet features of product images → DBSCAN for visual similarity
- Behavioral clustering: User-product interaction matrices → GMM on user purchase embeddings
- Hierarchical taxonomy: Agglomerative clustering builds the category tree (Electronics → Phones → Android Phones → Samsung Galaxy)
Scale
Amazon processes 2.5M product clustering updates daily. Mini-Batch K-Means with FAISS (Facebook AI Similarity Search) enables nearest-centroid queries in <1ms for real-time recommendation.
Startup Applications
Razorpay 🇮🇳
Clusters merchants by transaction patterns for fraud risk scoring. Unusual clusters flag potential fraud rings.
Meesho 🇮🇳
Clusters resellers by behavior to personalize product catalogs — "fashion-focused" vs "electronics-focused" resellers.
CRED 🇮🇳
Clusters credit card users by spending patterns to offer personalized rewards and cashback categories.
PhonePe 🇮🇳
Clusters UPI merchants by transaction volume and timing for dynamic pricing of payment processing.
Government Applications
- Aspirational Districts (NITI Aayog): Hierarchical clustering of 773 districts by development indicators identifies the 112 most under-served districts for targeted intervention
- Smart Cities Mission: K-Means clustering of cities by infrastructure readiness, digital adoption, and governance to prioritize smart city investment
- ISRO Earth Observation: DBSCAN on satellite imagery pixels for land-use classification — urban, agricultural, forest, water, barren — handling mixed pixels as noise
- Indian Railways: Clustering train routes by passenger load patterns to optimize scheduling — "festival rush" vs "daily commuter" vs "tourist season" clusters
- Public Health (ICMR): Spatial clustering of disease outbreaks (dengue, malaria) using DBSCAN on geo-coded cases to identify hotspots for targeted intervention
Industry Applications
| Industry | Clustering Application | Algorithm | Impact |
|---|---|---|---|
| Healthcare | Patient subtyping (diabetes subtypes) | GMM | Personalized treatment protocols |
| Finance | Portfolio diversification | Hierarchical | Optimal asset allocation |
| Retail | Market basket analysis | K-Means | Store layout optimization |
| Manufacturing | Quality control (defect grouping) | DBSCAN | Root cause identification |
| Cybersecurity | Network intrusion detection | DBSCAN | Anomalous traffic isolation |
| NLP | Topic discovery in documents | K-Means + TF-IDF | Content organization |
| Genomics | Gene expression profiling | Hierarchical | Disease biomarker discovery |
| Astronomy | Galaxy classification | GMM | Morphological taxonomy |
Mini Projects
Project 1: Customer Segmentation Dashboard
Objective: Build an end-to-end customer segmentation pipeline with interactive visualization.
"""
Mini Project 1: Customer Segmentation Dashboard
Dataset: Mall Customer Segmentation (simulated)
"""
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, silhouette_samples
np.random.seed(42)
# --- Simulate Mall Customer Data ---
n_customers = 500
data = {
'annual_income_k': np.concatenate([
np.random.normal(30, 8, 150), # Low income
np.random.normal(55, 10, 200), # Middle
np.random.normal(90, 12, 150), # High
]),
'spending_score': np.concatenate([
np.random.normal(70, 15, 100), # High spenders (low income)
np.random.normal(30, 10, 100), # Low spenders (low income)
np.random.normal(50, 15, 150), # Average
np.random.normal(75, 12, 80), # High spenders (high income)
np.random.normal(20, 8, 70), # Low spenders (high income)
]),
'age': np.concatenate([
np.random.normal(25, 5, 150),
np.random.normal(40, 8, 200),
np.random.normal(55, 7, 150),
]),
'visit_frequency': np.concatenate([
np.random.poisson(8, 200),
np.random.poisson(3, 150),
np.random.poisson(5, 150),
]),
}
X = np.column_stack([data[f] for f in data]).astype(float)
# --- Preprocessing ---
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# --- Elbow + Silhouette Analysis ---
K_range = range(2, 10)
results = {'k': [], 'inertia': [], 'silhouette': []}
for k in K_range:
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
labels = km.fit_predict(X_scaled)
results['k'].append(k)
results['inertia'].append(km.inertia_)
results['silhouette'].append(silhouette_score(X_scaled, labels))
# Print analysis
print("Optimal K Analysis:")
print(f"{'k':>3} {'Inertia':>10} {'Silhouette':>12}")
print("-" * 28)
for k, ine, sil in zip(results['k'], results['inertia'], results['silhouette']):
marker = ' ← BEST' if sil == max(results['silhouette']) else ''
print(f"{k:>3} {ine:>10.1f} {sil:>12.4f}{marker}")
# --- Final Clustering ---
best_k = results['k'][np.argmax(results['silhouette'])]
km_final = KMeans(n_clusters=best_k, init='k-means++', n_init=10,
random_state=42)
labels = km_final.fit_predict(X_scaled)
# --- Segment Profiles ---
print(f"\n--- Customer Segments (k={best_k}) ---")
print(f"{'Segment':<12} {'Count':>6} {'Income':>8} {'Spend':>7} "
f"{'Age':>6} {'Visits':>7}")
print("-" * 52)
segment_names = ['Budget', 'Value', 'Premium', 'VIP', 'Casual']
for seg in range(best_k):
mask = labels == seg
name = segment_names[seg] if seg < len(segment_names) else f"Seg{seg}"
print(f"{name:<12} {mask.sum():>6} "
f"{data['annual_income_k'][mask].mean():>8.1f} "
f"{data['spending_score'][mask].mean():>7.1f} "
f"{data['age'][mask].mean():>6.1f} "
f"{data['visit_frequency'][mask].mean():>7.1f}")
# --- PCA Visualization ---
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)
print(f"\nPCA Explained Variance: {pca.explained_variance_ratio_.sum():.2%}")
print(f"Silhouette Score: {silhouette_score(X_scaled, labels):.4f}")
# --- Actionable Recommendations ---
print("\n--- Marketing Recommendations ---")
for seg in range(best_k):
mask = labels == seg
income = data['annual_income_k'][mask].mean()
spend = data['spending_score'][mask].mean()
name = segment_names[seg] if seg < len(segment_names) else f"Seg{seg}"
if income > 70 and spend > 60:
action = "VIP loyalty program, exclusive previews"
elif income > 70 and spend < 40:
action = "Targeted promotions to increase engagement"
elif income < 40 and spend > 60:
action = "Installment plans, EMI options"
elif income < 40 and spend < 40:
action = "Value deals, festive season discounts"
else:
action = "Balanced offers, membership benefits"
print(f" {name}: {action}")
Project 2: Image Color Quantization
Objective: Reduce the number of colors in an image using K-Means — a classic application of clustering to compress images.
"""
Mini Project 2: Image Color Quantization using K-Means
Reduces 16M possible colors to k representative colors
"""
import numpy as np
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import pairwise_distances_argmin_min
def create_sample_image(height=100, width=100):
"""Create a synthetic image with distinct color regions."""
img = np.zeros((height, width, 3), dtype=np.uint8)
# Sky (light blue)
img[:40, :, :] = [135, 206, 235]
# Sun (yellow)
for i in range(15, 30):
for j in range(70, 85):
if (i-22)**2 + (j-77)**2 < 49:
img[i, j] = [255, 223, 0]
# Grass (green)
img[40:70, :, :] = [34, 139, 34]
# Earth (brown)
img[70:, :, :] = [139, 90, 43]
# Add noise
noise = np.random.randint(-15, 15, img.shape)
img = np.clip(img.astype(int) + noise, 0, 255).astype(np.uint8)
return img
def quantize_image(image, n_colors=8):
"""Quantize image colors using K-Means."""
h, w, c = image.shape
# Reshape to (n_pixels, 3)
pixels = image.reshape(-1, 3).astype(np.float64)
print(f"Original: {len(np.unique(pixels, axis=0))} unique colors")
# Mini-Batch K-Means for speed
kmeans = MiniBatchKMeans(
n_clusters=n_colors,
batch_size=1000,
random_state=42
)
labels = kmeans.fit_predict(pixels)
# Replace each pixel with its centroid color
quantized_pixels = kmeans.cluster_centers_[labels]
quantized_image = quantized_pixels.reshape(h, w, c).astype(np.uint8)
# Compression ratio
original_size = h * w * 3 * 8 # bits
# Quantized: palette (n_colors * 3 * 8 bits) + indices (h*w * log2(n_colors) bits)
compressed_size = n_colors * 3 * 8 + h * w * np.ceil(np.log2(n_colors))
ratio = original_size / compressed_size
print(f"Quantized: {n_colors} unique colors")
print(f"Compression ratio: {ratio:.1f}x")
print(f"Palette (RGB):")
for i, color in enumerate(kmeans.cluster_centers_.astype(int)):
count = (labels == i).sum()
pct = count / len(labels) * 100
print(f" Color {i}: RGB({color[0]:3d}, {color[1]:3d}, {color[2]:3d})"
f" — {pct:.1f}% of pixels")
return quantized_image, kmeans
# --- Run ---
img = create_sample_image()
print(f"Image shape: {img.shape}")
print(f"Total pixels: {img.shape[0] * img.shape[1]}\n")
for n_colors in [4, 8, 16]:
print(f"\n{'='*40}")
print(f"Quantization with {n_colors} colors:")
print(f"{'='*40}")
q_img, km = quantize_image(img, n_colors)
# MSE between original and quantized
mse = np.mean((img.astype(float) - q_img.astype(float)) ** 2)
psnr = 10 * np.log10(255**2 / mse) if mse > 0 else float('inf')
print(f"MSE: {mse:.2f}")
print(f"PSNR: {psnr:.2f} dB")
Project 3: Document Topic Discovery
Objective: Cluster news articles by topic using TF-IDF + K-Means.
"""
Mini Project 3: Document Topic Discovery
Clusters text documents into topics using TF-IDF + KMeans
"""
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Sample news headlines
documents = [
# Sports
"India wins cricket world cup after thrilling final",
"Virat Kohli scores century in IPL match",
"Premier League football season kicks off",
"Olympic gold medal for Indian athlete in javelin",
"Tennis grand slam results surprise fans",
# Technology
"New AI model breaks language understanding records",
"Smartphone sales surge in Indian market",
"Tech startup raises funding for cloud platform",
"Machine learning transforms healthcare diagnosis",
"Quantum computing milestone achieved by researchers",
# Politics
"Parliament session discusses new education policy",
"State elections results show shifting voter patterns",
"Government announces new infrastructure spending bill",
"International summit focuses on climate change policy",
"Political parties prepare for upcoming general elections",
# Business
"Stock market reaches all-time high on positive earnings",
"RBI announces interest rate decision for quarter",
"Startup unicorn IPO values company at billion dollars",
"Manufacturing sector shows strong growth this quarter",
"Oil prices impact global economic recovery forecast",
]
# TF-IDF Vectorization
vectorizer = TfidfVectorizer(max_features=200, stop_words='english')
X_tfidf = vectorizer.fit_transform(documents)
print(f"TF-IDF matrix shape: {X_tfidf.shape}")
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
# Find optimal k
for k in range(2, 7):
km = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = km.fit_predict(X_tfidf)
sil = silhouette_score(X_tfidf, labels)
print(f"k={k}: Silhouette={sil:.4f}")
# Final clustering
km = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = km.fit_predict(X_tfidf)
# Display results
print("\nDiscovered Topics:")
feature_names = vectorizer.get_feature_names_out()
for c in range(4):
print(f"\n--- Topic {c} ---")
# Top terms per cluster (highest centroid values)
center = km.cluster_centers_[c]
top_indices = center.argsort()[-5:][::-1]
top_terms = [feature_names[i] for i in top_indices]
print(f"Top terms: {', '.join(top_terms)}")
print(f"Documents:")
for i, doc in enumerate(documents):
if labels[i] == c:
print(f" • {doc[:60]}")
End-of-Chapter Exercises
Multiple Choice Questions
Q1. K-Means clustering minimizes which objective?
Q2. In K-Means++, the probability of selecting the next centroid is proportional to:
Q3. Which clustering algorithm can automatically determine the number of clusters?
Q4. A silhouette score of -0.3 for a data point indicates:
Q5. Which linkage type in hierarchical clustering is most susceptible to the "chaining" effect?
Q6. In the EM algorithm for GMMs, the E-step computes:
Q7. DBSCAN with ε=0.5 and MinPts=5 classifies a point as a core point if:
Q8. The key advantage of GMMs over K-Means is:
Q9. The time complexity of standard K-Means per iteration is:
Q10. For the Davies-Bouldin Index, which value indicates better clustering?
Q11. What happens if a K-Means cluster becomes empty during an iteration?
Q12. Which evaluation metric requires ground truth labels?
Interview Questions
Answer: The choice depends on the data characteristics and requirements:
- K-Means: Use when you expect roughly spherical, equally-sized clusters and know k. Fast (O(nkd)), scalable. Best for: customer segmentation, image quantization.
- DBSCAN: Use when clusters have arbitrary shapes, you don't know k, and you need to identify outliers. Best for: spatial data, anomaly detection. Struggles with varying densities.
- Hierarchical: Use when you want to explore multiple granularities via a dendrogram, small-to-medium datasets. Best for: taxonomy building, gene expression analysis. O(n²) memory limits scalability.
- GMM: Use when clusters overlap and you need soft assignments. Best for: probabilistic applications, when clusters are elliptical.
Answer: K-Means converges to a local minimum of the WCSS objective, not necessarily the global minimum. The NP-hard nature of the optimization means different initializations can lead to different solutions. Mitigations: (1) K-Means++ initialization, (2) multiple restarts (n_init=10 in sklearn), (3) verify using evaluation metrics. A "converged but wrong" result often happens when k is wrong or when clusters are non-spherical.
Answer: The "curse of dimensionality" makes distances meaningless in high dimensions (all points become equally far apart). Solutions:
- PCA/UMAP: Reduce dimensionality before clustering
- Feature selection: Remove irrelevant features
- Subspace clustering: Find clusters in different subspaces
- Deep clustering: Learn a low-dimensional embedding using autoencoders, then cluster in latent space
- Cosine distance: More meaningful than Euclidean in high dimensions (used in text clustering)
Answer: K-Means is a special case of EM for GMMs where: (1) all covariance matrices are σ²I (identical isotropic), (2) mixing weights are uniform (πⱼ = 1/k), and (3) we take σ → 0 so soft assignments become hard assignments. The E-step becomes hard assignment (argmin distance), and the M-step becomes computing cluster means. This is known as "hard EM."
Answer: Use intrinsic evaluation metrics: (1) Silhouette Score: measures cluster cohesion vs separation [-1, 1]. (2) Davies-Bouldin Index: measures ratio of within/between cluster distances (lower = better). (3) Calinski-Harabasz: ratio of between/within variance (higher = better). (4) Elbow method: look for diminishing returns in WCSS. (5) Business validation: do the clusters make domain sense? (6) Stability: bootstrap the data and check if clusters are consistent.
Answer: The ε and MinPts parameters are mistuned:
- Everything in one cluster: ε is too large — every point reaches every other point via density connectivity. Solution: decrease ε.
- Everything is noise: ε is too small or MinPts is too large — no point has enough neighbors to be a core point. Solution: increase ε or decrease MinPts.
- Tuning strategy: Plot the k-distance graph (sorted distances to the k-th nearest neighbor). The "elbow" in this graph suggests a good ε value. MinPts ≈ 2 × dimensionality is a common heuristic.
Answer: Standard algorithms won't fit in memory. Scalable approaches:
- Mini-Batch K-Means: Processes random subsets; sklearn implementation works on 10M+ points
- BIRCH: Builds a compact summary (CF tree) in one pass, then clusters summaries
- Distributed K-Means: MapReduce/Spark — each mapper assigns local points, reducer updates global centroids
- FAISS (Facebook): GPU-accelerated nearest neighbor search for trillion-scale clustering
- Approximate methods: Sample 1%, cluster, assign rest to nearest centroids
Answer: Clustering is heavily influenced by preprocessing:
- Feature scaling: StandardScaler or MinMaxScaler — distance-based algorithms are sensitive to scale
- Outlier handling: Winsorize or remove extreme outliers (they distort centroids)
- Dimensionality reduction: PCA for de-correlation; UMAP for nonlinear manifolds
- Feature engineering: Log-transform skewed features, encode categoricals properly
- Missing values: Impute before clustering (KNN imputation is good for clustering pipelines)
Answer: Hard clustering (K-Means): each point belongs to exactly one cluster. Example: "This customer is in the 'Premium' segment." Soft clustering (GMM): each point has probabilities across all clusters. Example: "This customer is 60% Premium, 30% Value-seeker, 10% Budget." Soft clustering is better when categories overlap — a Spotify user who listens to both jazz and rock shouldn't be forced into one genre cluster. GMMs, fuzzy c-means, and topic models (LDA) provide soft assignments.
Answer:
- K-Means: Time: O(n·k·d·I) where I = iterations (usually <100). Space: O(n·d + k·d). Very scalable.
- Hierarchical (agglomerative): Time: O(n³) naive, O(n² log n) with priority queue. Space: O(n²) for the distance matrix. This limits it to ~10,000-50,000 points in practice.
- DBSCAN: Time: O(n²) naive, O(n log n) with spatial index (KD-tree). Space: O(n).
- GMM (EM): Time: O(n·k·d²·I) (d² for covariance). Space: O(k·d²).
Research Problems
Key Takeaways
-
1
Unsupervised ≠ Unlabeled: Clustering discovers structure without labels. The algorithm defines what "good clusters" means through its objective function — different algorithms find different structure in the same data.
-
2
K-Means is fast but limited: O(nkd) per iteration, converges to local minimum of WCSS. Assumes spherical, equal-sized clusters. Use K-Means++ initialization and multiple restarts. Best for: large datasets, roughly spherical clusters.
-
3
Hierarchical gives you a tree: The dendrogram lets you explore clusterings at every granularity by cutting at different heights. Ward's linkage is the most robust. Limitation: O(n²) space makes it impractical beyond ~50K points.
-
4
DBSCAN finds arbitrary shapes and noise: The only algorithm here that can label points as outliers. No need to specify k. But tuning ε and MinPts requires care — use the k-distance graph. Fails with clusters of varying density.
-
5
GMMs provide probabilistic assignments: Soft clustering is more informative than hard clustering. EM algorithm guarantees monotonic log-likelihood increase. K-Means is a special case of GMM (with isotropic covariances, hard assignments).
-
6
Evaluation requires multiple metrics: Use Silhouette (cohesion vs separation), Davies-Bouldin (lower = better), Calinski-Harabasz (higher = better), and always validate with domain knowledge. No single metric tells the whole story.
-
7
Preprocessing is critical: Feature scaling, outlier handling, and dimensionality reduction dramatically affect results. Always StandardScale before distance-based clustering. Consider PCA for de-correlation.
-
8
Scale matters: K-Means (Mini-Batch) scales to billions of points. Hierarchical clustering is limited to tens of thousands. DBSCAN with spatial indexing handles millions. Choose your algorithm based on your data size.
-
9
Business interpretation completes the pipeline: Finding k clusters is only half the job. The other half is interpreting what each cluster represents and what action to take for each segment. A technically perfect clustering that business stakeholders can't understand is useless.
References
Foundational Papers
- Lloyd, S. P. (1982). Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. [Original K-Means paper, written 1957]
- Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD-96. [DBSCAN]
- Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. JRSS Series B, 39(1), 1–38. [EM Algorithm]
- Arthur, D. & Vassilvitskii, S. (2007). K-Means++: The Advantages of Careful Seeding. SODA '07. [K-Means++ initialization]
- Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function. JASA, 58(301), 236–244. [Ward's linkage]
Textbooks
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 9: Mixture Models and EM.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapter 14: Unsupervised Learning.
- Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 21: Clustering.
- Géron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd ed. O'Reilly. Chapter 9.
- Jain, A. K. (2010). Data Clustering: 50 Years Beyond K-Means. Pattern Recognition Letters, 31(8), 651–666.
Indian Context References
- NITI Aayog (2018). Aspirational Districts — Methodology for Identification. Government of India.
- Rao, C. R. (1945). Information and Accuracy Attainable in the Estimation of Statistical Parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–91. [Rao distance for clustering]
- UIDAI (2016). Aadhaar Technology Architecture. [Biometric deduplication via blocking/clustering]
- TRAI Reports on Telecom Usage Patterns — subscriber segmentation methodology.
Software Documentation
- Scikit-Learn Clustering Documentation: sklearn.cluster
- SciPy Hierarchical Clustering: scipy.cluster.hierarchy
- HDBSCAN Library: McInnes, L. (2017). hdbscan.readthedocs.io