Phase 3 • EduArtha

Classical Machine Learning

Before deep learning, you must understand the simpler models. They build your intuition for how learning works. This book covers supervised learning, unsupervised learning, model evaluation, and production ML frameworks.

⏱ 3–5 months | 14 Chapters | 55+ Exercises

Part I

Supervised Learning

Learning from labeled data

Chapter 1

Linear & Logistic Regression

Learning Objectives

Understand linear regression: hypothesis, cost function, gradient descent
Master logistic regression for binary classification
Implement both from scratch and with scikit-learn
Interpret coefficients and understand assumptions

Linear Regression

Linear regression finds the best-fit line through data by minimizing the sum of squared errors. The model assumes a linear relationship: ŷ = w₁x₁ + w₂x₂ + ... + wₙxₙ + b.

Cost function (MSE): J(w) = (1/2m) Σᵢ (ŷᵢ - yᵢ)² → Minimize by adjusting weights w

Python
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import make_regression

# Generate synthetic data
X, y = make_regression(n_samples=500, n_features=3, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Scikit-learn
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(f"Coefficients: {model.coef_.round(2)}")
print(f"Intercept:    {model.intercept_:.2f}")
print(f"R² Score:     {r2_score(y_test, y_pred):.4f}")
print(f"RMSE:         {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")

From Scratch: Gradient Descent

Python
class LinearRegressionGD:
    def __init__(self, lr=0.01, epochs=1000):
        self.lr = lr
        self.epochs = epochs

    def fit(self, X, y):
        m, n = X.shape
        self.w = np.zeros(n)
        self.b = 0
        self.losses = []

        for _ in range(self.epochs):
            y_pred = X @ self.w + self.b
            error = y_pred - y
            self.w -= self.lr * (1/m) * (X.T @ error)
            self.b -= self.lr * (1/m) * np.sum(error)
            self.losses.append(np.mean(error**2))

    def predict(self, X):
        return X @ self.w + self.b

Logistic Regression

Logistic regression applies the sigmoid function σ(z) = 1/(1+e⁻ᶻ) to convert linear output into a probability between 0 and 1. It uses binary cross-entropy loss instead of MSE.

Python
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, classification_report

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, test_size=0.2, random_state=42)

clf = LogisticRegression(max_iter=5000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(classification_report(y_test, y_pred, target_names=data.target_names))

Project: House Price Prediction

Python
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso

housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
    housing.data, housing.target, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# Compare Linear, Ridge (L2), Lasso (L1)
for name, model in [("Linear", LinearRegression()),
                    ("Ridge", Ridge(alpha=1.0)),
                    ("Lasso", Lasso(alpha=0.1))]:
    model.fit(X_train_s, y_train)
    score = model.score(X_test_s, y_test)
    print(f"{name:8s} R²: {score:.4f}")

Exercises

Exercise 1.1: What is the difference between R² and adjusted R²?

R² measures proportion of variance explained, but always increases with more features (even irrelevant ones). Adjusted R² penalizes for additional features: it only increases if the new feature improves the model more than expected by chance. Use adjusted R² when comparing models with different numbers of features.

Exercise 1.2: Why must you scale features before logistic regression?

Logistic regression uses gradient descent. Features with large ranges (e.g., salary: 50000) dominate over small-range features (e.g., age: 30). Without scaling, gradients are uneven, causing slow convergence and suboptimal solutions. StandardScaler (mean=0, std=1) puts all features on equal footing.

Exercise 1.3: When would Ridge (L2) outperform Lasso (L1)?

Ridge: When all features are somewhat relevant — it shrinks coefficients but doesn't zero them out. Better for multicollinearity. Lasso: When you suspect many features are irrelevant — it drives coefficients to exactly zero, performing automatic feature selection. Use ElasticNet for a mix of both.

Exercise 1.4: Implement the sigmoid function and binary cross-entropy loss

def sigmoid(z): return 1 / (1 + np.exp(-z))
def bce_loss(y, y_hat):
    eps = 1e-15
    y_hat = np.clip(y_hat, eps, 1-eps)
    return -np.mean(y*np.log(y_hat) + (1-y)*np.log(1-y_hat))

Chapter Summary

Linear regression minimizes MSE to find the best-fit hyperplane
Logistic regression uses sigmoid + cross-entropy for classification
Ridge (L2) and Lasso (L1) add regularization to prevent overfitting
Always scale features before training gradient-based models

Chapter 2

Decision Trees & Random Forests

Learning Objectives

Understand how decision trees split data using impurity measures
Master random forests: bagging + feature randomness
Interpret feature importance and control overfitting
Know when to use trees vs linear models

Decision Trees

A decision tree recursively splits data by asking yes/no questions on features, choosing splits that maximize information gain (reduce impurity). It's one of the most interpretable ML models.

Impurity Measure	Formula	Used For
Gini Index	1 - Σ pᵢ²	Classification (default in sklearn)
Entropy	-Σ pᵢ log₂(pᵢ)	Classification (information gain)
MSE	(1/n)Σ(yᵢ - ȳ)²	Regression

Python
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_iris

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42)

tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

print(f"Accuracy: {tree.score(X_test, y_test):.2%}")
print("\nTree Rules:")
print(export_text(tree, feature_names=iris.feature_names))

Random Forests

A random forest builds many decision trees on random subsets of data (bagging) and random subsets of features, then votes/averages their predictions. This dramatically reduces overfitting.

Python
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)
print(f"RF Accuracy: {rf.score(X_test, y_test):.2%}")

# Feature importance — which features matter most?
importance = pd.Series(rf.feature_importances_, index=iris.feature_names)
print("\nFeature Importance:")
print(importance.sort_values(ascending=False))

Controlling Overfitting in Trees

max_depth: Limits tree depth. min_samples_split: Minimum samples to split a node. min_samples_leaf: Minimum samples in a leaf. max_features: Number of features to consider per split. Start with defaults, then tune with cross-validation.

Project: Customer Churn Prediction

Python
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import numpy as np

# Simulate churn data
np.random.seed(42)
n = 1000
X = np.column_stack([
    np.random.randint(1, 72, n),         # tenure (months)
    np.random.uniform(20, 120, n),       # monthly charge
    np.random.randint(0, 10, n),          # support calls
    np.random.choice([0,1], n)            # has contract
])
# Churn more likely with short tenure, high charge, many calls
churn_prob = 1/(1 + np.exp(-(-2 + 0.05*(60-X[:,0]) + 0.02*X[:,1] + 0.3*X[:,2] - 1.5*X[:,3])))
y = (np.random.random(n) < churn_prob).astype(int)

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2)
rf = RandomForestClassifier(n_estimators=200, max_depth=6)
rf.fit(X_tr, y_tr)
print(classification_report(y_te, rf.predict(X_te), target_names=['Stay','Churn']))

Exercises

Exercise 2.1: Why do single decision trees tend to overfit?

Without constraints, a tree can grow until each leaf has one sample, memorizing the training data. It creates complex rules that capture noise rather than signal. Solutions: limit depth, require min samples per leaf, prune, or use ensembles (random forest).

Exercise 2.2: What makes random forests better than a single tree?

RF reduces variance through (1) bagging: each tree trains on a bootstrap sample (random subset with replacement), (2) feature randomness: each split considers only √n features, ensuring trees are diverse. Averaging many diverse, slightly overfitted trees reduces overall error.

Exercise 2.3: How do you interpret feature importance in random forests?

Feature importance = average decrease in impurity (Gini/entropy) across all trees when that feature is used. Higher = more predictive. Caveat: correlated features share importance, and high-cardinality features may be overweighted. Use permutation importance for more reliable results.

Chapter Summary

Decision trees split data by maximizing information gain at each node
Random forests combine many diverse trees via bagging + feature randomness
Feature importance reveals which inputs drive predictions
Control overfitting with max_depth, min_samples_leaf, and n_estimators

Chapter 3

Support Vector Machines (SVMs)

Learning Objectives

Understand maximum margin classifiers and support vectors
Master the kernel trick for non-linear classification
Choose between linear, RBF, and polynomial kernels
Apply SVMs for text classification

Maximum Margin Classifier

SVM finds the hyperplane that maximizes the margin — the distance between the nearest data points of each class (support vectors) and the decision boundary. Larger margin = better generalization.

Python
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

X, y = make_classification(n_samples=500, n_features=2,
    n_redundant=0, n_clusters_per_class=1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Always scale for SVM!
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# Compare kernels
for kernel in ['linear', 'rbf', 'poly']:
    svm = SVC(kernel=kernel, C=1.0)
    svm.fit(X_train_s, y_train)
    acc = svm.score(X_test_s, y_test)
    print(f"{kernel:8s} → Accuracy: {acc:.2%}, Support vectors: {svm.n_support_}")

The Kernel Trick

When data isn't linearly separable, the kernel trick maps it to a higher-dimensional space where it IS separable — without actually computing the transformation. The RBF kernel (Gaussian) maps to infinite dimensions and is the default choice. Parameter C controls the margin-error tradeoff; gamma controls the RBF radius.

Project: Handwritten Digit Classification

Python
from sklearn.datasets import load_digits

digits = load_digits()
X_tr, X_te, y_tr, y_te = train_test_split(digits.data, digits.target, test_size=0.2)
X_tr_s = StandardScaler().fit_transform(X_tr)
X_te_s = StandardScaler().fit_transform(X_te)

svm = SVC(kernel='rbf', C=10, gamma='scale')
svm.fit(X_tr_s, y_tr)
print(f"Digit classification accuracy: {svm.score(X_te_s, y_te):.2%}")

Exercises

Exercise 3.1: What happens with high C vs low C?

High C: Hard margin — tries to classify every training point correctly, risking overfitting. Low C: Soft margin — allows some misclassifications for a wider margin, better generalization. Think of C as "how much do I care about individual errors vs overall margin width."

Exercise 3.2: Why is SVM not suitable for very large datasets?

SVM training is O(n² to n³) in the number of samples. For 1M samples, it becomes extremely slow. SVMs also store support vectors, requiring significant memory. For large datasets, use linear SVM (SGDClassifier) or switch to random forests/gradient boosting.

Exercise 3.3: When would you choose SVM over random forest?

SVM excels with: (1) high-dimensional data with few samples (text classification), (2) clear margin of separation, (3) when you need a theoretically grounded model (max margin). Random forest is better for: large datasets, mixed feature types, feature importance needs, and when interpretability matters.

Chapter Summary

SVMs maximize the margin between classes using support vectors
The kernel trick enables non-linear classification without explicit transformation
RBF kernel is the default; C and gamma are the key hyperparameters
Always scale features before SVM; impractical for very large datasets

Chapter 4

k-Nearest Neighbors (kNN)

Learning Objectives

Understand lazy learning and instance-based classification
Choose k and distance metrics wisely
Know kNN strengths and limitations

Python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_wine

wine = load_wine()
X_tr, X_te, y_tr, y_te = train_test_split(wine.data, wine.target, test_size=0.2)
X_tr_s = StandardScaler().fit_transform(X_tr)
X_te_s = StandardScaler().fit_transform(X_te)

# Find best k
for k in [1, 3, 5, 7, 11, 15]:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_tr_s, y_tr)
    print(f"k={k:2d} → Accuracy: {knn.score(X_te_s, y_te):.2%}")
# k=1 overfits (memorizes), large k underfits (too smooth)

Pros	Cons
No training phase (lazy)	Slow prediction: O(n) per query
Simple, intuitive	Curse of dimensionality
Non-parametric (any shape)	Sensitive to irrelevant features
Works for classification & regression	Must store entire dataset

Exercises

Exercise 4.1: What is the curse of dimensionality and how does it affect kNN?

In high dimensions, all points become roughly equidistant — the concept of "nearest" loses meaning. A cube in 10D has most volume near its edges, not center. kNN degrades because distance metrics become unreliable. Solution: reduce dimensions with PCA first, or use tree-based models.

Exercise 4.2: Why use odd values of k?

Odd k avoids ties in binary classification. With k=4, you might get 2 votes for each class. Odd k (3, 5, 7) guarantees a majority decision. For multi-class, ties are still possible — sklearn breaks them by using the class with nearest neighbor.

Exercise 4.3: Implement weighted kNN where closer neighbors have more influence

knn = KNeighborsClassifier(n_neighbors=5, weights='distance')
# Closer neighbors get weight = 1/distance
# This helps when the decision boundary is complex

Chapter Summary

kNN classifies by majority vote of k nearest neighbors — no training phase
Always scale features; kNN is distance-based
Small k overfits (captures noise), large k underfits (over-smooths)
Impractical for high-dimensional or large datasets

Chapter 5

Ensemble Methods: Boosting & Bagging

Learning Objectives

Understand bagging (Random Forest) vs boosting (AdaBoost, XGBoost)
Master gradient boosting — the king of tabular ML
Use XGBoost and LightGBM for competition-winning models

Bagging vs Boosting

Aspect	Bagging (RF)	Boosting (XGBoost)
Strategy	Parallel: independent trees	Sequential: each fixes previous errors
Reduces	Variance (overfitting)	Bias (underfitting)
Risk	Rarely overfits	Can overfit if too many rounds
Speed	Parallelizable	Sequential (slower)

Python
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X_tr, X_te, y_tr, y_te = train_test_split(data.data, data.target, test_size=0.2)

# Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
gb.fit(X_tr, y_tr)
print(f"Gradient Boosting: {gb.score(X_te, y_te):.2%}")

# AdaBoost
ada = AdaBoostClassifier(n_estimators=100, learning_rate=0.5)
ada.fit(X_tr, y_tr)
print(f"AdaBoost:          {ada.score(X_te, y_te):.2%}")

XGBoost — Industry Standard

Python
# pip install xgboost
import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=200, max_depth=4, learning_rate=0.1,
    subsample=0.8, colsample_bytree=0.8,
    eval_metric='logloss', random_state=42
)
model.fit(X_tr, y_tr, eval_set=[(X_te, y_te)], verbose=0)
print(f"XGBoost: {model.score(X_te, y_te):.2%}")

# Feature importance
xgb.plot_importance(model, max_num_features=10)

Why This Matters

XGBoost and LightGBM dominate Kaggle competitions and real-world tabular ML. They outperform neural networks on structured data in most cases, train faster, require less data, and are more interpretable. If your data is in a table (CSV/SQL), gradient boosting is usually the best first model to try.

Exercises

Exercise 5.1: What does learning_rate do in boosting?

Learning rate (η) shrinks each tree's contribution: prediction += η × tree_output. Small η (0.01-0.1) means each tree contributes less, requiring more trees but reducing overfitting. Large η learns faster but may overfit. Rule: use small learning rate with many estimators.

Exercise 5.2: How does XGBoost handle missing values?

XGBoost learns the optimal direction (left or right) to send missing values at each split. During training, it tries both directions and picks the one that minimizes loss. This means you don't need to impute missing values before XGBoost — a major practical advantage.

Exercise 5.3: Compare RF, GBM, and XGBoost on a dataset of your choice

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
models = {
    "RF": RandomForestClassifier(n_estimators=100),
    "GBM": GradientBoostingClassifier(n_estimators=100),
    "XGB": xgb.XGBClassifier(n_estimators=100, eval_metric='logloss')
}
for name, m in models.items():
    m.fit(X_tr, y_tr)
    print(f"{name}: {m.score(X_te, y_te):.2%}")

Chapter Summary

Bagging (RF) reduces variance; boosting reduces bias
Gradient boosting builds trees sequentially to correct previous errors
XGBoost/LightGBM are the gold standard for tabular data
Key parameters: n_estimators, learning_rate, max_depth, subsample

Chapter 6

Naive Bayes Classifiers

Learning Objectives

Apply Bayes' theorem for classification
Understand the "naive" independence assumption
Use Gaussian, Multinomial, and Bernoulli variants
Build a text spam classifier

Python
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Text classification — Spam detection
emails = [
    "Win a free iPhone now click here",
    "Meeting scheduled for tomorrow 3pm",
    "Congratulations you won $1000 prize",
    "Please review the attached report",
    "Free trial offer limited time only",
    "Lunch meeting moved to 2pm",
    "Claim your reward now act fast",
    "Updated project timeline attached",
]
labels = [1,0,1,0,1,0,1,0]  # 1=spam, 0=ham

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)

clf = MultinomialNB()
clf.fit(X, labels)

test = ["Free prize winner click now", "Project deadline is Friday"]
X_test = vectorizer.transform(test)
print(clf.predict(X_test))  # [1, 0] — spam, ham ✓

Variant	Feature Type	Use Case
GaussianNB	Continuous (real-valued)	Iris, sensor data
MultinomialNB	Counts / frequencies	Text classification, word counts
BernoulliNB	Binary (0/1)	Binary text features, presence/absence

Exercises

Exercise 6.1: Why does Naive Bayes work well despite the independence assumption being wrong?

Even though word correlations exist ("free" and "win" co-occur in spam), the classifier only needs correct ranking of class probabilities, not exact values. The independence assumption simplifies computation dramatically while often producing competitive rankings. It excels with limited data and high dimensions.

Exercise 6.2: When would Naive Bayes outperform logistic regression?

(1) Very small training sets — NB needs fewer examples. (2) Very high-dimensional sparse data (text with thousands of words). (3) Real-time requirements — NB prediction is O(1) lookup. (4) When features genuinely are somewhat independent.

Exercise 6.3: What is Laplace smoothing and why is it necessary?

If a word never appears in spam during training, P(word|spam) = 0, making the entire probability zero regardless of other evidence. Laplace smoothing adds α (usually 1) to every count: P(word|spam) = (count + α) / (total + α×V). This ensures no probability is ever zero.

Chapter Summary

Naive Bayes applies Bayes' theorem with the independence assumption
MultinomialNB is the standard for text classification (spam, sentiment)
Extremely fast training and prediction — great for baselines
Laplace smoothing prevents zero probabilities from unseen features

Part II

Unsupervised Learning

Discovering patterns without labels

Chapter 7

Clustering: k-Means & DBSCAN

Learning Objectives

Understand k-Means algorithm: initialization, iteration, convergence
Master DBSCAN for density-based clustering
Choose the right k using elbow method and silhouette score
Know when to use which clustering algorithm

k-Means Clustering

k-Means partitions data into k clusters by iteratively: (1) assigning each point to the nearest centroid, (2) moving centroids to the mean of assigned points. Repeats until convergence.

Python
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)

# Elbow Method — find optimal k
inertias = []
for k in range(2, 10):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X)
    inertias.append(km.inertia_)
    sil = silhouette_score(X, km.labels_)
    print(f"k={k}: inertia={km.inertia_:.0f}, silhouette={sil:.3f}")

# Best model
km = KMeans(n_clusters=4, random_state=42, n_init=10)
km.fit(X)
print(f"\nCluster sizes: {np.bincount(km.labels_)}")

DBSCAN — Density-Based Clustering

Unlike k-Means, DBSCAN doesn't require specifying k. It finds clusters of arbitrary shape by identifying dense regions separated by sparse areas. Points in sparse regions are labeled as noise.

Python
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

X_moons, _ = make_moons(n_samples=300, noise=0.1)

# k-Means fails on non-spherical data
km = KMeans(n_clusters=2, n_init=10).fit(X_moons)
print(f"k-Means silhouette: {silhouette_score(X_moons, km.labels_):.3f}")

# DBSCAN handles it perfectly
db = DBSCAN(eps=0.2, min_samples=5).fit(X_moons)
print(f"DBSCAN silhouette:  {silhouette_score(X_moons, db.labels_):.3f}")
print(f"Noise points: {(db.labels_ == -1).sum()}")

Feature	k-Means	DBSCAN
Must specify k?	Yes	No (auto-discovers)
Cluster shape	Spherical only	Arbitrary shapes
Handles noise?	No (assigns all)	Yes (labels outliers)
Scalability	O(n·k) — fast	O(n log n) with indexing

Project: Customer Segmentation

Python
# Simulate customer data
np.random.seed(42)
customers = np.column_stack([
    np.concatenate([np.random.normal(30,10,100), np.random.normal(60,8,100), np.random.normal(45,12,100)]),
    np.concatenate([np.random.normal(2000,500,100), np.random.normal(8000,1500,100), np.random.normal(4500,1000,100)])
])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(customers)

km = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = km.fit_predict(X_scaled)

for i in range(3):
    mask = labels == i
    avg_age = customers[mask, 0].mean()
    avg_spend = customers[mask, 1].mean()
    print(f"Segment {i}: {mask.sum()} customers, "
          f"avg age={avg_age:.0f}, avg spend=${avg_spend:,.0f}")

Exercises

Exercise 7.1: What is the silhouette score and how do you interpret it?

Silhouette score measures how similar a point is to its own cluster vs the nearest other cluster. Range: [-1, 1]. Score near 1 = well-clustered. Near 0 = on boundary. Negative = probably in wrong cluster. Average silhouette across all points tells you overall clustering quality.

Exercise 7.2: How do you choose eps and min_samples in DBSCAN?

Use k-distance plot: for each point, compute distance to its k-th nearest neighbor (k = min_samples), sort descending, and look for the "elbow." That distance ≈ eps. Rule of thumb: min_samples = 2 × dimensions. Start with min_samples=5, then tune eps.

Exercise 7.3: Why does k-Means fail on non-spherical clusters?

k-Means assigns points to the nearest centroid using Euclidean distance, creating Voronoi cells that are always convex/spherical. Moon-shaped or elongated clusters get split incorrectly because the centroid doesn't represent the cluster's true shape. DBSCAN uses density connectivity instead.

Exercise 7.4: Implement k-Means from scratch

def kmeans(X, k, max_iter=100):
    centroids = X[np.random.choice(len(X), k, replace=False)]
    for _ in range(max_iter):
        dists = np.linalg.norm(X[:, None] - centroids, axis=2)
        labels = np.argmin(dists, axis=1)
        new_centroids = np.array([X[labels==i].mean(axis=0) for i in range(k)])
        if np.allclose(centroids, new_centroids): break
        centroids = new_centroids
    return labels, centroids

Chapter Summary

k-Means is fast and simple but limited to spherical clusters and requires specifying k
DBSCAN discovers clusters of arbitrary shape and identifies noise points
Elbow method and silhouette score help choose optimal k for k-Means
Always scale data before clustering

Chapter 8

Dimensionality Reduction: PCA, t-SNE, UMAP

Learning Objectives

Understand PCA mathematically — eigendecomposition of covariance matrix
Visualize high-dimensional data with t-SNE and UMAP
Know when to use each technique

PCA — Principal Component Analysis

PCA finds the directions of maximum variance in your data and projects onto those axes. It reduces dimensions while retaining the most information possible.

Python
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

digits = load_digits()  # 64 features (8×8 pixels)

pca = PCA(n_components=2)
X_2d = pca.fit_transform(digits.data)
print(f"Original: {digits.data.shape}")       # (1797, 64)
print(f"Reduced:  {X_2d.shape}")                # (1797, 2)
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.1%}")

# How many components to keep 95% variance?
pca_95 = PCA(n_components=0.95)
X_reduced = pca_95.fit_transform(digits.data)
print(f"Components for 95%: {pca_95.n_components_}")  # ~28 of 64

t-SNE & UMAP for Visualization

Python
from sklearn.manifold import TSNE

# t-SNE — best for 2D visualization of clusters
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(digits.data)

# UMAP — faster, preserves global structure better
# pip install umap-learn
# import umap
# reducer = umap.UMAP(n_components=2)
# X_umap = reducer.fit_transform(digits.data)

Method	Preserves	Speed	Use For
PCA	Global variance	Very fast	Preprocessing, noise reduction
t-SNE	Local structure	Slow (O(n²))	Visualization only
UMAP	Local + global	Fast	Visualization + preprocessing

Project: Visualize MNIST Digits in 2D

Python
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

digits = load_digits()
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_vis = tsne.fit_transform(digits.data)

plt.figure(figsize=(10,8))
scatter = plt.scatter(X_vis[:,0], X_vis[:,1], c=digits.target,
                      cmap='tab10', alpha=0.7, s=10)
plt.colorbar(scatter, label='Digit')
plt.title('MNIST Digits — t-SNE Visualization')
plt.savefig('mnist_tsne.png', dpi=150)
# Each digit forms a distinct cluster!

Exercises

Exercise 8.1: Why can't you use t-SNE for preprocessing (only visualization)?

t-SNE is non-parametric — it can't transform new data points. The mapping is specific to the training set. PCA learns a linear transformation that can be applied to new data. Also, t-SNE doesn't preserve distances (only neighbor relationships), making it unsuitable for downstream ML.

Exercise 8.2: How do you choose the number of PCA components?

Plot cumulative explained variance ratio. Choose the "elbow" or the point where you retain 95% of variance. Alternatively, use PCA(n_components=0.95) to auto-select. For visualization: 2-3 components. For preprocessing: typically keep 95-99% variance.

Exercise 8.3: What does the perplexity parameter control in t-SNE?

Perplexity ≈ "number of effective neighbors." Low perplexity (5-10) focuses on very local structure. High perplexity (30-50) considers more neighbors, revealing global structure. Typical range: 5-50. Always try multiple values, as results are sensitive to this parameter.

Chapter Summary

PCA projects onto directions of maximum variance — fast, linear, invertible
t-SNE excels at 2D visualization of cluster structure
UMAP is faster than t-SNE and preserves global structure better
Use PCA for preprocessing, t-SNE/UMAP for visualization

Chapter 9

Gaussian Mixture Models & Autoencoders

Learning Objectives

Understand GMMs as soft clustering via probability distributions
Learn the Expectation-Maximization (EM) algorithm
Grasp autoencoder concepts for dimensionality reduction

Gaussian Mixture Models

GMM assumes data comes from a mixture of k Gaussian distributions. Unlike k-Means (hard assignment), GMM gives probabilistic membership — each point has a probability of belonging to each cluster.

Python
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=500, centers=3, cluster_std=1.5, random_state=42)

gmm = GaussianMixture(n_components=3, random_state=42)
gmm.fit(X)
labels = gmm.predict(X)
probs = gmm.predict_proba(X)  # Soft assignment!

print(f"Point 0 probabilities: {probs[0].round(3)}")
# e.g., [0.001, 0.002, 0.997] — 99.7% cluster 2

# Model selection with BIC
for k in range(2, 7):
    g = GaussianMixture(n_components=k, random_state=42).fit(X)
    print(f"k={k}: BIC={g.bic(X):.0f}")  # Lower is better

Autoencoders (Concept)

An autoencoder is a neural network that learns to compress data into a low-dimensional "bottleneck" representation and reconstruct it. The bottleneck captures the most important features — similar to PCA but non-linear.

Input (784) → Encoder → Bottleneck (32) → Decoder → Output (784)

Autoencoders are used for: dimensionality reduction, denoising, anomaly detection (high reconstruction error = anomaly), and generative models (VAEs).

Exercises

Exercise 9.1: How is GMM different from k-Means?

k-Means: Hard assignment (each point belongs to exactly one cluster), assumes spherical clusters of equal size. GMM: Soft assignment (probability of belonging), models elliptical clusters of different sizes and orientations, uses EM algorithm. GMM is more flexible but slower.

Exercise 9.2: What is the EM algorithm?

E-step (Expectation): Fix parameters, compute probability of each point belonging to each Gaussian. M-step (Maximization): Fix assignments, update Gaussian parameters (mean, covariance, weight). Repeat until convergence. It's guaranteed to improve the likelihood at each step.

Exercise 9.3: When would you use an autoencoder over PCA?

Use autoencoder when: (1) Data has non-linear structure that PCA can't capture. (2) You need denoising (denoising autoencoder). (3) You want generative capabilities (VAE). Use PCA when: data is roughly linear, you need fast training, or you need interpretable components.

Chapter Summary

GMMs provide soft (probabilistic) cluster assignments via mixture of Gaussians
EM algorithm alternates between assigning probabilities and updating parameters
BIC/AIC help select the optimal number of components
Autoencoders learn non-linear dimensionality reduction via neural networks

Chapter 10

Anomaly Detection

Learning Objectives

Detect outliers with Isolation Forest and One-Class SVM
Use statistical methods: z-score, IQR
Build a fraud detection system

Python
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
import numpy as np

# Normal data + injected anomalies
np.random.seed(42)
X_normal = np.random.randn(500, 2)
X_anomaly = np.random.uniform(-6, 6, (20, 2))
X = np.vstack([X_normal, X_anomaly])

# Isolation Forest — isolates anomalies via random splits
iso = IsolationForest(contamination=0.05, random_state=42)
preds = iso.fit_predict(X)  # 1=normal, -1=anomaly
print(f"Isolation Forest: {(preds==-1).sum()} anomalies detected")

# One-Class SVM — learns boundary of normal data
ocsvm = OneClassSVM(kernel='rbf', nu=0.05)
preds_svm = ocsvm.fit_predict(X)
print(f"One-Class SVM: {(preds_svm==-1).sum()} anomalies detected")

# Statistical: Z-score
z_scores = np.abs((X - X.mean(axis=0)) / X.std(axis=0))
stat_anomalies = (z_scores > 3).any(axis=1)
print(f"Z-score (>3σ): {stat_anomalies.sum()} anomalies detected")

Project: Credit Card Fraud Detection

Python
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report

# Simulate transaction data (normally distributed amounts)
np.random.seed(42)
normal_txns = np.column_stack([
    np.random.normal(50, 20, 9900),    # amount
    np.random.normal(5, 2, 9900),      # frequency
    np.random.normal(100, 30, 9900),   # distance_from_home
])
fraud_txns = np.column_stack([
    np.random.normal(500, 200, 100),   # large amounts
    np.random.normal(20, 5, 100),      # high frequency
    np.random.normal(500, 150, 100),   # far from home
])
X = np.vstack([normal_txns, fraud_txns])
y_true = np.array([1]*9900 + [-1]*100)

iso = IsolationForest(contamination=0.02, random_state=42)
y_pred = iso.fit_predict(X)

print(classification_report(y_true, y_pred, target_names=['Fraud', 'Normal']))

Exercises

Exercise 10.1: How does Isolation Forest work?

It builds random trees that split features randomly. Anomalies are isolated in fewer splits (shorter path length) because they are rare and different. Normal points are in dense regions, requiring more splits. Average path length across trees gives the anomaly score.

Exercise 10.2: What is the contamination parameter?

It's the expected proportion of anomalies in the dataset. contamination=0.05 means 5% of data is expected to be anomalous. It sets the decision threshold. If unknown, use "auto" or evaluate with precision-recall curves.

Exercise 10.3: Why is accuracy a bad metric for anomaly detection?

With 99% normal data, a model that predicts everything as "normal" gets 99% accuracy but catches zero fraud. Use precision (how many flagged are truly fraud), recall (how many actual frauds are caught), and F1-score. AUC-PR is the best metric for imbalanced anomaly detection.

Chapter Summary

Isolation Forest isolates anomalies using random splits — fast and effective
One-Class SVM learns the boundary of normal data in feature space
Statistical methods (z-score, IQR) work well for univariate anomalies
Use precision/recall instead of accuracy for anomaly detection

Part III

Model Evaluation

Measuring what matters — the difference between a good and great model

Chapter 11

Train/Validation/Test Splits & Cross-Validation

Learning Objectives

Understand why and how to split data properly
Master k-fold, stratified, and time-series cross-validation
Avoid data leakage — the #1 ML pitfall

Data Splitting Strategies

Strategy	Use Case	Split Ratio
Train/Test	Quick evaluation	80/20
Train/Val/Test	Hyperparameter tuning	60/20/20
k-Fold CV	Robust estimate	k rotations
Stratified k-Fold	Imbalanced classes	k rotations (preserves class ratio)
Time Series Split	Temporal data	Expanding window

Python
from sklearn.model_selection import (cross_val_score, StratifiedKFold,
                                     TimeSeriesSplit)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X, y = data.data, data.target

# Stratified k-Fold (preserves class distribution)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
rf = RandomForestClassifier(n_estimators=100, random_state=42)

scores = cross_val_score(rf, X, y, cv=skf, scoring='accuracy')
print(f"5-Fold CV: {scores.mean():.4f} ± {scores.std():.4f}")
print(f"Per fold: {scores.round(4)}")

# Time Series Split — never look into the future!
tscv = TimeSeriesSplit(n_splits=5)
for i, (train_idx, test_idx) in enumerate(tscv.split(X)):
    print(f"Fold {i+1}: Train={len(train_idx)}, Test={len(test_idx)}")

Data Leakage — The Silent Killer

Data leakage occurs when information from the test set accidentally influences training. Common sources: (1) Scaling before splitting (use test set statistics), (2) Feature engineering using the full dataset, (3) Time-based features from the future. Always: split first, then preprocess.

Exercises

Exercise 11.1: Why is a single train/test split unreliable?

One split gives one score that depends on which specific examples ended up in test vs train. It has high variance. Cross-validation gives k scores, providing a mean and confidence interval. One lucky/unlucky split can be very misleading.

Exercise 11.2: When must you use TimeSeriesSplit instead of random split?

Whenever data has temporal ordering (stock prices, weather, user behavior over time). Random splitting would let the model train on future data and predict the past — unrealistically inflating performance. TimeSeriesSplit always trains on past and tests on future, mimicking real deployment.

Exercise 11.3: What is the difference between validation and test sets?

Validation: Used during development to tune hyperparameters and select models. You see its score repeatedly. Test: Used ONCE at the very end to estimate real-world performance. If you tune on the test set, you're overfitting to it — your reported performance will be optimistic.

Exercise 11.4: How does stratification help with imbalanced datasets?

Without stratification, a random split of 1000 samples (5% positive) could give a fold with 0 positives — making evaluation meaningless. Stratified k-fold ensures each fold has the same class ratio (5% positive), giving reliable per-class metrics and meaningful cross-validation.

Chapter Summary

Cross-validation gives robust performance estimates; single splits are unreliable
Stratified k-fold preserves class distribution — essential for imbalanced data
TimeSeriesSplit respects temporal ordering — never leak future information
Data leakage silently inflates metrics; always split before preprocessing

Chapter 12

Metrics, Bias-Variance & Regularization

Learning Objectives

Master classification metrics: precision, recall, F1, AUC-ROC
Understand the bias-variance tradeoff deeply
Detect and fix overfitting with regularization

Classification Metrics

Metric	Formula	Optimize When
Precision	TP / (TP + FP)	False positives are costly (spam filter)
Recall	TP / (TP + FN)	False negatives are costly (cancer detection)
F1	2 × (P × R) / (P + R)	You need balance between P and R
AUC-ROC	Area under ROC curve	Overall ranking quality
Accuracy	(TP + TN) / Total	Balanced classes only

Python
from sklearn.metrics import (confusion_matrix, classification_report,
                             roc_auc_score, roc_curve)
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X_tr, X_te, y_tr, y_te = train_test_split(data.data, data.target, test_size=0.2)

clf = LogisticRegression(max_iter=5000)
clf.fit(X_tr, y_tr)
y_pred = clf.predict(X_te)
y_proba = clf.predict_proba(X_te)[:, 1]

print("Confusion Matrix:")
print(confusion_matrix(y_te, y_pred))
print(f"\nAUC-ROC: {roc_auc_score(y_te, y_proba):.4f}")
print(classification_report(y_te, y_pred))

Bias-Variance Tradeoff

Total Error = Bias² + Variance + Irreducible Noise

High bias (underfitting): Model is too simple — consistently wrong. Fix: use more complex model, add features. High variance (overfitting): Model memorizes training data — great on train, poor on test. Fix: more data, regularization, simpler model.

Regularization

Python
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline

# Demonstrating overfitting and regularization
np.random.seed(42)
X = np.sort(np.random.uniform(0, 1, 30)).reshape(-1, 1)
y = np.sin(2*np.pi*X).ravel() + np.random.normal(0, 0.2, 30)

# Underfit (degree 1), Good fit (degree 3), Overfit (degree 15)
for degree in [1, 3, 15]:
    model = make_pipeline(
        PolynomialFeatures(degree),
        Ridge(alpha=0.01 if degree < 15 else 1.0)
    )
    model.fit(X, y)
    print(f"Degree {degree:2d}: Train R²={model.score(X,y):.4f}")

Project: Diagnose and Fix Overfitting

Python
from sklearn.model_selection import learning_curve

def plot_learning_curve(model, X, y):
    train_sizes, train_scores, val_scores = learning_curve(
        model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10))
    
    print("Training size | Train score | Val score")
    for size, tr, val in zip(train_sizes, train_scores.mean(axis=1),
                                val_scores.mean(axis=1)):
        print(f"  {size:11d} | {tr:.4f}      | {val:.4f}")
    
    gap = train_scores.mean() - val_scores.mean()
    print(f"\nGap: {gap:.4f}")
    print("Overfitting!" if gap > 0.1 else "Good fit")

# Compare overfitting (deep tree) vs regularized (limited depth)
from sklearn.tree import DecisionTreeClassifier
data = load_breast_cancer()

print("=== Deep Tree (likely overfit) ===")
plot_learning_curve(DecisionTreeClassifier(max_depth=None), data.data, data.target)

print("\n=== Shallow Tree (regularized) ===")
plot_learning_curve(DecisionTreeClassifier(max_depth=3), data.data, data.target)

Exercises

Exercise 12.1: When would you optimize for recall over precision?

Optimize recall when missing a positive is very costly: cancer screening (missing cancer = death), fraud detection (missing fraud = financial loss), manufacturing defect detection. Accept more false positives to catch more true positives. Optimize precision when false alarms are costly: email spam (legitimate email in spam folder).

Exercise 12.2: A model has 98% train accuracy but 72% test accuracy. Diagnose it.

26% gap = severe overfitting (high variance). The model memorized training data. Fixes: (1) Add regularization (increase alpha in Ridge/Lasso), (2) Reduce model complexity (fewer features, shallower trees), (3) Get more training data, (4) Add dropout (neural nets), (5) Use cross-validation for model selection.

Exercise 12.3: Explain the ROC curve and why AUC is useful

The ROC curve plots True Positive Rate vs False Positive Rate at every probability threshold. AUC (Area Under Curve) summarizes this: AUC=1.0 is perfect, AUC=0.5 is random. AUC is threshold-independent — it tells you how well the model ranks positives above negatives regardless of the cutoff you choose.

Exercise 12.4: What's the difference between L1 (Lasso) and L2 (Ridge) regularization?

L1 (Lasso): Adds |w| penalty. Drives some weights exactly to zero → feature selection. Sparse solutions. L2 (Ridge): Adds w² penalty. Shrinks all weights but doesn't zero them. Better when all features are relevant. ElasticNet: combines both with a mixing parameter.

Chapter Summary

Precision matters when false positives are costly; recall when false negatives are
AUC-ROC is the best threshold-independent metric for ranking quality
Bias-variance tradeoff: simple models underfit, complex models overfit
Regularization (L1/L2) prevents overfitting by penalizing large weights
Learning curves diagnose underfitting vs overfitting

Part IV

ML Frameworks

Production-ready machine learning

Chapter 13

Scikit-learn Pipelines & Feature Engineering

Learning Objectives

Build end-to-end ML pipelines with Pipeline and ColumnTransformer
Automate feature scaling, encoding, and imputation
Create custom transformers for domain-specific features

Python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd, numpy as np

# Simulated mixed-type dataset
df = pd.DataFrame({
    'age': [25, 32, np.nan, 45, 28, 52, 38],
    'income': [30000, 55000, 42000, np.nan, 48000, 72000, 60000],
    'education': ['BS','MS','BS','PhD','BS','MS','PhD'],
    'target': [0,1,0,1,0,1,1]
})

numeric_features = ['age', 'income']
categorical_features = ['education']

# Column-specific preprocessing
preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]), numeric_features),
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ]), categorical_features)
])

# Full pipeline: preprocess + model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100))
])

X = df.drop('target', axis=1)
y = df['target']
pipeline.fit(X, y)
print(f"Pipeline trained successfully!")
print(f"Prediction: {pipeline.predict(X[:2])}")

Why Pipelines Matter

Pipelines prevent data leakage (scaler fits only on train fold in CV), make code reproducible, enable easy serialization (pickle the whole pipeline), and simplify deployment — one object handles preprocessing + prediction.

Project: End-to-End Titanic Pipeline

Python
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
X, y = df[features], df['Survived']

preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imp', SimpleImputer(strategy='median')),
        ('scl', StandardScaler())
    ]), ['Age','SibSp','Parch','Fare']),
    ('cat', Pipeline([
        ('imp', SimpleImputer(strategy='most_frequent')),
        ('enc', OneHotEncoder(handle_unknown='ignore'))
    ]), ['Pclass','Sex','Embarked'])
])

pipe = Pipeline([('pre', preprocessor), ('clf', GradientBoostingClassifier())])
scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f"Titanic CV Accuracy: {scores.mean():.2%} ± {scores.std():.2%}")

Exercises

Exercise 13.1: What is data leakage in the context of pipelines?

If you fit the scaler on the entire dataset before cross-validation, the test fold's statistics influence the scaling — information leaks from test to train. Pipelines prevent this by fitting the preprocessor only on the training fold in each CV iteration.

Exercise 13.2: How do you create a custom transformer?

from sklearn.base import BaseEstimator, TransformerMixin
class LogTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None): return self
    def transform(self, X): return np.log1p(X)

Exercise 13.3: Why use OneHotEncoder instead of LabelEncoder for features?

LabelEncoder assigns ordinal numbers (Red=0, Blue=1, Green=2), implying an ordering that doesn't exist for nominal categories. The model might think Green>Blue>Red. OneHotEncoder creates binary columns — no false ordering. Use LabelEncoder only for the target variable.

Chapter Summary

Pipelines chain preprocessing + model into a single reproducible object
ColumnTransformer applies different transformations to different feature types
Pipelines prevent data leakage automatically during cross-validation
Custom transformers extend pipelines with domain-specific feature engineering

Chapter 14

Hyperparameter Tuning & MLflow

Learning Objectives

Use GridSearchCV and RandomizedSearchCV for systematic tuning
Understand Bayesian optimization concepts
Track experiments with MLflow for reproducibility

GridSearch vs RandomizedSearch

Python
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from scipy.stats import randint, uniform

data = load_breast_cancer()
X, y = data.data, data.target

# GridSearchCV — exhaustive (try ALL combinations)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10]
}

grid = GridSearchCV(RandomForestClassifier(random_state=42),
                    param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X, y)
print(f"Best params: {grid.best_params_}")
print(f"Best score:  {grid.best_score_:.4f}")

# RandomizedSearchCV — samples N random combos (faster for large spaces)
param_dist = {
    'n_estimators': randint(50, 500),
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': randint(2, 20),
    'max_features': uniform(0.1, 0.9)
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42), param_dist,
    n_iter=50, cv=5, scoring='accuracy', n_jobs=-1)
random_search.fit(X, y)
print(f"\nRandom best: {random_search.best_params_}")
print(f"Random score: {random_search.best_score_:.4f}")

Method	Tries	Best For	Speed
GridSearchCV	All combinations	Small parameter spaces	Slow (exponential)
RandomizedSearchCV	N random samples	Large/continuous spaces	Fast (linear in N)
Bayesian (Optuna)	Informed samples	Expensive evaluations	Most efficient

MLflow — Experiment Tracking

Python
# pip install mlflow
import mlflow
import mlflow.sklearn

mlflow.set_experiment("breast_cancer_classification")

for n_est in [50, 100, 200]:
    for max_d in [3, 5, 7]:
        with mlflow.start_run(run_name=f"RF_{n_est}_{max_d}"):
            # Train
            rf = RandomForestClassifier(n_estimators=n_est, max_depth=max_d)
            scores = cross_val_score(rf, X, y, cv=5)

            # Log parameters and metrics
            mlflow.log_param("n_estimators", n_est)
            mlflow.log_param("max_depth", max_d)
            mlflow.log_metric("cv_accuracy_mean", scores.mean())
            mlflow.log_metric("cv_accuracy_std", scores.std())

            # Log model
            rf.fit(X, y)
            mlflow.sklearn.log_model(rf, "model")

print("All experiments logged! Run 'mlflow ui' to view.")

Why Experiment Tracking Matters

Without tracking, you lose track of which parameters gave which results after dozens of experiments. MLflow logs everything automatically — parameters, metrics, models, artifacts. You can compare runs in a dashboard, reproduce any experiment, and deploy the best model with one click.

Exercises

Exercise 14.1: When should you use RandomizedSearchCV over GridSearchCV?

Use Randomized when: (1) parameter space is large (3+ params with many values each), (2) parameters are continuous, (3) you have limited compute budget. Grid with 5 params × 10 values each = 100,000 combinations. Randomized with 100 iterations often finds a solution within 5% of optimal in 1000× less time.

Exercise 14.2: What is Bayesian optimization and how does it improve on grid/random?

Bayesian optimization builds a probabilistic model (Gaussian Process) of the objective function. It uses this model to choose the next parameters to evaluate — focusing on promising regions. Libraries like Optuna and HyperOpt implement this. It's 10-100× more efficient than random search for expensive evaluations.

Exercise 14.3: How do you use MLflow to deploy a model?

# Save model
mlflow.sklearn.log_model(best_model, "model")

# Load and serve
loaded = mlflow.sklearn.load_model("runs:/<run_id>/model")
prediction = loaded.predict(new_data)

# Or serve as REST API:
# mlflow models serve -m runs:/<run_id>/model --port 5000

Chapter Summary

GridSearchCV tries all combinations — thorough but slow for large spaces
RandomizedSearchCV samples randomly — often finds near-optimal in less time
Bayesian optimization is most efficient for expensive model evaluations
MLflow tracks experiments, logs metrics/models, and enables reproducibility

🎓 Congratulations!

You've completed Classical Machine Learning. You now understand supervised and unsupervised learning, model evaluation, and production ML frameworks — the foundation for building real-world AI systems.