Phase 3 β’ EduArtha
Classical Machine Learning
Before deep learning, you must understand the simpler models. They build your intuition for how learning works. This book covers supervised learning, unsupervised learning, model evaluation, and production ML frameworks.
β± 3β5 months | 14 Chapters | 55+ Exercises
Supervised Learning
Learning from labeled data
Linear & Logistic Regression
Learning Objectives
- Understand linear regression: hypothesis, cost function, gradient descent
- Master logistic regression for binary classification
- Implement both from scratch and with scikit-learn
- Interpret coefficients and understand assumptions
Linear Regression
Linear regression finds the best-fit line through data by minimizing the sum of squared errors. The model assumes a linear relationship: Ε· = wβxβ + wβxβ + ... + wβxβ + b.
Python
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import make_regression
# Generate synthetic data
X, y = make_regression(n_samples=500, n_features=3, noise=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Scikit-learn
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Coefficients: {model.coef_.round(2)}")
print(f"Intercept: {model.intercept_:.2f}")
print(f"RΒ² Score: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
From Scratch: Gradient Descent
Python
class LinearRegressionGD:
def __init__(self, lr=0.01, epochs=1000):
self.lr = lr
self.epochs = epochs
def fit(self, X, y):
m, n = X.shape
self.w = np.zeros(n)
self.b = 0
self.losses = []
for _ in range(self.epochs):
y_pred = X @ self.w + self.b
error = y_pred - y
self.w -= self.lr * (1/m) * (X.T @ error)
self.b -= self.lr * (1/m) * np.sum(error)
self.losses.append(np.mean(error**2))
def predict(self, X):
return X @ self.w + self.b
Logistic Regression
Logistic regression applies the sigmoid function Ο(z) = 1/(1+eβ»αΆ») to convert linear output into a probability between 0 and 1. It uses binary cross-entropy loss instead of MSE.
Python
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, classification_report
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, test_size=0.2, random_state=42)
clf = LogisticRegression(max_iter=5000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2%}")
print(classification_report(y_test, y_pred, target_names=data.target_names))
Project: House Price Prediction
Python
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# Compare Linear, Ridge (L2), Lasso (L1)
for name, model in [("Linear", LinearRegression()),
("Ridge", Ridge(alpha=1.0)),
("Lasso", Lasso(alpha=0.1))]:
model.fit(X_train_s, y_train)
score = model.score(X_test_s, y_test)
print(f"{name:8s} RΒ²: {score:.4f}")
Exercises
Exercise 1.1: What is the difference between RΒ² and adjusted RΒ²?
RΒ² measures proportion of variance explained, but always increases with more features (even irrelevant ones). Adjusted RΒ² penalizes for additional features: it only increases if the new feature improves the model more than expected by chance. Use adjusted RΒ² when comparing models with different numbers of features.
Exercise 1.2: Why must you scale features before logistic regression?
Logistic regression uses gradient descent. Features with large ranges (e.g., salary: 50000) dominate over small-range features (e.g., age: 30). Without scaling, gradients are uneven, causing slow convergence and suboptimal solutions. StandardScaler (mean=0, std=1) puts all features on equal footing.
Exercise 1.3: When would Ridge (L2) outperform Lasso (L1)?
Ridge: When all features are somewhat relevant β it shrinks coefficients but doesn't zero them out. Better for multicollinearity. Lasso: When you suspect many features are irrelevant β it drives coefficients to exactly zero, performing automatic feature selection. Use ElasticNet for a mix of both.
Exercise 1.4: Implement the sigmoid function and binary cross-entropy loss
def sigmoid(z): return 1 / (1 + np.exp(-z))
def bce_loss(y, y_hat):
eps = 1e-15
y_hat = np.clip(y_hat, eps, 1-eps)
return -np.mean(y*np.log(y_hat) + (1-y)*np.log(1-y_hat))Chapter Summary
- Linear regression minimizes MSE to find the best-fit hyperplane
- Logistic regression uses sigmoid + cross-entropy for classification
- Ridge (L2) and Lasso (L1) add regularization to prevent overfitting
- Always scale features before training gradient-based models
Decision Trees & Random Forests
Learning Objectives
- Understand how decision trees split data using impurity measures
- Master random forests: bagging + feature randomness
- Interpret feature importance and control overfitting
- Know when to use trees vs linear models
Decision Trees
A decision tree recursively splits data by asking yes/no questions on features, choosing splits that maximize information gain (reduce impurity). It's one of the most interpretable ML models.
| Impurity Measure | Formula | Used For |
|---|---|---|
| Gini Index | 1 - Ξ£ pα΅’Β² | Classification (default in sklearn) |
| Entropy | -Ξ£ pα΅’ logβ(pα΅’) | Classification (information gain) |
| MSE | (1/n)Ξ£(yα΅’ - Θ³)Β² | Regression |
Python
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_iris
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42)
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)
print(f"Accuracy: {tree.score(X_test, y_test):.2%}")
print("\nTree Rules:")
print(export_text(tree, feature_names=iris.feature_names))
Random Forests
A random forest builds many decision trees on random subsets of data (bagging) and random subsets of features, then votes/averages their predictions. This dramatically reduces overfitting.
Python
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)
print(f"RF Accuracy: {rf.score(X_test, y_test):.2%}")
# Feature importance β which features matter most?
importance = pd.Series(rf.feature_importances_, index=iris.feature_names)
print("\nFeature Importance:")
print(importance.sort_values(ascending=False))
Controlling Overfitting in Trees
max_depth: Limits tree depth. min_samples_split: Minimum samples to split a node. min_samples_leaf: Minimum samples in a leaf. max_features: Number of features to consider per split. Start with defaults, then tune with cross-validation.
Project: Customer Churn Prediction
Python
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import numpy as np
# Simulate churn data
np.random.seed(42)
n = 1000
X = np.column_stack([
np.random.randint(1, 72, n), # tenure (months)
np.random.uniform(20, 120, n), # monthly charge
np.random.randint(0, 10, n), # support calls
np.random.choice([0,1], n) # has contract
])
# Churn more likely with short tenure, high charge, many calls
churn_prob = 1/(1 + np.exp(-(-2 + 0.05*(60-X[:,0]) + 0.02*X[:,1] + 0.3*X[:,2] - 1.5*X[:,3])))
y = (np.random.random(n) < churn_prob).astype(int)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2)
rf = RandomForestClassifier(n_estimators=200, max_depth=6)
rf.fit(X_tr, y_tr)
print(classification_report(y_te, rf.predict(X_te), target_names=['Stay','Churn']))
Exercises
Exercise 2.1: Why do single decision trees tend to overfit?
Without constraints, a tree can grow until each leaf has one sample, memorizing the training data. It creates complex rules that capture noise rather than signal. Solutions: limit depth, require min samples per leaf, prune, or use ensembles (random forest).
Exercise 2.2: What makes random forests better than a single tree?
RF reduces variance through (1) bagging: each tree trains on a bootstrap sample (random subset with replacement), (2) feature randomness: each split considers only βn features, ensuring trees are diverse. Averaging many diverse, slightly overfitted trees reduces overall error.
Exercise 2.3: How do you interpret feature importance in random forests?
Feature importance = average decrease in impurity (Gini/entropy) across all trees when that feature is used. Higher = more predictive. Caveat: correlated features share importance, and high-cardinality features may be overweighted. Use permutation importance for more reliable results.
Chapter Summary
- Decision trees split data by maximizing information gain at each node
- Random forests combine many diverse trees via bagging + feature randomness
- Feature importance reveals which inputs drive predictions
- Control overfitting with max_depth, min_samples_leaf, and n_estimators
Support Vector Machines (SVMs)
Learning Objectives
- Understand maximum margin classifiers and support vectors
- Master the kernel trick for non-linear classification
- Choose between linear, RBF, and polynomial kernels
- Apply SVMs for text classification
Maximum Margin Classifier
SVM finds the hyperplane that maximizes the margin β the distance between the nearest data points of each class (support vectors) and the decision boundary. Larger margin = better generalization.
Python
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
X, y = make_classification(n_samples=500, n_features=2,
n_redundant=0, n_clusters_per_class=1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Always scale for SVM!
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# Compare kernels
for kernel in ['linear', 'rbf', 'poly']:
svm = SVC(kernel=kernel, C=1.0)
svm.fit(X_train_s, y_train)
acc = svm.score(X_test_s, y_test)
print(f"{kernel:8s} β Accuracy: {acc:.2%}, Support vectors: {svm.n_support_}")
The Kernel Trick
When data isn't linearly separable, the kernel trick maps it to a higher-dimensional space where it IS separable β without actually computing the transformation. The RBF kernel (Gaussian) maps to infinite dimensions and is the default choice. Parameter C controls the margin-error tradeoff; gamma controls the RBF radius.
Project: Handwritten Digit Classification
Python
from sklearn.datasets import load_digits
digits = load_digits()
X_tr, X_te, y_tr, y_te = train_test_split(digits.data, digits.target, test_size=0.2)
X_tr_s = StandardScaler().fit_transform(X_tr)
X_te_s = StandardScaler().fit_transform(X_te)
svm = SVC(kernel='rbf', C=10, gamma='scale')
svm.fit(X_tr_s, y_tr)
print(f"Digit classification accuracy: {svm.score(X_te_s, y_te):.2%}")
Exercises
Exercise 3.1: What happens with high C vs low C?
High C: Hard margin β tries to classify every training point correctly, risking overfitting. Low C: Soft margin β allows some misclassifications for a wider margin, better generalization. Think of C as "how much do I care about individual errors vs overall margin width."
Exercise 3.2: Why is SVM not suitable for very large datasets?
SVM training is O(nΒ² to nΒ³) in the number of samples. For 1M samples, it becomes extremely slow. SVMs also store support vectors, requiring significant memory. For large datasets, use linear SVM (SGDClassifier) or switch to random forests/gradient boosting.
Exercise 3.3: When would you choose SVM over random forest?
SVM excels with: (1) high-dimensional data with few samples (text classification), (2) clear margin of separation, (3) when you need a theoretically grounded model (max margin). Random forest is better for: large datasets, mixed feature types, feature importance needs, and when interpretability matters.
Chapter Summary
- SVMs maximize the margin between classes using support vectors
- The kernel trick enables non-linear classification without explicit transformation
- RBF kernel is the default; C and gamma are the key hyperparameters
- Always scale features before SVM; impractical for very large datasets
k-Nearest Neighbors (kNN)
Learning Objectives
- Understand lazy learning and instance-based classification
- Choose k and distance metrics wisely
- Know kNN strengths and limitations
Python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_wine
wine = load_wine()
X_tr, X_te, y_tr, y_te = train_test_split(wine.data, wine.target, test_size=0.2)
X_tr_s = StandardScaler().fit_transform(X_tr)
X_te_s = StandardScaler().fit_transform(X_te)
# Find best k
for k in [1, 3, 5, 7, 11, 15]:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_tr_s, y_tr)
print(f"k={k:2d} β Accuracy: {knn.score(X_te_s, y_te):.2%}")
# k=1 overfits (memorizes), large k underfits (too smooth)
| Pros | Cons |
|---|---|
| No training phase (lazy) | Slow prediction: O(n) per query |
| Simple, intuitive | Curse of dimensionality |
| Non-parametric (any shape) | Sensitive to irrelevant features |
| Works for classification & regression | Must store entire dataset |
Exercises
Exercise 4.1: What is the curse of dimensionality and how does it affect kNN?
In high dimensions, all points become roughly equidistant β the concept of "nearest" loses meaning. A cube in 10D has most volume near its edges, not center. kNN degrades because distance metrics become unreliable. Solution: reduce dimensions with PCA first, or use tree-based models.
Exercise 4.2: Why use odd values of k?
Odd k avoids ties in binary classification. With k=4, you might get 2 votes for each class. Odd k (3, 5, 7) guarantees a majority decision. For multi-class, ties are still possible β sklearn breaks them by using the class with nearest neighbor.
Exercise 4.3: Implement weighted kNN where closer neighbors have more influence
knn = KNeighborsClassifier(n_neighbors=5, weights='distance')
# Closer neighbors get weight = 1/distance
# This helps when the decision boundary is complexChapter Summary
- kNN classifies by majority vote of k nearest neighbors β no training phase
- Always scale features; kNN is distance-based
- Small k overfits (captures noise), large k underfits (over-smooths)
- Impractical for high-dimensional or large datasets
Ensemble Methods: Boosting & Bagging
Learning Objectives
- Understand bagging (Random Forest) vs boosting (AdaBoost, XGBoost)
- Master gradient boosting β the king of tabular ML
- Use XGBoost and LightGBM for competition-winning models
Bagging vs Boosting
| Aspect | Bagging (RF) | Boosting (XGBoost) |
|---|---|---|
| Strategy | Parallel: independent trees | Sequential: each fixes previous errors |
| Reduces | Variance (overfitting) | Bias (underfitting) |
| Risk | Rarely overfits | Can overfit if too many rounds |
| Speed | Parallelizable | Sequential (slower) |
Python
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X_tr, X_te, y_tr, y_te = train_test_split(data.data, data.target, test_size=0.2)
# Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
gb.fit(X_tr, y_tr)
print(f"Gradient Boosting: {gb.score(X_te, y_te):.2%}")
# AdaBoost
ada = AdaBoostClassifier(n_estimators=100, learning_rate=0.5)
ada.fit(X_tr, y_tr)
print(f"AdaBoost: {ada.score(X_te, y_te):.2%}")
XGBoost β Industry Standard
Python
# pip install xgboost
import xgboost as xgb
model = xgb.XGBClassifier(
n_estimators=200, max_depth=4, learning_rate=0.1,
subsample=0.8, colsample_bytree=0.8,
eval_metric='logloss', random_state=42
)
model.fit(X_tr, y_tr, eval_set=[(X_te, y_te)], verbose=0)
print(f"XGBoost: {model.score(X_te, y_te):.2%}")
# Feature importance
xgb.plot_importance(model, max_num_features=10)
Why This Matters
XGBoost and LightGBM dominate Kaggle competitions and real-world tabular ML. They outperform neural networks on structured data in most cases, train faster, require less data, and are more interpretable. If your data is in a table (CSV/SQL), gradient boosting is usually the best first model to try.
Exercises
Exercise 5.1: What does learning_rate do in boosting?
Learning rate (Ξ·) shrinks each tree's contribution: prediction += Ξ· Γ tree_output. Small Ξ· (0.01-0.1) means each tree contributes less, requiring more trees but reducing overfitting. Large Ξ· learns faster but may overfit. Rule: use small learning rate with many estimators.
Exercise 5.2: How does XGBoost handle missing values?
XGBoost learns the optimal direction (left or right) to send missing values at each split. During training, it tries both directions and picks the one that minimizes loss. This means you don't need to impute missing values before XGBoost β a major practical advantage.
Exercise 5.3: Compare RF, GBM, and XGBoost on a dataset of your choice
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
models = {
"RF": RandomForestClassifier(n_estimators=100),
"GBM": GradientBoostingClassifier(n_estimators=100),
"XGB": xgb.XGBClassifier(n_estimators=100, eval_metric='logloss')
}
for name, m in models.items():
m.fit(X_tr, y_tr)
print(f"{name}: {m.score(X_te, y_te):.2%}")Chapter Summary
- Bagging (RF) reduces variance; boosting reduces bias
- Gradient boosting builds trees sequentially to correct previous errors
- XGBoost/LightGBM are the gold standard for tabular data
- Key parameters: n_estimators, learning_rate, max_depth, subsample
Naive Bayes Classifiers
Learning Objectives
- Apply Bayes' theorem for classification
- Understand the "naive" independence assumption
- Use Gaussian, Multinomial, and Bernoulli variants
- Build a text spam classifier
Python
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
# Text classification β Spam detection
emails = [
"Win a free iPhone now click here",
"Meeting scheduled for tomorrow 3pm",
"Congratulations you won $1000 prize",
"Please review the attached report",
"Free trial offer limited time only",
"Lunch meeting moved to 2pm",
"Claim your reward now act fast",
"Updated project timeline attached",
]
labels = [1,0,1,0,1,0,1,0] # 1=spam, 0=ham
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)
clf = MultinomialNB()
clf.fit(X, labels)
test = ["Free prize winner click now", "Project deadline is Friday"]
X_test = vectorizer.transform(test)
print(clf.predict(X_test)) # [1, 0] β spam, ham β
| Variant | Feature Type | Use Case |
|---|---|---|
| GaussianNB | Continuous (real-valued) | Iris, sensor data |
| MultinomialNB | Counts / frequencies | Text classification, word counts |
| BernoulliNB | Binary (0/1) | Binary text features, presence/absence |
Exercises
Exercise 6.1: Why does Naive Bayes work well despite the independence assumption being wrong?
Even though word correlations exist ("free" and "win" co-occur in spam), the classifier only needs correct ranking of class probabilities, not exact values. The independence assumption simplifies computation dramatically while often producing competitive rankings. It excels with limited data and high dimensions.
Exercise 6.2: When would Naive Bayes outperform logistic regression?
(1) Very small training sets β NB needs fewer examples. (2) Very high-dimensional sparse data (text with thousands of words). (3) Real-time requirements β NB prediction is O(1) lookup. (4) When features genuinely are somewhat independent.
Exercise 6.3: What is Laplace smoothing and why is it necessary?
If a word never appears in spam during training, P(word|spam) = 0, making the entire probability zero regardless of other evidence. Laplace smoothing adds Ξ± (usually 1) to every count: P(word|spam) = (count + Ξ±) / (total + Ξ±ΓV). This ensures no probability is ever zero.
Chapter Summary
- Naive Bayes applies Bayes' theorem with the independence assumption
- MultinomialNB is the standard for text classification (spam, sentiment)
- Extremely fast training and prediction β great for baselines
- Laplace smoothing prevents zero probabilities from unseen features
Unsupervised Learning
Discovering patterns without labels
Clustering: k-Means & DBSCAN
Learning Objectives
- Understand k-Means algorithm: initialization, iteration, convergence
- Master DBSCAN for density-based clustering
- Choose the right k using elbow method and silhouette score
- Know when to use which clustering algorithm
k-Means Clustering
k-Means partitions data into k clusters by iteratively: (1) assigning each point to the nearest centroid, (2) moving centroids to the mean of assigned points. Repeats until convergence.
Python
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=500, centers=4, cluster_std=0.8, random_state=42)
# Elbow Method β find optimal k
inertias = []
for k in range(2, 10):
km = KMeans(n_clusters=k, random_state=42, n_init=10)
km.fit(X)
inertias.append(km.inertia_)
sil = silhouette_score(X, km.labels_)
print(f"k={k}: inertia={km.inertia_:.0f}, silhouette={sil:.3f}")
# Best model
km = KMeans(n_clusters=4, random_state=42, n_init=10)
km.fit(X)
print(f"\nCluster sizes: {np.bincount(km.labels_)}")
DBSCAN β Density-Based Clustering
Unlike k-Means, DBSCAN doesn't require specifying k. It finds clusters of arbitrary shape by identifying dense regions separated by sparse areas. Points in sparse regions are labeled as noise.
Python
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
X_moons, _ = make_moons(n_samples=300, noise=0.1)
# k-Means fails on non-spherical data
km = KMeans(n_clusters=2, n_init=10).fit(X_moons)
print(f"k-Means silhouette: {silhouette_score(X_moons, km.labels_):.3f}")
# DBSCAN handles it perfectly
db = DBSCAN(eps=0.2, min_samples=5).fit(X_moons)
print(f"DBSCAN silhouette: {silhouette_score(X_moons, db.labels_):.3f}")
print(f"Noise points: {(db.labels_ == -1).sum()}")
| Feature | k-Means | DBSCAN |
|---|---|---|
| Must specify k? | Yes | No (auto-discovers) |
| Cluster shape | Spherical only | Arbitrary shapes |
| Handles noise? | No (assigns all) | Yes (labels outliers) |
| Scalability | O(nΒ·k) β fast | O(n log n) with indexing |
Project: Customer Segmentation
Python
# Simulate customer data
np.random.seed(42)
customers = np.column_stack([
np.concatenate([np.random.normal(30,10,100), np.random.normal(60,8,100), np.random.normal(45,12,100)]),
np.concatenate([np.random.normal(2000,500,100), np.random.normal(8000,1500,100), np.random.normal(4500,1000,100)])
])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(customers)
km = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = km.fit_predict(X_scaled)
for i in range(3):
mask = labels == i
avg_age = customers[mask, 0].mean()
avg_spend = customers[mask, 1].mean()
print(f"Segment {i}: {mask.sum()} customers, "
f"avg age={avg_age:.0f}, avg spend=${avg_spend:,.0f}")
Exercises
Exercise 7.1: What is the silhouette score and how do you interpret it?
Silhouette score measures how similar a point is to its own cluster vs the nearest other cluster. Range: [-1, 1]. Score near 1 = well-clustered. Near 0 = on boundary. Negative = probably in wrong cluster. Average silhouette across all points tells you overall clustering quality.
Exercise 7.2: How do you choose eps and min_samples in DBSCAN?
Use k-distance plot: for each point, compute distance to its k-th nearest neighbor (k = min_samples), sort descending, and look for the "elbow." That distance β eps. Rule of thumb: min_samples = 2 Γ dimensions. Start with min_samples=5, then tune eps.
Exercise 7.3: Why does k-Means fail on non-spherical clusters?
k-Means assigns points to the nearest centroid using Euclidean distance, creating Voronoi cells that are always convex/spherical. Moon-shaped or elongated clusters get split incorrectly because the centroid doesn't represent the cluster's true shape. DBSCAN uses density connectivity instead.
Exercise 7.4: Implement k-Means from scratch
def kmeans(X, k, max_iter=100):
centroids = X[np.random.choice(len(X), k, replace=False)]
for _ in range(max_iter):
dists = np.linalg.norm(X[:, None] - centroids, axis=2)
labels = np.argmin(dists, axis=1)
new_centroids = np.array([X[labels==i].mean(axis=0) for i in range(k)])
if np.allclose(centroids, new_centroids): break
centroids = new_centroids
return labels, centroidsChapter Summary
- k-Means is fast and simple but limited to spherical clusters and requires specifying k
- DBSCAN discovers clusters of arbitrary shape and identifies noise points
- Elbow method and silhouette score help choose optimal k for k-Means
- Always scale data before clustering
Dimensionality Reduction: PCA, t-SNE, UMAP
Learning Objectives
- Understand PCA mathematically β eigendecomposition of covariance matrix
- Visualize high-dimensional data with t-SNE and UMAP
- Know when to use each technique
PCA β Principal Component Analysis
PCA finds the directions of maximum variance in your data and projects onto those axes. It reduces dimensions while retaining the most information possible.
Python
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits
digits = load_digits() # 64 features (8Γ8 pixels)
pca = PCA(n_components=2)
X_2d = pca.fit_transform(digits.data)
print(f"Original: {digits.data.shape}") # (1797, 64)
print(f"Reduced: {X_2d.shape}") # (1797, 2)
print(f"Variance retained: {pca.explained_variance_ratio_.sum():.1%}")
# How many components to keep 95% variance?
pca_95 = PCA(n_components=0.95)
X_reduced = pca_95.fit_transform(digits.data)
print(f"Components for 95%: {pca_95.n_components_}") # ~28 of 64
t-SNE & UMAP for Visualization
Python
from sklearn.manifold import TSNE
# t-SNE β best for 2D visualization of clusters
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(digits.data)
# UMAP β faster, preserves global structure better
# pip install umap-learn
# import umap
# reducer = umap.UMAP(n_components=2)
# X_umap = reducer.fit_transform(digits.data)
| Method | Preserves | Speed | Use For |
|---|---|---|---|
| PCA | Global variance | Very fast | Preprocessing, noise reduction |
| t-SNE | Local structure | Slow (O(nΒ²)) | Visualization only |
| UMAP | Local + global | Fast | Visualization + preprocessing |
Project: Visualize MNIST Digits in 2D
Python
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt
digits = load_digits()
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_vis = tsne.fit_transform(digits.data)
plt.figure(figsize=(10,8))
scatter = plt.scatter(X_vis[:,0], X_vis[:,1], c=digits.target,
cmap='tab10', alpha=0.7, s=10)
plt.colorbar(scatter, label='Digit')
plt.title('MNIST Digits β t-SNE Visualization')
plt.savefig('mnist_tsne.png', dpi=150)
# Each digit forms a distinct cluster!
Exercises
Exercise 8.1: Why can't you use t-SNE for preprocessing (only visualization)?
t-SNE is non-parametric β it can't transform new data points. The mapping is specific to the training set. PCA learns a linear transformation that can be applied to new data. Also, t-SNE doesn't preserve distances (only neighbor relationships), making it unsuitable for downstream ML.
Exercise 8.2: How do you choose the number of PCA components?
Plot cumulative explained variance ratio. Choose the "elbow" or the point where you retain 95% of variance. Alternatively, use PCA(n_components=0.95) to auto-select. For visualization: 2-3 components. For preprocessing: typically keep 95-99% variance.
Exercise 8.3: What does the perplexity parameter control in t-SNE?
Perplexity β "number of effective neighbors." Low perplexity (5-10) focuses on very local structure. High perplexity (30-50) considers more neighbors, revealing global structure. Typical range: 5-50. Always try multiple values, as results are sensitive to this parameter.
Chapter Summary
- PCA projects onto directions of maximum variance β fast, linear, invertible
- t-SNE excels at 2D visualization of cluster structure
- UMAP is faster than t-SNE and preserves global structure better
- Use PCA for preprocessing, t-SNE/UMAP for visualization
Gaussian Mixture Models & Autoencoders
Learning Objectives
- Understand GMMs as soft clustering via probability distributions
- Learn the Expectation-Maximization (EM) algorithm
- Grasp autoencoder concepts for dimensionality reduction
Gaussian Mixture Models
GMM assumes data comes from a mixture of k Gaussian distributions. Unlike k-Means (hard assignment), GMM gives probabilistic membership β each point has a probability of belonging to each cluster.
Python
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=500, centers=3, cluster_std=1.5, random_state=42)
gmm = GaussianMixture(n_components=3, random_state=42)
gmm.fit(X)
labels = gmm.predict(X)
probs = gmm.predict_proba(X) # Soft assignment!
print(f"Point 0 probabilities: {probs[0].round(3)}")
# e.g., [0.001, 0.002, 0.997] β 99.7% cluster 2
# Model selection with BIC
for k in range(2, 7):
g = GaussianMixture(n_components=k, random_state=42).fit(X)
print(f"k={k}: BIC={g.bic(X):.0f}") # Lower is better
Autoencoders (Concept)
An autoencoder is a neural network that learns to compress data into a low-dimensional "bottleneck" representation and reconstruct it. The bottleneck captures the most important features β similar to PCA but non-linear.
Autoencoders are used for: dimensionality reduction, denoising, anomaly detection (high reconstruction error = anomaly), and generative models (VAEs).
Exercises
Exercise 9.1: How is GMM different from k-Means?
k-Means: Hard assignment (each point belongs to exactly one cluster), assumes spherical clusters of equal size. GMM: Soft assignment (probability of belonging), models elliptical clusters of different sizes and orientations, uses EM algorithm. GMM is more flexible but slower.
Exercise 9.2: What is the EM algorithm?
E-step (Expectation): Fix parameters, compute probability of each point belonging to each Gaussian. M-step (Maximization): Fix assignments, update Gaussian parameters (mean, covariance, weight). Repeat until convergence. It's guaranteed to improve the likelihood at each step.
Exercise 9.3: When would you use an autoencoder over PCA?
Use autoencoder when: (1) Data has non-linear structure that PCA can't capture. (2) You need denoising (denoising autoencoder). (3) You want generative capabilities (VAE). Use PCA when: data is roughly linear, you need fast training, or you need interpretable components.
Chapter Summary
- GMMs provide soft (probabilistic) cluster assignments via mixture of Gaussians
- EM algorithm alternates between assigning probabilities and updating parameters
- BIC/AIC help select the optimal number of components
- Autoencoders learn non-linear dimensionality reduction via neural networks
Anomaly Detection
Learning Objectives
- Detect outliers with Isolation Forest and One-Class SVM
- Use statistical methods: z-score, IQR
- Build a fraud detection system
Python
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
import numpy as np
# Normal data + injected anomalies
np.random.seed(42)
X_normal = np.random.randn(500, 2)
X_anomaly = np.random.uniform(-6, 6, (20, 2))
X = np.vstack([X_normal, X_anomaly])
# Isolation Forest β isolates anomalies via random splits
iso = IsolationForest(contamination=0.05, random_state=42)
preds = iso.fit_predict(X) # 1=normal, -1=anomaly
print(f"Isolation Forest: {(preds==-1).sum()} anomalies detected")
# One-Class SVM β learns boundary of normal data
ocsvm = OneClassSVM(kernel='rbf', nu=0.05)
preds_svm = ocsvm.fit_predict(X)
print(f"One-Class SVM: {(preds_svm==-1).sum()} anomalies detected")
# Statistical: Z-score
z_scores = np.abs((X - X.mean(axis=0)) / X.std(axis=0))
stat_anomalies = (z_scores > 3).any(axis=1)
print(f"Z-score (>3Ο): {stat_anomalies.sum()} anomalies detected")
Project: Credit Card Fraud Detection
Python
from sklearn.ensemble import IsolationForest
from sklearn.metrics import classification_report
# Simulate transaction data (normally distributed amounts)
np.random.seed(42)
normal_txns = np.column_stack([
np.random.normal(50, 20, 9900), # amount
np.random.normal(5, 2, 9900), # frequency
np.random.normal(100, 30, 9900), # distance_from_home
])
fraud_txns = np.column_stack([
np.random.normal(500, 200, 100), # large amounts
np.random.normal(20, 5, 100), # high frequency
np.random.normal(500, 150, 100), # far from home
])
X = np.vstack([normal_txns, fraud_txns])
y_true = np.array([1]*9900 + [-1]*100)
iso = IsolationForest(contamination=0.02, random_state=42)
y_pred = iso.fit_predict(X)
print(classification_report(y_true, y_pred, target_names=['Fraud', 'Normal']))
Exercises
Exercise 10.1: How does Isolation Forest work?
It builds random trees that split features randomly. Anomalies are isolated in fewer splits (shorter path length) because they are rare and different. Normal points are in dense regions, requiring more splits. Average path length across trees gives the anomaly score.
Exercise 10.2: What is the contamination parameter?
It's the expected proportion of anomalies in the dataset. contamination=0.05 means 5% of data is expected to be anomalous. It sets the decision threshold. If unknown, use "auto" or evaluate with precision-recall curves.
Exercise 10.3: Why is accuracy a bad metric for anomaly detection?
With 99% normal data, a model that predicts everything as "normal" gets 99% accuracy but catches zero fraud. Use precision (how many flagged are truly fraud), recall (how many actual frauds are caught), and F1-score. AUC-PR is the best metric for imbalanced anomaly detection.
Chapter Summary
- Isolation Forest isolates anomalies using random splits β fast and effective
- One-Class SVM learns the boundary of normal data in feature space
- Statistical methods (z-score, IQR) work well for univariate anomalies
- Use precision/recall instead of accuracy for anomaly detection
Model Evaluation
Measuring what matters β the difference between a good and great model
Train/Validation/Test Splits & Cross-Validation
Learning Objectives
- Understand why and how to split data properly
- Master k-fold, stratified, and time-series cross-validation
- Avoid data leakage β the #1 ML pitfall
Data Splitting Strategies
| Strategy | Use Case | Split Ratio |
|---|---|---|
| Train/Test | Quick evaluation | 80/20 |
| Train/Val/Test | Hyperparameter tuning | 60/20/20 |
| k-Fold CV | Robust estimate | k rotations |
| Stratified k-Fold | Imbalanced classes | k rotations (preserves class ratio) |
| Time Series Split | Temporal data | Expanding window |
Python
from sklearn.model_selection import (cross_val_score, StratifiedKFold,
TimeSeriesSplit)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X, y = data.data, data.target
# Stratified k-Fold (preserves class distribution)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(rf, X, y, cv=skf, scoring='accuracy')
print(f"5-Fold CV: {scores.mean():.4f} Β± {scores.std():.4f}")
print(f"Per fold: {scores.round(4)}")
# Time Series Split β never look into the future!
tscv = TimeSeriesSplit(n_splits=5)
for i, (train_idx, test_idx) in enumerate(tscv.split(X)):
print(f"Fold {i+1}: Train={len(train_idx)}, Test={len(test_idx)}")
Data Leakage β The Silent Killer
Data leakage occurs when information from the test set accidentally influences training. Common sources: (1) Scaling before splitting (use test set statistics), (2) Feature engineering using the full dataset, (3) Time-based features from the future. Always: split first, then preprocess.
Exercises
Exercise 11.1: Why is a single train/test split unreliable?
One split gives one score that depends on which specific examples ended up in test vs train. It has high variance. Cross-validation gives k scores, providing a mean and confidence interval. One lucky/unlucky split can be very misleading.
Exercise 11.2: When must you use TimeSeriesSplit instead of random split?
Whenever data has temporal ordering (stock prices, weather, user behavior over time). Random splitting would let the model train on future data and predict the past β unrealistically inflating performance. TimeSeriesSplit always trains on past and tests on future, mimicking real deployment.
Exercise 11.3: What is the difference between validation and test sets?
Validation: Used during development to tune hyperparameters and select models. You see its score repeatedly. Test: Used ONCE at the very end to estimate real-world performance. If you tune on the test set, you're overfitting to it β your reported performance will be optimistic.
Exercise 11.4: How does stratification help with imbalanced datasets?
Without stratification, a random split of 1000 samples (5% positive) could give a fold with 0 positives β making evaluation meaningless. Stratified k-fold ensures each fold has the same class ratio (5% positive), giving reliable per-class metrics and meaningful cross-validation.
Chapter Summary
- Cross-validation gives robust performance estimates; single splits are unreliable
- Stratified k-fold preserves class distribution β essential for imbalanced data
- TimeSeriesSplit respects temporal ordering β never leak future information
- Data leakage silently inflates metrics; always split before preprocessing
Metrics, Bias-Variance & Regularization
Learning Objectives
- Master classification metrics: precision, recall, F1, AUC-ROC
- Understand the bias-variance tradeoff deeply
- Detect and fix overfitting with regularization
Classification Metrics
| Metric | Formula | Optimize When |
|---|---|---|
| Precision | TP / (TP + FP) | False positives are costly (spam filter) |
| Recall | TP / (TP + FN) | False negatives are costly (cancer detection) |
| F1 | 2 Γ (P Γ R) / (P + R) | You need balance between P and R |
| AUC-ROC | Area under ROC curve | Overall ranking quality |
| Accuracy | (TP + TN) / Total | Balanced classes only |
Python
from sklearn.metrics import (confusion_matrix, classification_report,
roc_auc_score, roc_curve)
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X_tr, X_te, y_tr, y_te = train_test_split(data.data, data.target, test_size=0.2)
clf = LogisticRegression(max_iter=5000)
clf.fit(X_tr, y_tr)
y_pred = clf.predict(X_te)
y_proba = clf.predict_proba(X_te)[:, 1]
print("Confusion Matrix:")
print(confusion_matrix(y_te, y_pred))
print(f"\nAUC-ROC: {roc_auc_score(y_te, y_proba):.4f}")
print(classification_report(y_te, y_pred))
Bias-Variance Tradeoff
High bias (underfitting): Model is too simple β consistently wrong. Fix: use more complex model, add features. High variance (overfitting): Model memorizes training data β great on train, poor on test. Fix: more data, regularization, simpler model.
Regularization
Python
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
# Demonstrating overfitting and regularization
np.random.seed(42)
X = np.sort(np.random.uniform(0, 1, 30)).reshape(-1, 1)
y = np.sin(2*np.pi*X).ravel() + np.random.normal(0, 0.2, 30)
# Underfit (degree 1), Good fit (degree 3), Overfit (degree 15)
for degree in [1, 3, 15]:
model = make_pipeline(
PolynomialFeatures(degree),
Ridge(alpha=0.01 if degree < 15 else 1.0)
)
model.fit(X, y)
print(f"Degree {degree:2d}: Train RΒ²={model.score(X,y):.4f}")
Project: Diagnose and Fix Overfitting
Python
from sklearn.model_selection import learning_curve
def plot_learning_curve(model, X, y):
train_sizes, train_scores, val_scores = learning_curve(
model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10))
print("Training size | Train score | Val score")
for size, tr, val in zip(train_sizes, train_scores.mean(axis=1),
val_scores.mean(axis=1)):
print(f" {size:11d} | {tr:.4f} | {val:.4f}")
gap = train_scores.mean() - val_scores.mean()
print(f"\nGap: {gap:.4f}")
print("Overfitting!" if gap > 0.1 else "Good fit")
# Compare overfitting (deep tree) vs regularized (limited depth)
from sklearn.tree import DecisionTreeClassifier
data = load_breast_cancer()
print("=== Deep Tree (likely overfit) ===")
plot_learning_curve(DecisionTreeClassifier(max_depth=None), data.data, data.target)
print("\n=== Shallow Tree (regularized) ===")
plot_learning_curve(DecisionTreeClassifier(max_depth=3), data.data, data.target)
Exercises
Exercise 12.1: When would you optimize for recall over precision?
Optimize recall when missing a positive is very costly: cancer screening (missing cancer = death), fraud detection (missing fraud = financial loss), manufacturing defect detection. Accept more false positives to catch more true positives. Optimize precision when false alarms are costly: email spam (legitimate email in spam folder).
Exercise 12.2: A model has 98% train accuracy but 72% test accuracy. Diagnose it.
26% gap = severe overfitting (high variance). The model memorized training data. Fixes: (1) Add regularization (increase alpha in Ridge/Lasso), (2) Reduce model complexity (fewer features, shallower trees), (3) Get more training data, (4) Add dropout (neural nets), (5) Use cross-validation for model selection.
Exercise 12.3: Explain the ROC curve and why AUC is useful
The ROC curve plots True Positive Rate vs False Positive Rate at every probability threshold. AUC (Area Under Curve) summarizes this: AUC=1.0 is perfect, AUC=0.5 is random. AUC is threshold-independent β it tells you how well the model ranks positives above negatives regardless of the cutoff you choose.
Exercise 12.4: What's the difference between L1 (Lasso) and L2 (Ridge) regularization?
L1 (Lasso): Adds |w| penalty. Drives some weights exactly to zero β feature selection. Sparse solutions. L2 (Ridge): Adds wΒ² penalty. Shrinks all weights but doesn't zero them. Better when all features are relevant. ElasticNet: combines both with a mixing parameter.
Chapter Summary
- Precision matters when false positives are costly; recall when false negatives are
- AUC-ROC is the best threshold-independent metric for ranking quality
- Bias-variance tradeoff: simple models underfit, complex models overfit
- Regularization (L1/L2) prevents overfitting by penalizing large weights
- Learning curves diagnose underfitting vs overfitting
ML Frameworks
Production-ready machine learning
Scikit-learn Pipelines & Feature Engineering
Learning Objectives
- Build end-to-end ML pipelines with Pipeline and ColumnTransformer
- Automate feature scaling, encoding, and imputation
- Create custom transformers for domain-specific features
Python
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
import pandas as pd, numpy as np
# Simulated mixed-type dataset
df = pd.DataFrame({
'age': [25, 32, np.nan, 45, 28, 52, 38],
'income': [30000, 55000, 42000, np.nan, 48000, 72000, 60000],
'education': ['BS','MS','BS','PhD','BS','MS','PhD'],
'target': [0,1,0,1,0,1,1]
})
numeric_features = ['age', 'income']
categorical_features = ['education']
# Column-specific preprocessing
preprocessor = ColumnTransformer([
('num', Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
]), numeric_features),
('cat', Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
]), categorical_features)
])
# Full pipeline: preprocess + model
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100))
])
X = df.drop('target', axis=1)
y = df['target']
pipeline.fit(X, y)
print(f"Pipeline trained successfully!")
print(f"Prediction: {pipeline.predict(X[:2])}")
Why Pipelines Matter
Pipelines prevent data leakage (scaler fits only on train fold in CV), make code reproducible, enable easy serialization (pickle the whole pipeline), and simplify deployment β one object handles preprocessing + prediction.
Project: End-to-End Titanic Pipeline
Python
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
X, y = df[features], df['Survived']
preprocessor = ColumnTransformer([
('num', Pipeline([
('imp', SimpleImputer(strategy='median')),
('scl', StandardScaler())
]), ['Age','SibSp','Parch','Fare']),
('cat', Pipeline([
('imp', SimpleImputer(strategy='most_frequent')),
('enc', OneHotEncoder(handle_unknown='ignore'))
]), ['Pclass','Sex','Embarked'])
])
pipe = Pipeline([('pre', preprocessor), ('clf', GradientBoostingClassifier())])
scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f"Titanic CV Accuracy: {scores.mean():.2%} Β± {scores.std():.2%}")
Exercises
Exercise 13.1: What is data leakage in the context of pipelines?
If you fit the scaler on the entire dataset before cross-validation, the test fold's statistics influence the scaling β information leaks from test to train. Pipelines prevent this by fitting the preprocessor only on the training fold in each CV iteration.
Exercise 13.2: How do you create a custom transformer?
from sklearn.base import BaseEstimator, TransformerMixin
class LogTransformer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None): return self
def transform(self, X): return np.log1p(X)Exercise 13.3: Why use OneHotEncoder instead of LabelEncoder for features?
LabelEncoder assigns ordinal numbers (Red=0, Blue=1, Green=2), implying an ordering that doesn't exist for nominal categories. The model might think Green>Blue>Red. OneHotEncoder creates binary columns β no false ordering. Use LabelEncoder only for the target variable.
Chapter Summary
- Pipelines chain preprocessing + model into a single reproducible object
- ColumnTransformer applies different transformations to different feature types
- Pipelines prevent data leakage automatically during cross-validation
- Custom transformers extend pipelines with domain-specific feature engineering
Hyperparameter Tuning & MLflow
Learning Objectives
- Use GridSearchCV and RandomizedSearchCV for systematic tuning
- Understand Bayesian optimization concepts
- Track experiments with MLflow for reproducibility
GridSearch vs RandomizedSearch
Python
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from scipy.stats import randint, uniform
data = load_breast_cancer()
X, y = data.data, data.target
# GridSearchCV β exhaustive (try ALL combinations)
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 7, None],
'min_samples_split': [2, 5, 10]
}
grid = GridSearchCV(RandomForestClassifier(random_state=42),
param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid.fit(X, y)
print(f"Best params: {grid.best_params_}")
print(f"Best score: {grid.best_score_:.4f}")
# RandomizedSearchCV β samples N random combos (faster for large spaces)
param_dist = {
'n_estimators': randint(50, 500),
'max_depth': [3, 5, 7, 10, None],
'min_samples_split': randint(2, 20),
'max_features': uniform(0.1, 0.9)
}
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42), param_dist,
n_iter=50, cv=5, scoring='accuracy', n_jobs=-1)
random_search.fit(X, y)
print(f"\nRandom best: {random_search.best_params_}")
print(f"Random score: {random_search.best_score_:.4f}")
| Method | Tries | Best For | Speed |
|---|---|---|---|
| GridSearchCV | All combinations | Small parameter spaces | Slow (exponential) |
| RandomizedSearchCV | N random samples | Large/continuous spaces | Fast (linear in N) |
| Bayesian (Optuna) | Informed samples | Expensive evaluations | Most efficient |
MLflow β Experiment Tracking
Python
# pip install mlflow
import mlflow
import mlflow.sklearn
mlflow.set_experiment("breast_cancer_classification")
for n_est in [50, 100, 200]:
for max_d in [3, 5, 7]:
with mlflow.start_run(run_name=f"RF_{n_est}_{max_d}"):
# Train
rf = RandomForestClassifier(n_estimators=n_est, max_depth=max_d)
scores = cross_val_score(rf, X, y, cv=5)
# Log parameters and metrics
mlflow.log_param("n_estimators", n_est)
mlflow.log_param("max_depth", max_d)
mlflow.log_metric("cv_accuracy_mean", scores.mean())
mlflow.log_metric("cv_accuracy_std", scores.std())
# Log model
rf.fit(X, y)
mlflow.sklearn.log_model(rf, "model")
print("All experiments logged! Run 'mlflow ui' to view.")
Why Experiment Tracking Matters
Without tracking, you lose track of which parameters gave which results after dozens of experiments. MLflow logs everything automatically β parameters, metrics, models, artifacts. You can compare runs in a dashboard, reproduce any experiment, and deploy the best model with one click.
Exercises
Exercise 14.1: When should you use RandomizedSearchCV over GridSearchCV?
Use Randomized when: (1) parameter space is large (3+ params with many values each), (2) parameters are continuous, (3) you have limited compute budget. Grid with 5 params Γ 10 values each = 100,000 combinations. Randomized with 100 iterations often finds a solution within 5% of optimal in 1000Γ less time.
Exercise 14.2: What is Bayesian optimization and how does it improve on grid/random?
Bayesian optimization builds a probabilistic model (Gaussian Process) of the objective function. It uses this model to choose the next parameters to evaluate β focusing on promising regions. Libraries like Optuna and HyperOpt implement this. It's 10-100Γ more efficient than random search for expensive evaluations.
Exercise 14.3: How do you use MLflow to deploy a model?
# Save model
mlflow.sklearn.log_model(best_model, "model")
# Load and serve
loaded = mlflow.sklearn.load_model("runs:/<run_id>/model")
prediction = loaded.predict(new_data)
# Or serve as REST API:
# mlflow models serve -m runs:/<run_id>/model --port 5000Chapter Summary
- GridSearchCV tries all combinations β thorough but slow for large spaces
- RandomizedSearchCV samples randomly β often finds near-optimal in less time
- Bayesian optimization is most efficient for expensive model evaluations
- MLflow tracks experiments, logs metrics/models, and enables reproducibility
π Congratulations!
You've completed Classical Machine Learning. You now understand supervised and unsupervised learning, model evaluation, and production ML frameworks β the foundation for building real-world AI systems.
Β© 2025 EduArtha β Classical Machine Learning Complete Guide