Neural Networks & Deep Learning
Chapter 21: MLOps โ From Jupyter Notebook to Production
Bridging the Last Mile Between Model Training and Real-World Impact
โฑ๏ธ Reading Time: ~3 hours | ๐ Part VI: MLOps & Deployment | ๐ Engineering + Code Chapter
๐ Prerequisites: Chapters 12โ17 (CNN, Transfer Learning), Basic Python, Familiarity with REST APIs
Bloom's Taxonomy Map for This Chapter
| Bloom's Level | What You'll Achieve |
|---|---|
| ๐ต Remember | Recall the stages of the ML lifecycle, list model serialization formats (SavedModel, ONNX, TorchScript), and name key MLOps tools |
| ๐ต Understand | Explain why a model that works in Jupyter fails in production, differentiate between data drift, prediction drift, and concept drift |
| ๐ข Apply | Wrap a trained model in a FastAPI endpoint, write a Dockerfile for ML inference, and deploy to Google Cloud Run |
| ๐ก Analyze | Diagnose production model degradation using monitoring dashboards, compare different serving architectures for latency and throughput |
| ๐ Evaluate | Choose between MLflow, W&B, DVC, and Kubeflow for a given team size and use case; decide between cloud vs. edge deployment for Indian connectivity constraints |
| ๐ด Create | Build an end-to-end MLOps pipeline: train a plant disease model, package as FastAPI + Docker, deploy with CI/CD, and add monitoring |
Learning Objectives
By the end of this chapter, you will be able to:
- Map the complete ML lifecycle from data collection through deployment, monitoring, and retraining โ and identify where most projects fail (the "deployment gap")
- Serialize trained models using TensorFlow SavedModel, ONNX, and TorchScript formats, and explain the trade-offs of each
- Build a production-grade REST API using FastAPI with Pydantic validation, async inference, health checks, and proper error handling
- Containerize an ML application using Docker with multi-stage builds optimized for minimal image size
- Deploy models using TensorFlow Serving, and understand model versioning with A/B testing strategies
- Implement monitoring for data drift, prediction drift, and concept drift using statistical tests
- Optimize models for edge deployment on Android devices using TFLite, especially for low-connectivity scenarios in rural India
- Design a complete MLOps pipeline from experiment tracking (MLflow) through CI/CD to production monitoring
- Compare MLOps tools (MLflow, Weights & Biases, DVC, Kubeflow) and select the right stack for different organizational needs
- Build a Streamlit UI for demo and internal stakeholder consumption of ML predictions
Opening Hook
๐ฏ "94% accuracy. Manager: Great, deploy it. Now what?"
Priya, a data scientist at a Bangalore fintech startup, spent 3 months training a loan default prediction model. 94% accuracy on the test set. Her manager was ecstatic: "Ship it by Friday."
That was 6 weeks ago. Priya is still trying to deploy it.
The model ran perfectly in her Jupyter Notebook on her 32 GB RAM laptop. But the production server has 4 GB. Her pickle file threw version errors. The Flask API crashed under 50 concurrent requests. And nobody told her that RBI regulations require model explainability logs.
Priya's story is not unique. According to Gartner, only 53% of ML prototypes ever make it to production. The gap between a working notebook and a production system is what MLOps exists to bridge.
In this chapter, you'll learn to be the engineer who can say: "94% accuracy โ AND it's deployed, monitored, and auto-retraining."
FlipkartPaytmInfosysJioZomatoGoogle coined the term "MLOps" in a 2015 paper describing their internal ML infrastructure. Today, the MLOps market is valued at over $1.4 billion globally โ and India's IT services companies (TCS, Infosys, Wipro) are among the largest MLOps consulting providers worldwide.
Core Concepts
21.1 The ML Lifecycle โ Why Notebooks Are Not Products
The journey from data to value follows a structured lifecycle. Most university courses cover only the first three stages. Production ML requires mastering all six:
๐ The Complete ML Lifecycle
Gathering raw data, cleaning, labeling, versioning datasets. For Flipkart's product recommendation, this means processing 1.5 billion+ click events daily.
Stage 2: Feature Engineering & TrainingFeature extraction, model architecture selection, hyperparameter tuning. The part you've mastered in Chapters 1โ20.
Stage 3: Evaluation & ValidationTest set metrics, cross-validation, fairness audits. "94% accuracy" lives here.
Stage 4: Deployment (Serving)Packaging the model as a service (REST API, gRPC), containerizing (Docker), deploying to cloud or edge. This is where most projects die.
Stage 5: MonitoringTracking prediction quality, data drift, latency, throughput. Without monitoring, your model silently degrades.
Stage 6: Retraining & CI/CDAutomated retraining pipelines triggered by drift detection or scheduled intervals. Continuous integration for ML code, continuous delivery for models.
ML in Production = Code + Data + Model + Infra + Monitoring + Governance
The "Hidden Technical Debt" Problem
A landmark Google paper (Sculley et al., 2015) revealed that in real-world ML systems, the actual model training code is just a tiny fraction of the total codebase. The vast majority is infrastructure: data pipelines, feature stores, serving infrastructure, monitoring, and configuration management.
At Paytm, the fraud detection team reported that ML model code accounts for only ~5% of their production codebase. The remaining 95% handles data ingestion from 450+ million wallets, feature computation at 10,000+ TPS (transactions per second), A/B testing across regional segments, and compliance logging required by RBI.
21.2 Model Serialization โ Saving Models for Production
Your trained model exists as weights in GPU memory. To deploy it, you must serialize it โ convert it to a portable, self-contained format that any server can load without your original training code.
๐ง Model Serialization Formats
TensorFlow's native format. Saves the complete computation graph + weights + signatures. Directly loadable by TensorFlow Serving, TFLite, and TF.js.
2. ONNX (Open Neural Network Exchange)Framework-agnostic format supported by Microsoft, Facebook, and Amazon. Train in PyTorch, deploy with ONNX Runtime on any platform. Ideal for cross-framework interoperability.
3. TorchScript (PyTorch)PyTorch's serialization via torch.jit.trace() or torch.jit.script(). Converts Python models to an intermediate representation that runs without Python dependency.
Simple binary serialization. Not recommended for production โ version-dependent, security risks (arbitrary code execution), and no graph optimization.
TensorFlow SavedModel
# Save a Keras model as TensorFlow SavedModel
import tensorflow as tf
# After training your model (e.g., plant disease classifier from Ch17)
model = tf.keras.models.load_model('plant_disease_model.h5')
# Save as SavedModel (directory format)
tf.saved_model.save(model, './saved_model/plant_disease/1')
# The '1' is the version number โ critical for model versioning!
# Directory structure:
# saved_model/plant_disease/1/
# โโโ saved_model.pb โ computation graph
# โโโ fingerprint.pb
# โโโ variables/
# โโโ variables.data-00000-of-00001 โ weights
# โโโ variables.index
# Load it back
loaded = tf.saved_model.load('./saved_model/plant_disease/1')
print(loaded.signatures) # Shows input/output specsPython
ONNX Export from PyTorch
import torch
import torch.onnx
# Assume `model` is a trained PyTorch model
model.eval()
# Create dummy input matching your model's expected shape
dummy_input = torch.randn(1, 3, 224, 224) # Batch=1, 3 channels, 224ร224
# Export to ONNX
torch.onnx.export(
model,
dummy_input,
"plant_disease.onnx",
input_names=["image"],
output_names=["prediction"],
dynamic_axes={"image": {0: "batch_size"}}, # Allow variable batch
opset_version=13
)
# Verify with ONNX Runtime
import onnxruntime as ort
session = ort.InferenceSession("plant_disease.onnx")
result = session.run(None, {"image": dummy_input.numpy()})
print(f"Prediction shape: {result[0].shape}")Python
TorchScript
# Method 1: Tracing (works for models without control flow)
traced_model = torch.jit.trace(model, dummy_input)
traced_model.save("plant_disease_traced.pt")
# Method 2: Scripting (works with if/else, loops)
scripted_model = torch.jit.script(model)
scripted_model.save("plant_disease_scripted.pt")
# Load without needing the model class definition!
loaded = torch.jit.load("plant_disease_traced.pt")
output = loaded(dummy_input)Python
| Format | Framework | Graph Optimization | Cross-Platform | Production-Ready |
|---|---|---|---|---|
SavedModel | TensorFlow | โ XLA, TF-TRT | TF ecosystem | โ โ โ |
ONNX | Any โ ONNX Runtime | โ Graph fusion | โ Universal | โ โ โ |
TorchScript | PyTorch | โ Fusion passes | C++, mobile | โ โ |
pickle/joblib | Scikit-learn | โ None | โ Python only | โ ๏ธ Not recommended |
Never use pickle for production ML models. Pickle files are Python-version and library-version dependent. A model pickled with scikit-learn 1.2 may not load with scikit-learn 1.3. Worse, pickle can execute arbitrary code โ a malicious pickle file can compromise your server. Always use SavedModel, ONNX, or TorchScript.
21.3 REST API with FastAPI โ The /predict Endpoint
FastAPI is the modern Python web framework of choice for ML serving. It's built on Starlette (async ASGI) and Pydantic (data validation), offering automatic OpenAPI docs, type validation, and async request handling out of the box.
Why FastAPI Over Flask for ML?
| Feature | Flask | FastAPI |
|---|---|---|
| Async Support | โ WSGI (sync) | โ ASGI (native async) |
| Request Validation | Manual | โ Pydantic auto-validation |
| Auto API Docs | โ Need Swagger plugin | โ Built-in /docs |
| Type Hints | Optional | โ Required & enforced |
| Performance | ~500 req/s | ~3000+ req/s |
| ML Suitability | Good for prototypes | โ Production-grade |
Complete FastAPI ML Server
# app.py โ Production FastAPI ML serving application
from fastapi import FastAPI, File, UploadFile, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from typing import List, Optional
import tensorflow as tf
import numpy as np
from PIL import Image
import io
import time
import logging
# โโโ Setup โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
app = FastAPI(
title="Plant Disease Classifier API",
description="Predicts plant disease from leaf images (Ch17 model)",
version="1.0.0"
)
# CORS โ allow Streamlit frontend to call this API
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
# โโโ Load model at startup (not per-request!) โโโโ
CLASS_NAMES = [
"Healthy", "Early Blight", "Late Blight",
"Bacterial Spot", "Yellow Leaf Curl"
]
model = None
@app.on_event("startup")
async def load_model():
global model
logger.info("Loading plant disease model...")
model = tf.saved_model.load("./saved_model/plant_disease/1")
logger.info("โ
Model loaded successfully!")
# โโโ Pydantic Response Schemas โโโโโโโโโโโโโโโโโโโโ
class PredictionResult(BaseModel):
predicted_class: str = Field(..., example="Early Blight")
confidence: float = Field(..., ge=0, le=1, example=0.94)
all_probabilities: dict = Field(...)
inference_time_ms: float = Field(..., example=45.2)
class HealthResponse(BaseModel):
status: str = "healthy"
model_loaded: bool = True
version: str = "1.0.0"
# โโโ Helper: preprocess image โโโโโโโโโโโโโโโโโโโโโ
def preprocess_image(image_bytes: bytes) -> np.ndarray:
"""Resize to 224x224, normalize to [0,1], add batch dim."""
img = Image.open(io.BytesIO(image_bytes)).convert("RGB")
img = img.resize((224, 224))
arr = np.array(img, dtype=np.float32) / 255.0
return np.expand_dims(arr, axis=0) # (1, 224, 224, 3)
# โโโ Endpoints โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
@app.get("/health", response_model=HealthResponse)
async def health_check():
return HealthResponse(model_loaded=model is not None)
@app.post("/predict", response_model=PredictionResult)
async def predict(file: UploadFile = File(...)):
"""Upload a leaf image and get disease prediction."""
# Validate file type
if file.content_type not in ["image/jpeg", "image/png"]:
raise HTTPException(
status_code=400,
detail="Only JPEG/PNG images accepted"
)
# Read & preprocess
image_bytes = await file.read()
input_tensor = preprocess_image(image_bytes)
# Inference with timing
start = time.perf_counter()
infer = model.signatures["serving_default"]
predictions = infer(tf.constant(input_tensor))
output_key = list(predictions.keys())[0]
probs = tf.nn.softmax(predictions[output_key]).numpy()[0]
elapsed_ms = (time.perf_counter() - start) * 1000
# Build response
predicted_idx = int(np.argmax(probs))
return PredictionResult(
predicted_class=CLASS_NAMES[predicted_idx],
confidence=round(float(probs[predicted_idx]), 4),
all_probabilities={
name: round(float(p), 4)
for name, p in zip(CLASS_NAMES, probs)
},
inference_time_ms=round(elapsed_ms, 2)
)Python
Run & Test
# Terminal: Start the server
$ uvicorn app:app --host 0.0.0.0 --port 8000 --workers 4
# Another terminal: Test with curl
$ curl -X POST "http://localhost:8000/predict" \
-F "file=@tomato_leaf.jpg"
# Or open browser โ http://localhost:8000/docs
# FastAPI auto-generates interactive Swagger UI!Bash
Always load the model at startup (using @app.on_event("startup")), not inside the /predict function. Loading a SavedModel takes 2โ10 seconds. If you load per-request, every prediction will have multi-second latency and you'll run out of memory fast.
21.4 Docker Containerization for ML
Docker solves the "it works on my machine" problem. Your container bundles the OS, Python version, libraries, model weights, and application code into a single portable image.
๐ณ Why Docker for ML?
Exact same environment in development, staging, and production. No more "but it worked in my Jupyter notebook!"
Dependency IsolationTensorFlow 2.15 needs CUDA 12.2 and cuDNN 8.9. PyTorch 2.1 needs CUDA 12.1. Docker keeps them separate.
ScalabilityKubernetes can spin up 50 replicas of your container in seconds during peak traffic. Essential for Zomato's lunch-hour recommendation surge.
PortabilitySame image runs on AWS, GCP, Azure, or an on-premise server at TCS's data center in Mumbai.
Dockerfile for ML Serving (Multi-Stage Build)
# Dockerfile โ Optimized for ML serving
# Stage 1: Builder (install heavy dependencies)
FROM python:3.11-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
# Stage 2: Runtime (minimal image)
FROM python:3.11-slim
# Metadata
LABEL maintainer="mlops-team@startup.in"
LABEL version="1.0"
# Security: non-root user
RUN groupadd -r mluser && useradd -r -g mluser mluser
WORKDIR /app
# Copy installed packages from builder
COPY --from=builder /install /usr/local
# Copy application code
COPY app.py .
COPY saved_model/ ./saved_model/
# Switch to non-root user
USER mluser
# Expose port
EXPOSE 8000
# Health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
# Start server
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]Dockerfile
requirements.txt
fastapi==0.104.1
uvicorn[standard]==0.24.0
tensorflow-cpu==2.15.0 # Use CPU version to keep image small
numpy==1.26.2
Pillow==10.1.0
python-multipart==0.0.6 # Required for file uploadsText
Build & Run
# Build the image
$ docker build -t plant-disease-api:v1 .
# Check image size
$ docker images plant-disease-api
# REPOSITORY TAG SIZE
# plant-disease-api v1 1.2GB (with TF-CPU)
# Run the container
$ docker run -d \
--name plant-api \
-p 8000:8000 \
--memory=2g \
--cpus=2 \
plant-disease-api:v1
# Test it
$ curl -X POST http://localhost:8000/predict -F "file=@leaf.jpg"
# View logs
$ docker logs -f plant-apiBash
Don't copy your entire training dataset into the Docker image. A common beginner mistake is COPY . . which copies training data, checkpoints, Jupyter notebooks โ ballooning the image to 20+ GB. Only copy the serving code and the final model artifact.
.dockerignore
# Don't copy training artifacts into production image
__pycache__/
*.pyc
.git/
.gitignore
data/
notebooks/
*.ipynb
checkpoints/
wandb/
mlruns/
*.h5
README.mdText
21.5 TensorFlow Serving โ Industrial-Grade Model Serving
While FastAPI is great for lightweight deployment, TensorFlow Serving is purpose-built for serving TF models at scale. It handles model versioning, batching, and hardware acceleration natively.
โก TensorFlow Serving Features
Place model versions in numbered directories (/1, /2, /3). TF Serving automatically detects and hot-swaps to the latest version โ zero-downtime updates.
Automatically batches incoming requests to maximize GPU utilization. A GPU running one image at a time wastes 90% of its capacity.
gRPC & RESTSupports both gRPC (binary, fast, ~3ร faster than REST) and REST endpoints out of the box.
GPU AccelerationNatively supports NVIDIA GPUs with TensorRT optimization.
Deploy with TensorFlow Serving (Docker)
# Directory structure expected by TF Serving:
# models/
# โโโ plant_disease/
# โโโ 1/ โ Version 1
# โ โโโ saved_model.pb + variables/
# โโโ 2/ โ Version 2 (auto-detected!)
# โโโ saved_model.pb + variables/
# Run TensorFlow Serving container
$ docker run -d \
--name tf-serving \
-p 8501:8501 \
-p 8500:8500 \
-v "$(pwd)/models:/models" \
-e MODEL_NAME=plant_disease \
tensorflow/serving:latest
# REST API prediction (port 8501)
$ curl -X POST http://localhost:8501/v1/models/plant_disease:predict \
-H "Content-Type: application/json" \
-d '{"instances": [[[0.1, 0.2, 0.3], ...]]}'
# Check model status
$ curl http://localhost:8501/v1/models/plant_diseaseBash
TF Serving Configuration for A/B Testing
# model_config.config โ Serve multiple versions simultaneously
model_config_list {
config {
name: "plant_disease"
base_path: "/models/plant_disease"
model_platform: "tensorflow"
model_version_policy {
specific {
versions: 1 # Champion model (90% traffic)
versions: 2 # Challenger model (10% traffic)
}
}
}
}
# Start TF Serving with config file
$ docker run -d \
-p 8501:8501 \
-v "$(pwd)/models:/models" \
-v "$(pwd)/model_config.config:/config" \
tensorflow/serving \
--model_config_file=/configConfig
A/B Traffic Splitting with Python
import random
import requests
def predict_with_ab_test(image_data, champion_pct=0.9):
"""Route traffic: 90% to v1 (champion), 10% to v2 (challenger)."""
base_url = "http://localhost:8501/v1/models/plant_disease"
if random.random() < champion_pct:
version = 1 # Champion
else:
version = 2 # Challenger
url = f"{base_url}/versions/{version}:predict"
response = requests.post(url, json={"instances": [image_data]})
result = response.json()
result["model_version"] = version
return resultPython
Flipkart uses A/B testing extensively for their search ranking ML models. During the Big Billion Days sale, they run up to 15 simultaneous model variants to optimize conversion rates across different product categories and regional user segments. Their ML platform serves 100,000+ predictions per second during peak traffic.
21.6 Model Versioning & Experiment Tracking
In production, you don't have one model โ you have dozens of experiments, each with different hyperparameters, data versions, and feature sets. Model versioning answers: "Which model is running? What data was it trained on? Who approved it?"
๐ MLflow โ The Open-Source Experiment Tracker
Log parameters, metrics, artifacts, and source code for every training run.
Model RegistryCentral repository with stages: None โ Staging โ Production โ Archived.
Reproducible ML projects with MLproject files specifying environment and entry points.
Deploy models directly from the registry with mlflow models serve.
MLflow Experiment Tracking
import mlflow
import mlflow.tensorflow
import tensorflow as tf
# Set tracking URI (local or remote server)
mlflow.set_tracking_uri("http://mlflow-server.internal:5000")
mlflow.set_experiment("plant-disease-classifier")
# Start a training run
with mlflow.start_run(run_name="resnet50-augmented-v3"):
# Log hyperparameters
mlflow.log_params({
"model_arch": "ResNet50",
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 50,
"augmentation": "random_flip+rotate+zoom",
"dataset_version": "plantvillage-v3-india-augmented",
"optimizer": "Adam",
})
# Train the model
model = build_resnet50_classifier()
history = model.fit(train_ds, validation_data=val_ds, epochs=50)
# Log metrics
mlflow.log_metrics({
"val_accuracy": history.history["val_accuracy"][-1],
"val_loss": history.history["val_loss"][-1],
"train_accuracy": history.history["accuracy"][-1],
"model_size_mb": 98.5,
"inference_latency_ms": 42.0,
})
# Log the model artifact
mlflow.tensorflow.log_model(
model,
artifact_path="model",
registered_model_name="PlantDiseaseClassifier"
)
# Log training curves as artifact
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(history.history["val_accuracy"])
ax.set_title("Validation Accuracy")
fig.savefig("training_curve.png")
mlflow.log_artifact("training_curve.png")
print("Run logged! Check MLflow UI at http://localhost:5000")Python
Model Registry โ Promoting to Production
from mlflow.tracking import MlflowClient
client = MlflowClient()
# Promote model version 3 to Production
client.transition_model_version_stage(
name="PlantDiseaseClassifier",
version=3,
stage="Production"
)
# Archive the old version
client.transition_model_version_stage(
name="PlantDiseaseClassifier",
version=2,
stage="Archived"
)
# Load the production model for serving
model_uri = "models:/PlantDiseaseClassifier/Production"
prod_model = mlflow.tensorflow.load_model(model_uri)
print("โ
Production model v3 loaded!")Python
Tag every training run with the Git commit hash of your code. This ensures perfect reproducibility: mlflow.log_param("git_hash", subprocess.check_output(["git", "rev-parse", "HEAD"]).strip()). Three months later, when you need to reproduce a specific result, you'll know exactly which code generated it.
21.7 Monitoring โ Data Drift, Prediction Drift, Concept Drift
A model deployed without monitoring is a ticking time bomb. Real-world data changes over time โ user behavior shifts, seasonal patterns emerge, economic conditions fluctuate. This phenomenon is called drift.
๐ Three Types of Drift
The distribution of input features changes. Example: A loan default model trained on metro-city data starts receiving rural applicant data with very different income distributions.
2. Prediction Drift (Output Drift)The distribution of model predictions changes. Example: Your spam classifier suddenly flags 60% of emails as spam instead of the usual 5%. Something is wrong with inputs or the model itself.
3. Concept DriftThe relationship between inputs and outputs changes. Example: Post-COVID, Zomato's "will this user order again?" model broke because dining patterns fundamentally changed โ more deliveries, fewer dine-ins, different time-of-day patterns.
KS Statistic = sup|Fโ(x) โ Fโ(x)| where Fโ = training CDF, Fโ = production CDF
If KS > threshold (e.g., 0.1), flag data drift alert
Drift Detection Implementation
import numpy as np
from scipy import stats
from datetime import datetime
import json
import logging
logger = logging.getLogger("drift_monitor")
class DriftMonitor:
"""Monitor data drift, prediction drift, and concept drift."""
def __init__(self, reference_data: np.ndarray,
reference_predictions: np.ndarray,
feature_names: list,
ks_threshold: float = 0.1,
psi_threshold: float = 0.2):
self.reference_data = reference_data
self.reference_preds = reference_predictions
self.feature_names = feature_names
self.ks_threshold = ks_threshold
self.psi_threshold = psi_threshold
self.alerts = []
def _calculate_psi(self, expected: np.ndarray,
actual: np.ndarray,
bins: int = 10) -> float:
"""Population Stability Index โ measures distribution shift."""
expected_pct, bin_edges = np.histogram(expected, bins=bins)
actual_pct, _ = np.histogram(actual, bins=bin_edges)
# Normalize and add small epsilon to avoid division by zero
expected_pct = expected_pct / len(expected) + 1e-6
actual_pct = actual_pct / len(actual) + 1e-6
psi = np.sum(
(actual_pct - expected_pct) * np.log(actual_pct / expected_pct)
)
return float(psi)
def check_data_drift(self, production_data: np.ndarray) -> dict:
"""KS test per feature: training vs production distribution."""
drift_report = {"timestamp": datetime.now().isoformat(),
"drifted_features": [], "details": {}}
for i, feature in enumerate(self.feature_names):
ks_stat, p_value = stats.ks_2samp(
self.reference_data[:, i],
production_data[:, i]
)
is_drifted = ks_stat > self.ks_threshold
drift_report["details"][feature] = {
"ks_statistic": round(ks_stat, 4),
"p_value": round(p_value, 6),
"is_drifted": is_drifted
}
if is_drifted:
drift_report["drifted_features"].append(feature)
logger.warning(f"๐จ DRIFT DETECTED: {feature} "
f"(KS={ks_stat:.4f}, p={p_value:.6f})")
return drift_report
def check_prediction_drift(self, prod_preds: np.ndarray) -> dict:
"""PSI on prediction distribution."""
psi = self._calculate_psi(self.reference_preds, prod_preds)
is_drifted = psi > self.psi_threshold
if is_drifted:
logger.warning(f"๐จ PREDICTION DRIFT: PSI={psi:.4f}")
return {
"psi": round(psi, 4),
"is_drifted": is_drifted,
"threshold": self.psi_threshold,
"reference_mean": float(np.mean(self.reference_preds)),
"production_mean": float(np.mean(prod_preds)),
}
# โโโ Usage Example โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Suppose we have reference (training) and production data
np.random.seed(42)
ref_data = np.random.randn(1000, 5) # Training data
prod_data = np.random.randn(500, 5) + 0.3 # Shifted production data
ref_preds = np.random.uniform(0, 1, 1000) # Training predictions
monitor = DriftMonitor(
reference_data=ref_data,
reference_predictions=ref_preds,
feature_names=["income", "age", "credit_score",
"loan_amount", "tenure"],
)
report = monitor.check_data_drift(prod_data)
print(json.dumps(report, indent=2))Python
At Jio, their content recommendation model experienced severe concept drift during IPL season every year. User engagement patterns shift dramatically โ watch time increases 3ร, genre preferences shift from movies to sports, and time-of-day patterns change completely. Their MLOps pipeline now includes seasonal retraining triggers that automatically detect and respond to IPL-period drift.
21.8 Edge Deployment โ TFLite for Android in India
India has 750+ million smartphone users, but large portions still rely on 2G/3G connectivity. Sending images to a cloud API requires stable bandwidth and adds latency. Edge deployment puts the model on the phone itself โ predictions happen offline.
๐ฑ TFLite โ TensorFlow for Mobile
Reduce model from Float32 (4 bytes/weight) to Int8 (1 byte/weight). Cuts size by 4ร and speeds up inference on ARM CPUs.
Model SizeA ResNet50 model goes from ~98 MB โ ~25 MB after quantization. Fits comfortably on a โน8,000 entry-level Android phone.
Offline InferenceA farmer in rural Bihar can photograph a diseased tomato leaf and get a prediction without internet connectivity.
Supported OperationsNot all TF ops are supported in TFLite. Use tf.lite.OpsSet.TFLITE_BUILTINS for maximum compatibility; add SELECT_TF_OPS for unsupported ones (increases binary size).
Convert to TFLite with Quantization
import tensorflow as tf
import numpy as np
# Load the trained Keras model
model = tf.keras.models.load_model("plant_disease_model.h5")
# โโโ Method 1: Dynamic Range Quantization (simplest) โโ
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
with open("plant_disease_dynamic.tflite", "wb") as f:
f.write(tflite_model)
print(f"Dynamic quant: {len(tflite_model) / 1e6:.1f} MB")
# โโโ Method 2: Full Integer Quantization (smallest) โโโ
def representative_dataset():
"""Provide ~100 samples for calibration."""
for _ in range(100):
sample = np.random.rand(1, 224, 224, 3).astype(np.float32)
yield [sample]
converter2 = tf.lite.TFLiteConverter.from_keras_model(model)
converter2.optimizations = [tf.lite.Optimize.DEFAULT]
converter2.representative_dataset = representative_dataset
converter2.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
converter2.inference_input_type = tf.uint8
converter2.inference_output_type = tf.uint8
tflite_int8 = converter2.convert()
with open("plant_disease_int8.tflite", "wb") as f:
f.write(tflite_int8)
print(f"Int8 quant: {len(tflite_int8) / 1e6:.1f} MB")Python
Run TFLite Inference (Python โ mimics Android behavior)
# TFLite inference โ same API available in Android Java/Kotlin
interpreter = tf.lite.Interpreter(model_path="plant_disease_int8.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Prepare input
img = tf.io.read_file("test_leaf.jpg")
img = tf.image.decode_jpeg(img, channels=3)
img = tf.image.resize(img, [224, 224])
img = tf.cast(img, tf.uint8) # Int8 model expects uint8 input
input_data = tf.expand_dims(img, 0)
# Run inference
interpreter.set_tensor(input_details[0]["index"], input_data.numpy())
interpreter.invoke()
output = interpreter.get_tensor(output_details[0]["index"])
CLASS_NAMES = ["Healthy", "Early Blight", "Late Blight",
"Bacterial Spot", "Yellow Leaf Curl"]
predicted = CLASS_NAMES[np.argmax(output)]
print(f"Prediction: {predicted}")Python
| Model Variant | Size | Latency (Pixel 6) | Accuracy |
|---|---|---|---|
| Original (Float32) | 98 MB | ~180 ms | 94.2% |
| Dynamic Range Quant | 25 MB | ~65 ms | 93.8% |
| Full Int8 Quant | 25 MB | ~45 ms | 93.1% |
| Float16 Quant | 49 MB | ~90 ms (GPU) | 94.1% |
Microsoft's Kaizala team (now integrated into Teams) deployed TFLite models for language detection on Indian phones to route messages in the correct language (Hindi, Tamil, Telugu, etc.) across their 100+ million user base โ all running locally on devices as low-end as โน5,000 feature phones.
From-Scratch Code โ Minimal Model Server in Pure Python
Before using FastAPI, let's build a barebones HTTP model server using only Python's standard library. This helps you understand what frameworks like FastAPI abstract away.
# minimal_server.py โ HTTP model server with zero dependencies
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
import numpy as np
import time
# โโโ Simulate a "model" with a simple function โโโ
class SimpleModel:
"""Simulates a trained model for demonstration."""
def __init__(self):
# In production, this would load TF/PyTorch model
self.classes = ["Healthy", "Early Blight", "Late Blight"]
self.weights = np.random.randn(10, 3) # Fake weights
print("โ
Model loaded")
def predict(self, features: list) -> dict:
"""Run inference on input features."""
x = np.array(features[:10]).reshape(1, -1)
logits = x @ self.weights
probs = np.exp(logits) / np.exp(logits).sum() # softmax
idx = int(np.argmax(probs))
return {
"class": self.classes[idx],
"confidence": round(float(probs[0][idx]), 4),
}
# Load model ONCE at startup
model = SimpleModel()
# โโโ HTTP Request Handler โโโโโโโโโโโโโโโโโโโโโโโโโ
class MLHandler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == "/health":
self._respond(200, {"status": "healthy"})
else:
self._respond(404, {"error": "Not found"})
def do_POST(self):
if self.path == "/predict":
# Read request body
length = int(self.headers.get("Content-Length", 0))
body = self.rfile.read(length).decode("utf-8")
try:
data = json.loads(body)
features = data.get("features", [])
if not features:
self._respond(400, {"error": "Missing 'features'"})
return
# Run inference with timing
start = time.perf_counter()
result = model.predict(features)
result["latency_ms"] = round(
(time.perf_counter() - start) * 1000, 2
)
self._respond(200, result)
except json.JSONDecodeError:
self._respond(400, {"error": "Invalid JSON"})
except Exception as e:
self._respond(500, {"error": str(e)})
else:
self._respond(404, {"error": "Not found"})
def _respond(self, status: int, data: dict):
self.send_response(status)
self.send_header("Content-Type", "application/json")
self.end_headers()
self.wfile.write(json.dumps(data).encode("utf-8"))
def log_message(self, format, *args):
# Custom logging with timestamp
print(f"[{time.strftime('%H:%M:%S')}] {args[0]}")
# โโโ Start Server โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
if __name__ == "__main__":
server = HTTPServer(("", 8000), MLHandler)
print("๐ ML Server running on http://localhost:8000")
print(" POST /predict โ Send features for prediction")
print(" GET /health โ Health check")
server.serve_forever()Python
This 80-line server demonstrates the core pattern: load once, serve many. FastAPI adds validation, async, OpenAPI docs, and middleware โ but the fundamental architecture is identical.
Industry Code โ Full Deployment Pipeline
Let's take the plant disease model from Chapter 17 and build the complete production pipeline: FastAPI server โ Dockerfile โ Cloud Run deployment โ Streamlit UI.
5A. Deploy to Google Cloud Run
# deploy.sh โ Full deployment script for Google Cloud Run
#!/bin/bash
set -e
# Configuration
PROJECT_ID="my-ml-project-india"
REGION="asia-south1" # Mumbai region (lowest latency for India)
SERVICE_NAME="plant-disease-api"
IMAGE_NAME="gcr.io/${PROJECT_ID}/${SERVICE_NAME}"
# Step 1: Build Docker image
echo "๐จ Building Docker image..."
docker build -t ${IMAGE_NAME}:latest .
# Step 2: Push to Google Container Registry
echo "๐ค Pushing to GCR..."
docker push ${IMAGE_NAME}:latest
# Step 3: Deploy to Cloud Run
echo "๐ Deploying to Cloud Run..."
gcloud run deploy ${SERVICE_NAME} \
--image ${IMAGE_NAME}:latest \
--platform managed \
--region ${REGION} \
--memory 2Gi \
--cpu 2 \
--timeout 60 \
--concurrency 80 \
--min-instances 1 \ # Keep 1 instance warm (avoids cold starts)
--max-instances 10 \ # Auto-scale up to 10 instances
--allow-unauthenticated # Public API (use IAM for production)
# Step 4: Get the URL
URL=$(gcloud run services describe ${SERVICE_NAME} \
--region ${REGION} --format "value(status.url)")
echo "โ
Deployed! API URL: ${URL}"
echo " Health: ${URL}/health"
echo " Docs: ${URL}/docs"
echo " Predict: curl -X POST ${URL}/predict -F 'file=@leaf.jpg'"Bash
5B. Streamlit UI for Stakeholders
# streamlit_app.py โ User-friendly frontend for model predictions
import streamlit as st
import requests
from PIL import Image
import io
# โโโ Page Config โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
st.set_page_config(
page_title="๐ฟ Plant Disease Detector",
page_icon="๐ฟ",
layout="centered"
)
# API endpoint (Cloud Run URL or local)
API_URL = st.sidebar.text_input(
"API URL",
value="http://localhost:8000"
)
# โโโ Main UI โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
st.title("๐ฟ Plant Disease Detector")
st.markdown("""
Upload a photo of a plant leaf to detect diseases.
Built for Indian farmers โ works with tomato, potato, and pepper leaves.
""")
# โโโ Camera or Upload โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
tab1, tab2 = st.tabs(["๐ท Camera", "๐ Upload"])
with tab1:
camera_image = st.camera_input("Take a photo of the leaf")
image_source = camera_image
with tab2:
uploaded_file = st.file_uploader(
"Choose a leaf image",
type=["jpg", "jpeg", "png"]
)
if uploaded_file:
image_source = uploaded_file
# โโโ Predict โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
if "image_source" in dir() and image_source is not None:
# Display the image
img = Image.open(image_source)
st.image(img, caption="Uploaded Leaf", use_column_width=True)
if st.button("๐ Analyze Leaf", type="primary"):
with st.spinner("Analyzing..."):
# Send to API
image_source.seek(0)
files = {"file": ("leaf.jpg", image_source, "image/jpeg")}
response = requests.post(f"{API_URL}/predict", files=files)
if response.status_code == 200:
result = response.json()
# Display results
col1, col2 = st.columns(2)
with col1:
st.metric("Prediction", result["predicted_class"])
with col2:
st.metric("Confidence",
f"{result['confidence']*100:.1f}%")
# Probability bar chart
st.bar_chart(result["all_probabilities"])
# Inference time
st.caption(
f"โก Inference: {result['inference_time_ms']:.1f} ms"
)
# Treatment recommendations for Indian farmers
if result["predicted_class"] != "Healthy":
st.warning(f"โ ๏ธ Disease detected: "
f"{result['predicted_class']}")
st.info("๐ Consult your local Krishi Vigyan "
"Kendra (KVK) for treatment options. "
"Call Kisan Call Centre: 1800-180-1551")
else:
st.success("โ
Your plant looks healthy!")
else:
st.error(f"API Error: {response.text}")
# โโโ Sidebar Info โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
st.sidebar.markdown("---")
st.sidebar.markdown("### โน๏ธ About")
st.sidebar.markdown("""
- **Model**: ResNet50 (Chapter 17)
- **Dataset**: PlantVillage + India crops
- **Accuracy**: 94.2%
- **Supported**: Tomato, Potato, Pepper
- **Cost**: โน0 (open-source)
""")Python
Run the Streamlit App
# Install Streamlit
$ pip install streamlit
# Run the app
$ streamlit run streamlit_app.py --server.port 8501
# Opens browser at http://localhost:8501Bash
5C. GitHub Actions CI/CD Pipeline
# .github/workflows/deploy.yml โ Auto-deploy on push to main
name: Deploy Plant Disease API
on:
push:
branches: [main]
paths:
- 'app.py'
- 'Dockerfile'
- 'requirements.txt'
- 'saved_model/**'
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- run: pip install -r requirements.txt
- run: pip install pytest httpx
- run: pytest tests/ -v
deploy:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: google-github-actions/auth@v2
with:
credentials_json: ${{ secrets.GCP_SA_KEY }}
- uses: google-github-actions/setup-gcloud@v2
- run: gcloud auth configure-docker
- run: |
docker build -t gcr.io/${{ secrets.GCP_PROJECT }}/plant-disease-api .
docker push gcr.io/${{ secrets.GCP_PROJECT }}/plant-disease-api
- run: |
gcloud run deploy plant-disease-api \
--image gcr.io/${{ secrets.GCP_PROJECT }}/plant-disease-api \
--region asia-south1 \
--platform managedYAML
Always run tests before deployment. Write a test_app.py using httpx.AsyncClient to test your FastAPI endpoints. A broken deployment at 2 AM costs your team sleep and your company revenue. At Zomato, every ML model deployment requires passing 3 stages: unit tests, integration tests, and shadow traffic comparison against the current production model.
Visual Diagrams
6A. End-to-End MLOps Architecture
6B. Model Serving Comparison
6C. Drift Detection Flow
Worked Example โ From Notebook to Production in 7 Steps
Let's trace the complete journey of deploying the plant disease classifier for a Kisan (farmer) app targeting 10,000 daily users across Uttar Pradesh and Maharashtra.
๐ Problem Statement
AgriTech startup "KhetScan" (fictional) wants to deploy the Chapter 17 plant disease model as a mobile-first service. Requirements:
- Handle 10,000 daily predictions (peak: 100/minute during morning farm visits)
- Latency < 500 ms per prediction
- Support offline mode for low-connectivity areas (2G regions)
- Cost budget: โน15,000/month for cloud infrastructure
- Auto-retrain monthly on new labeled images from the field
Step 1: Export the Trained Model
# We trained this in Chapter 17 โ now export for production
model = tf.keras.models.load_model("plant_disease_ch17.h5")
print(f"Model params: {model.count_params():,}") # 23,587,716
print(f"Model size: {os.path.getsize('plant_disease_ch17.h5')/1e6:.1f} MB") # 94.3 MB
# Save as SavedModel for TF Serving
tf.saved_model.save(model, "./models/plant_disease/1")
# Also create TFLite for offline mobile
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite = converter.convert()
with open("plant_disease.tflite", "wb") as f:
f.write(tflite)
print(f"TFLite size: {len(tflite)/1e6:.1f} MB") # 24.1 MBPython
Step 2: Build FastAPI Server (from Section 21.3)
Use the complete app.py from Section 21.3. Add request logging for monitoring:
# Add to app.py โ Log every prediction for monitoring
import csv
from datetime import datetime
def log_prediction(filename, predicted_class, confidence, latency):
"""Append prediction to CSV log for drift monitoring."""
with open("predictions_log.csv", "a", newline="") as f:
writer = csv.writer(f)
writer.writerow([
datetime.now().isoformat(),
filename,
predicted_class,
confidence,
latency
])Python
Step 3: Dockerize (from Section 21.4)
Use the Dockerfile from Section 21.4. Build and test locally.
Step 4: Deploy to Cloud Run (Mumbai region)
# asia-south1 = Mumbai โ lowest latency for Indian users
$ gcloud run deploy khetscan-api \
--image gcr.io/khetscan/plant-api:v1 \
--region asia-south1 \
--memory 2Gi --cpu 2 \
--min-instances 1 --max-instances 5 \
--allow-unauthenticated
# Estimated cost: ~โน8,000/month for this configBash
Step 5: Cost Estimation (INR)
| Component | Monthly Cost (โน) | Notes |
|---|---|---|
| Cloud Run (2 vCPU, 2 GB) | โน6,500 | min-instances=1, auto-scales to 5 |
| Container Registry | โน200 | 1.2 GB image storage |
| Cloud Logging | โน500 | First 50 GB free |
| MLflow Server (e2-small) | โน1,800 | Experiment tracking |
| Cloud Storage (model artifacts) | โน100 | ~5 GB |
| Total | โน9,100/month | Within โน15,000 budget โ |
Step 6: Set Up Monitoring
# Weekly drift check cron job
# Run every Monday at 6 AM IST
# crontab: 0 6 * * 1 python check_drift.py
import pandas as pd
import numpy as np
# Load prediction logs from the past week
logs = pd.read_csv("predictions_log.csv",
names=["timestamp", "file", "class",
"confidence", "latency"])
# Check 1: Prediction distribution shift
class_dist = logs["class"].value_counts(normalize=True)
print("Prediction distribution this week:")
print(class_dist)
# Expected: Healthy ~60%, Early Blight ~20%, Late Blight ~15%, ...
# If Healthy drops below 40%, something may have changed
# Check 2: Confidence decay
avg_confidence = logs["confidence"].mean()
print(f"Average confidence: {avg_confidence:.3f}")
if avg_confidence < 0.7:
print("โ ๏ธ ALERT: Confidence dropping โ possible data drift!")Python
Step 7: TFLite Integration for Offline Mode
The TFLite model (24 MB) is bundled with the Android app. When the farmer has no internet, inference runs locally on the phone. When connectivity is available, predictions are sent to the cloud API for logging and retraining data collection.
Online (Cloud API): Latency ~200ms, always latest model, requires internet
Offline (TFLite): Latency ~60ms, bundled model version, no internet needed
Hybrid: Try cloud first, fallback to TFLite โ best of both worlds โ
Case Study โ Infosys ML Platform & MLflow
๐ข Infosys Nia โ Enterprise AI Platform for Indian Enterprises
Background
Infosys, India's second-largest IT services company (revenue: โน1.5 lakh crore/year, 300,000+ employees), needed to standardize ML deployment across 200+ client engagements spanning banking, telecom, manufacturing, and retail.
The Challenge
Before their ML platform (Infosys Nia), each project team built deployment pipelines from scratch:
- Team A used Flask + pickle on AWS EC2
- Team B used Django + ONNX on Azure
- Team C used custom gRPC on-premise servers
- No experiment tracking โ "Which model version is in production?" had no answer
- Average time from model-ready to production: 4โ6 months
MLOps Solution
Infosys built a centralized ML platform with these components:
| Component | Tool | Purpose |
|---|---|---|
| Experiment Tracking | MLflow | Log all training runs, parameters, metrics |
| Model Registry | MLflow + Custom | Version control with approval workflows |
| Feature Store | Feast | Centralized, reusable feature pipelines |
| Serving | TF Serving + Triton | Unified serving with A/B testing |
| Monitoring | Prometheus + Grafana | Drift detection, latency, throughput |
| Orchestration | Kubeflow Pipelines | End-to-end ML pipeline automation |
| CI/CD | Jenkins + GitHub Actions | Automated testing and deployment |
Results
- Deployment time reduced from 4โ6 months to 2โ3 weeks
- Model reproducibility: 100% of experiments tracked and reproducible
- Cost savings: ~โน2 crore/year in reduced engineering overhead
- Standardization: 50+ projects using the same platform within 1 year
- Client satisfaction: 35% improvement in project delivery timelines
Key Lesson
The biggest ROI came not from fancy algorithms but from infrastructure standardization. Having a single, well-documented deployment path meant any data scientist could deploy a model in 2 weeks instead of 4 months. The platform paid for itself in 6 months.
TCS iON uses a similar platform for their education technology products. Their exam proctoring ML model (detecting cheating in online exams) is deployed across 50+ exam centers in India using Kubernetes with edge inference โ each center runs a local TFLite model for real-time face detection, while the heavier behavioral analysis model runs on their Mumbai data center. This hybrid approach handles the connectivity challenges of tier-2 and tier-3 city exam centers.
Common Mistakes & Misconceptions
Mistake #1: "My model works in Jupyter, so it's production-ready."
Jupyter notebooks hide stateful bugs โ cells run out of order, global variables leak between cells, and pip install in a notebook doesn't guarantee the same environment in production. Always refactor notebook code into proper Python modules with explicit dependencies.
Mistake #2: Loading the model inside the prediction function.
def predict(data): model = tf.keras.models.load_model("model.h5"); return model.predict(data) โ This loads the model on every single request. A 100 MB model takes 3โ5 seconds to load. Your API becomes unusably slow. Load once at startup, predict many times.
Mistake #3: No input validation on the /predict endpoint.
Production APIs receive garbage: corrupt images, wrong formats, adversarial inputs, and request payloads 100ร larger than expected. Without Pydantic validation and size limits, your server will crash or produce meaningless results. Always validate before inference.
Mistake #4: "Deploy and forget" โ no monitoring.
A model deployed without monitoring is guaranteed to silently fail. Real-world data drifts โ seasonal changes, user behavior shifts, upstream data pipeline breaks. If you don't monitor, you'll only discover your model is broken when a customer complains (or worse, when revenue drops).
Mistake #5: Using model.predict() instead of model() in TensorFlow.
In TensorFlow 2.x, model.predict() has significant overhead for single-sample inference because it creates a new computation graph each time. For serving, use model(input_tensor, training=False) for direct __call__ โ it's 10โ50ร faster for single inputs.
Mistake #6: Ignoring the โน cost of GPU instances for serving.
A p2.xlarge GPU instance on AWS costs ~โน7,000/day. Most inference workloads don't need GPUs โ a well-optimized CPU model with batching handles 100+ requests/second. Use GPUs for training, CPUs for serving (unless you have extreme latency requirements).
Comparison Table โ MLOps Tools
10A. MLflow vs W&B vs DVC vs Kubeflow
| Feature | MLflow | Weights & Biases | DVC | Kubeflow |
|---|---|---|---|---|
| Primary Use | Experiment tracking + registry | Experiment tracking + visualization | Data & model versioning | End-to-end ML pipelines |
| Open Source | โ Fully OSS | โ ๏ธ Free tier, paid pro | โ Fully OSS | โ Fully OSS |
| Setup Complexity | Low (pip install) | Very Low (SaaS) | Low (pip install) | High (Kubernetes required) |
| Experiment Tracking | โ Good | โ โ Excellent (best UI) | โ Not primary focus | โ ๏ธ Basic |
| Model Registry | โ Built-in | โ Model registry | โ Git-based only | โ ๏ธ Via add-ons |
| Data Versioning | โ ๏ธ Basic artifacts | โ ๏ธ Artifacts only | โ โ Primary feature | โ Not built-in |
| Pipeline Orchestration | โ ๏ธ MLflow Projects | โ No | โ ๏ธ DVC pipelines | โ โ Kubeflow Pipelines |
| Serving | โ mlflow models serve | โ No | โ No | โ KFServing |
| Team Size | 2โ20 (startups, mid) | 1โ50 (any size) | 2โ10 (ML engineers) | 10โ100 (enterprise) |
| Cost (10-person team) | โน0 (self-hosted) | ~โน50,000/mo (Teams) | โน0 (self-hosted) | โน0 (+ infra costs) |
| Best For | All-round MLOps starter | Research teams, visualization | Data-heavy ML projects | Enterprise K8s shops |
10B. Serving Options Compared
| Option | Latency | Throughput | Scale | Complexity | Cost (โน/mo) |
|---|---|---|---|---|---|
| FastAPI + Uvicorn | ~50ms | 500โ3K req/s | Single server / Cloud Run | Low | โน2,000โโน10,000 |
| TF Serving | ~20ms | 5Kโ50K req/s | Docker / Kubernetes | Medium | โน5,000โโน50,000 |
| Triton Inference Server | ~10ms | 10Kโ100K req/s | Multi-GPU, multi-model | High | โน50,000+ |
| AWS SageMaker Endpoint | ~100ms | 1Kโ10K req/s | Fully managed | Low | โน15,000โโน1,00,000 |
| TFLite (On-device) | ~30ms | N/A (local) | Per-device | Medium | โน0 |
Start with MLflow + FastAPI + Cloud Run. This combination gives you experiment tracking, model registry, production serving, and auto-scaling for under โน10,000/month. Only move to Kubeflow + Triton when you have 10+ models in production and a dedicated ML platform team.
Exercises
Section A โ Multiple-Choice Questions (10)
Which of the following is the MOST common reason ML models fail to reach production?
- Low accuracy on test set
- Infrastructure and deployment challenges
- Insufficient training data
- GPU unavailability
What is the primary advantage of ONNX over TensorFlow SavedModel?
- Faster inference speed
- Smaller file size
- Framework-agnostic interoperability
- Built-in model versioning
In FastAPI, why should you load the ML model in @app.on_event("startup") rather than inside the /predict endpoint?
- FastAPI doesn't support loading models inside endpoints
- To avoid loading the model on every request (which adds seconds of latency)
- Because Python garbage collection deletes the model after each request
- For security reasons โ endpoints can't access the filesystem
What does a multi-stage Docker build accomplish for ML serving?
- Runs multiple models simultaneously
- Reduces final image size by separating build and runtime dependencies
- Enables GPU access inside containers
- Automatically scales the number of containers
Which type of drift occurs when the relationship between input features and the target variable changes over time?
- Data drift (covariate shift)
- Prediction drift (output drift)
- Concept drift
- Feature drift
TensorFlow Serving's model versioning stores models in numbered directories (/1, /2, /3). What happens when you add a new directory /4?
- Nothing โ you must restart the server
- TF Serving automatically detects and loads the new version
- The old versions are deleted
- TF Serving crashes and needs manual intervention
You quantize a model from Float32 to Int8. What is the approximate reduction in model size?
- 2ร
- 4ร
- 8ร
- 16ร
In A/B testing for ML models, a "champion-challenger" setup means:
- Two models compete to be the fastest
- The current production model (champion) receives most traffic while a new model (challenger) gets a small percentage
- Both models are trained simultaneously on different data
- The model with higher accuracy automatically replaces the other
Which MLOps tool would you recommend for a 3-person startup that needs experiment tracking and model registry at zero cost?
- Kubeflow (requires Kubernetes cluster)
- Weights & Biases (paid Teams plan)
- MLflow (open-source, self-hosted)
- AWS SageMaker (managed, expensive)
pip install mlflow, and provides experiment tracking + model registry at zero licensing cost. Kubeflow requires Kubernetes (too complex for 3 people), W&B Teams is paid, and SageMaker costs โน15,000+/month.Why is pickle NOT recommended for production model serialization?
- Pickle files are too large
- Pickle doesn't support neural networks
- Pickle is Python-version dependent, insecure, and has no graph optimization
- Pickle only works on Windows
Section B โ Short Answer Questions (5)
Explain the difference between data drift and concept drift with an example from an Indian e-commerce company like Flipkart.
Concept drift: The relationship between inputs and outputs changes. Example: During Diwali season, the same user features (age, location, past purchases) map to completely different purchase behaviors โ users who normally buy electronics start buying gifts and sweets. The mapping P(purchase|features) has changed.
List the six stages of the ML lifecycle and briefly describe what happens at each stage.
2. Feature Engineering & Training: Extract features, select architecture, tune hyperparameters, train model.
3. Evaluation & Validation: Test set metrics, cross-validation, fairness audits.
4. Deployment (Serving): Package as API/service, containerize, deploy to cloud/edge.
5. Monitoring: Track prediction quality, data drift, latency, throughput.
6. Retraining & CI/CD: Automated retraining pipelines, continuous integration/delivery for models.
Why does FastAPI outperform Flask for ML serving? Mention at least three technical reasons.
2. Pydantic validation: FastAPI automatically validates request/response schemas via type hints. Flask requires manual validation code.
3. Performance: FastAPI benchmarks at ~3000+ req/s vs Flask's ~500 req/s due to Starlette's async event loop.
4. Auto-generated OpenAPI docs: FastAPI auto-generates interactive /docs endpoint โ invaluable for frontend teams consuming your ML API.
What is the purpose of a .dockerignore file in ML projects? Why is it especially important for ML compared to regular web apps?
.dockerignore prevents specified files from being included in the Docker build context. For ML projects, this is critical because ML directories contain massive files: training datasets (10+ GB), model checkpoints (multiple 100 MB files), Jupyter notebooks, wandb logs, and TensorBoard event files. Without .dockerignore, COPY . . would include all of these, creating a Docker image that's 20-50 GB instead of 1-2 GB, taking hours to build and push.Explain why you might choose TFLite edge deployment over cloud API deployment for an agricultural app targeting Indian farmers.
2. Latency: Cloud API adds ~200-500ms network latency. TFLite inference is ~30-60ms locally โ critical for real-time camera-based disease detection.
3. Cost: No cloud server costs per prediction. For 10,000 daily users, cloud costs add up; on-device inference is free.
4. Privacy: Farm images stay on the device โ no sensitive location/crop data sent to cloud servers.
5. Device compatibility: TFLite runs on low-end Android phones (โน5,000-โน8,000 range) that most Indian farmers use.
Section C โ Long Answer Questions (3)
Design an MLOps pipeline for a Paytm-like company deploying a fraud detection model. Include: (a) experiment tracking setup, (b) model serving architecture, (c) monitoring strategy, (d) retraining triggers. Draw a system architecture diagram as part of your answer. [15 marks]
Compare the three types of drift (data, prediction, concept) in detail. For each type: (a) define it mathematically, (b) provide a real-world Indian industry example, (c) describe the detection method (statistical test), and (d) explain the remediation strategy. [15 marks]
A startup is choosing between four MLOps tool stacks for their 8-person ML team with a monthly budget of โน50,000. Compare: (a) MLflow + FastAPI + Cloud Run, (b) W&B + TF Serving + GKE, (c) DVC + BentoML + AWS Lambda, (d) Kubeflow + Triton + On-premise. Evaluate each on cost, complexity, scalability, and team skill requirements. Recommend one with justification. [15 marks]
Section D โ Programming Exercises (2)
Deploy a Sentiment Analysis Model as a REST API
Take a pre-trained sentiment analysis model (you may use a simple TF/Keras text classifier or HuggingFace pipeline) and:
- Write a FastAPI application with a
/analyzeendpoint that accepts JSON with a"text"field - Add Pydantic input/output validation (text length: 1โ5000 chars, output: sentiment label + confidence)
- Add a
/healthendpoint and a/batchendpoint (accepts list of texts) - Write a Dockerfile to containerize the application
- Include at least 3 unit tests using
pytestandhttpx
Hint: For a quick sentiment model, use from transformers import pipeline; classifier = pipeline("sentiment-analysis")
Build a Production Monitoring Script
Create a Python monitoring script that:
- Reads a CSV log of predictions (columns: timestamp, input_features, predicted_class, confidence)
- Computes data drift using KS test for numerical features and Chi-square test for categorical features
- Computes prediction drift using Population Stability Index (PSI)
- Generates an HTML drift report with color-coded alerts (green=OK, yellow=warning, red=critical)
- Sends a Slack notification if any drift metric exceeds the threshold
Use the DriftMonitor class from Section 21.7 as a starting point. Extend it with Chi-square test and HTML report generation.
Section E โ Mini-Project
๐ Mini-Project: End-to-End MLOps Pipeline for Indian Crop Disease Detection
Objective
Build a complete, production-ready ML deployment pipeline for a crop disease detection service targeting Indian farmers.
Requirements
- Model Training (with MLflow): Train the plant disease classifier from Chapter 17. Log at least 5 experiment runs with different hyperparameters using MLflow. Register the best model in MLflow Model Registry.
- FastAPI Server: Build a production API with:
/predictโ Image upload โ disease prediction/healthโ Server health check/model-infoโ Returns model version, training date, accuracy- Request logging to CSV for drift monitoring
- Docker + Cloud Deploy: Containerize with Docker (multi-stage build). Deploy to Google Cloud Run (asia-south1 region). Document the deployment cost estimation in INR.
- Streamlit Frontend: Build a user-friendly UI that allows image upload and camera capture. Display predictions with confidence bars. Include Hindi/English language toggle.
- Monitoring: Implement weekly drift detection using the DriftMonitor class. Generate automated drift reports. Set up alerting (email or Slack).
- TFLite Conversion: Convert the model to TFLite (Int8 quantized). Benchmark accuracy vs. original. Document size reduction.
Deliverables
- GitHub repository with clean README, all code, and Dockerfile
- MLflow experiment tracking screenshots (5+ runs)
- Deployed Cloud Run URL (working for at least 1 week)
- Streamlit app demo video (2 minutes)
- Monitoring report showing drift analysis on synthetic data
- Cost analysis document (projected monthly cost in โน for 1K, 10K, 100K daily users)
Evaluation Rubric
| Criterion | Marks |
|---|---|
| MLflow integration (5+ tracked experiments) | 15 |
| FastAPI server (clean code, validation, error handling) | 20 |
| Docker + Cloud Run deployment (working URL) | 20 |
| Streamlit UI (user-friendly, camera support) | 15 |
| Monitoring & drift detection | 15 |
| TFLite conversion + benchmarks | 10 |
| Documentation & code quality | 5 |
| Total | 100 |
Chapter Summary
๐ง Key Takeaways from Chapter 21
- The Deployment Gap is Real: Only ~53% of ML prototypes reach production. The gap isn't about model accuracy โ it's about infrastructure, monitoring, and engineering.
- ML Lifecycle: Data โ Train โ Evaluate โ Deploy โ Monitor โ Retrain. Production ML requires mastering all six stages, not just the first three.
- Model Serialization: Use SavedModel (TensorFlow), ONNX (cross-framework), or TorchScript (PyTorch) for production. Never use pickle.
- FastAPI for ML Serving: Load model at startup, validate inputs with Pydantic, use async handlers, add health checks, and auto-generate API docs.
- Docker: Containerize everything for reproducibility. Use multi-stage builds and .dockerignore to keep images small. Run as non-root user for security.
- TF Serving: Purpose-built for serving TF models at scale. Supports automatic model versioning, request batching, and A/B testing via version-numbered directories.
- Model Versioning: Use MLflow for experiment tracking and model registry. Tag every run with Git commit hash. Promote models through stages: Staging โ Production โ Archived.
- Monitoring is Non-Negotiable: Track data drift (KS test), prediction drift (PSI), and concept drift. A deployed model without monitoring will silently degrade.
- Edge Deployment: TFLite with Int8 quantization reduces models by ~4ร with minimal accuracy loss. Essential for Indian low-connectivity scenarios.
- Start Simple: MLflow + FastAPI + Cloud Run is the ideal MLOps starter stack. Costs under โน10,000/month and scales to 10K+ daily users.
Level 0: Manual Jupyter notebooks โ Level 1: Automated training + manual deploy โ
Level 2: CI/CD for models + monitoring โ Level 3: Full automation with drift-triggered retraining
References & Further Reading
Foundational Papers
- Sculley, D., et al. (2015). "Hidden Technical Debt in Machine Learning Systems." Advances in Neural Information Processing Systems (NeurIPS). โ The landmark paper on ML infrastructure complexity.
- Paleyes, A., Urma, R.-G., & Lawrence, N. D. (2022). "Challenges in Deploying Machine Learning: A Survey of Case Studies." ACM Computing Surveys, 55(6). โ Comprehensive survey of deployment failures.
- Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2018). "Data Lifecycle Challenges in Production Machine Learning." ACM SIGMOD Record, 47(2).
Framework Documentation
- FastAPI Documentation. https://fastapi.tiangolo.com โ Official docs with excellent ML serving examples.
- TensorFlow Serving. https://www.tensorflow.org/tfx/serving โ Architecture guide and API reference.
- MLflow Documentation. https://mlflow.org/docs/latest/index.html โ Experiment tracking, model registry, and deployment.
- ONNX Runtime. https://onnxruntime.ai โ Cross-platform inference engine.
- TensorFlow Lite Guide. https://www.tensorflow.org/lite/guide โ Mobile and edge deployment.
Books
- Huyen, C. (2022). Designing Machine Learning Systems. O'Reilly Media. โ The definitive guide to production ML systems.
- Gift, N., Deza, A., & Behrman, K. (2021). Practical MLOps. O'Reilly Media. โ Hands-on MLOps engineering.
- Treveil, M., et al. (2020). Introducing MLOps. O'Reilly Media. โ MLOps concepts and best practices.
Indian Industry & Tools
- Infosys Nia Platform. https://www.infosys.com/nia.html โ Enterprise AI platform case studies.
- Docker Documentation. https://docs.docker.com โ Containerization best practices.
- Google Cloud Run. https://cloud.google.com/run/docs โ Serverless container deployment (asia-south1 region).
- Streamlit Documentation. https://docs.streamlit.io โ Building ML demo UIs.
Monitoring & Drift
- Evidently AI. https://evidentlyai.com โ Open-source ML monitoring and drift detection.
- Gama, J., et al. (2014). "A Survey on Concept Drift Adaptation." ACM Computing Surveys, 46(4). โ Comprehensive drift taxonomy and detection methods.