Neural Networks & Deep Learning

Chapter 21: MLOps — From Jupyter Notebook to Production

Bridging the Last Mile Between Model Training and Real-World Impact

⏱️ Reading Time: ~3 hours | 📖 Part VI: MLOps & Deployment | 🚀 Engineering + Code Chapter

📋 Prerequisites: Chapters 12–17 (CNN, Transfer Learning), Basic Python, Familiarity with REST APIs

Bloom's Taxonomy Map for This Chapter

Bloom's Level	What You'll Achieve
🔵 Remember	Recall the stages of the ML lifecycle, list model serialization formats (SavedModel, ONNX, TorchScript), and name key MLOps tools
🔵 Understand	Explain why a model that works in Jupyter fails in production, differentiate between data drift, prediction drift, and concept drift
🟢 Apply	Wrap a trained model in a FastAPI endpoint, write a Dockerfile for ML inference, and deploy to Google Cloud Run
🟡 Analyze	Diagnose production model degradation using monitoring dashboards, compare different serving architectures for latency and throughput
🟠 Evaluate	Choose between MLflow, W&B, DVC, and Kubeflow for a given team size and use case; decide between cloud vs. edge deployment for Indian connectivity constraints
🔴 Create	Build an end-to-end MLOps pipeline: train a plant disease model, package as FastAPI + Docker, deploy with CI/CD, and add monitoring

Section 1

Learning Objectives

By the end of this chapter, you will be able to:

Map the complete ML lifecycle from data collection through deployment, monitoring, and retraining — and identify where most projects fail (the "deployment gap")
Serialize trained models using TensorFlow SavedModel, ONNX, and TorchScript formats, and explain the trade-offs of each
Build a production-grade REST API using FastAPI with Pydantic validation, async inference, health checks, and proper error handling
Containerize an ML application using Docker with multi-stage builds optimized for minimal image size
Deploy models using TensorFlow Serving, and understand model versioning with A/B testing strategies
Implement monitoring for data drift, prediction drift, and concept drift using statistical tests
Optimize models for edge deployment on Android devices using TFLite, especially for low-connectivity scenarios in rural India
Design a complete MLOps pipeline from experiment tracking (MLflow) through CI/CD to production monitoring
Compare MLOps tools (MLflow, Weights & Biases, DVC, Kubeflow) and select the right stack for different organizational needs
Build a Streamlit UI for demo and internal stakeholder consumption of ML predictions

Section 2

Opening Hook

🎯 "94% accuracy. Manager: Great, deploy it. Now what?"

Priya, a data scientist at a Bangalore fintech startup, spent 3 months training a loan default prediction model. 94% accuracy on the test set. Her manager was ecstatic: "Ship it by Friday."

That was 6 weeks ago. Priya is still trying to deploy it.

The model ran perfectly in her Jupyter Notebook on her 32 GB RAM laptop. But the production server has 4 GB. Her pickle file threw version errors. The Flask API crashed under 50 concurrent requests. And nobody told her that RBI regulations require model explainability logs.

Priya's story is not unique. According to Gartner, only 53% of ML prototypes ever make it to production. The gap between a working notebook and a production system is what MLOps exists to bridge.

In this chapter, you'll learn to be the engineer who can say: "94% accuracy — AND it's deployed, monitored, and auto-retraining."

FlipkartPaytmInfosysJioZomato

Google coined the term "MLOps" in a 2015 paper describing their internal ML infrastructure. Today, the MLOps market is valued at over $1.4 billion globally — and India's IT services companies (TCS, Infosys, Wipro) are among the largest MLOps consulting providers worldwide.

Section 3

Core Concepts

21.1 The ML Lifecycle — Why Notebooks Are Not Products

The journey from data to value follows a structured lifecycle. Most university courses cover only the first three stages. Production ML requires mastering all six:

📋 The Complete ML Lifecycle

Stage 1: Data Collection & Preparation

Gathering raw data, cleaning, labeling, versioning datasets. For Flipkart's product recommendation, this means processing 1.5 billion+ click events daily.

Stage 2: Feature Engineering & Training

Feature extraction, model architecture selection, hyperparameter tuning. The part you've mastered in Chapters 1–20.

Stage 3: Evaluation & Validation

Test set metrics, cross-validation, fairness audits. "94% accuracy" lives here.

Stage 4: Deployment (Serving)

Packaging the model as a service (REST API, gRPC), containerizing (Docker), deploying to cloud or edge. This is where most projects die.

Stage 5: Monitoring

Tracking prediction quality, data drift, latency, throughput. Without monitoring, your model silently degrades.

Stage 6: Retraining & CI/CD

Automated retraining pipelines triggered by drift detection or scheduled intervals. Continuous integration for ML code, continuous delivery for models.

┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ DATA │───▶│ TRAIN │───▶│ EVALUATE │───▶│ DEPLOY │───▶│ MONITOR │───▶│ RETRAIN │ │Collection│ │ Model │ │ Validate │ │ Serve │ │ Drift │ │ CI/CD │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ ▲ │ └──────────────────────────────────────────────────────────────────────────────┘ ♻️ CONTINUOUS FEEDBACK LOOP

The MLOps Equation:
ML in Production = Code + Data + Model + Infra + Monitoring + Governance

The "Hidden Technical Debt" Problem

A landmark Google paper (Sculley et al., 2015) revealed that in real-world ML systems, the actual model training code is just a tiny fraction of the total codebase. The vast majority is infrastructure: data pipelines, feature stores, serving infrastructure, monitoring, and configuration management.

At Paytm, the fraud detection team reported that ML model code accounts for only ~5% of their production codebase. The remaining 95% handles data ingestion from 450+ million wallets, feature computation at 10,000+ TPS (transactions per second), A/B testing across regional segments, and compliance logging required by RBI.

21.2 Model Serialization — Saving Models for Production

Your trained model exists as weights in GPU memory. To deploy it, you must serialize it — convert it to a portable, self-contained format that any server can load without your original training code.

🔧 Model Serialization Formats

1. TensorFlow SavedModel

TensorFlow's native format. Saves the complete computation graph + weights + signatures. Directly loadable by TensorFlow Serving, TFLite, and TF.js.

2. ONNX (Open Neural Network Exchange)

Framework-agnostic format supported by Microsoft, Facebook, and Amazon. Train in PyTorch, deploy with ONNX Runtime on any platform. Ideal for cross-framework interoperability.

3. TorchScript (PyTorch)

PyTorch's serialization via torch.jit.trace() or torch.jit.script(). Converts Python models to an intermediate representation that runs without Python dependency.

4. Pickle / Joblib (Scikit-learn)

Simple binary serialization. Not recommended for production — version-dependent, security risks (arbitrary code execution), and no graph optimization.

TensorFlow SavedModel

# Save a Keras model as TensorFlow SavedModel
import tensorflow as tf

# After training your model (e.g., plant disease classifier from Ch17)
model = tf.keras.models.load_model('plant_disease_model.h5')

# Save as SavedModel (directory format)
tf.saved_model.save(model, './saved_model/plant_disease/1')

# The '1' is the version number — critical for model versioning!
# Directory structure:
# saved_model/plant_disease/1/
#   ├── saved_model.pb        ← computation graph
#   ├── fingerprint.pb
#   └── variables/
#       ├── variables.data-00000-of-00001  ← weights
#       └── variables.index

# Load it back
loaded = tf.saved_model.load('./saved_model/plant_disease/1')
print(loaded.signatures)  # Shows input/output specsPython

ONNX Export from PyTorch

import torch
import torch.onnx

# Assume `model` is a trained PyTorch model
model.eval()

# Create dummy input matching your model's expected shape
dummy_input = torch.randn(1, 3, 224, 224)  # Batch=1, 3 channels, 224×224

# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "plant_disease.onnx",
    input_names=["image"],
    output_names=["prediction"],
    dynamic_axes={"image": {0: "batch_size"}},  # Allow variable batch
    opset_version=13
)

# Verify with ONNX Runtime
import onnxruntime as ort
session = ort.InferenceSession("plant_disease.onnx")
result = session.run(None, {"image": dummy_input.numpy()})
print(f"Prediction shape: {result[0].shape}")Python

TorchScript

# Method 1: Tracing (works for models without control flow)
traced_model = torch.jit.trace(model, dummy_input)
traced_model.save("plant_disease_traced.pt")

# Method 2: Scripting (works with if/else, loops)
scripted_model = torch.jit.script(model)
scripted_model.save("plant_disease_scripted.pt")

# Load without needing the model class definition!
loaded = torch.jit.load("plant_disease_traced.pt")
output = loaded(dummy_input)Python

Format	Framework	Graph Optimization	Cross-Platform	Production-Ready
`SavedModel`	TensorFlow	✅ XLA, TF-TRT	TF ecosystem	✅✅✅
`ONNX`	Any → ONNX Runtime	✅ Graph fusion	✅ Universal	✅✅✅
`TorchScript`	PyTorch	✅ Fusion passes	C++, mobile	✅✅
`pickle/joblib`	Scikit-learn	❌ None	❌ Python only	⚠️ Not recommended

Never use pickle for production ML models. Pickle files are Python-version and library-version dependent. A model pickled with scikit-learn 1.2 may not load with scikit-learn 1.3. Worse, pickle can execute arbitrary code — a malicious pickle file can compromise your server. Always use SavedModel, ONNX, or TorchScript.

21.3 REST API with FastAPI — The /predict Endpoint

FastAPI is the modern Python web framework of choice for ML serving. It's built on Starlette (async ASGI) and Pydantic (data validation), offering automatic OpenAPI docs, type validation, and async request handling out of the box.

Why FastAPI Over Flask for ML?

Feature	Flask	FastAPI
Async Support	❌ WSGI (sync)	✅ ASGI (native async)
Request Validation	Manual	✅ Pydantic auto-validation
Auto API Docs	❌ Need Swagger plugin	✅ Built-in /docs
Type Hints	Optional	✅ Required & enforced
Performance	~500 req/s	~3000+ req/s
ML Suitability	Good for prototypes	✅ Production-grade

Complete FastAPI ML Server

# app.py — Production FastAPI ML serving application
from fastapi import FastAPI, File, UploadFile, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel, Field
from typing import List, Optional
import tensorflow as tf
import numpy as np
from PIL import Image
import io
import time
import logging

# ─── Setup ────────────────────────────────────────
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI(
    title="Plant Disease Classifier API",
    description="Predicts plant disease from leaf images (Ch17 model)",
    version="1.0.0"
)

# CORS — allow Streamlit frontend to call this API
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

# ─── Load model at startup (not per-request!) ────
CLASS_NAMES = [
    "Healthy", "Early Blight", "Late Blight",
    "Bacterial Spot", "Yellow Leaf Curl"
]
model = None

@app.on_event("startup")
async def load_model():
    global model
    logger.info("Loading plant disease model...")
    model = tf.saved_model.load("./saved_model/plant_disease/1")
    logger.info("✅ Model loaded successfully!")

# ─── Pydantic Response Schemas ────────────────────
class PredictionResult(BaseModel):
    predicted_class: str = Field(..., example="Early Blight")
    confidence: float = Field(..., ge=0, le=1, example=0.94)
    all_probabilities: dict = Field(...)
    inference_time_ms: float = Field(..., example=45.2)

class HealthResponse(BaseModel):
    status: str = "healthy"
    model_loaded: bool = True
    version: str = "1.0.0"

# ─── Helper: preprocess image ─────────────────────
def preprocess_image(image_bytes: bytes) -> np.ndarray:
    """Resize to 224x224, normalize to [0,1], add batch dim."""
    img = Image.open(io.BytesIO(image_bytes)).convert("RGB")
    img = img.resize((224, 224))
    arr = np.array(img, dtype=np.float32) / 255.0
    return np.expand_dims(arr, axis=0)  # (1, 224, 224, 3)

# ─── Endpoints ────────────────────────────────────
@app.get("/health", response_model=HealthResponse)
async def health_check():
    return HealthResponse(model_loaded=model is not None)

@app.post("/predict", response_model=PredictionResult)
async def predict(file: UploadFile = File(...)):
    """Upload a leaf image and get disease prediction."""
    # Validate file type
    if file.content_type not in ["image/jpeg", "image/png"]:
        raise HTTPException(
            status_code=400,
            detail="Only JPEG/PNG images accepted"
        )

    # Read & preprocess
    image_bytes = await file.read()
    input_tensor = preprocess_image(image_bytes)

    # Inference with timing
    start = time.perf_counter()
    infer = model.signatures["serving_default"]
    predictions = infer(tf.constant(input_tensor))
    output_key = list(predictions.keys())[0]
    probs = tf.nn.softmax(predictions[output_key]).numpy()[0]
    elapsed_ms = (time.perf_counter() - start) * 1000

    # Build response
    predicted_idx = int(np.argmax(probs))
    return PredictionResult(
        predicted_class=CLASS_NAMES[predicted_idx],
        confidence=round(float(probs[predicted_idx]), 4),
        all_probabilities={
            name: round(float(p), 4)
            for name, p in zip(CLASS_NAMES, probs)
        },
        inference_time_ms=round(elapsed_ms, 2)
    )Python

Run & Test

# Terminal: Start the server
$ uvicorn app:app --host 0.0.0.0 --port 8000 --workers 4

# Another terminal: Test with curl
$ curl -X POST "http://localhost:8000/predict" \
    -F "file=@tomato_leaf.jpg"

# Or open browser → http://localhost:8000/docs
# FastAPI auto-generates interactive Swagger UI!Bash

{ "predicted_class": "Early Blight", "confidence": 0.9412, "all_probabilities": { "Healthy": 0.0203, "Early Blight": 0.9412, "Late Blight": 0.0189, "Bacterial Spot": 0.0101, "Yellow Leaf Curl": 0.0095 }, "inference_time_ms": 42.37 }

Always load the model at startup (using @app.on_event("startup")), not inside the /predict function. Loading a SavedModel takes 2–10 seconds. If you load per-request, every prediction will have multi-second latency and you'll run out of memory fast.

21.4 Docker Containerization for ML

Docker solves the "it works on my machine" problem. Your container bundles the OS, Python version, libraries, model weights, and application code into a single portable image.

🐳 Why Docker for ML?

Reproducibility

Exact same environment in development, staging, and production. No more "but it worked in my Jupyter notebook!"

Dependency Isolation

TensorFlow 2.15 needs CUDA 12.2 and cuDNN 8.9. PyTorch 2.1 needs CUDA 12.1. Docker keeps them separate.

Scalability

Kubernetes can spin up 50 replicas of your container in seconds during peak traffic. Essential for Zomato's lunch-hour recommendation surge.

Portability

Same image runs on AWS, GCP, Azure, or an on-premise server at TCS's data center in Mumbai.

Dockerfile for ML Serving (Multi-Stage Build)

# Dockerfile — Optimized for ML serving
# Stage 1: Builder (install heavy dependencies)
FROM python:3.11-slim AS builder

WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# Stage 2: Runtime (minimal image)
FROM python:3.11-slim

# Metadata
LABEL maintainer="mlops-team@startup.in"
LABEL version="1.0"

# Security: non-root user
RUN groupadd -r mluser && useradd -r -g mluser mluser

WORKDIR /app

# Copy installed packages from builder
COPY --from=builder /install /usr/local

# Copy application code
COPY app.py .
COPY saved_model/ ./saved_model/

# Switch to non-root user
USER mluser

# Expose port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

# Start server
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]Dockerfile

requirements.txt

fastapi==0.104.1
uvicorn[standard]==0.24.0
tensorflow-cpu==2.15.0     # Use CPU version to keep image small
numpy==1.26.2
Pillow==10.1.0
python-multipart==0.0.6    # Required for file uploadsText

Build & Run

# Build the image
$ docker build -t plant-disease-api:v1 .

# Check image size
$ docker images plant-disease-api
# REPOSITORY          TAG    SIZE
# plant-disease-api   v1     1.2GB  (with TF-CPU)

# Run the container
$ docker run -d \
    --name plant-api \
    -p 8000:8000 \
    --memory=2g \
    --cpus=2 \
    plant-disease-api:v1

# Test it
$ curl -X POST http://localhost:8000/predict -F "file=@leaf.jpg"

# View logs
$ docker logs -f plant-apiBash

Don't copy your entire training dataset into the Docker image. A common beginner mistake is COPY . . which copies training data, checkpoints, Jupyter notebooks — ballooning the image to 20+ GB. Only copy the serving code and the final model artifact.

.dockerignore

# Don't copy training artifacts into production image
__pycache__/
*.pyc
.git/
.gitignore
data/
notebooks/
*.ipynb
checkpoints/
wandb/
mlruns/
*.h5
README.mdText

21.5 TensorFlow Serving — Industrial-Grade Model Serving

While FastAPI is great for lightweight deployment, TensorFlow Serving is purpose-built for serving TF models at scale. It handles model versioning, batching, and hardware acceleration natively.

⚡ TensorFlow Serving Features

Automatic Model Versioning

Place model versions in numbered directories (/1, /2, /3). TF Serving automatically detects and hot-swaps to the latest version — zero-downtime updates.

Request Batching

Automatically batches incoming requests to maximize GPU utilization. A GPU running one image at a time wastes 90% of its capacity.

gRPC & REST

Supports both gRPC (binary, fast, ~3× faster than REST) and REST endpoints out of the box.

GPU Acceleration

Natively supports NVIDIA GPUs with TensorRT optimization.

Deploy with TensorFlow Serving (Docker)

# Directory structure expected by TF Serving:
# models/
#   └── plant_disease/
#       ├── 1/               ← Version 1
#       │   └── saved_model.pb + variables/
#       └── 2/               ← Version 2 (auto-detected!)
#           └── saved_model.pb + variables/

# Run TensorFlow Serving container
$ docker run -d \
    --name tf-serving \
    -p 8501:8501 \
    -p 8500:8500 \
    -v "$(pwd)/models:/models" \
    -e MODEL_NAME=plant_disease \
    tensorflow/serving:latest

# REST API prediction (port 8501)
$ curl -X POST http://localhost:8501/v1/models/plant_disease:predict \
    -H "Content-Type: application/json" \
    -d '{"instances": [[[0.1, 0.2, 0.3], ...]]}'

# Check model status
$ curl http://localhost:8501/v1/models/plant_diseaseBash

TF Serving Configuration for A/B Testing

# model_config.config — Serve multiple versions simultaneously
model_config_list {
  config {
    name: "plant_disease"
    base_path: "/models/plant_disease"
    model_platform: "tensorflow"
    model_version_policy {
      specific {
        versions: 1    # Champion model (90% traffic)
        versions: 2    # Challenger model (10% traffic)
      }
    }
  }
}

# Start TF Serving with config file
$ docker run -d \
    -p 8501:8501 \
    -v "$(pwd)/models:/models" \
    -v "$(pwd)/model_config.config:/config" \
    tensorflow/serving \
    --model_config_file=/configConfig

A/B Traffic Splitting with Python

import random
import requests

def predict_with_ab_test(image_data, champion_pct=0.9):
    """Route traffic: 90% to v1 (champion), 10% to v2 (challenger)."""
    base_url = "http://localhost:8501/v1/models/plant_disease"

    if random.random() < champion_pct:
        version = 1  # Champion
    else:
        version = 2  # Challenger

    url = f"{base_url}/versions/{version}:predict"
    response = requests.post(url, json={"instances": [image_data]})

    result = response.json()
    result["model_version"] = version
    return resultPython

Flipkart uses A/B testing extensively for their search ranking ML models. During the Big Billion Days sale, they run up to 15 simultaneous model variants to optimize conversion rates across different product categories and regional user segments. Their ML platform serves 100,000+ predictions per second during peak traffic.

21.6 Model Versioning & Experiment Tracking

In production, you don't have one model — you have dozens of experiments, each with different hyperparameters, data versions, and feature sets. Model versioning answers: "Which model is running? What data was it trained on? Who approved it?"

📊 MLflow — The Open-Source Experiment Tracker

Tracking

Log parameters, metrics, artifacts, and source code for every training run.

Model Registry

Central repository with stages: None → Staging → Production → Archived.

Projects

Reproducible ML projects with MLproject files specifying environment and entry points.

Serving

Deploy models directly from the registry with mlflow models serve.

MLflow Experiment Tracking

import mlflow
import mlflow.tensorflow
import tensorflow as tf

# Set tracking URI (local or remote server)
mlflow.set_tracking_uri("http://mlflow-server.internal:5000")
mlflow.set_experiment("plant-disease-classifier")

# Start a training run
with mlflow.start_run(run_name="resnet50-augmented-v3"):

    # Log hyperparameters
    mlflow.log_params({
        "model_arch": "ResNet50",
        "learning_rate": 0.001,
        "batch_size": 32,
        "epochs": 50,
        "augmentation": "random_flip+rotate+zoom",
        "dataset_version": "plantvillage-v3-india-augmented",
        "optimizer": "Adam",
    })

    # Train the model
    model = build_resnet50_classifier()
    history = model.fit(train_ds, validation_data=val_ds, epochs=50)

    # Log metrics
    mlflow.log_metrics({
        "val_accuracy": history.history["val_accuracy"][-1],
        "val_loss": history.history["val_loss"][-1],
        "train_accuracy": history.history["accuracy"][-1],
        "model_size_mb": 98.5,
        "inference_latency_ms": 42.0,
    })

    # Log the model artifact
    mlflow.tensorflow.log_model(
        model,
        artifact_path="model",
        registered_model_name="PlantDiseaseClassifier"
    )

    # Log training curves as artifact
    import matplotlib.pyplot as plt
    fig, ax = plt.subplots()
    ax.plot(history.history["val_accuracy"])
    ax.set_title("Validation Accuracy")
    fig.savefig("training_curve.png")
    mlflow.log_artifact("training_curve.png")

print("Run logged! Check MLflow UI at http://localhost:5000")Python

Model Registry — Promoting to Production

from mlflow.tracking import MlflowClient

client = MlflowClient()

# Promote model version 3 to Production
client.transition_model_version_stage(
    name="PlantDiseaseClassifier",
    version=3,
    stage="Production"
)

# Archive the old version
client.transition_model_version_stage(
    name="PlantDiseaseClassifier",
    version=2,
    stage="Archived"
)

# Load the production model for serving
model_uri = "models:/PlantDiseaseClassifier/Production"
prod_model = mlflow.tensorflow.load_model(model_uri)
print("✅ Production model v3 loaded!")Python

Tag every training run with the Git commit hash of your code. This ensures perfect reproducibility: mlflow.log_param("git_hash", subprocess.check_output(["git", "rev-parse", "HEAD"]).strip()). Three months later, when you need to reproduce a specific result, you'll know exactly which code generated it.

21.7 Monitoring — Data Drift, Prediction Drift, Concept Drift

A model deployed without monitoring is a ticking time bomb. Real-world data changes over time — user behavior shifts, seasonal patterns emerge, economic conditions fluctuate. This phenomenon is called drift.

📈 Three Types of Drift

1. Data Drift (Covariate Shift)

The distribution of input features changes. Example: A loan default model trained on metro-city data starts receiving rural applicant data with very different income distributions.

2. Prediction Drift (Output Drift)

The distribution of model predictions changes. Example: Your spam classifier suddenly flags 60% of emails as spam instead of the usual 5%. Something is wrong with inputs or the model itself.

3. Concept Drift

The relationship between inputs and outputs changes. Example: Post-COVID, Zomato's "will this user order again?" model broke because dining patterns fundamentally changed — more deliveries, fewer dine-ins, different time-of-day patterns.

Kolmogorov-Smirnov Test for Drift Detection:
KS Statistic = sup|F₁(x) − F₂(x)| where F₁ = training CDF, F₂ = production CDF
If KS > threshold (e.g., 0.1), flag data drift alert

Drift Detection Implementation

import numpy as np
from scipy import stats
from datetime import datetime
import json
import logging

logger = logging.getLogger("drift_monitor")

class DriftMonitor:
    """Monitor data drift, prediction drift, and concept drift."""

    def __init__(self, reference_data: np.ndarray,
                 reference_predictions: np.ndarray,
                 feature_names: list,
                 ks_threshold: float = 0.1,
                 psi_threshold: float = 0.2):
        self.reference_data = reference_data
        self.reference_preds = reference_predictions
        self.feature_names = feature_names
        self.ks_threshold = ks_threshold
        self.psi_threshold = psi_threshold
        self.alerts = []

    def _calculate_psi(self, expected: np.ndarray,
                       actual: np.ndarray,
                       bins: int = 10) -> float:
        """Population Stability Index — measures distribution shift."""
        expected_pct, bin_edges = np.histogram(expected, bins=bins)
        actual_pct, _ = np.histogram(actual, bins=bin_edges)

        # Normalize and add small epsilon to avoid division by zero
        expected_pct = expected_pct / len(expected) + 1e-6
        actual_pct = actual_pct / len(actual) + 1e-6

        psi = np.sum(
            (actual_pct - expected_pct) * np.log(actual_pct / expected_pct)
        )
        return float(psi)

    def check_data_drift(self, production_data: np.ndarray) -> dict:
        """KS test per feature: training vs production distribution."""
        drift_report = {"timestamp": datetime.now().isoformat(),
                        "drifted_features": [], "details": {}}

        for i, feature in enumerate(self.feature_names):
            ks_stat, p_value = stats.ks_2samp(
                self.reference_data[:, i],
                production_data[:, i]
            )
            is_drifted = ks_stat > self.ks_threshold

            drift_report["details"][feature] = {
                "ks_statistic": round(ks_stat, 4),
                "p_value": round(p_value, 6),
                "is_drifted": is_drifted
            }

            if is_drifted:
                drift_report["drifted_features"].append(feature)
                logger.warning(f"🚨 DRIFT DETECTED: {feature} "
                              f"(KS={ks_stat:.4f}, p={p_value:.6f})")

        return drift_report

    def check_prediction_drift(self, prod_preds: np.ndarray) -> dict:
        """PSI on prediction distribution."""
        psi = self._calculate_psi(self.reference_preds, prod_preds)
        is_drifted = psi > self.psi_threshold

        if is_drifted:
            logger.warning(f"🚨 PREDICTION DRIFT: PSI={psi:.4f}")

        return {
            "psi": round(psi, 4),
            "is_drifted": is_drifted,
            "threshold": self.psi_threshold,
            "reference_mean": float(np.mean(self.reference_preds)),
            "production_mean": float(np.mean(prod_preds)),
        }

# ─── Usage Example ────────────────────────────────
# Suppose we have reference (training) and production data
np.random.seed(42)
ref_data = np.random.randn(1000, 5)             # Training data
prod_data = np.random.randn(500, 5) + 0.3      # Shifted production data
ref_preds = np.random.uniform(0, 1, 1000)       # Training predictions

monitor = DriftMonitor(
    reference_data=ref_data,
    reference_predictions=ref_preds,
    feature_names=["income", "age", "credit_score",
                   "loan_amount", "tenure"],
)

report = monitor.check_data_drift(prod_data)
print(json.dumps(report, indent=2))Python

🚨 DRIFT DETECTED: income (KS=0.1420, p=0.000312) 🚨 DRIFT DETECTED: age (KS=0.1156, p=0.003841) 🚨 DRIFT DETECTED: credit_score (KS=0.1284, p=0.001052) 🚨 DRIFT DETECTED: loan_amount (KS=0.1098, p=0.006127) 🚨 DRIFT DETECTED: tenure (KS=0.1312, p=0.000823) { "timestamp": "2026-06-24T14:30:00", "drifted_features": ["income", "age", "credit_score", "loan_amount", "tenure"], "details": { ... } }

At Jio, their content recommendation model experienced severe concept drift during IPL season every year. User engagement patterns shift dramatically — watch time increases 3×, genre preferences shift from movies to sports, and time-of-day patterns change completely. Their MLOps pipeline now includes seasonal retraining triggers that automatically detect and respond to IPL-period drift.

21.8 Edge Deployment — TFLite for Android in India

India has 750+ million smartphone users, but large portions still rely on 2G/3G connectivity. Sending images to a cloud API requires stable bandwidth and adds latency. Edge deployment puts the model on the phone itself — predictions happen offline.

📱 TFLite — TensorFlow for Mobile

Quantization

Reduce model from Float32 (4 bytes/weight) to Int8 (1 byte/weight). Cuts size by 4× and speeds up inference on ARM CPUs.

Model Size

A ResNet50 model goes from ~98 MB → ~25 MB after quantization. Fits comfortably on a ₹8,000 entry-level Android phone.

Offline Inference

A farmer in rural Bihar can photograph a diseased tomato leaf and get a prediction without internet connectivity.

Supported Operations

Not all TF ops are supported in TFLite. Use tf.lite.OpsSet.TFLITE_BUILTINS for maximum compatibility; add SELECT_TF_OPS for unsupported ones (increases binary size).

Convert to TFLite with Quantization

import tensorflow as tf
import numpy as np

# Load the trained Keras model
model = tf.keras.models.load_model("plant_disease_model.h5")

# ─── Method 1: Dynamic Range Quantization (simplest) ──
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

with open("plant_disease_dynamic.tflite", "wb") as f:
    f.write(tflite_model)
print(f"Dynamic quant: {len(tflite_model) / 1e6:.1f} MB")

# ─── Method 2: Full Integer Quantization (smallest) ───
def representative_dataset():
    """Provide ~100 samples for calibration."""
    for _ in range(100):
        sample = np.random.rand(1, 224, 224, 3).astype(np.float32)
        yield [sample]

converter2 = tf.lite.TFLiteConverter.from_keras_model(model)
converter2.optimizations = [tf.lite.Optimize.DEFAULT]
converter2.representative_dataset = representative_dataset
converter2.target_spec.supported_ops = [
    tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
converter2.inference_input_type = tf.uint8
converter2.inference_output_type = tf.uint8

tflite_int8 = converter2.convert()
with open("plant_disease_int8.tflite", "wb") as f:
    f.write(tflite_int8)
print(f"Int8 quant: {len(tflite_int8) / 1e6:.1f} MB")Python

Dynamic quant: 25.1 MB Int8 quant: 24.8 MB

Run TFLite Inference (Python — mimics Android behavior)

# TFLite inference — same API available in Android Java/Kotlin
interpreter = tf.lite.Interpreter(model_path="plant_disease_int8.tflite")
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Prepare input
img = tf.io.read_file("test_leaf.jpg")
img = tf.image.decode_jpeg(img, channels=3)
img = tf.image.resize(img, [224, 224])
img = tf.cast(img, tf.uint8)  # Int8 model expects uint8 input
input_data = tf.expand_dims(img, 0)

# Run inference
interpreter.set_tensor(input_details[0]["index"], input_data.numpy())
interpreter.invoke()
output = interpreter.get_tensor(output_details[0]["index"])

CLASS_NAMES = ["Healthy", "Early Blight", "Late Blight",
               "Bacterial Spot", "Yellow Leaf Curl"]
predicted = CLASS_NAMES[np.argmax(output)]
print(f"Prediction: {predicted}")Python

Model Variant	Size	Latency (Pixel 6)	Accuracy
Original (Float32)	98 MB	~180 ms	94.2%
Dynamic Range Quant	25 MB	~65 ms	93.8%
Full Int8 Quant	25 MB	~45 ms	93.1%
Float16 Quant	49 MB	~90 ms (GPU)	94.1%

Microsoft's Kaizala team (now integrated into Teams) deployed TFLite models for language detection on Indian phones to route messages in the correct language (Hindi, Tamil, Telugu, etc.) across their 100+ million user base — all running locally on devices as low-end as ₹5,000 feature phones.

Section 4

From-Scratch Code — Minimal Model Server in Pure Python

Before using FastAPI, let's build a barebones HTTP model server using only Python's standard library. This helps you understand what frameworks like FastAPI abstract away.

# minimal_server.py — HTTP model server with zero dependencies
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
import numpy as np
import time

# ─── Simulate a "model" with a simple function ───
class SimpleModel:
    """Simulates a trained model for demonstration."""

    def __init__(self):
        # In production, this would load TF/PyTorch model
        self.classes = ["Healthy", "Early Blight", "Late Blight"]
        self.weights = np.random.randn(10, 3)  # Fake weights
        print("✅ Model loaded")

    def predict(self, features: list) -> dict:
        """Run inference on input features."""
        x = np.array(features[:10]).reshape(1, -1)
        logits = x @ self.weights
        probs = np.exp(logits) / np.exp(logits).sum()  # softmax
        idx = int(np.argmax(probs))
        return {
            "class": self.classes[idx],
            "confidence": round(float(probs[0][idx]), 4),
        }

# Load model ONCE at startup
model = SimpleModel()

# ─── HTTP Request Handler ─────────────────────────
class MLHandler(BaseHTTPRequestHandler):

    def do_GET(self):
        if self.path == "/health":
            self._respond(200, {"status": "healthy"})
        else:
            self._respond(404, {"error": "Not found"})

    def do_POST(self):
        if self.path == "/predict":
            # Read request body
            length = int(self.headers.get("Content-Length", 0))
            body = self.rfile.read(length).decode("utf-8")

            try:
                data = json.loads(body)
                features = data.get("features", [])
                if not features:
                    self._respond(400, {"error": "Missing 'features'"})
                    return

                # Run inference with timing
                start = time.perf_counter()
                result = model.predict(features)
                result["latency_ms"] = round(
                    (time.perf_counter() - start) * 1000, 2
                )
                self._respond(200, result)

            except json.JSONDecodeError:
                self._respond(400, {"error": "Invalid JSON"})
            except Exception as e:
                self._respond(500, {"error": str(e)})
        else:
            self._respond(404, {"error": "Not found"})

    def _respond(self, status: int, data: dict):
        self.send_response(status)
        self.send_header("Content-Type", "application/json")
        self.end_headers()
        self.wfile.write(json.dumps(data).encode("utf-8"))

    def log_message(self, format, *args):
        # Custom logging with timestamp
        print(f"[{time.strftime('%H:%M:%S')}] {args[0]}")

# ─── Start Server ─────────────────────────────────
if __name__ == "__main__":
    server = HTTPServer(("", 8000), MLHandler)
    print("🚀 ML Server running on http://localhost:8000")
    print("   POST /predict  — Send features for prediction")
    print("   GET  /health   — Health check")
    server.serve_forever()Python

✅ Model loaded 🚀 ML Server running on http://localhost:8000 POST /predict — Send features for prediction GET /health — Health check # Test: curl -X POST http://localhost:8000/predict \ # -H "Content-Type: application/json" \ # -d '{"features": [0.5, 0.3, 0.8, 0.1, 0.9, 0.2, 0.7, 0.4, 0.6, 0.3]}' # → {"class": "Late Blight", "confidence": 0.6821, "latency_ms": 0.15}

This 80-line server demonstrates the core pattern: load once, serve many. FastAPI adds validation, async, OpenAPI docs, and middleware — but the fundamental architecture is identical.

Section 5

Industry Code — Full Deployment Pipeline

Let's take the plant disease model from Chapter 17 and build the complete production pipeline: FastAPI server → Dockerfile → Cloud Run deployment → Streamlit UI.

5A. Deploy to Google Cloud Run

# deploy.sh — Full deployment script for Google Cloud Run
#!/bin/bash
set -e

# Configuration
PROJECT_ID="my-ml-project-india"
REGION="asia-south1"            # Mumbai region (lowest latency for India)
SERVICE_NAME="plant-disease-api"
IMAGE_NAME="gcr.io/${PROJECT_ID}/${SERVICE_NAME}"

# Step 1: Build Docker image
echo "🔨 Building Docker image..."
docker build -t ${IMAGE_NAME}:latest .

# Step 2: Push to Google Container Registry
echo "📤 Pushing to GCR..."
docker push ${IMAGE_NAME}:latest

# Step 3: Deploy to Cloud Run
echo "🚀 Deploying to Cloud Run..."
gcloud run deploy ${SERVICE_NAME} \
    --image ${IMAGE_NAME}:latest \
    --platform managed \
    --region ${REGION} \
    --memory 2Gi \
    --cpu 2 \
    --timeout 60 \
    --concurrency 80 \
    --min-instances 1 \       # Keep 1 instance warm (avoids cold starts)
    --max-instances 10 \      # Auto-scale up to 10 instances
    --allow-unauthenticated   # Public API (use IAM for production)

# Step 4: Get the URL
URL=$(gcloud run services describe ${SERVICE_NAME} \
    --region ${REGION} --format "value(status.url)")
echo "✅ Deployed! API URL: ${URL}"
echo "   Health: ${URL}/health"
echo "   Docs:   ${URL}/docs"
echo "   Predict: curl -X POST ${URL}/predict -F 'file=@leaf.jpg'"Bash

5B. Streamlit UI for Stakeholders

# streamlit_app.py — User-friendly frontend for model predictions
import streamlit as st
import requests
from PIL import Image
import io

# ─── Page Config ──────────────────────────────────
st.set_page_config(
    page_title="🌿 Plant Disease Detector",
    page_icon="🌿",
    layout="centered"
)

# API endpoint (Cloud Run URL or local)
API_URL = st.sidebar.text_input(
    "API URL",
    value="http://localhost:8000"
)

# ─── Main UI ──────────────────────────────────────
st.title("🌿 Plant Disease Detector")
st.markdown("""
Upload a photo of a plant leaf to detect diseases.
Built for Indian farmers — works with tomato, potato, and pepper leaves.
""")

# ─── Camera or Upload ─────────────────────────────
tab1, tab2 = st.tabs(["📷 Camera", "📁 Upload"])

with tab1:
    camera_image = st.camera_input("Take a photo of the leaf")
    image_source = camera_image

with tab2:
    uploaded_file = st.file_uploader(
        "Choose a leaf image",
        type=["jpg", "jpeg", "png"]
    )
    if uploaded_file:
        image_source = uploaded_file

# ─── Predict ──────────────────────────────────────
if "image_source" in dir() and image_source is not None:
    # Display the image
    img = Image.open(image_source)
    st.image(img, caption="Uploaded Leaf", use_column_width=True)

    if st.button("🔍 Analyze Leaf", type="primary"):
        with st.spinner("Analyzing..."):
            # Send to API
            image_source.seek(0)
            files = {"file": ("leaf.jpg", image_source, "image/jpeg")}
            response = requests.post(f"{API_URL}/predict", files=files)

            if response.status_code == 200:
                result = response.json()

                # Display results
                col1, col2 = st.columns(2)
                with col1:
                    st.metric("Prediction", result["predicted_class"])
                with col2:
                    st.metric("Confidence",
                              f"{result['confidence']*100:.1f}%")

                # Probability bar chart
                st.bar_chart(result["all_probabilities"])

                # Inference time
                st.caption(
                    f"⚡ Inference: {result['inference_time_ms']:.1f} ms"
                )

                # Treatment recommendations for Indian farmers
                if result["predicted_class"] != "Healthy":
                    st.warning(f"⚠️ Disease detected: "
                              f"{result['predicted_class']}")
                    st.info("💊 Consult your local Krishi Vigyan "
                           "Kendra (KVK) for treatment options. "
                           "Call Kisan Call Centre: 1800-180-1551")
                else:
                    st.success("✅ Your plant looks healthy!")
            else:
                st.error(f"API Error: {response.text}")

# ─── Sidebar Info ─────────────────────────────────
st.sidebar.markdown("---")
st.sidebar.markdown("### ℹ️ About")
st.sidebar.markdown("""
- **Model**: ResNet50 (Chapter 17)
- **Dataset**: PlantVillage + India crops
- **Accuracy**: 94.2%
- **Supported**: Tomato, Potato, Pepper
- **Cost**: ₹0 (open-source)
""")Python

Run the Streamlit App

# Install Streamlit
$ pip install streamlit

# Run the app
$ streamlit run streamlit_app.py --server.port 8501

# Opens browser at http://localhost:8501Bash

5C. GitHub Actions CI/CD Pipeline

# .github/workflows/deploy.yml — Auto-deploy on push to main
name: Deploy Plant Disease API

on:
  push:
    branches: [main]
    paths:
      - 'app.py'
      - 'Dockerfile'
      - 'requirements.txt'
      - 'saved_model/**'

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install -r requirements.txt
      - run: pip install pytest httpx
      - run: pytest tests/ -v

  deploy:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: google-github-actions/auth@v2
        with:
          credentials_json: ${{ secrets.GCP_SA_KEY }}
      - uses: google-github-actions/setup-gcloud@v2
      - run: gcloud auth configure-docker
      - run: |
          docker build -t gcr.io/${{ secrets.GCP_PROJECT }}/plant-disease-api .
          docker push gcr.io/${{ secrets.GCP_PROJECT }}/plant-disease-api
      - run: |
          gcloud run deploy plant-disease-api \
            --image gcr.io/${{ secrets.GCP_PROJECT }}/plant-disease-api \
            --region asia-south1 \
            --platform managedYAML

Always run tests before deployment. Write a test_app.py using httpx.AsyncClient to test your FastAPI endpoints. A broken deployment at 2 AM costs your team sleep and your company revenue. At Zomato, every ML model deployment requires passing 3 stages: unit tests, integration tests, and shadow traffic comparison against the current production model.

Section 6

Visual Diagrams

6A. End-to-End MLOps Architecture

┌─────────────────────────────────────────────────────────────────────────┐ │ MLOps ARCHITECTURE │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌──────────────────┐ │ │ │ Data │ │ Feature │ │ Model │ │ Model │ │ │ │ Lake │──▶│ Store │──▶│ Training │──▶│ Registry │ │ │ │ (S3/GCS) │ │ (Feast) │ │ (GPU) │ │ (MLflow) │ │ │ └──────────┘ └──────────┘ └───────────┘ └────────┬─────────┘ │ │ │ │ │ │ │ │ ┌───────────────────┘ │ │ │ │ │ Experiment Tracking │ │ │ │ │ (MLflow / W&B) ▼ │ │ │ │ ┌─────────────────┐ │ │ │ │ │ CI/CD │ │ │ │ │ │ (GitHub │ │ │ │ │ │ Actions) │ │ │ │ │ └────────┬────────┘ │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ ┌─────────────────┐ │ │ │ │ ┌───REST/gRPC────────────▶│ Serving │ │ │ │ │ │ │ (TF Serving / │ │ │ │ │ │ ┌──────────────┐ │ FastAPI) │ │ │ │ │ │ │ API Gateway │ └────────┬────────┘ │ │ │ │ │ │ + Load │ │ │ │ │ │ │ │ Balancer │◀──────────────┘ │ │ │ │ │ └──────┬───────┘ │ │ │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ │ ┌──────────────┐ ┌─────────────────┐ │ │ │ │ │ │ Monitoring │──▶│ Alerting │ │ │ │ │ │ │ (Prometheus │ │ (PagerDuty / │ │ │ │ │ │ │ + Grafana) │ │ Slack) │ │ │ │ │ │ └──────────────┘ └─────────────────┘ │ │ │ │ │ │ └───────┼─────────┼────┼───────────────────────────────────────────────┘ │ │ │ │ ▼ │ │ ┌─────────┴──┐ │ │ Users │ │ │ (Mobile / │ │ │ Web / │ │ │ Streamlit)│ │ └────────────┘ │ └─── Data Pipeline (Airflow / Prefect) ──▶ Retrain Trigger

6B. Model Serving Comparison

SERVING OPTIONS COMPARED ┌─────────────────────────────────────────────────────────┐ │ │ │ Simple Complex │ │ ◄────────────────────────────────────────────────► │ │ │ │ Flask/FastAPI TF Serving Kubernetes │ │ + Gunicorn + Docker + Istio │ │ │ │ • 1-10 req/s • 100-10K req/s • 10K-1M req/s │ │ • Prototype • Production • Enterprise │ │ • ₹500/mo • ₹5,000/mo • ₹50,000/mo │ │ • 1 developer • 2-3 developers • ML Platform team │ │ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ MVP / │ ───▶ │ Scale │───▶ │ Enter- │ │ │ │ Demo │ │ Up │ │ prise │ │ │ └─────────┘ └─────────┘ └─────────┘ │ │ │ └─────────────────────────────────────────────────────────┘

6C. Drift Detection Flow

DRIFT DETECTION PIPELINE ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │Production│ │ Statistical │ │ Decision │ │ Data │────▶│ Tests │────▶│ Engine │ └──────────┘ └──────────────┘ └──────┬───────┘ │ ┌───────────────────────────────────────┼──────────┐ │ │ │ ▼ ▼ ▼ ┌──────────┐ ┌──────────┐ ┌────────┐ │ No │ │ Alert │ │Retrain │ │ Action │ │ Team │ │Trigger │ └──────────┘ └──────────┘ └────────┘ Tests Used: ├── KS Test (Kolmogorov-Smirnov) → Data Drift ├── PSI (Population Stability) → Prediction Drift ├── Chi-Square → Categorical Features └── Page-Hinkley → Concept Drift

Section 7

Worked Example — From Notebook to Production in 7 Steps

Let's trace the complete journey of deploying the plant disease classifier for a Kisan (farmer) app targeting 10,000 daily users across Uttar Pradesh and Maharashtra.

📝 Problem Statement

AgriTech startup "KhetScan" (fictional) wants to deploy the Chapter 17 plant disease model as a mobile-first service. Requirements:

Handle 10,000 daily predictions (peak: 100/minute during morning farm visits)
Latency < 500 ms per prediction
Support offline mode for low-connectivity areas (2G regions)
Cost budget: ₹15,000/month for cloud infrastructure
Auto-retrain monthly on new labeled images from the field

Step 1: Export the Trained Model

# We trained this in Chapter 17 — now export for production
model = tf.keras.models.load_model("plant_disease_ch17.h5")
print(f"Model params: {model.count_params():,}")      # 23,587,716
print(f"Model size: {os.path.getsize('plant_disease_ch17.h5')/1e6:.1f} MB")  # 94.3 MB

# Save as SavedModel for TF Serving
tf.saved_model.save(model, "./models/plant_disease/1")

# Also create TFLite for offline mobile
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite = converter.convert()
with open("plant_disease.tflite", "wb") as f:
    f.write(tflite)
print(f"TFLite size: {len(tflite)/1e6:.1f} MB")  # 24.1 MBPython

Step 2: Build FastAPI Server (from Section 21.3)

Use the complete app.py from Section 21.3. Add request logging for monitoring:

# Add to app.py — Log every prediction for monitoring
import csv
from datetime import datetime

def log_prediction(filename, predicted_class, confidence, latency):
    """Append prediction to CSV log for drift monitoring."""
    with open("predictions_log.csv", "a", newline="") as f:
        writer = csv.writer(f)
        writer.writerow([
            datetime.now().isoformat(),
            filename,
            predicted_class,
            confidence,
            latency
        ])Python

Step 3: Dockerize (from Section 21.4)

Use the Dockerfile from Section 21.4. Build and test locally.

Step 4: Deploy to Cloud Run (Mumbai region)

# asia-south1 = Mumbai → lowest latency for Indian users
$ gcloud run deploy khetscan-api \
    --image gcr.io/khetscan/plant-api:v1 \
    --region asia-south1 \
    --memory 2Gi --cpu 2 \
    --min-instances 1 --max-instances 5 \
    --allow-unauthenticated

# Estimated cost: ~₹8,000/month for this configBash

Step 5: Cost Estimation (INR)

Component	Monthly Cost (₹)	Notes
Cloud Run (2 vCPU, 2 GB)	₹6,500	min-instances=1, auto-scales to 5
Container Registry	₹200	1.2 GB image storage
Cloud Logging	₹500	First 50 GB free
MLflow Server (e2-small)	₹1,800	Experiment tracking
Cloud Storage (model artifacts)	₹100	~5 GB
Total	₹9,100/month	Within ₹15,000 budget ✅

Step 6: Set Up Monitoring

# Weekly drift check cron job
# Run every Monday at 6 AM IST
# crontab: 0 6 * * 1 python check_drift.py

import pandas as pd
import numpy as np

# Load prediction logs from the past week
logs = pd.read_csv("predictions_log.csv",
                   names=["timestamp", "file", "class",
                          "confidence", "latency"])

# Check 1: Prediction distribution shift
class_dist = logs["class"].value_counts(normalize=True)
print("Prediction distribution this week:")
print(class_dist)
# Expected: Healthy ~60%, Early Blight ~20%, Late Blight ~15%, ...
# If Healthy drops below 40%, something may have changed

# Check 2: Confidence decay
avg_confidence = logs["confidence"].mean()
print(f"Average confidence: {avg_confidence:.3f}")
if avg_confidence < 0.7:
    print("⚠️ ALERT: Confidence dropping — possible data drift!")Python

Step 7: TFLite Integration for Offline Mode

The TFLite model (24 MB) is bundled with the Android app. When the farmer has no internet, inference runs locally on the phone. When connectivity is available, predictions are sent to the cloud API for logging and retraining data collection.

Deployment Decision Matrix:
Online (Cloud API): Latency ~200ms, always latest model, requires internet
Offline (TFLite): Latency ~60ms, bundled model version, no internet needed
Hybrid: Try cloud first, fallback to TFLite — best of both worlds ✅

Section 8

Case Study — Infosys ML Platform & MLflow

🏢 Infosys Nia — Enterprise AI Platform for Indian Enterprises

Background

Infosys, India's second-largest IT services company (revenue: ₹1.5 lakh crore/year, 300,000+ employees), needed to standardize ML deployment across 200+ client engagements spanning banking, telecom, manufacturing, and retail.

The Challenge

Before their ML platform (Infosys Nia), each project team built deployment pipelines from scratch:

Team A used Flask + pickle on AWS EC2
Team B used Django + ONNX on Azure
Team C used custom gRPC on-premise servers
No experiment tracking — "Which model version is in production?" had no answer
Average time from model-ready to production: 4–6 months

MLOps Solution

Infosys built a centralized ML platform with these components:

Component	Tool	Purpose
Experiment Tracking	MLflow	Log all training runs, parameters, metrics
Model Registry	MLflow + Custom	Version control with approval workflows
Feature Store	Feast	Centralized, reusable feature pipelines
Serving	TF Serving + Triton	Unified serving with A/B testing
Monitoring	Prometheus + Grafana	Drift detection, latency, throughput
Orchestration	Kubeflow Pipelines	End-to-end ML pipeline automation
CI/CD	Jenkins + GitHub Actions	Automated testing and deployment

Results

Deployment time reduced from 4–6 months to 2–3 weeks
Model reproducibility: 100% of experiments tracked and reproducible
Cost savings: ~₹2 crore/year in reduced engineering overhead
Standardization: 50+ projects using the same platform within 1 year
Client satisfaction: 35% improvement in project delivery timelines

Key Lesson

The biggest ROI came not from fancy algorithms but from infrastructure standardization. Having a single, well-documented deployment path meant any data scientist could deploy a model in 2 weeks instead of 4 months. The platform paid for itself in 6 months.

TCS iON uses a similar platform for their education technology products. Their exam proctoring ML model (detecting cheating in online exams) is deployed across 50+ exam centers in India using Kubernetes with edge inference — each center runs a local TFLite model for real-time face detection, while the heavier behavioral analysis model runs on their Mumbai data center. This hybrid approach handles the connectivity challenges of tier-2 and tier-3 city exam centers.

Section 9

Common Mistakes & Misconceptions

Mistake #1: "My model works in Jupyter, so it's production-ready."
Jupyter notebooks hide stateful bugs — cells run out of order, global variables leak between cells, and pip install in a notebook doesn't guarantee the same environment in production. Always refactor notebook code into proper Python modules with explicit dependencies.

Mistake #2: Loading the model inside the prediction function.
def predict(data): model = tf.keras.models.load_model("model.h5"); return model.predict(data) — This loads the model on every single request. A 100 MB model takes 3–5 seconds to load. Your API becomes unusably slow. Load once at startup, predict many times.

Mistake #3: No input validation on the /predict endpoint.
Production APIs receive garbage: corrupt images, wrong formats, adversarial inputs, and request payloads 100× larger than expected. Without Pydantic validation and size limits, your server will crash or produce meaningless results. Always validate before inference.

Mistake #4: "Deploy and forget" — no monitoring.
A model deployed without monitoring is guaranteed to silently fail. Real-world data drifts — seasonal changes, user behavior shifts, upstream data pipeline breaks. If you don't monitor, you'll only discover your model is broken when a customer complains (or worse, when revenue drops).

Mistake #5: Using model.predict() instead of model() in TensorFlow.
In TensorFlow 2.x, model.predict() has significant overhead for single-sample inference because it creates a new computation graph each time. For serving, use model(input_tensor, training=False) for direct __call__ — it's 10–50× faster for single inputs.

Mistake #6: Ignoring the ₹ cost of GPU instances for serving.
A p2.xlarge GPU instance on AWS costs ~₹7,000/day. Most inference workloads don't need GPUs — a well-optimized CPU model with batching handles 100+ requests/second. Use GPUs for training, CPUs for serving (unless you have extreme latency requirements).

Section 10

Comparison Table — MLOps Tools

10A. MLflow vs W&B vs DVC vs Kubeflow

Feature	MLflow	Weights & Biases	DVC	Kubeflow
Primary Use	Experiment tracking + registry	Experiment tracking + visualization	Data & model versioning	End-to-end ML pipelines
Open Source	✅ Fully OSS	⚠️ Free tier, paid pro	✅ Fully OSS	✅ Fully OSS
Setup Complexity	Low (pip install)	Very Low (SaaS)	Low (pip install)	High (Kubernetes required)
Experiment Tracking	✅ Good	✅✅ Excellent (best UI)	❌ Not primary focus	⚠️ Basic
Model Registry	✅ Built-in	✅ Model registry	❌ Git-based only	⚠️ Via add-ons
Data Versioning	⚠️ Basic artifacts	⚠️ Artifacts only	✅✅ Primary feature	❌ Not built-in
Pipeline Orchestration	⚠️ MLflow Projects	❌ No	⚠️ DVC pipelines	✅✅ Kubeflow Pipelines
Serving	✅ mlflow models serve	❌ No	❌ No	✅ KFServing
Team Size	2–20 (startups, mid)	1–50 (any size)	2–10 (ML engineers)	10–100 (enterprise)
Cost (10-person team)	₹0 (self-hosted)	~₹50,000/mo (Teams)	₹0 (self-hosted)	₹0 (+ infra costs)
Best For	All-round MLOps starter	Research teams, visualization	Data-heavy ML projects	Enterprise K8s shops

10B. Serving Options Compared

Option	Latency	Throughput	Scale	Complexity	Cost (₹/mo)
FastAPI + Uvicorn	~50ms	500–3K req/s	Single server / Cloud Run	Low	₹2,000–₹10,000
TF Serving	~20ms	5K–50K req/s	Docker / Kubernetes	Medium	₹5,000–₹50,000
Triton Inference Server	~10ms	10K–100K req/s	Multi-GPU, multi-model	High	₹50,000+
AWS SageMaker Endpoint	~100ms	1K–10K req/s	Fully managed	Low	₹15,000–₹1,00,000
TFLite (On-device)	~30ms	N/A (local)	Per-device	Medium	₹0

Start with MLflow + FastAPI + Cloud Run. This combination gives you experiment tracking, model registry, production serving, and auto-scaling for under ₹10,000/month. Only move to Kubeflow + Triton when you have 10+ models in production and a dedicated ML platform team.

Section 11

Exercises

Section A — Multiple-Choice Questions (10)

Which of the following is the MOST common reason ML models fail to reach production?

Low accuracy on test set
Infrastructure and deployment challenges
Insufficient training data
GPU unavailability

✅ B. According to industry surveys, the "deployment gap" — infrastructure, versioning, monitoring, and organizational challenges — is the #1 reason. Most models achieve adequate accuracy but fail at the deployment stage.

UnderstandBeginner

What is the primary advantage of ONNX over TensorFlow SavedModel?

Faster inference speed
Smaller file size
Framework-agnostic interoperability
Built-in model versioning

✅ C. ONNX is designed to be framework-agnostic — you can train in PyTorch and deploy with ONNX Runtime, TensorRT, or any other ONNX-compatible runtime. SavedModel is limited to the TensorFlow ecosystem.

RememberBeginner

In FastAPI, why should you load the ML model in @app.on_event("startup") rather than inside the /predict endpoint?

FastAPI doesn't support loading models inside endpoints
To avoid loading the model on every request (which adds seconds of latency)
Because Python garbage collection deletes the model after each request
For security reasons — endpoints can't access the filesystem

✅ B. Model loading is expensive (2–10 seconds for large models). Loading at startup means it's done once, and all subsequent requests use the already-loaded model from memory. Loading per-request would make every prediction take seconds instead of milliseconds.

UnderstandIntermediate

What does a multi-stage Docker build accomplish for ML serving?

Runs multiple models simultaneously
Reduces final image size by separating build and runtime dependencies
Enables GPU access inside containers
Automatically scales the number of containers

✅ B. Multi-stage builds install compilation tools and build dependencies in one stage, then copy only the compiled packages to a slim final image. This can reduce image size from 3+ GB to under 1.5 GB.

UnderstandIntermediate

Which type of drift occurs when the relationship between input features and the target variable changes over time?

Data drift (covariate shift)
Prediction drift (output drift)
Concept drift
Feature drift

✅ C. Concept drift occurs when the underlying relationship P(Y|X) changes. For example, post-COVID dining behavior changed, so the mapping from user features to "will order" fundamentally shifted — even though the features themselves looked similar.

RememberBeginner

TensorFlow Serving's model versioning stores models in numbered directories (/1, /2, /3). What happens when you add a new directory /4?

Nothing — you must restart the server
TF Serving automatically detects and loads the new version
The old versions are deleted
TF Serving crashes and needs manual intervention

✅ B. TF Serving has a file system poller that automatically detects new model versions. It loads the new version and, by default, unloads old versions — enabling zero-downtime model updates.

RememberIntermediate

You quantize a model from Float32 to Int8. What is the approximate reduction in model size?

2×
4×
8×
16×

✅ B. Float32 uses 4 bytes per weight. Int8 uses 1 byte per weight. So the model size is reduced by approximately 4× (from ~98 MB to ~25 MB in our example). Actual reduction varies slightly due to model structure overhead.

ApplyBeginner

In A/B testing for ML models, a "champion-challenger" setup means:

Two models compete to be the fastest
The current production model (champion) receives most traffic while a new model (challenger) gets a small percentage
Both models are trained simultaneously on different data
The model with higher accuracy automatically replaces the other

✅ B. Champion-challenger is a safe deployment strategy. The champion (current best) handles 90%+ of traffic while the challenger (new candidate) gets 5–10%. If the challenger's real-world metrics beat the champion's, it's promoted to champion.

UnderstandIntermediate

Which MLOps tool would you recommend for a 3-person startup that needs experiment tracking and model registry at zero cost?

Kubeflow (requires Kubernetes cluster)
Weights & Biases (paid Teams plan)
MLflow (open-source, self-hosted)
AWS SageMaker (managed, expensive)

✅ C. MLflow is fully open-source, can be installed with pip install mlflow, and provides experiment tracking + model registry at zero licensing cost. Kubeflow requires Kubernetes (too complex for 3 people), W&B Teams is paid, and SageMaker costs ₹15,000+/month.

EvaluateIntermediate

Q10

Why is pickle NOT recommended for production model serialization?

Pickle files are too large
Pickle doesn't support neural networks
Pickle is Python-version dependent, insecure, and has no graph optimization
Pickle only works on Windows

✅ C. Pickle files are tightly coupled to the Python version and library versions used to create them. They can also execute arbitrary code upon loading (security risk), and they don't support computation graph optimizations like operator fusion or quantization.

UnderstandBeginner

Section B — Short Answer Questions (5)

B1 Intermediate

Explain the difference between data drift and concept drift with an example from an Indian e-commerce company like Flipkart.

Data drift: The distribution of input features changes. Example: Flipkart's recommendation model was trained on data from metro cities, but as Flipkart expanded to tier-3 cities, the input feature distributions (average order value, product categories, browsing patterns) shifted significantly.

Concept drift: The relationship between inputs and outputs changes. Example: During Diwali season, the same user features (age, location, past purchases) map to completely different purchase behaviors — users who normally buy electronics start buying gifts and sweets. The mapping P(purchase|features) has changed.

B2 Beginner

List the six stages of the ML lifecycle and briefly describe what happens at each stage.

1. Data Collection & Preparation: Gather, clean, label, and version datasets.
2. Feature Engineering & Training: Extract features, select architecture, tune hyperparameters, train model.
3. Evaluation & Validation: Test set metrics, cross-validation, fairness audits.
4. Deployment (Serving): Package as API/service, containerize, deploy to cloud/edge.
5. Monitoring: Track prediction quality, data drift, latency, throughput.
6. Retraining & CI/CD: Automated retraining pipelines, continuous integration/delivery for models.

B3 Intermediate

Why does FastAPI outperform Flask for ML serving? Mention at least three technical reasons.

1. ASGI vs WSGI: FastAPI uses ASGI (async), allowing non-blocking I/O. Flask uses WSGI (sync), blocking on each request. This matters when preprocessing (image decode) is I/O-bound.
2. Pydantic validation: FastAPI automatically validates request/response schemas via type hints. Flask requires manual validation code.
3. Performance: FastAPI benchmarks at ~3000+ req/s vs Flask's ~500 req/s due to Starlette's async event loop.
4. Auto-generated OpenAPI docs: FastAPI auto-generates interactive /docs endpoint — invaluable for frontend teams consuming your ML API.

B4 Intermediate

What is the purpose of a .dockerignore file in ML projects? Why is it especially important for ML compared to regular web apps?

.dockerignore prevents specified files from being included in the Docker build context. For ML projects, this is critical because ML directories contain massive files: training datasets (10+ GB), model checkpoints (multiple 100 MB files), Jupyter notebooks, wandb logs, and TensorBoard event files. Without .dockerignore, COPY . . would include all of these, creating a Docker image that's 20-50 GB instead of 1-2 GB, taking hours to build and push.

B5 Advanced

Explain why you might choose TFLite edge deployment over cloud API deployment for an agricultural app targeting Indian farmers.

1. Connectivity: Rural India (where most farmers are) often has unreliable 2G/3G connectivity. Cloud APIs require stable internet; TFLite works fully offline.
2. Latency: Cloud API adds ~200-500ms network latency. TFLite inference is ~30-60ms locally — critical for real-time camera-based disease detection.
3. Cost: No cloud server costs per prediction. For 10,000 daily users, cloud costs add up; on-device inference is free.
4. Privacy: Farm images stay on the device — no sensitive location/crop data sent to cloud servers.
5. Device compatibility: TFLite runs on low-end Android phones (₹5,000-₹8,000 range) that most Indian farmers use.

Section C — Long Answer Questions (3)

C1 Intermediate

Design an MLOps pipeline for a Paytm-like company deploying a fraud detection model. Include: (a) experiment tracking setup, (b) model serving architecture, (c) monitoring strategy, (d) retraining triggers. Draw a system architecture diagram as part of your answer. [15 marks]

C2 Advanced

Compare the three types of drift (data, prediction, concept) in detail. For each type: (a) define it mathematically, (b) provide a real-world Indian industry example, (c) describe the detection method (statistical test), and (d) explain the remediation strategy. [15 marks]

C3 Advanced

A startup is choosing between four MLOps tool stacks for their 8-person ML team with a monthly budget of ₹50,000. Compare: (a) MLflow + FastAPI + Cloud Run, (b) W&B + TF Serving + GKE, (c) DVC + BentoML + AWS Lambda, (d) Kubeflow + Triton + On-premise. Evaluate each on cost, complexity, scalability, and team skill requirements. Recommend one with justification. [15 marks]

Section D — Programming Exercises (2)

D1 Intermediate

Deploy a Sentiment Analysis Model as a REST API

Take a pre-trained sentiment analysis model (you may use a simple TF/Keras text classifier or HuggingFace pipeline) and:

Write a FastAPI application with a /analyze endpoint that accepts JSON with a "text" field
Add Pydantic input/output validation (text length: 1–5000 chars, output: sentiment label + confidence)
Add a /health endpoint and a /batch endpoint (accepts list of texts)
Write a Dockerfile to containerize the application
Include at least 3 unit tests using pytest and httpx

Hint: For a quick sentiment model, use from transformers import pipeline; classifier = pipeline("sentiment-analysis")

D2 Advanced

Build a Production Monitoring Script

Create a Python monitoring script that:

Reads a CSV log of predictions (columns: timestamp, input_features, predicted_class, confidence)
Computes data drift using KS test for numerical features and Chi-square test for categorical features
Computes prediction drift using Population Stability Index (PSI)
Generates an HTML drift report with color-coded alerts (green=OK, yellow=warning, red=critical)
Sends a Slack notification if any drift metric exceeds the threshold

Use the DriftMonitor class from Section 21.7 as a starting point. Extend it with Chi-square test and HTML report generation.

Section E — Mini-Project

🚀 Mini-Project: End-to-End MLOps Pipeline for Indian Crop Disease Detection

Objective

Build a complete, production-ready ML deployment pipeline for a crop disease detection service targeting Indian farmers.

Requirements

Model Training (with MLflow): Train the plant disease classifier from Chapter 17. Log at least 5 experiment runs with different hyperparameters using MLflow. Register the best model in MLflow Model Registry.
FastAPI Server: Build a production API with:
- /predict — Image upload → disease prediction
- /health — Server health check
- /model-info — Returns model version, training date, accuracy
- Request logging to CSV for drift monitoring
Docker + Cloud Deploy: Containerize with Docker (multi-stage build). Deploy to Google Cloud Run (asia-south1 region). Document the deployment cost estimation in INR.
Streamlit Frontend: Build a user-friendly UI that allows image upload and camera capture. Display predictions with confidence bars. Include Hindi/English language toggle.
Monitoring: Implement weekly drift detection using the DriftMonitor class. Generate automated drift reports. Set up alerting (email or Slack).
TFLite Conversion: Convert the model to TFLite (Int8 quantized). Benchmark accuracy vs. original. Document size reduction.

Deliverables

GitHub repository with clean README, all code, and Dockerfile
MLflow experiment tracking screenshots (5+ runs)
Deployed Cloud Run URL (working for at least 1 week)
Streamlit app demo video (2 minutes)
Monitoring report showing drift analysis on synthetic data
Cost analysis document (projected monthly cost in ₹ for 1K, 10K, 100K daily users)

Evaluation Rubric

Criterion	Marks
MLflow integration (5+ tracked experiments)	15
FastAPI server (clean code, validation, error handling)	20
Docker + Cloud Run deployment (working URL)	20
Streamlit UI (user-friendly, camera support)	15
Monitoring & drift detection	15
TFLite conversion + benchmarks	10
Documentation & code quality	5
Total	100

Section 12

Chapter Summary

🧠 Key Takeaways from Chapter 21

The Deployment Gap is Real: Only ~53% of ML prototypes reach production. The gap isn't about model accuracy — it's about infrastructure, monitoring, and engineering.
ML Lifecycle: Data → Train → Evaluate → Deploy → Monitor → Retrain. Production ML requires mastering all six stages, not just the first three.
Model Serialization: Use SavedModel (TensorFlow), ONNX (cross-framework), or TorchScript (PyTorch) for production. Never use pickle.
FastAPI for ML Serving: Load model at startup, validate inputs with Pydantic, use async handlers, add health checks, and auto-generate API docs.
Docker: Containerize everything for reproducibility. Use multi-stage builds and .dockerignore to keep images small. Run as non-root user for security.
TF Serving: Purpose-built for serving TF models at scale. Supports automatic model versioning, request batching, and A/B testing via version-numbered directories.
Model Versioning: Use MLflow for experiment tracking and model registry. Tag every run with Git commit hash. Promote models through stages: Staging → Production → Archived.
Monitoring is Non-Negotiable: Track data drift (KS test), prediction drift (PSI), and concept drift. A deployed model without monitoring will silently degrade.
Edge Deployment: TFLite with Int8 quantization reduces models by ~4× with minimal accuracy loss. Essential for Indian low-connectivity scenarios.
Start Simple: MLflow + FastAPI + Cloud Run is the ideal MLOps starter stack. Costs under ₹10,000/month and scales to 10K+ daily users.

The MLOps Maturity Model:
Level 0: Manual Jupyter notebooks → Level 1: Automated training + manual deploy →
Level 2: CI/CD for models + monitoring → Level 3: Full automation with drift-triggered retraining

Section 13

References & Further Reading

Foundational Papers

Sculley, D., et al. (2015). "Hidden Technical Debt in Machine Learning Systems." Advances in Neural Information Processing Systems (NeurIPS). — The landmark paper on ML infrastructure complexity.
Paleyes, A., Urma, R.-G., & Lawrence, N. D. (2022). "Challenges in Deploying Machine Learning: A Survey of Case Studies." ACM Computing Surveys, 55(6). — Comprehensive survey of deployment failures.
Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2018). "Data Lifecycle Challenges in Production Machine Learning." ACM SIGMOD Record, 47(2).

Framework Documentation

FastAPI Documentation. https://fastapi.tiangolo.com — Official docs with excellent ML serving examples.
TensorFlow Serving. https://www.tensorflow.org/tfx/serving — Architecture guide and API reference.
MLflow Documentation. https://mlflow.org/docs/latest/index.html — Experiment tracking, model registry, and deployment.
ONNX Runtime. https://onnxruntime.ai — Cross-platform inference engine.
TensorFlow Lite Guide. https://www.tensorflow.org/lite/guide — Mobile and edge deployment.

Books

Huyen, C. (2022). Designing Machine Learning Systems. O'Reilly Media. — The definitive guide to production ML systems.
Gift, N., Deza, A., & Behrman, K. (2021). Practical MLOps. O'Reilly Media. — Hands-on MLOps engineering.
Treveil, M., et al. (2020). Introducing MLOps. O'Reilly Media. — MLOps concepts and best practices.

Indian Industry & Tools

Infosys Nia Platform. https://www.infosys.com/nia.html — Enterprise AI platform case studies.
Docker Documentation. https://docs.docker.com — Containerization best practices.
Google Cloud Run. https://cloud.google.com/run/docs — Serverless container deployment (asia-south1 region).
Streamlit Documentation. https://docs.streamlit.io — Building ML demo UIs.

Monitoring & Drift

Evidently AI. https://evidentlyai.com — Open-source ML monitoring and drift detection.
Gama, J., et al. (2014). "A Survey on Concept Drift Adaptation." ACM Computing Surveys, 46(4). — Comprehensive drift taxonomy and detection methods.