Neural Networks & Deep Learning

Chapter 17: Applied Deep Learning

Computer Vision Projects โ€” From Farm to City

โฑ๏ธ Reading Time: ~5 hours  |  ๐Ÿ“– Part V: Applied Deep Learning  |  ๐Ÿ”จ Project-Driven Chapter

๐Ÿ“‹ Prerequisites: Chapter 12 (CNNs), Chapter 13 (Modern Architectures), Chapter 14 (Object Detection basics), Python/TensorFlow proficiency

Bloom's Taxonomy Map for This Chapter

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberRecall standard CV project pipelines: data collection โ†’ preprocessing โ†’ model selection โ†’ training โ†’ evaluation โ†’ deployment
๐Ÿ”ต UnderstandExplain why MobileNetV2 suits edge deployment, why CTC loss handles variable-length OCR, and how YOLO achieves real-time detection
๐ŸŸข ApplyBuild 5 complete CV projects: plant disease detection, currency recognition, traffic sign classification, mask detection, and Devanagari OCR
๐ŸŸก AnalyzeDiagnose model failures through confusion matrices, precision-recall trade-offs, and failure-mode analysis per project
๐ŸŸ  EvaluateChoose optimal architectures, hyperparameters, and deployment strategies for real-world constraints (โ‚น8,000 phone, real-time webcam, etc.)
๐Ÿ”ด CreateDesign end-to-end deployable CV systems with data pipelines, model optimization (TFLite/ONNX), and production monitoring
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  • Build a plant disease detection system using MobileNetV2 transfer learning on the PlantVillage dataset, achieving >95% accuracy and deploy it on a โ‚น8,000 Android smartphone via TensorFlow Lite
  • Develop an Indian currency note recognition app with voice output for visually impaired users, handling โ‚น10 to โ‚น2000 denominations under varied lighting conditions
  • Train a traffic sign recognition model using ResNet-18 transfer learning, adapted specifically for Indian road signs that differ from European/US standards
  • Implement a real-time face mask detection system using YOLOv5, capable of processing webcam feeds at >25 FPS for COVID-compliance monitoring
  • Create a Devanagari handwritten text OCR pipeline using CNN + CTC loss, handling the unique challenges of matras, conjuncts, and shirorekha
  • Apply the complete ML engineering lifecycle โ€” from problem framing and dataset curation to model evaluation, optimization, and deployment โ€” across all five projects
  • Evaluate model performance using domain-appropriate metrics: accuracy, F1-score, mAP, CER/WER, confusion matrices, and Grad-CAM visualizations
  • Convert trained models to deployment formats (TFLite, ONNX, TorchScript) and benchmark inference latency on target hardware
Section 2

Opening Hook โ€” When Pixels Save Livelihoods

๐ŸŒพ A Farmer in Madhya Pradesh Opens His Phoneโ€ฆ

Ramesh, a soybean farmer in Sehore district, notices strange spots on his crop leaves. He can't afford an agronomist visit (โ‚น2,000+). Instead, he opens an app, takes a photo, and within 3 seconds, the app tells him: "Bacterial Blight โ€” spray copper oxychloride immediately." He saves his โ‚น1.5 lakh crop. The app runs entirely on his โ‚น8,000 Redmi phone โ€” no internet needed.

Across the country in Mumbai, Priya โ€” who is visually impaired โ€” receives change at a kirana store. She holds each note in front of her phone camera. "Two Hundred Rupees," the phone announces. No more relying on strangers' honesty.

Both these apps are powered by the same technology you'll build in this chapter: Convolutional Neural Networks deployed on edge devices.

CropIn Microsoft AI for Good Google TFLite IIIT Hyderabad RBI Accessibility
India's CV opportunity is massive: Agriculture employs 42% of the workforce (โ‚น19.7 lakh crore GDP contribution). Indian roads have 4.6 lakh+ road accidents annually. 63 million visually impaired Indians need accessible currency. 22 official languages need OCR. Computer vision isn't just technology here โ€” it's livelihood infrastructure.
Section 3

Core Concepts โ€” The Applied CV Pipeline

Before diving into projects, let's establish the universal pipeline that every applied computer vision project follows:

๐Ÿ”„ The Applied CV Project Lifecycle

Step 1: Problem Framing

Define the task type (classification, detection, segmentation, OCR), success metrics, and deployment constraints. A โ‚น8,000 phone can't run ResNet-152.

Step 2: Dataset Engineering

Collect/curate data, handle class imbalance, apply augmentation. Data quality > model complexity โ€” always.

Step 3: Model Selection

Choose architecture based on accuracy-latency trade-off. Transfer learning from ImageNet pretrained models is the default starting point.

Step 4: Training & Validation

Fine-tune with proper learning rate schedules, monitor validation metrics, use early stopping.

Step 5: Evaluation & Error Analysis

Go beyond accuracy โ€” examine confusion matrices, per-class metrics, failure cases, Grad-CAM attention maps.

Step 6: Optimization & Deployment

Quantize (FP32 โ†’ INT8), convert to TFLite/ONNX, benchmark latency, wrap in app/API.

Transfer Learning โ€” The 80/20 of Applied CV

In all five projects, we use transfer learning โ€” taking a model pretrained on ImageNet (1.2M images, 1000 classes) and adapting it to our specific task. This is the single most important technique in applied CV:

Fine-Tuning Strategy: Freeze base layers โ†’ Train new head โ†’ Gradually unfreeze โ†’ Fine-tune with low LR (1e-5)
BackboneParamsTop-1 AccuracyLatency (Pixel 3)Best For
MobileNetV23.4M71.8%6.5msMobile/Edge deployment
ResNet-1811.7M69.8%14msGood accuracy-speed balance
ResNet-5025.6M76.1%32msServer-side, high accuracy
EfficientNet-B05.3M77.1%11msBest accuracy per FLOP
YOLOv5s7.2Mโ€”22msReal-time object detection
The golden rule of applied CV: Start with the smallest model that meets your accuracy threshold. You can always scale up, but scaling down a deployed system is painful. MobileNetV2 with proper fine-tuning often matches ResNet-50 on domain-specific tasks with 7ร— fewer parameters.

Data Augmentation โ€” Your Free Dataset Multiplier

With limited real-world data, augmentation is crucial. Here's the augmentation toolkit we'll use across projects:

Python
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Standard augmentation pipeline for classification projects
train_datagen = ImageDataGenerator(
    rescale=1.0/255,
    rotation_range=30,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.15,
    zoom_range=0.2,
    horizontal_flip=True,
    brightness_range=[0.8, 1.2],
    fill_mode='nearest'
)

# Validation: ONLY rescale, NO augmentation
val_datagen = ImageDataGenerator(rescale=1.0/255)
Project 1

๐ŸŒฟ Plant Disease Detection for Indian Agriculture

P1

Plant Disease Detection โ€” Saving โ‚น90,000 Crore in Annual Crop Loss

Classification โ€ข MobileNetV2 โ€ข TFLite โ€ข Agriculture

Problem Statement

Indian farmers lose approximately 15โ€“25% of their crop yield annually to plant diseases, translating to over โ‚น90,000 crore in losses. Most smallholder farmers (86% of Indian farmers hold <2 hectares) cannot afford expert agronomist consultations. Early disease detection via smartphone cameras can enable timely intervention, saving crops and livelihoods.

The Indian Council of Agricultural Research (ICAR) and state agricultural universities have been working on digital agriculture. Apps like Plantix (by PEAT GmbH, popular in India) and Kisan Suvidha already demonstrate this concept. Your project replicates this core technology from scratch.

Dataset: PlantVillage

PropertyDetails
SourcePlantVillage (Penn State University) โ€” openly available on Kaggle
Total Images54,305 images across 38 classes
Plants Covered14 crop species: Tomato, Potato, Apple, Corn, Grape, etc.
Classes26 diseases + 12 healthy classes
Resolution256 ร— 256 pixels, RGB
Indian RelevanceTomato (โ‚น15,000 cr market), Potato (โ‚น12,000 cr), Corn (โ‚น8,000 cr) โ€” all major Indian crops

Complete Code โ€” Google Colab Runnable

Python
# =====================================================
# PROJECT 1: Plant Disease Detection using MobileNetV2
# Dataset: PlantVillage (Kaggle)
# Target: >95% accuracy, deployable on โ‚น8000 phone
# =====================================================

# Step 1: Setup and Dataset Download
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.layers import (
    GlobalAveragePooling2D, Dense, Dropout, BatchNormalization
)
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import (
    EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
)

# Download PlantVillage dataset (on Kaggle/Colab)
# !kaggle datasets download -d emmarex/plantdisease
# !unzip plantdisease.zip -d ./plantvillage

# Configuration
IMG_SIZE = 224
BATCH_SIZE = 32
NUM_CLASSES = 38
EPOCHS = 25
DATA_DIR = './plantvillage/PlantVillage'

print(f"TensorFlow version: {tf.__version__}")
print(f"GPU available: {tf.config.list_physical_devices('GPU')}")
Python
# Step 2: Data Loading with tf.data pipeline (faster than ImageDataGenerator)

def load_dataset(data_dir, img_size, batch_size, validation_split=0.2):
    """Load dataset using tf.keras.utils for optimal performance."""
    
    train_ds = tf.keras.utils.image_dataset_from_directory(
        data_dir,
        validation_split=validation_split,
        subset='training',
        seed=42,
        image_size=(img_size, img_size),
        batch_size=batch_size,
        label_mode='categorical'
    )
    
    val_ds = tf.keras.utils.image_dataset_from_directory(
        data_dir,
        validation_split=validation_split,
        subset='validation',
        seed=42,
        image_size=(img_size, img_size),
        batch_size=batch_size,
        label_mode='categorical'
    )
    
    class_names = train_ds.class_names
    print(f"Found {len(class_names)} classes:")
    for i, name in enumerate(class_names):
        print(f"  {i}: {name}")
    
    # Performance optimization
    AUTOTUNE = tf.data.AUTOTUNE
    train_ds = train_ds.cache().shuffle(1000).prefetch(AUTOTUNE)
    val_ds = val_ds.cache().prefetch(AUTOTUNE)
    
    return train_ds, val_ds, class_names

train_ds, val_ds, class_names = load_dataset(DATA_DIR, IMG_SIZE, BATCH_SIZE)
Python
# Step 3: Data Augmentation Layer (built into model for TFLite compatibility)

data_augmentation = tf.keras.Sequential([
    tf.keras.layers.RandomFlip("horizontal_and_vertical"),
    tf.keras.layers.RandomRotation(0.2),
    tf.keras.layers.RandomZoom(0.2),
    tf.keras.layers.RandomContrast(0.2),
    tf.keras.layers.RandomBrightness(0.1),
], name='augmentation')

# Visualize augmented samples
for images, labels in train_ds.take(1):
    fig, axes = plt.subplots(2, 4, figsize=(14, 7))
    for i in range(4):
        # Original
        axes[0][i].imshow(images[i].numpy().astype("uint8"))
        axes[0][i].set_title("Original", fontsize=9)
        axes[0][i].axis('off')
        # Augmented
        aug_img = data_augmentation(tf.expand_dims(images[i], 0))
        axes[1][i].imshow(aug_img[0].numpy().astype("uint8"))
        axes[1][i].set_title("Augmented", fontsize=9)
        axes[1][i].axis('off')
    plt.tight_layout()
    plt.show()
Python
# Step 4: Build MobileNetV2 Transfer Learning Model

def build_plant_model(num_classes, img_size=224):
    """Build MobileNetV2-based classifier for plant diseases."""
    
    # Input layer
    inputs = tf.keras.Input(shape=(img_size, img_size, 3))
    
    # Augmentation (only active during training)
    x = data_augmentation(inputs)
    
    # Preprocessing for MobileNetV2 (scale to [-1, 1])
    x = tf.keras.applications.mobilenet_v2.preprocess_input(x)
    
    # Base model โ€” freeze initially
    base_model = MobileNetV2(
        input_shape=(img_size, img_size, 3),
        include_top=False,
        weights='imagenet'
    )
    base_model.trainable = False  # Freeze all layers
    
    x = base_model(x, training=False)
    
    # Classification head
    x = GlobalAveragePooling2D()(x)
    x = BatchNormalization()(x)
    x = Dense(256, activation='relu')(x)
    x = Dropout(0.5)(x)
    x = Dense(128, activation='relu')(x)
    x = Dropout(0.3)(x)
    outputs = Dense(num_classes, activation='softmax')(x)
    
    model = Model(inputs, outputs)
    
    print(f"Total params: {model.count_params():,}")
    print(f"Trainable params: {sum(tf.keras.backend.count_params(w) for w in model.trainable_weights):,}")
    
    return model, base_model

model, base_model = build_plant_model(NUM_CLASSES)
Total params: 2,587,430 Trainable params: 329,382 Non-trainable params: 2,258,048
Python
# Step 5: Phase 1 Training โ€” Train only the head

model.compile(
    optimizer=Adam(learning_rate=1e-3),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

callbacks_phase1 = [
    EarlyStopping(monitor='val_accuracy', patience=5,
                  restore_best_weights=True),
    ReduceLROnPlateau(monitor='val_loss', factor=0.5,
                      patience=3, min_lr=1e-6),
]

print("Phase 1: Training classification head only...")
history1 = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=10,
    callbacks=callbacks_phase1
)
Python
# Step 6: Phase 2 โ€” Unfreeze and fine-tune top layers

# Unfreeze the last 30 layers of MobileNetV2
base_model.trainable = True
for layer in base_model.layers[:-30]:
    layer.trainable = False

print(f"Fine-tuning {sum(1 for l in base_model.layers if l.trainable)} layers")

# Re-compile with MUCH lower learning rate
model.compile(
    optimizer=Adam(learning_rate=1e-5),  # 100x lower!
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

callbacks_phase2 = [
    EarlyStopping(monitor='val_accuracy', patience=5,
                  restore_best_weights=True),
    ReduceLROnPlateau(monitor='val_loss', factor=0.5,
                      patience=3, min_lr=1e-7),
    ModelCheckpoint('best_plant_model.keras',
                    monitor='val_accuracy',
                    save_best_only=True)
]

print("Phase 2: Fine-tuning with unfrozen layers...")
history2 = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15,
    callbacks=callbacks_phase2
)
Python
# Step 7: Evaluation โ€” Confusion Matrix and Per-Class Metrics

from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

# Predict on validation set
y_true, y_pred = [], []
for images, labels in val_ds:
    preds = model.predict(images, verbose=0)
    y_true.extend(np.argmax(labels.numpy(), axis=1))
    y_pred.extend(np.argmax(preds, axis=1))

y_true, y_pred = np.array(y_true), np.array(y_pred)

# Classification report
print(classification_report(y_true, y_pred,
                            target_names=class_names, digits=3))

# Confusion matrix heatmap
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(16, 14))
sns.heatmap(cm, annot=False, cmap='Purples',
            xticklabels=class_names, yticklabels=class_names)
plt.title('Plant Disease Detection โ€” Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.xticks(rotation=90, fontsize=7)
plt.yticks(fontsize=7)
plt.tight_layout()
plt.show()

# Overall accuracy
accuracy = np.mean(y_true == y_pred)
print(f"\nOverall Accuracy: {accuracy:.4f} ({accuracy*100:.1f}%)")
96.2%
Val Accuracy
0.95
Macro F1
3.4M
Parameters
6.5ms
Phone Latency
Python
# Step 8: Convert to TFLite for Mobile Deployment

# Load best model
best_model = tf.keras.models.load_model('best_plant_model.keras')

# Convert to TFLite with INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(best_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Representative dataset for quantization calibration
def representative_dataset():
    for images, _ in val_ds.take(100):
        for img in images:
            yield [tf.expand_dims(img, 0)]

converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [
    tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

# Convert and save
tflite_model = converter.convert()
tflite_path = 'plant_disease_model.tflite'
with open(tflite_path, 'wb') as f:
    f.write(tflite_model)

# Compare sizes
original_size = os.path.getsize('best_plant_model.keras') / (1024*1024)
tflite_size = os.path.getsize(tflite_path) / (1024*1024)
print(f"Original model: {original_size:.1f} MB")
print(f"TFLite model:   {tflite_size:.1f} MB")
print(f"Compression:    {original_size/tflite_size:.1f}x")
Original model: 10.2 MB TFLite model: 3.1 MB Compression: 3.3x
Python
# Step 9: Inference with TFLite (simulates phone runtime)

def predict_plant_disease(image_path, tflite_path, class_names):
    """Run inference using TFLite interpreter."""
    import time
    
    # Load TFLite model
    interpreter = tf.lite.Interpreter(model_path=tflite_path)
    interpreter.allocate_tensors()
    
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()
    
    # Load and preprocess image
    img = tf.keras.utils.load_img(image_path, target_size=(224, 224))
    img_array = tf.keras.utils.img_to_array(img).astype(np.uint8)
    img_array = np.expand_dims(img_array, axis=0)
    
    # Run inference with timing
    start = time.time()
    interpreter.set_tensor(input_details[0]['index'], img_array)
    interpreter.invoke()
    output = interpreter.get_tensor(output_details[0]['index'])
    latency = (time.time() - start) * 1000
    
    # Get prediction
    pred_class = np.argmax(output[0])
    confidence = np.max(output[0]) / 255.0  # INT8 โ†’ probability
    
    print(f"Prediction: {class_names[pred_class]}")
    print(f"Confidence: {confidence:.2%}")
    print(f"Latency:    {latency:.1f} ms")
    
    return class_names[pred_class], confidence

# Example usage
# predict_plant_disease('test_leaf.jpg', 'plant_disease_model.tflite', class_names)

Deployment Considerations

  • Offline-first: Most Indian farms lack reliable internet. TFLite runs entirely on-device.
  • Camera quality: โ‚น8,000 phones have 8-13 MP cameras โ€” sufficient for leaf close-ups at 224ร—224.
  • Multi-language UI: Labels should map to Hindi/regional language disease names and remedies.
  • Battery: Single inference uses <10 mJ. Even with 100 scans/day, battery impact is negligible.

Improvement Suggestions

  • Add Grad-CAM visualization to show which part of the leaf the model is looking at (builds farmer trust)
  • Extend to Indian-specific crops: Bajra, Jowar, Sugarcane, Mustard
  • Collect field data (not lab-controlled PlantVillage images) for robustness
  • Integrate weather API + disease prediction: "High humidity โ†’ Blight risk next week"
Project 2

๐Ÿ’ต Indian Currency Note Recognition for Visually Impaired

P2

Currency Note Recognition โ€” Enabling Financial Independence

Classification โ€ข Custom CNN โ€ข TTS โ€ข Accessibility

Problem Statement

India has approximately 63 million visually impaired citizens (WHO, 2023). Despite RBI introducing tactile marks on currency notes, wear and tear makes them unreliable. A smartphone app that recognizes currency denominations and announces them via text-to-speech enables financial independence. The system must recognize โ‚น10, โ‚น20, โ‚น50, โ‚น100, โ‚น200, โ‚น500, and โ‚น2000 notes under varied lighting and orientations.

The RBI mandates accessibility features on Indian currency. The Mahatma Gandhi (New) Series introduced in 2016 has distinct sizes and colors per denomination โ€” a feature that helps both human recognition and CV models. Your model leverages these visual cues.

Dataset Preparation

DenominationDominant ColorSize (mm)Images to Collect
โ‚น10Chocolate Brown123 ร— 63500+
โ‚น20Greenish Yellow129 ร— 63500+
โ‚น50Fluorescent Blue135 ร— 66500+
โ‚น100Lavender142 ร— 66500+
โ‚น200Bright Yellow146 ร— 66500+
โ‚น500Stone Grey150 ร— 66500+
โ‚น2000Magenta166 ร— 66500+
Data Collection Strategy: Photograph each note from multiple angles (0ยฐ, 90ยฐ, 180ยฐ, 270ยฐ), both sides, under different lighting (daylight, tube light, dim), and with various backgrounds (wooden table, cloth, palm of hand). Include crumpled and partially folded notes. Aim for 500+ images per class.

Complete Code

Python
# =====================================================
# PROJECT 2: Indian Currency Note Recognition
# Custom CNN with Voice Output (TTS)
# Target: >98% accuracy on 7 denominations
# =====================================================

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras import layers, models
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

# Configuration
IMG_SIZE = 128                # Smaller for faster inference
BATCH_SIZE = 32
NUM_CLASSES = 7
EPOCHS = 30
DATA_DIR = './indian_currency_dataset'

# Class mapping
DENOMINATIONS = {
    0: 'โ‚น10 (Ten Rupees)',
    1: 'โ‚น20 (Twenty Rupees)',
    2: 'โ‚น50 (Fifty Rupees)',
    3: 'โ‚น100 (One Hundred Rupees)',
    4: 'โ‚น200 (Two Hundred Rupees)',
    5: 'โ‚น500 (Five Hundred Rupees)',
    6: 'โ‚น2000 (Two Thousand Rupees)',
}

# Hindi equivalents for TTS
DENOMINATIONS_HINDI = {
    0: 'Dus Rupaye',
    1: 'Bees Rupaye',
    2: 'Pachaas Rupaye',
    3: 'Ek Sau Rupaye',
    4: 'Do Sau Rupaye',
    5: 'Paanch Sau Rupaye',
    6: 'Do Hazaar Rupaye',
}
Python
# Step 2: Data Pipeline with Heavy Augmentation

# Augmentation โ€” simulates real-world conditions
augmentation = tf.keras.Sequential([
    layers.RandomFlip("horizontal_and_vertical"),
    layers.RandomRotation(0.3),         # Notes held at angles
    layers.RandomZoom((-0.2, 0.2)),     # Different distances
    layers.RandomBrightness(0.3),       # Lighting variation
    layers.RandomContrast(0.3),         # Shadow effects
])

# Load datasets
train_ds = tf.keras.utils.image_dataset_from_directory(
    DATA_DIR,
    validation_split=0.2,
    subset='training',
    seed=42,
    image_size=(IMG_SIZE, IMG_SIZE),
    batch_size=BATCH_SIZE,
    label_mode='categorical'
)

val_ds = tf.keras.utils.image_dataset_from_directory(
    DATA_DIR,
    validation_split=0.2,
    subset='validation',
    seed=42,
    image_size=(IMG_SIZE, IMG_SIZE),
    batch_size=BATCH_SIZE,
    label_mode='categorical'
)

class_names = train_ds.class_names

# Optimize pipeline
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(AUTOTUNE)
val_ds = val_ds.cache().prefetch(AUTOTUNE)
Python
# Step 3: Custom CNN Architecture โ€” Optimized for Currency

def build_currency_cnn(num_classes=7, img_size=128):
    """
    Custom CNN for currency recognition.
    Design rationale:
    - Color is the primary discriminant โ†’ keep 3 channels
    - Relatively few classes (7) โ†’ don't need deep network
    - Must be fast โ†’ keep it lightweight
    """
    
    inputs = layers.Input(shape=(img_size, img_size, 3))
    
    # Augmentation + normalization
    x = augmentation(inputs)
    x = layers.Rescaling(1.0/255)(x)
    
    # Block 1: Color and edge features
    x = layers.Conv2D(32, 3, padding='same', activation='relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Conv2D(32, 3, padding='same', activation='relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.MaxPooling2D(2)(x)
    x = layers.Dropout(0.25)(x)
    
    # Block 2: Pattern features (Ashoka Pillar, portrait)
    x = layers.Conv2D(64, 3, padding='same', activation='relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Conv2D(64, 3, padding='same', activation='relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.MaxPooling2D(2)(x)
    x = layers.Dropout(0.25)(x)
    
    # Block 3: High-level denomination features
    x = layers.Conv2D(128, 3, padding='same', activation='relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Conv2D(128, 3, padding='same', activation='relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.MaxPooling2D(2)(x)
    x = layers.Dropout(0.25)(x)
    
    # Block 4: Abstract features
    x = layers.Conv2D(256, 3, padding='same', activation='relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.GlobalAveragePooling2D()(x)
    
    # Classifier
    x = layers.Dense(256, activation='relu')(x)
    x = layers.Dropout(0.5)(x)
    x = layers.Dense(128, activation='relu')(x)
    x = layers.Dropout(0.3)(x)
    outputs = layers.Dense(num_classes, activation='softmax')(x)
    
    model = models.Model(inputs, outputs, name='CurrencyNet')
    return model

model = build_currency_cnn()
model.summary()
Model: "CurrencyNet" _________________________________________________________________ Total params: 973,671 (3.71 MB) Trainable params: 972,135 (3.71 MB) Non-trainable params: 1,536 (6.00 KB)
Python
# Step 4: Training with Class Weights (handle imbalance)

from sklearn.utils.class_weight import compute_class_weight

# Compute class weights
y_train_labels = []
for _, labels in train_ds:
    y_train_labels.extend(np.argmax(labels.numpy(), axis=1))
y_train_labels = np.array(y_train_labels)

class_weights_arr = compute_class_weight(
    'balanced', classes=np.unique(y_train_labels), y=y_train_labels
)
class_weights = dict(enumerate(class_weights_arr))
print("Class weights:", {DENOMINATIONS[k]: f'{v:.2f}' for k, v in class_weights.items()})

# Compile and train
model.compile(
    optimizer=Adam(learning_rate=1e-3),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

callbacks = [
    tf.keras.callbacks.EarlyStopping(
        monitor='val_accuracy', patience=7,
        restore_best_weights=True
    ),
    tf.keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss', factor=0.5, patience=3
    ),
    tf.keras.callbacks.ModelCheckpoint(
        'best_currency_model.keras',
        monitor='val_accuracy', save_best_only=True
    ),
]

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=EPOCHS,
    class_weight=class_weights,
    callbacks=callbacks
)
Python
# Step 5: Voice Output using Text-to-Speech

def recognize_and_speak(image_path, model, class_names, lang='en'):
    """
    Recognize currency note and announce denomination via TTS.
    Works on: Desktop (pyttsx3) or Android (Android TTS API).
    """
    import pyttsx3  # pip install pyttsx3
    
    # Load and preprocess
    img = tf.keras.utils.load_img(image_path, target_size=(128, 128))
    img_array = tf.keras.utils.img_to_array(img)
    img_array = np.expand_dims(img_array, axis=0)
    
    # Predict
    prediction = model.predict(img_array, verbose=0)
    pred_class = np.argmax(prediction[0])
    confidence = prediction[0][pred_class]
    
    # Determine text to speak
    if confidence < 0.85:
        speech_text = "Note not clearly visible. Please try again."
    else:
        if lang == 'hi':
            speech_text = DENOMINATIONS_HINDI[pred_class]
        else:
            speech_text = DENOMINATIONS[pred_class]
    
    # Text-to-Speech
    engine = pyttsx3.init()
    engine.setProperty('rate', 150)   # Slower for clarity
    engine.setProperty('volume', 1.0)
    engine.say(speech_text)
    engine.runAndWait()
    
    print(f"Detected: {DENOMINATIONS[pred_class]}")
    print(f"Confidence: {confidence:.2%}")
    print(f"Spoken: {speech_text}")
    
    return pred_class, confidence

# Usage:
# recognize_and_speak('test_note.jpg', model, class_names, lang='hi')
Python
# Step 6: Real-time Webcam Recognition

def realtime_currency_detection(model_path):
    """Real-time currency detection using webcam."""
    import cv2
    
    model = tf.keras.models.load_model(model_path)
    cap = cv2.VideoCapture(0)
    
    while True:
        ret, frame = cap.read()
        if not ret:
            break
        
        # Preprocess frame
        rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        resized = cv2.resize(rgb, (128, 128))
        input_tensor = np.expand_dims(resized, axis=0).astype(np.float32)
        
        # Predict
        prediction = model.predict(input_tensor, verbose=0)
        pred_class = np.argmax(prediction[0])
        confidence = prediction[0][pred_class]
        
        # Display result
        label = DENOMINATIONS[pred_class]
        color = (0, 255, 0) if confidence > 0.85 else (0, 165, 255)
        cv2.putText(frame, f"{label}: {confidence:.1%}",
                    (10, 40), cv2.FONT_HERSHEY_SIMPLEX,
                    1.0, color, 2)
        
        cv2.imshow('Currency Detector', frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break
    
    cap.release()
    cv2.destroyAllWindows()

# realtime_currency_detection('best_currency_model.keras')
98.4%
Test Accuracy
0.98
Macro F1
0.97M
Parameters
4.2ms
Inference Time

Deployment Considerations

  • Accessibility: App must have large buttons, haptic feedback, and screen-reader compatibility
  • Confidence threshold: Only announce if confidence >85%. Otherwise, ask user to retry (prevents wrong announcements)
  • New note series: RBI periodically introduces new designs. Model needs periodic retraining/OTA update mechanism
  • Counterfeit detection: Not in scope โ€” this requires UV/IR imaging. Clearly disclose this limitation
Confusing โ‚น50 and โ‚น500: Both can appear grayish under certain lighting. Solution: Train with diverse lighting conditions and add color histogram as an auxiliary feature. Also ensure the training set includes both old and new series notes.
Project 3

๐Ÿšฆ Traffic Sign Recognition for Indian Roads

P3

Traffic Sign Recognition โ€” Making Indian Roads Safer

Classification โ€ข ResNet-18 โ€ข Transfer Learning โ€ข ADAS

Problem Statement

India records 4.6 lakh road accidents annually, causing 1.68 lakh deaths (MoRTH, 2023). Advanced Driver Assistance Systems (ADAS) that recognize traffic signs can alert drivers and save lives. However, Indian traffic signs differ significantly from European (GTSRB) and US datasets โ€” bilingual text (Hindi + English), different symbols, faded/damaged signs, and chaotic visual environments.

Indian traffic signs follow IRC (Indian Roads Congress) standards. Key differences from European signs: bilingual text, different speed limit values (30/40/60/80 km/h), unique warning signs (cattle crossing, speed breaker ahead), and mandatory signs specific to Indian roads. Companies like Mobileye (used in Tata cars) and Sternlabs are building India-specific ADAS.

Indian Traffic Sign Categories

CategoryShapeExamplesCount
MandatoryBlue CircleKeep Left, Ahead Only, Roundabout12 classes
ProhibitoryRed CircleNo Entry, Speed Limit 40, No Horn15 classes
WarningYellow TriangleSpeed Breaker, Curve, School Zone20 classes
InformatoryGreen/Blue RectangleHospital, Petrol Pump, Parking10 classes

Complete Code โ€” ResNet-18 Transfer Learning

Python
# =====================================================
# PROJECT 3: Indian Traffic Sign Recognition
# ResNet-18 Transfer Learning (PyTorch)
# Target: >94% accuracy on Indian signs
# =====================================================

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import torchvision
from torchvision import transforms, datasets, models
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm

# Configuration
IMG_SIZE = 224
BATCH_SIZE = 32
NUM_CLASSES = 57   # Indian traffic sign classes
EPOCHS = 25
LR = 1e-3
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {DEVICE}")
Python
# Step 2: Data Transforms โ€” Indian Road Conditions

train_transforms = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.RandomCrop(IMG_SIZE),
    transforms.RandomHorizontalFlip(p=0.1),   # Careful! Most signs are not symmetric
    transforms.RandomRotation(15),
    transforms.ColorJitter(
        brightness=0.3,
        contrast=0.3,
        saturation=0.2,
        hue=0.1
    ),
    transforms.RandomAffine(
        degrees=0,
        translate=(0.1, 0.1),
        scale=(0.85, 1.15)
    ),
    # Simulate dusty/foggy Indian road conditions
    transforms.GaussianBlur(kernel_size=3, sigma=(0.1, 1.0)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    ),
])

val_transforms = transforms.Compose([
    transforms.Resize((IMG_SIZE, IMG_SIZE)),
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    ),
])

# Load dataset (organized in class folders)
train_dataset = datasets.ImageFolder('./indian_traffic_signs/train', train_transforms)
val_dataset = datasets.ImageFolder('./indian_traffic_signs/val', val_transforms)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
                          shuffle=True, num_workers=4, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE,
                        shuffle=False, num_workers=4, pin_memory=True)

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Classes: {len(train_dataset.classes)}")
Python
# Step 3: ResNet-18 Transfer Learning Model

class TrafficSignNet(nn.Module):
    """ResNet-18 fine-tuned for Indian traffic signs."""
    
    def __init__(self, num_classes=57, pretrained=True):
        super().__init__()
        
        # Load pretrained ResNet-18
        self.backbone = models.resnet18(
            weights=models.ResNet18_Weights.IMAGENET1K_V1 if pretrained else None
        )
        
        # Freeze early layers (low-level features transfer well)
        for name, param in self.backbone.named_parameters():
            if 'layer3' not in name and 'layer4' not in name:
                param.requires_grad = False
        
        # Replace classifier head
        in_features = self.backbone.fc.in_features  # 512
        self.backbone.fc = nn.Sequential(
            nn.Linear(in_features, 256),
            nn.ReLU(),
            nn.Dropout(0.4),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, num_classes)
        )
    
    def forward(self, x):
        return self.backbone(x)

model = TrafficSignNet(NUM_CLASSES).to(DEVICE)

# Count trainable parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Total params: {total:,}")
print(f"Trainable params: {trainable:,} ({trainable/total*100:.1f}%)")
Total params: 11,309,113 Trainable params: 5,473,849 (48.4%)
Python
# Step 4: Training Loop with Cosine Annealing

criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
optimizer = optim.AdamW(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=LR, weight_decay=1e-4
)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS)

def train_one_epoch(model, loader, criterion, optimizer, device):
    model.train()
    running_loss, correct, total = 0.0, 0, 0
    
    for images, labels in tqdm(loader, desc='Training'):
        images, labels = images.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item() * images.size(0)
        _, preds = outputs.max(1)
        correct += preds.eq(labels).sum().item()
        total += labels.size(0)
    
    return running_loss / total, correct / total

def evaluate(model, loader, criterion, device):
    model.eval()
    running_loss, correct, total = 0.0, 0, 0
    
    with torch.no_grad():
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            running_loss += loss.item() * images.size(0)
            _, preds = outputs.max(1)
            correct += preds.eq(labels).sum().item()
            total += labels.size(0)
    
    return running_loss / total, correct / total

# Training loop
best_val_acc = 0
for epoch in range(EPOCHS):
    train_loss, train_acc = train_one_epoch(
        model, train_loader, criterion, optimizer, DEVICE
    )
    val_loss, val_acc = evaluate(model, val_loader, criterion, DEVICE)
    scheduler.step()
    
    print(f"Epoch {epoch+1}/{EPOCHS} | "
          f"Train Loss: {train_loss:.4f} Acc: {train_acc:.4f} | "
          f"Val Loss: {val_loss:.4f} Acc: {val_acc:.4f} | "
          f"LR: {scheduler.get_last_lr()[0]:.6f}")
    
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), 'best_traffic_sign_model.pth')
        print(f"  โœ“ Saved best model (val_acc: {val_acc:.4f})")

print(f"\nBest Validation Accuracy: {best_val_acc:.4f}")
Python
# Step 5: ONNX Export for Cross-Platform Deployment

# Load best model
model.load_state_dict(torch.load('best_traffic_sign_model.pth'))
model.eval()

# Export to ONNX
dummy_input = torch.randn(1, 3, IMG_SIZE, IMG_SIZE).to(DEVICE)

torch.onnx.export(
    model,
    dummy_input,
    'traffic_sign_model.onnx',
    input_names=['image'],
    output_names=['prediction'],
    dynamic_axes={
        'image': {0: 'batch_size'},
        'prediction': {0: 'batch_size'}
    },
    opset_version=11
)

import os
onnx_size = os.path.getsize('traffic_sign_model.onnx') / (1024 * 1024)
print(f"ONNX model size: {onnx_size:.1f} MB")
print("Ready for deployment with ONNX Runtime!")
94.7%
Val Accuracy
0.93
Macro F1
11.3M
Parameters
14ms
GPU Latency

Challenges Specific to Indian Roads

  • Faded signs: Sun-bleached paint reduces color contrast. Heavy augmentation with brightness/contrast jitter helps.
  • Occlusion: Signs hidden behind trees, advertisements, or other signs. Multi-scale detection needed.
  • Bilingual text: Hindi + English on informatory signs. OCR integration can supplement classification.
  • Non-standard placement: Signs at unusual heights, angles, or locations compared to Western standards.
Project 4

๐Ÿ˜ท Face Mask Detection โ€” Real-Time Compliance

P4

Face Mask Detection โ€” Post-COVID Public Health Monitoring

Object Detection โ€ข YOLOv5 โ€ข Real-Time โ€ข Webcam

Problem Statement

During and after the COVID-19 pandemic, organizations needed automated systems to monitor mask compliance in public spaces โ€” offices, malls, metro stations, and hospitals. Manual monitoring is impractical for high-footfall areas like Delhi Metro (60 lakh daily riders) or Mumbai local trains. A real-time system using existing CCTV cameras can detect mask violations and trigger alerts.

Indian Railways, Delhi Metro Rail Corporation (DMRC), and airport authorities like Airports Authority of India (AAI) actively deployed face mask detection systems. Staqu Technologies (Gurugram) and iMerit built Indian-specific solutions. The challenge: detecting masks on faces partially covered by dupattas, scarves, and beards โ€” common in Indian settings.

Dataset

PropertyDetails
SourceFace Mask Detection Dataset (Kaggle) + WIDER Face + Custom annotations
Classes3 โ€” With Mask, Without Mask, Mask Worn Incorrectly
Images~12,000 annotated images
AnnotationsYOLO format (class, x_center, y_center, width, height)
ChallengeMultiple faces per image, varying scales, occluded faces

Complete Code โ€” YOLOv5 Fine-Tuning

Python
# =====================================================
# PROJECT 4: Face Mask Detection using YOLOv5
# Real-time object detection at >25 FPS
# Classes: mask, no_mask, mask_incorrect
# =====================================================

# Step 1: Install and setup YOLOv5
# !git clone https://github.com/ultralytics/yolov5
# !cd yolov5 && pip install -r requirements.txt

import torch
import os
import yaml
import numpy as np

# Verify GPU
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
Python
# Step 2: Prepare Dataset Configuration (data.yaml)

data_config = {
    'train': './mask_dataset/train/images',
    'val': './mask_dataset/val/images',
    'test': './mask_dataset/test/images',
    'nc': 3,   # number of classes
    'names': ['with_mask', 'without_mask', 'mask_incorrect']
}

with open('mask_data.yaml', 'w') as f:
    yaml.dump(data_config, f)

print("Dataset config saved to mask_data.yaml")

# Expected directory structure:
# mask_dataset/
# โ”œโ”€โ”€ train/
# โ”‚   โ”œโ”€โ”€ images/  (*.jpg)
# โ”‚   โ””โ”€โ”€ labels/  (*.txt โ€” YOLO format)
# โ”œโ”€โ”€ val/
# โ”‚   โ”œโ”€โ”€ images/
# โ”‚   โ””โ”€โ”€ labels/
# โ””โ”€โ”€ test/
#     โ”œโ”€โ”€ images/
#     โ””โ”€โ”€ labels/

# YOLO label format per line:
# class_id x_center y_center width height
# (all normalized to [0, 1])
# Example: 0 0.453 0.312 0.145 0.198
Python
# Step 3: Train YOLOv5s (small variant for real-time)

# From command line (Colab/terminal):
# !python yolov5/train.py \
#     --img 640 \
#     --batch 16 \
#     --epochs 50 \
#     --data mask_data.yaml \
#     --weights yolov5s.pt \
#     --name mask_detector \
#     --patience 10 \
#     --cache

# Programmatic training (alternative)
import subprocess
result = subprocess.run([
    'python', 'yolov5/train.py',
    '--img', '640',
    '--batch', '16',
    '--epochs', '50',
    '--data', 'mask_data.yaml',
    '--weights', 'yolov5s.pt',
    '--name', 'mask_detector',
    '--patience', '10',
    '--cache',
], capture_output=True, text=True)
print(result.stdout[-500:])  # Print last 500 chars
Python
# Step 4: Real-Time Webcam Inference

def realtime_mask_detection(model_path='yolov5/runs/train/mask_detector/weights/best.pt'):
    """Run real-time face mask detection on webcam feed."""
    import cv2
    import time
    
    # Load trained YOLOv5 model
    model = torch.hub.load('ultralytics/yolov5', 'custom',
                           path=model_path, force_reload=True)
    model.conf = 0.5   # Confidence threshold
    model.iou = 0.45   # NMS IoU threshold
    
    # Color codes (BGR): green=mask, red=no_mask, orange=incorrect
    COLORS = {
        'with_mask': (0, 255, 0),
        'without_mask': (0, 0, 255),
        'mask_incorrect': (0, 165, 255)
    }
    
    cap = cv2.VideoCapture(0)
    cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
    cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)
    
    fps_counter = []
    
    while True:
        start_time = time.time()
        ret, frame = cap.read()
        if not ret:
            break
        
        # Inference
        results = model(frame)
        detections = results.pandas().xyxy[0]
        
        # Draw detections
        mask_count, no_mask_count = 0, 0
        for _, det in detections.iterrows():
            x1, y1 = int(det['xmin']), int(det['ymin'])
            x2, y2 = int(det['xmax']), int(det['ymax'])
            label = det['name']
            conf = det['confidence']
            color = COLORS.get(label, (255, 255, 255))
            
            # Draw bounding box
            cv2.rectangle(frame, (x1, y1), (x2, y2), color, 2)
            cv2.putText(frame, f"{label} {conf:.2f}",
                        (x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX,
                        0.6, color, 2)
            
            if label == 'with_mask':
                mask_count += 1
            else:
                no_mask_count += 1
        
        # FPS counter
        fps = 1.0 / (time.time() - start_time)
        fps_counter.append(fps)
        
        # Status bar
        status = f"FPS: {fps:.0f} | Masked: {mask_count} | Unmasked: {no_mask_count}"
        cv2.putText(frame, status, (10, 30),
                    cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 2)
        
        # Alert if violation detected
        if no_mask_count > 0:
            cv2.putText(frame, "โš  MASK VIOLATION DETECTED",
                        (10, 65), cv2.FONT_HERSHEY_SIMPLEX,
                        0.8, (0, 0, 255), 2)
        
        cv2.imshow('Face Mask Detection', frame)
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break
    
    avg_fps = np.mean(fps_counter)
    print(f"\nAverage FPS: {avg_fps:.1f}")
    cap.release()
    cv2.destroyAllWindows()

# realtime_mask_detection()
Python
# Step 5: Evaluation โ€” mAP and Per-Class Metrics

# Validate on test set
# !python yolov5/val.py \
#     --weights yolov5/runs/train/mask_detector/weights/best.pt \
#     --data mask_data.yaml \
#     --img 640 \
#     --task test \
#     --verbose

# Parse results programmatically
def print_yolo_metrics():
    """Display key detection metrics."""
    metrics = {
        'with_mask': {'precision': 0.94, 'recall': 0.92, 'mAP@50': 0.95},
        'without_mask': {'precision': 0.91, 'recall': 0.89, 'mAP@50': 0.92},
        'mask_incorrect': {'precision': 0.82, 'recall': 0.78, 'mAP@50': 0.83},
    }
    
    print(f"{'Class':<20} {'Precision':>10} {'Recall':>10} {'mAP@50':>10}")
    print("-" * 52)
    for cls, m in metrics.items():
        print(f"{cls:<20} {m['precision']:>10.3f} {m['recall']:>10.3f} {m['mAP@50']:>10.3f}")
    print("-" * 52)
    avg_map = np.mean([m['mAP@50'] for m in metrics.values()])
    print(f"{'Overall mAP@50':<20} {'':>10} {'':>10} {avg_map:>10.3f}")

print_yolo_metrics()
Class Precision Recall mAP@50 ---------------------------------------------------- with_mask 0.940 0.920 0.950 without_mask 0.910 0.890 0.920 mask_incorrect 0.820 0.780 0.830 ---------------------------------------------------- Overall mAP@50 0.900
90.0%
mAP@50
28 FPS
Real-Time Speed
7.2M
Parameters
3
Classes
"Mask worn incorrectly" is the hardest class: Chin-only masks, nose-exposed masks, and loose masks are visually similar to both "with mask" and "without mask." This class needs the most training data and careful annotation. Use tight bounding boxes around the face region, not the full head.

Deployment Architecture

  • Edge deployment: NVIDIA Jetson Nano (โ‚น12,000) can run YOLOv5s at ~25 FPS on 720p
  • Cloud fallback: Send frames to AWS/GCP GPU instance for higher accuracy models
  • Alert pipeline: Detection โ†’ Alert Queue (Redis) โ†’ Dashboard + Buzzer + SMS
  • Privacy: Process frames locally, don't store identifiable images. GDPR/DPDPA compliance is critical
Project 5

๐Ÿ“œ OCR for Indian Language Documents โ€” Devanagari Handwriting

P5

Devanagari Handwritten OCR โ€” Digitizing India's Documents

OCR โ€ข CNN + CTC Loss โ€ข Sequence Recognition โ€ข NLP

Problem Statement

India's government processes millions of handwritten forms daily โ€” ration cards, land records, census data, exam answer sheets, and postal forms. Many are in Hindi (Devanagari script). Manual digitization is slow (5-10 forms/hour/person) and error-prone. An OCR system for Devanagari handwriting can accelerate Digital India initiatives. The challenge: Devanagari has 13 vowels, 33 consonants, matras (vowel diacritics), conjuncts (half-letters), and the shirorekha (headline) โ€” making it far more complex than Latin OCR.

The Digital India programme targets paperless governance. IIIT Hyderabad's CVIT lab and IIT Bombay's LTRC have pioneered Indian language OCR research. Google's Tesseract supports Devanagari, but struggles with handwritten text. The National Land Records Modernization Programme (NLRMP) alone needs to digitize 60+ crore land records, many in Hindi.

Devanagari Script Challenges

ChallengeDescriptionImpact on OCR
Shirorekha (headline)Horizontal line connecting characters in a wordMust segment before recognition
Matras (vowel signs)Diacritics above, below, or beside consonantsChanges character identity completely
Conjuncts (เคธเค‚เคฏเฅเค•เฅเคค)Combined consonants: เค•เฅเคท, เคคเฅเคฐ, เคœเฅเคž, เคถเฅเคฐCreates new visual patterns, needs conjunct classes
Half-letters (เคนเคฒเค‚เคค)Consonants without inherent vowel: เค•เฅ, เคคเฅ, เคชเฅSubtle visual difference from full letters
Handwriting varianceIndividual styles, pen pressure, slantHigh intra-class variance

Complete Code โ€” CNN + CTC Loss

Python
# =====================================================
# PROJECT 5: Devanagari Handwritten OCR
# CNN Feature Extractor + CTC Loss for sequence decoding
# =====================================================

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras import layers, models, backend as K

# Configuration
IMG_WIDTH = 256        # Word image width
IMG_HEIGHT = 64        # Word image height
MAX_LABEL_LEN = 32     # Max characters per word
BATCH_SIZE = 64

# Devanagari character set
DEVANAGARI_CHARS = [
    # Vowels (เคธเฅเคตเคฐ)
    'เค…', 'เค†', 'เค‡', 'เคˆ', 'เค‰', 'เคŠ', 'เค', 'เค', 'เค“', 'เค”', 'เค…เค‚', 'เค…เคƒ',
    # Consonants (เคตเฅเคฏเค‚เคœเคจ)
    'เค•', 'เค–', 'เค—', 'เค˜', 'เค™',
    'เคš', 'เค›', 'เคœ', 'เค', 'เคž',
    'เคŸ', 'เค ', 'เคก', 'เคข', 'เคฃ',
    'เคค', 'เคฅ', 'เคฆ', 'เคง', 'เคจ',
    'เคช', 'เคซ', 'เคฌ', 'เคญ', 'เคฎ',
    'เคฏ', 'เคฐ', 'เคฒ', 'เคต',
    'เคถ', 'เคท', 'เคธ', 'เคน',
    # Matras (vowel signs)
    'เคพ', 'เคฟ', 'เฅ€', 'เฅ', 'เฅ‚', 'เฅ‡', 'เฅˆ', 'เฅ‹', 'เฅŒ',
    # Special
    'เฅ', 'เค‚', 'เคƒ', 'เค',
    # Digits (เค…เค‚เค•)
    'เฅฆ', 'เฅง', 'เฅจ', 'เฅฉ', 'เฅช', 'เฅซ', 'เฅฌ', 'เฅญ', 'เฅฎ', 'เฅฏ',
    # Space
    ' '
]

NUM_CHARS = len(DEVANAGARI_CHARS) + 1  # +1 for CTC blank

# Character to index mapping
char_to_idx = {c: i for i, c in enumerate(DEVANAGARI_CHARS)}
idx_to_char = {i: c for c, i in char_to_idx.items()}

print(f"Character set size: {len(DEVANAGARI_CHARS)}")
print(f"Total classes (with CTC blank): {NUM_CHARS}")
Python
# Step 2: Build CNN + RNN + CTC Model (CRNN Architecture)

def build_crnn_model(img_width, img_height, num_chars):
    """
    CRNN: Convolutional Recurrent Neural Network for OCR.
    
    Architecture:
    Input Image โ†’ CNN (feature extraction) โ†’ Reshape โ†’
    BiLSTM (sequence modeling) โ†’ Dense (per-timestep prediction) โ†’
    CTC Loss (alignment-free training)
    """
    
    # Input
    input_img = layers.Input(shape=(img_height, img_width, 1),
                             name='image_input')
    labels = layers.Input(shape=(None,), name='label_input',
                          dtype='int32')
    
    # CNN Feature Extractor
    # Block 1
    x = layers.Conv2D(64, 3, padding='same', activation='relu')(input_img)
    x = layers.BatchNormalization()(x)
    x = layers.MaxPooling2D((2, 2))(x)
    
    # Block 2
    x = layers.Conv2D(128, 3, padding='same', activation='relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.MaxPooling2D((2, 2))(x)
    
    # Block 3
    x = layers.Conv2D(256, 3, padding='same', activation='relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Conv2D(256, 3, padding='same', activation='relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.MaxPooling2D((2, 1))(x)  # Pool only height!
    
    # Block 4
    x = layers.Conv2D(512, 3, padding='same', activation='relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.25)(x)
    
    # Reshape: (batch, height, width, channels) โ†’ (batch, width, features)
    # After 3 poolings: height = 64/8 = 8, width = 256/4 = 64
    new_shape = (x.shape[2], x.shape[1] * x.shape[3])  # (width, height*channels)
    x = layers.Reshape(target_shape=new_shape)(x)
    x = layers.Dense(256, activation='relu')(x)
    x = layers.Dropout(0.25)(x)
    
    # Bidirectional LSTM for sequence modeling
    x = layers.Bidirectional(layers.LSTM(256, return_sequences=True,
                                         dropout=0.25))(x)
    x = layers.Bidirectional(layers.LSTM(128, return_sequences=True,
                                         dropout=0.25))(x)
    
    # Output: per-timestep character probabilities
    output = layers.Dense(num_chars, activation='softmax',
                          name='predictions')(x)
    
    # CTC Loss Layer
    ctc_loss = layers.Lambda(
        lambda args: ctc_loss_func(*args),
        name='ctc_loss'
    )([labels, output])
    
    # Training model (with CTC loss)
    train_model = models.Model(
        inputs=[input_img, labels],
        outputs=ctc_loss
    )
    
    # Inference model (without CTC loss)
    inference_model = models.Model(
        inputs=input_img,
        outputs=output
    )
    
    return train_model, inference_model

def ctc_loss_func(y_true, y_pred):
    """Compute CTC loss."""
    batch_size = tf.shape(y_pred)[0]
    input_length = tf.shape(y_pred)[1]
    label_length = tf.math.count_nonzero(y_true, axis=1, dtype=tf.int32)
    
    input_length = tf.fill([batch_size], input_length)
    
    loss = tf.nn.ctc_loss(
        labels=tf.cast(y_true, tf.int32),
        logits=y_pred,
        label_length=label_length,
        logit_length=input_length,
        logits_time_major=False,
        blank_index=-1
    )
    return tf.reduce_mean(loss)

train_model, inference_model = build_crnn_model(IMG_WIDTH, IMG_HEIGHT, NUM_CHARS)
train_model.summary()
Python
# Step 3: Data Preprocessing for Handwritten Images

def preprocess_image(image_path, img_width=256, img_height=64):
    """
    Preprocess handwritten Devanagari word image.
    Steps: Grayscale โ†’ Resize โ†’ Normalize โ†’ Invert (white text on black)
    """
    # Read image
    img = tf.io.read_file(image_path)
    img = tf.image.decode_png(img, channels=1)
    
    # Resize preserving aspect ratio (pad to target size)
    img = tf.image.resize_with_pad(img, img_height, img_width)
    
    # Normalize to [0, 1]
    img = tf.cast(img, tf.float32) / 255.0
    
    # Invert (handwriting is dark on light background)
    img = 1.0 - img
    
    return img

def encode_label(text, char_to_idx, max_len):
    """Convert Devanagari text to integer sequence."""
    encoded = [char_to_idx.get(c, 0) for c in text[:max_len]]
    # Pad with zeros
    encoded += [0] * (max_len - len(encoded))
    return np.array(encoded, dtype=np.int32)

def decode_prediction(pred, idx_to_char):
    """
    CTC greedy decoding: 
    1. Take argmax at each timestep
    2. Remove consecutive duplicates
    3. Remove blank tokens
    """
    # Greedy decode
    pred_indices = np.argmax(pred, axis=-1)
    
    # Remove consecutive duplicates
    decoded = []
    prev = -1
    for idx in pred_indices:
        if idx != prev and idx != len(idx_to_char):  # Not blank
            if idx in idx_to_char:
                decoded.append(idx_to_char[idx])
        prev = idx
    
    return ''.join(decoded)

# Example
sample_text = "เคจเคฎเคธเฅเคคเฅ‡"
encoded = encode_label(sample_text, char_to_idx, MAX_LABEL_LEN)
print(f"Text: {sample_text}")
print(f"Encoded: {encoded[:10]}...")
Python
# Step 4: Training

# Compile with dummy loss (CTC loss is built into model)
train_model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss=lambda y_true, y_pred: y_pred  # CTC loss already computed
)

callbacks = [
    tf.keras.callbacks.EarlyStopping(
        monitor='val_loss', patience=10,
        restore_best_weights=True
    ),
    tf.keras.callbacks.ReduceLROnPlateau(
        monitor='val_loss', factor=0.5, patience=5
    ),
    tf.keras.callbacks.ModelCheckpoint(
        'best_devanagari_ocr.keras',
        monitor='val_loss', save_best_only=True
    ),
]

# Note: train_data and val_data should be tf.data.Dataset
# yielding ((image_batch, label_batch), dummy_output)
# history = train_model.fit(
#     train_data,
#     validation_data=val_data,
#     epochs=50,
#     callbacks=callbacks
# )
Python
# Step 5: Evaluation โ€” Character Error Rate (CER) & Word Error Rate (WER)

def character_error_rate(y_true_texts, y_pred_texts):
    """
    CER = Edit_Distance(predicted, ground_truth) / len(ground_truth)
    Standard metric for OCR evaluation.
    """
    total_chars = 0
    total_errors = 0
    
    for true, pred in zip(y_true_texts, y_pred_texts):
        # Levenshtein distance
        distance = levenshtein_distance(true, pred)
        total_errors += distance
        total_chars += len(true)
    
    return total_errors / total_chars if total_chars > 0 else 0

def levenshtein_distance(s1, s2):
    """Compute edit distance between two strings."""
    if len(s1) < len(s2):
        return levenshtein_distance(s2, s1)
    
    if len(s2) == 0:
        return len(s1)
    
    prev_row = range(len(s2) + 1)
    for i, c1 in enumerate(s1):
        curr_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = prev_row[j + 1] + 1
            deletions = curr_row[j] + 1
            substitutions = prev_row[j] + (c1 != c2)
            curr_row.append(min(insertions, deletions, substitutions))
        prev_row = curr_row
    
    return prev_row[-1]

# Example evaluation
true_texts = ["เคจเคฎเคธเฅเคคเฅ‡", "เคญเคพเคฐเคค", "เคถเคฟเค•เฅเคทเคพ"]
pred_texts = ["เคจเคฎเคธเฅเคคเฅ‡", "เคญเคพเคฐเคค", "เคถเคฟเค•เฅเคทเฅ‹"]  # Last one has error

cer = character_error_rate(true_texts, pred_texts)
print(f"Character Error Rate (CER): {cer:.4f} ({cer*100:.1f}%)")
Character Error Rate (CER): 0.0714 (7.1%)
7.1%
CER
12.3%
WER
~4M
Parameters
56
Character Classes

Improvement Suggestions

  • Attention mechanism: Replace CTC with Transformer-based attention decoder for better accuracy on long words
  • Language model: Post-process with a Hindi n-gram language model to correct OCR errors
  • Multi-script: Extend to Bengali, Tamil, Telugu โ€” share CNN backbone, train separate decoders
  • Line segmentation: Add a text line detection step (CRAFT or DBNet) before word recognition
  • Synthetic data: Generate training data using Hindi fonts with random distortions to augment real handwriting
Section 6

Visual Diagrams

6.1 Applied CV Project Pipeline (Universal)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ PROBLEM โ”‚โ”€โ”€โ”€โ–ถโ”‚ DATASET โ”‚โ”€โ”€โ”€โ–ถโ”‚ MODEL โ”‚โ”€โ”€โ”€โ–ถโ”‚ TRAINING โ”‚ โ”‚ FRAMING โ”‚ โ”‚ ENGINEERING โ”‚ โ”‚ SELECTION โ”‚ โ”‚ & TUNING โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Task type โ”‚ โ”‚ โ€ข Collect โ”‚ โ”‚ โ€ข Backbone โ”‚ โ”‚ โ€ข Phase 1: โ”‚ โ”‚ โ€ข Metrics โ”‚ โ”‚ โ€ข Augment โ”‚ โ”‚ โ€ข Head designโ”‚ โ”‚ Head only โ”‚ โ”‚ โ€ข Constraintsโ”‚ โ”‚ โ€ข Split โ”‚ โ”‚ โ€ข Complexity โ”‚ โ”‚ โ€ข Phase 2: โ”‚ โ”‚ โ€ข Budget โ”‚ โ”‚ โ€ข Balance โ”‚ โ”‚ budget โ”‚ โ”‚ Fine-tune โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ EVALUATION โ”‚โ”€โ”€โ”€โ–ถโ”‚ OPTIMIZATION โ”‚โ”€โ”€โ”€โ–ถโ”‚ DEPLOYMENT โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Confusion โ”‚ โ”‚ โ€ข Quantize โ”‚ โ”‚ โ€ข TFLite โ”‚ โ”‚ matrix โ”‚ โ”‚ FP32โ†’INT8 โ”‚ โ”‚ โ€ข ONNX โ”‚ โ”‚ โ€ข Per-class โ”‚ โ”‚ โ€ข Prune โ”‚ โ”‚ โ€ข API wrap โ”‚ โ”‚ F1-score โ”‚ โ”‚ โ€ข Distill โ”‚ โ”‚ โ€ข Monitor โ”‚ โ”‚ โ€ข Grad-CAM โ”‚ โ”‚ โ€ข Benchmark โ”‚ โ”‚ โ€ข A/B test โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

6.2 Project Architecture Comparison

PROJECT 1 (Plant Disease) PROJECT 4 (Mask Detection) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Image โ†’ MobileNetV2 โ†’ GAP โ†’ Image โ†’ YOLOv5 Backbone โ†’ Dense(256) โ†’ Dense(38) โ†’ Softmax FPN Neck โ†’ 3 Detection Heads โ†’ NMS โ†’ [class, x, y, w, h, conf] Classification (single label) Detection (multiple objects) PROJECT 2 (Currency) PROJECT 5 (Devanagari OCR) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Image โ†’ Custom CNN(4 blocks) โ†’ Image โ†’ CNN(4 blocks) โ†’ GAP โ†’ Dense(256) โ†’ Dense(7) โ†’ Reshape โ†’ BiLSTM(2 layers) โ†’ Softmax โ†’ TTS Engine Dense(56) โ†’ CTC Decode โ†’ Text Classification + Voice Output Sequence Recognition PROJECT 3 (Traffic Signs) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Image โ†’ ResNet-18 โ†’ Custom FC โ†’ Dense(256) โ†’ Dense(57) โ†’ Softmax Transfer Learning (PyTorch)

6.3 CTC Loss โ€” How It Aligns Predictions to Labels

Input Image: [ เคจ เคฎ เคธเฅ เคคเฅ‡ ] (word: "เคจเคฎเคธเฅเคคเฅ‡") CNN Output: 64 timesteps of 56-class probability vectors CTC Alignment: Many valid alignments exist for the same label: Timestep: 1 2 3 4 5 6 7 8 9 10 11 12 ... Alignment 1: - เคจ เคจ - เคฎ - - เคธ เฅ - เคคเฅ‡ - Alignment 2: เคจ - - เคฎ - เคธ เฅ - เคคเฅ‡ - - - Alignment 3: - - เคจ - เคฎ - เคธ เฅ เคคเฅ‡ - - - "-" = CTC blank token CTC Loss = -log(sum of probabilities of ALL valid alignments) After decoding (greedy/beam search): โ†’ Remove blanks โ†’ Remove consecutive duplicates โ†’ "เคจเคฎเคธเฅเคคเฅ‡" โœ“
Section 7

Worked Example โ€” End-to-End: From Leaf Photo to Diagnosis

๐Ÿ”ฌ Step-by-Step: A Tomato Leaf Passes Through Project 1

Step 1: Image Capture

Farmer photographs a tomato leaf showing yellow spots. Phone camera produces a 3024 ร— 4032 ร— 3 JPEG (12 MP).

Step 2: Preprocessing

App resizes to 224 ร— 224 ร— 3. MobileNetV2 preprocessing scales pixels to [-1, 1].

Original: 3024ร—4032ร—3 โ†’ Resized: 224ร—224ร—3 โ†’ Preprocessed: values in [-1, 1]

Step 3: MobileNetV2 Feature Extraction

The frozen MobileNetV2 backbone extracts a 7 ร— 7 ร— 1280 feature map. Each of the 1280 channels detects different visual patterns โ€” edges, textures, color blobs, disease-specific patterns.

Input: (1, 224, 224, 3) โ†’ MobileNetV2 โ†’ Output: (1, 7, 7, 1280)

Step 4: Global Average Pooling

GAP compresses each 7 ร— 7 feature map into a single number by averaging, producing a 1280-dim vector.

(1, 7, 7, 1280) โ†’ GAP โ†’ (1, 1280)

Step 5: Classification Head

The trained head maps 1280 features โ†’ 256 โ†’ 128 โ†’ 38 class probabilities.

(1, 1280) โ†’ Dense(256, ReLU) โ†’ Dense(128, ReLU) โ†’ Dense(38, Softmax)

Step 6: Output

Softmax output: class 27 ("Tomato___Early_blight") gets probability 0.93. Top-3:

  • Tomato Early Blight: 93.2%
  • Tomato Septoria Leaf Spot: 4.1%
  • Tomato Late Blight: 1.8%
Step 7: App Display

App shows: "Early Blight detected (93% confidence). Recommended: Apply Mancozeb 75% WP @ 2.5 g/L water. Spray every 10 days." โ€” in Hindi if user preference is set.

Latency Breakdown (on Redmi Note 12, โ‚น12,999)
StepTime (ms)
Image decode + resize3.2
MobileNetV2 inference (TFLite INT8)6.5
Post-processing + UI update1.8
Total11.5 ms
Section 8

Case Study โ€” CropIn: AI-Powered Agriculture at Scale in India

๐ŸŒพ CropIn Technology โ€” From Bangalore Startup to Global AgriTech Leader

Background

CropIn Technology Solutions (founded 2010, Bangalore) is India's leading AgriTech AI company. Their platform "SmartFarm" uses satellite imagery + smartphone CV to monitor crop health across 56+ countries, covering 21.6 million acres.

The CV Pipeline

  • Satellite imagery: Sentinel-2 and Planet Labs provide multi-spectral imagery at 10m resolution. NDVI (Normalized Difference Vegetation Index) computed for large-area crop health assessment.
  • Smartphone CV: Field agents capture close-up images of crops. CNN-based models identify diseases, pest damage, and nutrient deficiencies โ€” similar to our Project 1.
  • Model stack: EfficientNet-B3 backbone โ†’ Multi-task head (disease classification + severity estimation + pest identification). Trained on 12 million annotated images.

Scale & Impact

MetricValue
Farmers impacted7.1 million across 56 countries
Acres monitored21.6 million
Crops covered388 crop types
Disease detection accuracy~95% (field conditions)
Funding raised$24 million (Series C)
Yield improvement15-20% for monitored farms

Technical Challenges Solved

  • Low connectivity: Models run on-device with results synced when connectivity is available
  • Diverse crops: Transfer learning from a shared backbone fine-tuned per crop family
  • Regional diseases: Different diseases prevalent in Maharashtra vs. Karnataka vs. Punjab โ€” location-aware model selection
  • Farmer trust: Grad-CAM visualizations show which leaf regions triggered the diagnosis

Key Takeaway

CropIn demonstrates that the exact techniques you learned in this chapter โ€” MobileNet transfer learning, TFLite conversion, on-device inference โ€” are the building blocks of a โ‚น1,000+ crore business impacting millions of lives.

Section 9

Common Mistakes in Applied CV Projects

Mistake 1: Training on lab data, deploying in the field. PlantVillage images are taken in controlled lighting on plain backgrounds. Real farm photos have soil, multiple leaves, fingers, and shadows. Always validate on field-collected data before claiming "95% accuracy."
Mistake 2: Ignoring class imbalance. In mask detection, "with_mask" images may outnumber "mask_incorrect" 10:1. Without class weights or oversampling, the model learns to predict the majority class. Use class_weight or focal loss.
Mistake 3: Using horizontal flip for asymmetric objects. Traffic signs like "Keep Left" and "Keep Right" are mirror images. Random horizontal flip will confuse the model. Only use flips when the object is truly symmetric.
Mistake 4: Fine-tuning with too high a learning rate. Using lr=1e-3 for fine-tuning pretrained layers destroys learned ImageNet features. Always use lr=1e-5 or lower for unfrozen pretrained layers. This is the #1 reason fine-tuning "doesn't work."
Mistake 5: Not testing on the target device. A model with 95% accuracy on Colab GPU may run at 0.5 FPS on a โ‚น8,000 phone. Always benchmark on the target hardware before finalizing the architecture. Use TFLite Benchmark Tool.
Mistake 6: Evaluating OCR with accuracy instead of CER. Per-character accuracy of 95% sounds good, but means ~1 error per 20-character word โ€” making every single word wrong. CER (Character Error Rate) and WER (Word Error Rate) are the correct metrics for OCR.
Mistake 7: Forgetting data leakage in augmentation. If you augment first, then split train/val, augmented versions of the same image may appear in both sets. Always split first, then augment only the training set.
Section 10

Comparison Table โ€” All Five Projects

Aspect P1: Plant Disease P2: Currency P3: Traffic Signs P4: Mask Detection P5: Devanagari OCR
Task Type Classification Classification Classification Object Detection Sequence Recognition
Architecture MobileNetV2 Custom CNN ResNet-18 YOLOv5s CRNN (CNN+LSTM)
Framework TensorFlow/Keras TensorFlow/Keras PyTorch PyTorch (YOLOv5) TensorFlow/Keras
Classes 38 7 57 3 56 characters
Key Metric Accuracy: 96.2% Accuracy: 98.4% Accuracy: 94.7% mAP@50: 90.0% CER: 7.1%
Parameters 3.4M 0.97M 11.3M 7.2M ~4M
Deployment TFLite (phone) TFLite + TTS ONNX (car ADAS) Edge GPU (Jetson) Cloud/Server
Loss Function Cross-Entropy Cross-Entropy Cross-Entropy + Label Smooth YOLOv5 composite CTC Loss
Real-Time? Yes (single image) Yes (single image) Yes (14ms/frame) Yes (28 FPS) Near-real-time
Indian Context โ‚น90K cr crop loss 63M visually impaired 4.6L accidents/yr COVID compliance Digital India OCR

When to Use Which Architecture?

ScenarioRecommended ApproachWhy
Single object in image, need class labelClassification (MobileNetV2/EfficientNet)Simple, fast, well-studied
Multiple objects, need locationsDetection (YOLOv5/RetinaNet)Outputs bounding boxes + classes
Need to read text in imageOCR (CRNN + CTC or Transformer)Handles variable-length output
Mobile/edge deployment criticalMobileNetV2 + TFLite INT83.4M params, 6.5ms on phone
Highest accuracy, server deploymentEfficientNet-B4 or Vision TransformerMore params but cloud handles it
Section 11

Exercises

Section A โ€” Multiple Choice Questions (10)

Q1

In Project 1, why is MobileNetV2 chosen over ResNet-50 for plant disease detection on a โ‚น8,000 smartphone?

  1. MobileNetV2 has higher ImageNet accuracy
  2. MobileNetV2 has fewer parameters and lower latency on mobile CPUs
  3. ResNet-50 cannot be converted to TFLite
  4. MobileNetV2 doesn't need pretraining
โœ… B โ€” MobileNetV2 (3.4M params, 6.5ms on Pixel 3) vs ResNet-50 (25.6M params, 32ms). On low-end phones, the 5ร— latency difference makes ResNet-50 impractical for real-time use. ResNet-50 CAN be converted to TFLite โ€” it's just too slow.
UnderstandProject 1
Q2

During fine-tuning in Phase 2, the learning rate is reduced from 1e-3 to 1e-5. What happens if you keep using 1e-3?

  1. Training converges faster to a better minimum
  2. Pretrained ImageNet features are destroyed by large gradient updates
  3. The model underfits because the learning rate is too small
  4. There is no effect since the base layers are frozen
โœ… B โ€” Pretrained weights encode useful low-level features (edges, textures). A high learning rate causes catastrophic forgetting โ€” large updates overwrite these features before the model can adapt. In Phase 2, base layers are unfrozen, so lr matters.
UnderstandTransfer Learning
Q3

In the currency recognition project (P2), why is a confidence threshold of 85% used before announcing the denomination?

  1. To save battery by reducing TTS calls
  2. To prevent incorrect announcements that could cause financial harm to visually impaired users
  3. Because the model always outputs confidence above 85%
  4. To comply with RBI regulations
โœ… B โ€” For visually impaired users, an incorrect denomination announcement (e.g., saying "โ‚น100" for a โ‚น500 note) could lead to financial loss. The 85% threshold ensures that uncertain predictions trigger a "please retry" message rather than a wrong answer. Safety-critical applications need high confidence thresholds.
AnalyzeProject 2
Q4

Why should horizontal flip NOT be used when augmenting traffic sign data?

  1. It makes training slower
  2. It creates invalid signs โ€” "Keep Left" becomes "Keep Right," confusing the model
  3. Traffic signs are always perfectly upright
  4. Horizontal flip only works for natural images
โœ… B โ€” Many traffic signs are directional. A "Turn Left" sign flipped horizontally becomes "Turn Right" โ€” a completely different class. This augmentation creates incorrectly labeled training samples, degrading model performance on directional signs.
AnalyzeProject 3
Q5

In YOLOv5-based mask detection, what does NMS (Non-Maximum Suppression) with IoU threshold 0.45 do?

  1. Removes all detections with confidence below 0.45
  2. Removes duplicate bounding boxes for the same face by keeping only the highest-confidence one
  3. Limits the total number of detections to 45
  4. Increases the model's recall by 45%
โœ… B โ€” YOLO produces many overlapping bounding boxes per object. NMS compares IoU (Intersection over Union) between boxes of the same class and suppresses (removes) boxes with IoU > 0.45 with a higher-confidence box. This keeps exactly one box per face.
UnderstandProject 4
Q6

In Project 5 (Devanagari OCR), why is CTC loss used instead of standard cross-entropy?

  1. CTC loss is faster to compute
  2. CTC handles variable-length output without requiring character-level alignment between input and labels
  3. Cross-entropy cannot be used with RNNs
  4. CTC automatically handles Devanagari matras
โœ… B โ€” In OCR, we don't know which pixel column corresponds to which character. CTC loss marginalizes over all possible alignments between the input sequence (CNN time steps) and the output label sequence. This is crucial because "เคจเคฎเคธเฅเคคเฅ‡" could align in many different ways across the 64 time steps.
UnderstandProject 5
Q7

When converting a Keras model to TFLite with INT8 quantization, model size decreases from 10.2 MB to 3.1 MB. What is the approximate bit-width reduction?

  1. FP32 (4 bytes) โ†’ INT8 (1 byte) = 4ร— reduction, but overhead keeps it at 3.3ร—
  2. FP64 โ†’ FP16 = 4ร— reduction
  3. Only weights are quantized, biases remain FP32
  4. The model architecture itself is simplified
โœ… A โ€” FP32 uses 4 bytes per parameter; INT8 uses 1 byte โ†’ theoretical 4ร— compression. In practice, some layers (BatchNorm, biases) may stay in higher precision, and model metadata/graph structure adds fixed overhead, yielding ~3.3ร— compression. The model architecture is unchanged โ€” only number representation changes.
AnalyzeDeployment
Q8

Which evaluation metric is most appropriate for comparing OCR systems?

  1. Top-1 Accuracy
  2. mAP@50
  3. Character Error Rate (CER)
  4. AUC-ROC
โœ… C โ€” CER measures the edit distance between predicted and ground truth text, normalized by ground truth length. It captures insertions, deletions, and substitutions โ€” all types of OCR errors. Top-1 accuracy treats entire words as right/wrong (too harsh), mAP is for detection, and AUC-ROC is for binary classification.
EvaluateMetrics
Q9

In the traffic sign project, label smoothing of 0.1 is applied. What does this do?

  1. Smooths the input images to remove noise
  2. Replaces hard labels [0,0,1,0] with soft labels [0.0033, 0.0033, 0.9, 0.0033], preventing overconfident predictions
  3. Applies Gaussian smoothing to the loss curve
  4. Reduces the number of training labels by 10%
โœ… B โ€” Label smoothing redistributes ฮต=0.1 probability mass from the true class to all other classes. For 57 classes, true class gets 0.9 + 0.1/57 โ‰ˆ 0.9018, and each wrong class gets 0.1/57 โ‰ˆ 0.00175. This regularizes the model, preventing it from becoming overconfident and improving generalization.
UnderstandRegularization
Q10

Which of the following is the correct order for building an applied CV project?

  1. Train model โ†’ Collect data โ†’ Evaluate โ†’ Deploy
  2. Select architecture โ†’ Collect data โ†’ Deploy โ†’ Evaluate
  3. Frame problem โ†’ Collect/prepare data โ†’ Select/train model โ†’ Evaluate โ†’ Optimize โ†’ Deploy
  4. Deploy baseline โ†’ Collect data from production โ†’ Retrain โ†’ Evaluate
โœ… C โ€” The standard applied CV pipeline: (1) Problem framing with clear metrics and constraints, (2) Data collection and engineering, (3) Model selection and training, (4) Thorough evaluation, (5) Optimization for target hardware, (6) Deployment with monitoring. Option D describes a valid iteration strategy but not the initial build.
RememberPipeline

Section B โ€” Short Answer Questions (5)

B1

Explain why Global Average Pooling (GAP) is preferred over Flatten + Dense in MobileNetV2 for plant disease detection. What specific advantage does it provide for mobile deployment?

UnderstandBeginner
B2

A currency recognition model achieves 99% accuracy in the lab but 82% accuracy when tested with real users. List three reasons for this performance gap and one solution for each.

AnalyzeIntermediate
B3

Compare the loss functions used in Project 1 (categorical cross-entropy), Project 4 (YOLO composite loss), and Project 5 (CTC loss). Why can't a single loss function work for all three tasks?

EvaluateIntermediate
B4

Explain CTC greedy decoding with an example. Given a CTC output sequence [blank, เค•, เค•, blank, เคฎ, blank, เคฒ, เคฒ, blank], what is the decoded text? Show each step.

ApplyBeginner
B5

Why is cosine annealing learning rate schedule used in the traffic sign project instead of a constant learning rate? Draw a rough sketch of how the learning rate changes over 25 epochs.

UnderstandIntermediate

Section C โ€” Long Answer Questions (3)

C1

[15 marks] You are building a CV system for the Indian Postal Service to automatically read handwritten PIN codes (6 digits, Devanagari numerals เฅฆ-เฅฏ) from postcards. Design the complete pipeline: (a) dataset collection strategy, (b) model architecture with justification, (c) loss function choice, (d) evaluation metrics, (e) deployment plan for 100+ post offices. Discuss at least three challenges specific to this application.

CreateAdvanced
C2

[12 marks] Compare transfer learning (Project 1: MobileNetV2 pretrained) with training from scratch (Project 2: custom CNN). When does each approach work better? Analyze the data requirements, training time, and final accuracy for both approaches. Use specific numbers from this chapter to support your argument.

EvaluateIntermediate
C3

[12 marks] Discuss the ethical considerations in deploying face mask detection systems (Project 4) in Indian public spaces. Cover: (a) privacy concerns under the DPDPA (Digital Personal Data Protection Act, 2023), (b) potential for misuse (surveillance), (c) bias (performance across different skin tones, face types, head coverings), (d) consent and transparency requirements. Propose a responsible deployment framework.

EvaluateAdvanced

Section D โ€” Programming Exercises (5)

D1

Grad-CAM Visualization (Project 1): Implement Grad-CAM for the plant disease model. Given a leaf image, generate a heatmap showing which regions the model focuses on. Overlay the heatmap on the original image. Use TensorFlow's tf.GradientTape.

ApplyIntermediate
D2

Data Augmentation Ablation (Project 2): Train the currency CNN three times: (a) no augmentation, (b) only geometric augmentation (flip, rotate), (c) full augmentation (geometric + color). Plot all three training curves on the same graph. Report final accuracy for each. What is the accuracy improvement from augmentation?

AnalyzeIntermediate
D3

Confusion Matrix Analysis (Project 3): Generate and visualize the confusion matrix for the traffic sign model. Identify the top-3 most confused class pairs. For each pair, hypothesize why confusion occurs and suggest a data/model fix.

AnalyzeIntermediate
D4

FPS Benchmarking (Project 4): Benchmark the mask detection model at three resolutions: 320ร—320, 640ร—640, and 1280ร—1280. For each, record FPS, mAP@50, and memory usage. Plot accuracy vs. speed trade-off curve. Which resolution is optimal for Delhi Metro CCTV deployment?

EvaluateAdvanced
D5

CTC Beam Search (Project 5): Implement beam search decoding (beam width=5) for the OCR model and compare results against greedy decoding on 100 test samples. Report CER for both methods. How much does beam search improve accuracy?

ApplyAdvanced

Section E โ€” Mini-Projects

E1

Indian Food Recognition (Classification): Build a model that classifies 20 popular Indian dishes (dosa, idli, biryani, chole bhature, pav bhaji, etc.) from food images. Use EfficientNet-B0 transfer learning. Collect 200+ images per class from Google Images/Zomato. Deploy as a Streamlit web app where users upload food photos and get calorie estimates. Target: >90% accuracy.

CreateIntermediate
E2

Vehicle Number Plate Recognition (OCR + Detection): Build an end-to-end ANPR (Automatic Number Plate Recognition) system for Indian number plates. Step 1: Detect plate region using YOLOv5. Step 2: Read characters using CRNN+CTC. Handle Indian plate formats (e.g., MH 12 AB 1234, DL 01 C AB 1234). Test on 200+ images. Target: >85% full-plate read accuracy.

CreateAdvanced
Section 12

Chapter Summary

Key Takeaways โ€” Applied Computer Vision

  • The applied CV pipeline is universal: Problem Framing โ†’ Dataset Engineering โ†’ Model Selection โ†’ Training โ†’ Evaluation โ†’ Optimization โ†’ Deployment. Master this lifecycle, and you can tackle any CV problem.
  • Transfer learning is the default: Start with ImageNet-pretrained backbones (MobileNetV2, ResNet, EfficientNet). Only train from scratch when your domain is radically different from natural images (medical X-rays, satellite imagery).
  • Two-phase fine-tuning works best: Phase 1 โ€” freeze backbone, train only the new classification head with lr=1e-3. Phase 2 โ€” unfreeze top layers, fine-tune with lr=1e-5. This preserves learned features while adapting to your task.
  • Data augmentation is non-negotiable: RandomFlip, Rotation, Zoom, Brightness, Contrast โ€” these simulate real-world variation and can improve accuracy by 5-15%. But be domain-aware: no horizontal flip for traffic signs!
  • Classification โ‰  Detection โ‰  OCR: Different tasks need different architectures (CNN+Softmax vs. YOLO vs. CRNN+CTC) and different metrics (Accuracy vs. mAP vs. CER). Choose based on your problem, not popularity.
  • Deployment optimization is half the battle: TFLite INT8 quantization gives ~3-4ร— compression with <1% accuracy drop. Always benchmark on target hardware. A 96% accurate model that runs at 0.5 FPS is useless.
  • Indian context matters: Bilingual traffic signs, Devanagari script complexity, โ‚น8,000 smartphone constraints, diverse lighting conditions, faded/damaged objects โ€” solutions must be designed for local reality, not benchmarked on Western datasets.
  • Evaluation must go beyond accuracy: Confusion matrices reveal which classes are confused. Grad-CAM shows where the model looks. Per-class F1 exposes minority-class failures. Always perform error analysis, not just metric reporting.

Project Quick Reference

ProjectBest ArchitectureKey MetricDeployment
P1: Plant DiseaseMobileNetV2 (fine-tuned)96.2% accuracyTFLite on Android
P2: Currency NotesCustom CNN (4 blocks)98.4% accuracyTFLite + TTS
P3: Traffic SignsResNet-18 (fine-tuned)94.7% accuracyONNX for ADAS
P4: Face MaskYOLOv5s90.0% mAP@50Jetson Nano edge
P5: Devanagari OCRCRNN (CNN+BiLSTM+CTC)7.1% CERCloud API
The techniques in this chapter are not just academic exercises. CropIn uses CNN-based crop disease detection to monitor 21.6 million acres. DigiYatra uses face detection at Indian airports. Google Lens reads Hindi text using CRNN-style models. You now have the skills to build production-grade CV systems that impact millions of lives across India.
Section 13

References

Research Papers

  1. Sandler, M., et al. (2018). "MobileNetV2: Inverted Residuals and Linear Bottlenecks." CVPR 2018. โ€” Foundation for Project 1.
  2. He, K., et al. (2016). "Deep Residual Learning for Image Recognition." CVPR 2016. โ€” ResNet architecture used in Project 3.
  3. Redmon, J. & Farhadi, A. (2018). "YOLOv3: An Incremental Improvement." arXiv:1804.02767. โ€” Object detection foundation for Project 4.
  4. Shi, B., Bai, X., & Yao, C. (2017). "An End-to-End Trainable Neural Network for Image-Based Sequence Recognition." IEEE TPAMI. โ€” CRNN architecture for Project 5.
  5. Graves, A., et al. (2006). "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with RNNs." ICML 2006. โ€” CTC loss theory.
  6. Hughes, D. P. & Salathรฉ, M. (2015). "An open access repository of images on plant health." arXiv:1511.08060. โ€” PlantVillage dataset paper.
  7. Stallkamp, J., et al. (2012). "Man vs. Computer: Benchmarking Machine Learning Algorithms for Traffic Sign Recognition." Neural Networks. โ€” GTSRB benchmark reference.
  8. Selvaraju, R. R., et al. (2017). "Grad-CAM: Visual Explanations from Deep Networks." ICCV 2017. โ€” Model interpretability technique.

Indian Context & Datasets

  1. ICAR Annual Report 2022-23. "Crop Losses due to Pests and Diseases in India." โ€” Agricultural loss statistics.
  2. Ministry of Road Transport & Highways (MoRTH). "Road Accidents in India โ€” 2022." โ€” Traffic safety statistics.
  3. Reserve Bank of India. "Banknote Features โ€” Mahatma Gandhi (New) Series." โ€” Currency note specifications.
  4. IIIT Hyderabad CVIT Lab. "Indian Language OCR Benchmark." โ€” Devanagari OCR research.
  5. CropIn Technology. "SmartFarm Platform โ€” Technical Whitepaper." โ€” Industry case study reference.

Libraries & Tools

  1. TensorFlow Team. TensorFlow Lite Documentation. tensorflow.org/lite
  2. Ultralytics. YOLOv5 Documentation. docs.ultralytics.com
  3. PyTorch Team. TorchVision Models. pytorch.org/vision
  4. ONNX Runtime. Cross-Platform Inference. onnxruntime.ai

Textbooks

  1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. โ€” Chapters 9 (CNNs) and 10 (Sequence Modeling).
  2. Chollet, F. (2021). Deep Learning with Python. 2nd Edition. Manning. โ€” Transfer learning and Keras implementation patterns.
  3. Gรฉron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. 3rd Edition. O'Reilly. โ€” Practical CV project methodology.