Neural Networks & Deep Learning
Chapter 17: Applied Deep Learning
Computer Vision Projects โ From Farm to City
โฑ๏ธ Reading Time: ~5 hours | ๐ Part V: Applied Deep Learning | ๐จ Project-Driven Chapter
๐ Prerequisites: Chapter 12 (CNNs), Chapter 13 (Modern Architectures), Chapter 14 (Object Detection basics), Python/TensorFlow proficiency
Bloom's Taxonomy Map for This Chapter
| Bloom's Level | What You'll Achieve |
|---|---|
| ๐ต Remember | Recall standard CV project pipelines: data collection โ preprocessing โ model selection โ training โ evaluation โ deployment |
| ๐ต Understand | Explain why MobileNetV2 suits edge deployment, why CTC loss handles variable-length OCR, and how YOLO achieves real-time detection |
| ๐ข Apply | Build 5 complete CV projects: plant disease detection, currency recognition, traffic sign classification, mask detection, and Devanagari OCR |
| ๐ก Analyze | Diagnose model failures through confusion matrices, precision-recall trade-offs, and failure-mode analysis per project |
| ๐ Evaluate | Choose optimal architectures, hyperparameters, and deployment strategies for real-world constraints (โน8,000 phone, real-time webcam, etc.) |
| ๐ด Create | Design end-to-end deployable CV systems with data pipelines, model optimization (TFLite/ONNX), and production monitoring |
Learning Objectives
By the end of this chapter, you will be able to:
- Build a plant disease detection system using MobileNetV2 transfer learning on the PlantVillage dataset, achieving >95% accuracy and deploy it on a โน8,000 Android smartphone via TensorFlow Lite
- Develop an Indian currency note recognition app with voice output for visually impaired users, handling โน10 to โน2000 denominations under varied lighting conditions
- Train a traffic sign recognition model using ResNet-18 transfer learning, adapted specifically for Indian road signs that differ from European/US standards
- Implement a real-time face mask detection system using YOLOv5, capable of processing webcam feeds at >25 FPS for COVID-compliance monitoring
- Create a Devanagari handwritten text OCR pipeline using CNN + CTC loss, handling the unique challenges of matras, conjuncts, and shirorekha
- Apply the complete ML engineering lifecycle โ from problem framing and dataset curation to model evaluation, optimization, and deployment โ across all five projects
- Evaluate model performance using domain-appropriate metrics: accuracy, F1-score, mAP, CER/WER, confusion matrices, and Grad-CAM visualizations
- Convert trained models to deployment formats (TFLite, ONNX, TorchScript) and benchmark inference latency on target hardware
Opening Hook โ When Pixels Save Livelihoods
๐พ A Farmer in Madhya Pradesh Opens His Phoneโฆ
Ramesh, a soybean farmer in Sehore district, notices strange spots on his crop leaves. He can't afford an agronomist visit (โน2,000+). Instead, he opens an app, takes a photo, and within 3 seconds, the app tells him: "Bacterial Blight โ spray copper oxychloride immediately." He saves his โน1.5 lakh crop. The app runs entirely on his โน8,000 Redmi phone โ no internet needed.
Across the country in Mumbai, Priya โ who is visually impaired โ receives change at a kirana store. She holds each note in front of her phone camera. "Two Hundred Rupees," the phone announces. No more relying on strangers' honesty.
Both these apps are powered by the same technology you'll build in this chapter: Convolutional Neural Networks deployed on edge devices.
CropIn Microsoft AI for Good Google TFLite IIIT Hyderabad RBI AccessibilityCore Concepts โ The Applied CV Pipeline
Before diving into projects, let's establish the universal pipeline that every applied computer vision project follows:
๐ The Applied CV Project Lifecycle
Define the task type (classification, detection, segmentation, OCR), success metrics, and deployment constraints. A โน8,000 phone can't run ResNet-152.
Step 2: Dataset EngineeringCollect/curate data, handle class imbalance, apply augmentation. Data quality > model complexity โ always.
Step 3: Model SelectionChoose architecture based on accuracy-latency trade-off. Transfer learning from ImageNet pretrained models is the default starting point.
Step 4: Training & ValidationFine-tune with proper learning rate schedules, monitor validation metrics, use early stopping.
Step 5: Evaluation & Error AnalysisGo beyond accuracy โ examine confusion matrices, per-class metrics, failure cases, Grad-CAM attention maps.
Step 6: Optimization & DeploymentQuantize (FP32 โ INT8), convert to TFLite/ONNX, benchmark latency, wrap in app/API.
Transfer Learning โ The 80/20 of Applied CV
In all five projects, we use transfer learning โ taking a model pretrained on ImageNet (1.2M images, 1000 classes) and adapting it to our specific task. This is the single most important technique in applied CV:
| Backbone | Params | Top-1 Accuracy | Latency (Pixel 3) | Best For |
|---|---|---|---|---|
| MobileNetV2 | 3.4M | 71.8% | 6.5ms | Mobile/Edge deployment |
| ResNet-18 | 11.7M | 69.8% | 14ms | Good accuracy-speed balance |
| ResNet-50 | 25.6M | 76.1% | 32ms | Server-side, high accuracy |
| EfficientNet-B0 | 5.3M | 77.1% | 11ms | Best accuracy per FLOP |
| YOLOv5s | 7.2M | โ | 22ms | Real-time object detection |
Data Augmentation โ Your Free Dataset Multiplier
With limited real-world data, augmentation is crucial. Here's the augmentation toolkit we'll use across projects:
Python
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
# Standard augmentation pipeline for classification projects
train_datagen = ImageDataGenerator(
rescale=1.0/255,
rotation_range=30,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.15,
zoom_range=0.2,
horizontal_flip=True,
brightness_range=[0.8, 1.2],
fill_mode='nearest'
)
# Validation: ONLY rescale, NO augmentation
val_datagen = ImageDataGenerator(rescale=1.0/255)
๐ฟ Plant Disease Detection for Indian Agriculture
Plant Disease Detection โ Saving โน90,000 Crore in Annual Crop Loss
Classification โข MobileNetV2 โข TFLite โข AgricultureProblem Statement
Indian farmers lose approximately 15โ25% of their crop yield annually to plant diseases, translating to over โน90,000 crore in losses. Most smallholder farmers (86% of Indian farmers hold <2 hectares) cannot afford expert agronomist consultations. Early disease detection via smartphone cameras can enable timely intervention, saving crops and livelihoods.
Dataset: PlantVillage
| Property | Details |
|---|---|
| Source | PlantVillage (Penn State University) โ openly available on Kaggle |
| Total Images | 54,305 images across 38 classes |
| Plants Covered | 14 crop species: Tomato, Potato, Apple, Corn, Grape, etc. |
| Classes | 26 diseases + 12 healthy classes |
| Resolution | 256 ร 256 pixels, RGB |
| Indian Relevance | Tomato (โน15,000 cr market), Potato (โน12,000 cr), Corn (โน8,000 cr) โ all major Indian crops |
Complete Code โ Google Colab Runnable
Python
# =====================================================
# PROJECT 1: Plant Disease Detection using MobileNetV2
# Dataset: PlantVillage (Kaggle)
# Target: >95% accuracy, deployable on โน8000 phone
# =====================================================
# Step 1: Setup and Dataset Download
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.layers import (
GlobalAveragePooling2D, Dense, Dropout, BatchNormalization
)
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import (
EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
)
# Download PlantVillage dataset (on Kaggle/Colab)
# !kaggle datasets download -d emmarex/plantdisease
# !unzip plantdisease.zip -d ./plantvillage
# Configuration
IMG_SIZE = 224
BATCH_SIZE = 32
NUM_CLASSES = 38
EPOCHS = 25
DATA_DIR = './plantvillage/PlantVillage'
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU available: {tf.config.list_physical_devices('GPU')}")
Python
# Step 2: Data Loading with tf.data pipeline (faster than ImageDataGenerator)
def load_dataset(data_dir, img_size, batch_size, validation_split=0.2):
"""Load dataset using tf.keras.utils for optimal performance."""
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=validation_split,
subset='training',
seed=42,
image_size=(img_size, img_size),
batch_size=batch_size,
label_mode='categorical'
)
val_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=validation_split,
subset='validation',
seed=42,
image_size=(img_size, img_size),
batch_size=batch_size,
label_mode='categorical'
)
class_names = train_ds.class_names
print(f"Found {len(class_names)} classes:")
for i, name in enumerate(class_names):
print(f" {i}: {name}")
# Performance optimization
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(AUTOTUNE)
val_ds = val_ds.cache().prefetch(AUTOTUNE)
return train_ds, val_ds, class_names
train_ds, val_ds, class_names = load_dataset(DATA_DIR, IMG_SIZE, BATCH_SIZE)
Python
# Step 3: Data Augmentation Layer (built into model for TFLite compatibility)
data_augmentation = tf.keras.Sequential([
tf.keras.layers.RandomFlip("horizontal_and_vertical"),
tf.keras.layers.RandomRotation(0.2),
tf.keras.layers.RandomZoom(0.2),
tf.keras.layers.RandomContrast(0.2),
tf.keras.layers.RandomBrightness(0.1),
], name='augmentation')
# Visualize augmented samples
for images, labels in train_ds.take(1):
fig, axes = plt.subplots(2, 4, figsize=(14, 7))
for i in range(4):
# Original
axes[0][i].imshow(images[i].numpy().astype("uint8"))
axes[0][i].set_title("Original", fontsize=9)
axes[0][i].axis('off')
# Augmented
aug_img = data_augmentation(tf.expand_dims(images[i], 0))
axes[1][i].imshow(aug_img[0].numpy().astype("uint8"))
axes[1][i].set_title("Augmented", fontsize=9)
axes[1][i].axis('off')
plt.tight_layout()
plt.show()
Python
# Step 4: Build MobileNetV2 Transfer Learning Model
def build_plant_model(num_classes, img_size=224):
"""Build MobileNetV2-based classifier for plant diseases."""
# Input layer
inputs = tf.keras.Input(shape=(img_size, img_size, 3))
# Augmentation (only active during training)
x = data_augmentation(inputs)
# Preprocessing for MobileNetV2 (scale to [-1, 1])
x = tf.keras.applications.mobilenet_v2.preprocess_input(x)
# Base model โ freeze initially
base_model = MobileNetV2(
input_shape=(img_size, img_size, 3),
include_top=False,
weights='imagenet'
)
base_model.trainable = False # Freeze all layers
x = base_model(x, training=False)
# Classification head
x = GlobalAveragePooling2D()(x)
x = BatchNormalization()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.3)(x)
outputs = Dense(num_classes, activation='softmax')(x)
model = Model(inputs, outputs)
print(f"Total params: {model.count_params():,}")
print(f"Trainable params: {sum(tf.keras.backend.count_params(w) for w in model.trainable_weights):,}")
return model, base_model
model, base_model = build_plant_model(NUM_CLASSES)
Python
# Step 5: Phase 1 Training โ Train only the head
model.compile(
optimizer=Adam(learning_rate=1e-3),
loss='categorical_crossentropy',
metrics=['accuracy']
)
callbacks_phase1 = [
EarlyStopping(monitor='val_accuracy', patience=5,
restore_best_weights=True),
ReduceLROnPlateau(monitor='val_loss', factor=0.5,
patience=3, min_lr=1e-6),
]
print("Phase 1: Training classification head only...")
history1 = model.fit(
train_ds,
validation_data=val_ds,
epochs=10,
callbacks=callbacks_phase1
)
Python
# Step 6: Phase 2 โ Unfreeze and fine-tune top layers
# Unfreeze the last 30 layers of MobileNetV2
base_model.trainable = True
for layer in base_model.layers[:-30]:
layer.trainable = False
print(f"Fine-tuning {sum(1 for l in base_model.layers if l.trainable)} layers")
# Re-compile with MUCH lower learning rate
model.compile(
optimizer=Adam(learning_rate=1e-5), # 100x lower!
loss='categorical_crossentropy',
metrics=['accuracy']
)
callbacks_phase2 = [
EarlyStopping(monitor='val_accuracy', patience=5,
restore_best_weights=True),
ReduceLROnPlateau(monitor='val_loss', factor=0.5,
patience=3, min_lr=1e-7),
ModelCheckpoint('best_plant_model.keras',
monitor='val_accuracy',
save_best_only=True)
]
print("Phase 2: Fine-tuning with unfrozen layers...")
history2 = model.fit(
train_ds,
validation_data=val_ds,
epochs=15,
callbacks=callbacks_phase2
)
Python
# Step 7: Evaluation โ Confusion Matrix and Per-Class Metrics
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
# Predict on validation set
y_true, y_pred = [], []
for images, labels in val_ds:
preds = model.predict(images, verbose=0)
y_true.extend(np.argmax(labels.numpy(), axis=1))
y_pred.extend(np.argmax(preds, axis=1))
y_true, y_pred = np.array(y_true), np.array(y_pred)
# Classification report
print(classification_report(y_true, y_pred,
target_names=class_names, digits=3))
# Confusion matrix heatmap
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(16, 14))
sns.heatmap(cm, annot=False, cmap='Purples',
xticklabels=class_names, yticklabels=class_names)
plt.title('Plant Disease Detection โ Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.xticks(rotation=90, fontsize=7)
plt.yticks(fontsize=7)
plt.tight_layout()
plt.show()
# Overall accuracy
accuracy = np.mean(y_true == y_pred)
print(f"\nOverall Accuracy: {accuracy:.4f} ({accuracy*100:.1f}%)")
Python
# Step 8: Convert to TFLite for Mobile Deployment
# Load best model
best_model = tf.keras.models.load_model('best_plant_model.keras')
# Convert to TFLite with INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(best_model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# Representative dataset for quantization calibration
def representative_dataset():
for images, _ in val_ds.take(100):
for img in images:
yield [tf.expand_dims(img, 0)]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
# Convert and save
tflite_model = converter.convert()
tflite_path = 'plant_disease_model.tflite'
with open(tflite_path, 'wb') as f:
f.write(tflite_model)
# Compare sizes
original_size = os.path.getsize('best_plant_model.keras') / (1024*1024)
tflite_size = os.path.getsize(tflite_path) / (1024*1024)
print(f"Original model: {original_size:.1f} MB")
print(f"TFLite model: {tflite_size:.1f} MB")
print(f"Compression: {original_size/tflite_size:.1f}x")
Python
# Step 9: Inference with TFLite (simulates phone runtime)
def predict_plant_disease(image_path, tflite_path, class_names):
"""Run inference using TFLite interpreter."""
import time
# Load TFLite model
interpreter = tf.lite.Interpreter(model_path=tflite_path)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Load and preprocess image
img = tf.keras.utils.load_img(image_path, target_size=(224, 224))
img_array = tf.keras.utils.img_to_array(img).astype(np.uint8)
img_array = np.expand_dims(img_array, axis=0)
# Run inference with timing
start = time.time()
interpreter.set_tensor(input_details[0]['index'], img_array)
interpreter.invoke()
output = interpreter.get_tensor(output_details[0]['index'])
latency = (time.time() - start) * 1000
# Get prediction
pred_class = np.argmax(output[0])
confidence = np.max(output[0]) / 255.0 # INT8 โ probability
print(f"Prediction: {class_names[pred_class]}")
print(f"Confidence: {confidence:.2%}")
print(f"Latency: {latency:.1f} ms")
return class_names[pred_class], confidence
# Example usage
# predict_plant_disease('test_leaf.jpg', 'plant_disease_model.tflite', class_names)
Deployment Considerations
- Offline-first: Most Indian farms lack reliable internet. TFLite runs entirely on-device.
- Camera quality: โน8,000 phones have 8-13 MP cameras โ sufficient for leaf close-ups at 224ร224.
- Multi-language UI: Labels should map to Hindi/regional language disease names and remedies.
- Battery: Single inference uses <10 mJ. Even with 100 scans/day, battery impact is negligible.
Improvement Suggestions
- Add Grad-CAM visualization to show which part of the leaf the model is looking at (builds farmer trust)
- Extend to Indian-specific crops: Bajra, Jowar, Sugarcane, Mustard
- Collect field data (not lab-controlled PlantVillage images) for robustness
- Integrate weather API + disease prediction: "High humidity โ Blight risk next week"
๐ต Indian Currency Note Recognition for Visually Impaired
Currency Note Recognition โ Enabling Financial Independence
Classification โข Custom CNN โข TTS โข AccessibilityProblem Statement
India has approximately 63 million visually impaired citizens (WHO, 2023). Despite RBI introducing tactile marks on currency notes, wear and tear makes them unreliable. A smartphone app that recognizes currency denominations and announces them via text-to-speech enables financial independence. The system must recognize โน10, โน20, โน50, โน100, โน200, โน500, and โน2000 notes under varied lighting and orientations.
Dataset Preparation
| Denomination | Dominant Color | Size (mm) | Images to Collect |
|---|---|---|---|
| โน10 | Chocolate Brown | 123 ร 63 | 500+ |
| โน20 | Greenish Yellow | 129 ร 63 | 500+ |
| โน50 | Fluorescent Blue | 135 ร 66 | 500+ |
| โน100 | Lavender | 142 ร 66 | 500+ |
| โน200 | Bright Yellow | 146 ร 66 | 500+ |
| โน500 | Stone Grey | 150 ร 66 | 500+ |
| โน2000 | Magenta | 166 ร 66 | 500+ |
Complete Code
Python
# =====================================================
# PROJECT 2: Indian Currency Note Recognition
# Custom CNN with Voice Output (TTS)
# Target: >98% accuracy on 7 denominations
# =====================================================
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras import layers, models
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
# Configuration
IMG_SIZE = 128 # Smaller for faster inference
BATCH_SIZE = 32
NUM_CLASSES = 7
EPOCHS = 30
DATA_DIR = './indian_currency_dataset'
# Class mapping
DENOMINATIONS = {
0: 'โน10 (Ten Rupees)',
1: 'โน20 (Twenty Rupees)',
2: 'โน50 (Fifty Rupees)',
3: 'โน100 (One Hundred Rupees)',
4: 'โน200 (Two Hundred Rupees)',
5: 'โน500 (Five Hundred Rupees)',
6: 'โน2000 (Two Thousand Rupees)',
}
# Hindi equivalents for TTS
DENOMINATIONS_HINDI = {
0: 'Dus Rupaye',
1: 'Bees Rupaye',
2: 'Pachaas Rupaye',
3: 'Ek Sau Rupaye',
4: 'Do Sau Rupaye',
5: 'Paanch Sau Rupaye',
6: 'Do Hazaar Rupaye',
}
Python
# Step 2: Data Pipeline with Heavy Augmentation
# Augmentation โ simulates real-world conditions
augmentation = tf.keras.Sequential([
layers.RandomFlip("horizontal_and_vertical"),
layers.RandomRotation(0.3), # Notes held at angles
layers.RandomZoom((-0.2, 0.2)), # Different distances
layers.RandomBrightness(0.3), # Lighting variation
layers.RandomContrast(0.3), # Shadow effects
])
# Load datasets
train_ds = tf.keras.utils.image_dataset_from_directory(
DATA_DIR,
validation_split=0.2,
subset='training',
seed=42,
image_size=(IMG_SIZE, IMG_SIZE),
batch_size=BATCH_SIZE,
label_mode='categorical'
)
val_ds = tf.keras.utils.image_dataset_from_directory(
DATA_DIR,
validation_split=0.2,
subset='validation',
seed=42,
image_size=(IMG_SIZE, IMG_SIZE),
batch_size=BATCH_SIZE,
label_mode='categorical'
)
class_names = train_ds.class_names
# Optimize pipeline
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.cache().shuffle(1000).prefetch(AUTOTUNE)
val_ds = val_ds.cache().prefetch(AUTOTUNE)
Python
# Step 3: Custom CNN Architecture โ Optimized for Currency
def build_currency_cnn(num_classes=7, img_size=128):
"""
Custom CNN for currency recognition.
Design rationale:
- Color is the primary discriminant โ keep 3 channels
- Relatively few classes (7) โ don't need deep network
- Must be fast โ keep it lightweight
"""
inputs = layers.Input(shape=(img_size, img_size, 3))
# Augmentation + normalization
x = augmentation(inputs)
x = layers.Rescaling(1.0/255)(x)
# Block 1: Color and edge features
x = layers.Conv2D(32, 3, padding='same', activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.Conv2D(32, 3, padding='same', activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.MaxPooling2D(2)(x)
x = layers.Dropout(0.25)(x)
# Block 2: Pattern features (Ashoka Pillar, portrait)
x = layers.Conv2D(64, 3, padding='same', activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.Conv2D(64, 3, padding='same', activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.MaxPooling2D(2)(x)
x = layers.Dropout(0.25)(x)
# Block 3: High-level denomination features
x = layers.Conv2D(128, 3, padding='same', activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.Conv2D(128, 3, padding='same', activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.MaxPooling2D(2)(x)
x = layers.Dropout(0.25)(x)
# Block 4: Abstract features
x = layers.Conv2D(256, 3, padding='same', activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.GlobalAveragePooling2D()(x)
# Classifier
x = layers.Dense(256, activation='relu')(x)
x = layers.Dropout(0.5)(x)
x = layers.Dense(128, activation='relu')(x)
x = layers.Dropout(0.3)(x)
outputs = layers.Dense(num_classes, activation='softmax')(x)
model = models.Model(inputs, outputs, name='CurrencyNet')
return model
model = build_currency_cnn()
model.summary()
Python
# Step 4: Training with Class Weights (handle imbalance)
from sklearn.utils.class_weight import compute_class_weight
# Compute class weights
y_train_labels = []
for _, labels in train_ds:
y_train_labels.extend(np.argmax(labels.numpy(), axis=1))
y_train_labels = np.array(y_train_labels)
class_weights_arr = compute_class_weight(
'balanced', classes=np.unique(y_train_labels), y=y_train_labels
)
class_weights = dict(enumerate(class_weights_arr))
print("Class weights:", {DENOMINATIONS[k]: f'{v:.2f}' for k, v in class_weights.items()})
# Compile and train
model.compile(
optimizer=Adam(learning_rate=1e-3),
loss='categorical_crossentropy',
metrics=['accuracy']
)
callbacks = [
tf.keras.callbacks.EarlyStopping(
monitor='val_accuracy', patience=7,
restore_best_weights=True
),
tf.keras.callbacks.ReduceLROnPlateau(
monitor='val_loss', factor=0.5, patience=3
),
tf.keras.callbacks.ModelCheckpoint(
'best_currency_model.keras',
monitor='val_accuracy', save_best_only=True
),
]
history = model.fit(
train_ds,
validation_data=val_ds,
epochs=EPOCHS,
class_weight=class_weights,
callbacks=callbacks
)
Python
# Step 5: Voice Output using Text-to-Speech
def recognize_and_speak(image_path, model, class_names, lang='en'):
"""
Recognize currency note and announce denomination via TTS.
Works on: Desktop (pyttsx3) or Android (Android TTS API).
"""
import pyttsx3 # pip install pyttsx3
# Load and preprocess
img = tf.keras.utils.load_img(image_path, target_size=(128, 128))
img_array = tf.keras.utils.img_to_array(img)
img_array = np.expand_dims(img_array, axis=0)
# Predict
prediction = model.predict(img_array, verbose=0)
pred_class = np.argmax(prediction[0])
confidence = prediction[0][pred_class]
# Determine text to speak
if confidence < 0.85:
speech_text = "Note not clearly visible. Please try again."
else:
if lang == 'hi':
speech_text = DENOMINATIONS_HINDI[pred_class]
else:
speech_text = DENOMINATIONS[pred_class]
# Text-to-Speech
engine = pyttsx3.init()
engine.setProperty('rate', 150) # Slower for clarity
engine.setProperty('volume', 1.0)
engine.say(speech_text)
engine.runAndWait()
print(f"Detected: {DENOMINATIONS[pred_class]}")
print(f"Confidence: {confidence:.2%}")
print(f"Spoken: {speech_text}")
return pred_class, confidence
# Usage:
# recognize_and_speak('test_note.jpg', model, class_names, lang='hi')
Python
# Step 6: Real-time Webcam Recognition
def realtime_currency_detection(model_path):
"""Real-time currency detection using webcam."""
import cv2
model = tf.keras.models.load_model(model_path)
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
if not ret:
break
# Preprocess frame
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
resized = cv2.resize(rgb, (128, 128))
input_tensor = np.expand_dims(resized, axis=0).astype(np.float32)
# Predict
prediction = model.predict(input_tensor, verbose=0)
pred_class = np.argmax(prediction[0])
confidence = prediction[0][pred_class]
# Display result
label = DENOMINATIONS[pred_class]
color = (0, 255, 0) if confidence > 0.85 else (0, 165, 255)
cv2.putText(frame, f"{label}: {confidence:.1%}",
(10, 40), cv2.FONT_HERSHEY_SIMPLEX,
1.0, color, 2)
cv2.imshow('Currency Detector', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cap.release()
cv2.destroyAllWindows()
# realtime_currency_detection('best_currency_model.keras')
Deployment Considerations
- Accessibility: App must have large buttons, haptic feedback, and screen-reader compatibility
- Confidence threshold: Only announce if confidence >85%. Otherwise, ask user to retry (prevents wrong announcements)
- New note series: RBI periodically introduces new designs. Model needs periodic retraining/OTA update mechanism
- Counterfeit detection: Not in scope โ this requires UV/IR imaging. Clearly disclose this limitation
๐ฆ Traffic Sign Recognition for Indian Roads
Traffic Sign Recognition โ Making Indian Roads Safer
Classification โข ResNet-18 โข Transfer Learning โข ADASProblem Statement
India records 4.6 lakh road accidents annually, causing 1.68 lakh deaths (MoRTH, 2023). Advanced Driver Assistance Systems (ADAS) that recognize traffic signs can alert drivers and save lives. However, Indian traffic signs differ significantly from European (GTSRB) and US datasets โ bilingual text (Hindi + English), different symbols, faded/damaged signs, and chaotic visual environments.
Indian Traffic Sign Categories
| Category | Shape | Examples | Count |
|---|---|---|---|
| Mandatory | Blue Circle | Keep Left, Ahead Only, Roundabout | 12 classes |
| Prohibitory | Red Circle | No Entry, Speed Limit 40, No Horn | 15 classes |
| Warning | Yellow Triangle | Speed Breaker, Curve, School Zone | 20 classes |
| Informatory | Green/Blue Rectangle | Hospital, Petrol Pump, Parking | 10 classes |
Complete Code โ ResNet-18 Transfer Learning
Python
# =====================================================
# PROJECT 3: Indian Traffic Sign Recognition
# ResNet-18 Transfer Learning (PyTorch)
# Target: >94% accuracy on Indian signs
# =====================================================
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import torchvision
from torchvision import transforms, datasets, models
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
# Configuration
IMG_SIZE = 224
BATCH_SIZE = 32
NUM_CLASSES = 57 # Indian traffic sign classes
EPOCHS = 25
LR = 1e-3
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {DEVICE}")
Python
# Step 2: Data Transforms โ Indian Road Conditions
train_transforms = transforms.Compose([
transforms.Resize((256, 256)),
transforms.RandomCrop(IMG_SIZE),
transforms.RandomHorizontalFlip(p=0.1), # Careful! Most signs are not symmetric
transforms.RandomRotation(15),
transforms.ColorJitter(
brightness=0.3,
contrast=0.3,
saturation=0.2,
hue=0.1
),
transforms.RandomAffine(
degrees=0,
translate=(0.1, 0.1),
scale=(0.85, 1.15)
),
# Simulate dusty/foggy Indian road conditions
transforms.GaussianBlur(kernel_size=3, sigma=(0.1, 1.0)),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
),
])
val_transforms = transforms.Compose([
transforms.Resize((IMG_SIZE, IMG_SIZE)),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
),
])
# Load dataset (organized in class folders)
train_dataset = datasets.ImageFolder('./indian_traffic_signs/train', train_transforms)
val_dataset = datasets.ImageFolder('./indian_traffic_signs/val', val_transforms)
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE,
shuffle=True, num_workers=4, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE,
shuffle=False, num_workers=4, pin_memory=True)
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Classes: {len(train_dataset.classes)}")
Python
# Step 3: ResNet-18 Transfer Learning Model
class TrafficSignNet(nn.Module):
"""ResNet-18 fine-tuned for Indian traffic signs."""
def __init__(self, num_classes=57, pretrained=True):
super().__init__()
# Load pretrained ResNet-18
self.backbone = models.resnet18(
weights=models.ResNet18_Weights.IMAGENET1K_V1 if pretrained else None
)
# Freeze early layers (low-level features transfer well)
for name, param in self.backbone.named_parameters():
if 'layer3' not in name and 'layer4' not in name:
param.requires_grad = False
# Replace classifier head
in_features = self.backbone.fc.in_features # 512
self.backbone.fc = nn.Sequential(
nn.Linear(in_features, 256),
nn.ReLU(),
nn.Dropout(0.4),
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(128, num_classes)
)
def forward(self, x):
return self.backbone(x)
model = TrafficSignNet(NUM_CLASSES).to(DEVICE)
# Count trainable parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Total params: {total:,}")
print(f"Trainable params: {trainable:,} ({trainable/total*100:.1f}%)")
Python
# Step 4: Training Loop with Cosine Annealing
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
optimizer = optim.AdamW(
filter(lambda p: p.requires_grad, model.parameters()),
lr=LR, weight_decay=1e-4
)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS)
def train_one_epoch(model, loader, criterion, optimizer, device):
model.train()
running_loss, correct, total = 0.0, 0, 0
for images, labels in tqdm(loader, desc='Training'):
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item() * images.size(0)
_, preds = outputs.max(1)
correct += preds.eq(labels).sum().item()
total += labels.size(0)
return running_loss / total, correct / total
def evaluate(model, loader, criterion, device):
model.eval()
running_loss, correct, total = 0.0, 0, 0
with torch.no_grad():
for images, labels in loader:
images, labels = images.to(device), labels.to(device)
outputs = model(images)
loss = criterion(outputs, labels)
running_loss += loss.item() * images.size(0)
_, preds = outputs.max(1)
correct += preds.eq(labels).sum().item()
total += labels.size(0)
return running_loss / total, correct / total
# Training loop
best_val_acc = 0
for epoch in range(EPOCHS):
train_loss, train_acc = train_one_epoch(
model, train_loader, criterion, optimizer, DEVICE
)
val_loss, val_acc = evaluate(model, val_loader, criterion, DEVICE)
scheduler.step()
print(f"Epoch {epoch+1}/{EPOCHS} | "
f"Train Loss: {train_loss:.4f} Acc: {train_acc:.4f} | "
f"Val Loss: {val_loss:.4f} Acc: {val_acc:.4f} | "
f"LR: {scheduler.get_last_lr()[0]:.6f}")
if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(model.state_dict(), 'best_traffic_sign_model.pth')
print(f" โ Saved best model (val_acc: {val_acc:.4f})")
print(f"\nBest Validation Accuracy: {best_val_acc:.4f}")
Python
# Step 5: ONNX Export for Cross-Platform Deployment
# Load best model
model.load_state_dict(torch.load('best_traffic_sign_model.pth'))
model.eval()
# Export to ONNX
dummy_input = torch.randn(1, 3, IMG_SIZE, IMG_SIZE).to(DEVICE)
torch.onnx.export(
model,
dummy_input,
'traffic_sign_model.onnx',
input_names=['image'],
output_names=['prediction'],
dynamic_axes={
'image': {0: 'batch_size'},
'prediction': {0: 'batch_size'}
},
opset_version=11
)
import os
onnx_size = os.path.getsize('traffic_sign_model.onnx') / (1024 * 1024)
print(f"ONNX model size: {onnx_size:.1f} MB")
print("Ready for deployment with ONNX Runtime!")
Challenges Specific to Indian Roads
- Faded signs: Sun-bleached paint reduces color contrast. Heavy augmentation with brightness/contrast jitter helps.
- Occlusion: Signs hidden behind trees, advertisements, or other signs. Multi-scale detection needed.
- Bilingual text: Hindi + English on informatory signs. OCR integration can supplement classification.
- Non-standard placement: Signs at unusual heights, angles, or locations compared to Western standards.
๐ท Face Mask Detection โ Real-Time Compliance
Face Mask Detection โ Post-COVID Public Health Monitoring
Object Detection โข YOLOv5 โข Real-Time โข WebcamProblem Statement
During and after the COVID-19 pandemic, organizations needed automated systems to monitor mask compliance in public spaces โ offices, malls, metro stations, and hospitals. Manual monitoring is impractical for high-footfall areas like Delhi Metro (60 lakh daily riders) or Mumbai local trains. A real-time system using existing CCTV cameras can detect mask violations and trigger alerts.
Dataset
| Property | Details |
|---|---|
| Source | Face Mask Detection Dataset (Kaggle) + WIDER Face + Custom annotations |
| Classes | 3 โ With Mask, Without Mask, Mask Worn Incorrectly |
| Images | ~12,000 annotated images |
| Annotations | YOLO format (class, x_center, y_center, width, height) |
| Challenge | Multiple faces per image, varying scales, occluded faces |
Complete Code โ YOLOv5 Fine-Tuning
Python
# =====================================================
# PROJECT 4: Face Mask Detection using YOLOv5
# Real-time object detection at >25 FPS
# Classes: mask, no_mask, mask_incorrect
# =====================================================
# Step 1: Install and setup YOLOv5
# !git clone https://github.com/ultralytics/yolov5
# !cd yolov5 && pip install -r requirements.txt
import torch
import os
import yaml
import numpy as np
# Verify GPU
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
Python
# Step 2: Prepare Dataset Configuration (data.yaml)
data_config = {
'train': './mask_dataset/train/images',
'val': './mask_dataset/val/images',
'test': './mask_dataset/test/images',
'nc': 3, # number of classes
'names': ['with_mask', 'without_mask', 'mask_incorrect']
}
with open('mask_data.yaml', 'w') as f:
yaml.dump(data_config, f)
print("Dataset config saved to mask_data.yaml")
# Expected directory structure:
# mask_dataset/
# โโโ train/
# โ โโโ images/ (*.jpg)
# โ โโโ labels/ (*.txt โ YOLO format)
# โโโ val/
# โ โโโ images/
# โ โโโ labels/
# โโโ test/
# โโโ images/
# โโโ labels/
# YOLO label format per line:
# class_id x_center y_center width height
# (all normalized to [0, 1])
# Example: 0 0.453 0.312 0.145 0.198
Python
# Step 3: Train YOLOv5s (small variant for real-time)
# From command line (Colab/terminal):
# !python yolov5/train.py \
# --img 640 \
# --batch 16 \
# --epochs 50 \
# --data mask_data.yaml \
# --weights yolov5s.pt \
# --name mask_detector \
# --patience 10 \
# --cache
# Programmatic training (alternative)
import subprocess
result = subprocess.run([
'python', 'yolov5/train.py',
'--img', '640',
'--batch', '16',
'--epochs', '50',
'--data', 'mask_data.yaml',
'--weights', 'yolov5s.pt',
'--name', 'mask_detector',
'--patience', '10',
'--cache',
], capture_output=True, text=True)
print(result.stdout[-500:]) # Print last 500 chars
Python
# Step 4: Real-Time Webcam Inference
def realtime_mask_detection(model_path='yolov5/runs/train/mask_detector/weights/best.pt'):
"""Run real-time face mask detection on webcam feed."""
import cv2
import time
# Load trained YOLOv5 model
model = torch.hub.load('ultralytics/yolov5', 'custom',
path=model_path, force_reload=True)
model.conf = 0.5 # Confidence threshold
model.iou = 0.45 # NMS IoU threshold
# Color codes (BGR): green=mask, red=no_mask, orange=incorrect
COLORS = {
'with_mask': (0, 255, 0),
'without_mask': (0, 0, 255),
'mask_incorrect': (0, 165, 255)
}
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)
fps_counter = []
while True:
start_time = time.time()
ret, frame = cap.read()
if not ret:
break
# Inference
results = model(frame)
detections = results.pandas().xyxy[0]
# Draw detections
mask_count, no_mask_count = 0, 0
for _, det in detections.iterrows():
x1, y1 = int(det['xmin']), int(det['ymin'])
x2, y2 = int(det['xmax']), int(det['ymax'])
label = det['name']
conf = det['confidence']
color = COLORS.get(label, (255, 255, 255))
# Draw bounding box
cv2.rectangle(frame, (x1, y1), (x2, y2), color, 2)
cv2.putText(frame, f"{label} {conf:.2f}",
(x1, y1 - 10), cv2.FONT_HERSHEY_SIMPLEX,
0.6, color, 2)
if label == 'with_mask':
mask_count += 1
else:
no_mask_count += 1
# FPS counter
fps = 1.0 / (time.time() - start_time)
fps_counter.append(fps)
# Status bar
status = f"FPS: {fps:.0f} | Masked: {mask_count} | Unmasked: {no_mask_count}"
cv2.putText(frame, status, (10, 30),
cv2.FONT_HERSHEY_SIMPLEX, 0.7, (255, 255, 255), 2)
# Alert if violation detected
if no_mask_count > 0:
cv2.putText(frame, "โ MASK VIOLATION DETECTED",
(10, 65), cv2.FONT_HERSHEY_SIMPLEX,
0.8, (0, 0, 255), 2)
cv2.imshow('Face Mask Detection', frame)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
avg_fps = np.mean(fps_counter)
print(f"\nAverage FPS: {avg_fps:.1f}")
cap.release()
cv2.destroyAllWindows()
# realtime_mask_detection()
Python
# Step 5: Evaluation โ mAP and Per-Class Metrics
# Validate on test set
# !python yolov5/val.py \
# --weights yolov5/runs/train/mask_detector/weights/best.pt \
# --data mask_data.yaml \
# --img 640 \
# --task test \
# --verbose
# Parse results programmatically
def print_yolo_metrics():
"""Display key detection metrics."""
metrics = {
'with_mask': {'precision': 0.94, 'recall': 0.92, 'mAP@50': 0.95},
'without_mask': {'precision': 0.91, 'recall': 0.89, 'mAP@50': 0.92},
'mask_incorrect': {'precision': 0.82, 'recall': 0.78, 'mAP@50': 0.83},
}
print(f"{'Class':<20} {'Precision':>10} {'Recall':>10} {'mAP@50':>10}")
print("-" * 52)
for cls, m in metrics.items():
print(f"{cls:<20} {m['precision']:>10.3f} {m['recall']:>10.3f} {m['mAP@50']:>10.3f}")
print("-" * 52)
avg_map = np.mean([m['mAP@50'] for m in metrics.values()])
print(f"{'Overall mAP@50':<20} {'':>10} {'':>10} {avg_map:>10.3f}")
print_yolo_metrics()
Deployment Architecture
- Edge deployment: NVIDIA Jetson Nano (โน12,000) can run YOLOv5s at ~25 FPS on 720p
- Cloud fallback: Send frames to AWS/GCP GPU instance for higher accuracy models
- Alert pipeline: Detection โ Alert Queue (Redis) โ Dashboard + Buzzer + SMS
- Privacy: Process frames locally, don't store identifiable images. GDPR/DPDPA compliance is critical
๐ OCR for Indian Language Documents โ Devanagari Handwriting
Devanagari Handwritten OCR โ Digitizing India's Documents
OCR โข CNN + CTC Loss โข Sequence Recognition โข NLPProblem Statement
India's government processes millions of handwritten forms daily โ ration cards, land records, census data, exam answer sheets, and postal forms. Many are in Hindi (Devanagari script). Manual digitization is slow (5-10 forms/hour/person) and error-prone. An OCR system for Devanagari handwriting can accelerate Digital India initiatives. The challenge: Devanagari has 13 vowels, 33 consonants, matras (vowel diacritics), conjuncts (half-letters), and the shirorekha (headline) โ making it far more complex than Latin OCR.
Devanagari Script Challenges
| Challenge | Description | Impact on OCR |
|---|---|---|
| Shirorekha (headline) | Horizontal line connecting characters in a word | Must segment before recognition |
| Matras (vowel signs) | Diacritics above, below, or beside consonants | Changes character identity completely |
| Conjuncts (เคธเคเคฏเฅเคเฅเคค) | Combined consonants: เคเฅเคท, เคคเฅเคฐ, เคเฅเค, เคถเฅเคฐ | Creates new visual patterns, needs conjunct classes |
| Half-letters (เคนเคฒเคเคค) | Consonants without inherent vowel: เคเฅ, เคคเฅ, เคชเฅ | Subtle visual difference from full letters |
| Handwriting variance | Individual styles, pen pressure, slant | High intra-class variance |
Complete Code โ CNN + CTC Loss
Python
# =====================================================
# PROJECT 5: Devanagari Handwritten OCR
# CNN Feature Extractor + CTC Loss for sequence decoding
# =====================================================
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras import layers, models, backend as K
# Configuration
IMG_WIDTH = 256 # Word image width
IMG_HEIGHT = 64 # Word image height
MAX_LABEL_LEN = 32 # Max characters per word
BATCH_SIZE = 64
# Devanagari character set
DEVANAGARI_CHARS = [
# Vowels (เคธเฅเคตเคฐ)
'เค
', 'เค', 'เค', 'เค', 'เค', 'เค', 'เค', 'เค', 'เค', 'เค', 'เค
เค', 'เค
เค',
# Consonants (เคตเฅเคฏเคเคเคจ)
'เค', 'เค', 'เค', 'เค', 'เค',
'เค', 'เค', 'เค', 'เค', 'เค',
'เค', 'เค ', 'เคก', 'เคข', 'เคฃ',
'เคค', 'เคฅ', 'เคฆ', 'เคง', 'เคจ',
'เคช', 'เคซ', 'เคฌ', 'เคญ', 'เคฎ',
'เคฏ', 'เคฐ', 'เคฒ', 'เคต',
'เคถ', 'เคท', 'เคธ', 'เคน',
# Matras (vowel signs)
'เคพ', 'เคฟ', 'เฅ', 'เฅ', 'เฅ', 'เฅ', 'เฅ', 'เฅ', 'เฅ',
# Special
'เฅ', 'เค', 'เค', 'เค',
# Digits (เค
เคเค)
'เฅฆ', 'เฅง', 'เฅจ', 'เฅฉ', 'เฅช', 'เฅซ', 'เฅฌ', 'เฅญ', 'เฅฎ', 'เฅฏ',
# Space
' '
]
NUM_CHARS = len(DEVANAGARI_CHARS) + 1 # +1 for CTC blank
# Character to index mapping
char_to_idx = {c: i for i, c in enumerate(DEVANAGARI_CHARS)}
idx_to_char = {i: c for c, i in char_to_idx.items()}
print(f"Character set size: {len(DEVANAGARI_CHARS)}")
print(f"Total classes (with CTC blank): {NUM_CHARS}")
Python
# Step 2: Build CNN + RNN + CTC Model (CRNN Architecture)
def build_crnn_model(img_width, img_height, num_chars):
"""
CRNN: Convolutional Recurrent Neural Network for OCR.
Architecture:
Input Image โ CNN (feature extraction) โ Reshape โ
BiLSTM (sequence modeling) โ Dense (per-timestep prediction) โ
CTC Loss (alignment-free training)
"""
# Input
input_img = layers.Input(shape=(img_height, img_width, 1),
name='image_input')
labels = layers.Input(shape=(None,), name='label_input',
dtype='int32')
# CNN Feature Extractor
# Block 1
x = layers.Conv2D(64, 3, padding='same', activation='relu')(input_img)
x = layers.BatchNormalization()(x)
x = layers.MaxPooling2D((2, 2))(x)
# Block 2
x = layers.Conv2D(128, 3, padding='same', activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.MaxPooling2D((2, 2))(x)
# Block 3
x = layers.Conv2D(256, 3, padding='same', activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.Conv2D(256, 3, padding='same', activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.MaxPooling2D((2, 1))(x) # Pool only height!
# Block 4
x = layers.Conv2D(512, 3, padding='same', activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.Dropout(0.25)(x)
# Reshape: (batch, height, width, channels) โ (batch, width, features)
# After 3 poolings: height = 64/8 = 8, width = 256/4 = 64
new_shape = (x.shape[2], x.shape[1] * x.shape[3]) # (width, height*channels)
x = layers.Reshape(target_shape=new_shape)(x)
x = layers.Dense(256, activation='relu')(x)
x = layers.Dropout(0.25)(x)
# Bidirectional LSTM for sequence modeling
x = layers.Bidirectional(layers.LSTM(256, return_sequences=True,
dropout=0.25))(x)
x = layers.Bidirectional(layers.LSTM(128, return_sequences=True,
dropout=0.25))(x)
# Output: per-timestep character probabilities
output = layers.Dense(num_chars, activation='softmax',
name='predictions')(x)
# CTC Loss Layer
ctc_loss = layers.Lambda(
lambda args: ctc_loss_func(*args),
name='ctc_loss'
)([labels, output])
# Training model (with CTC loss)
train_model = models.Model(
inputs=[input_img, labels],
outputs=ctc_loss
)
# Inference model (without CTC loss)
inference_model = models.Model(
inputs=input_img,
outputs=output
)
return train_model, inference_model
def ctc_loss_func(y_true, y_pred):
"""Compute CTC loss."""
batch_size = tf.shape(y_pred)[0]
input_length = tf.shape(y_pred)[1]
label_length = tf.math.count_nonzero(y_true, axis=1, dtype=tf.int32)
input_length = tf.fill([batch_size], input_length)
loss = tf.nn.ctc_loss(
labels=tf.cast(y_true, tf.int32),
logits=y_pred,
label_length=label_length,
logit_length=input_length,
logits_time_major=False,
blank_index=-1
)
return tf.reduce_mean(loss)
train_model, inference_model = build_crnn_model(IMG_WIDTH, IMG_HEIGHT, NUM_CHARS)
train_model.summary()
Python
# Step 3: Data Preprocessing for Handwritten Images
def preprocess_image(image_path, img_width=256, img_height=64):
"""
Preprocess handwritten Devanagari word image.
Steps: Grayscale โ Resize โ Normalize โ Invert (white text on black)
"""
# Read image
img = tf.io.read_file(image_path)
img = tf.image.decode_png(img, channels=1)
# Resize preserving aspect ratio (pad to target size)
img = tf.image.resize_with_pad(img, img_height, img_width)
# Normalize to [0, 1]
img = tf.cast(img, tf.float32) / 255.0
# Invert (handwriting is dark on light background)
img = 1.0 - img
return img
def encode_label(text, char_to_idx, max_len):
"""Convert Devanagari text to integer sequence."""
encoded = [char_to_idx.get(c, 0) for c in text[:max_len]]
# Pad with zeros
encoded += [0] * (max_len - len(encoded))
return np.array(encoded, dtype=np.int32)
def decode_prediction(pred, idx_to_char):
"""
CTC greedy decoding:
1. Take argmax at each timestep
2. Remove consecutive duplicates
3. Remove blank tokens
"""
# Greedy decode
pred_indices = np.argmax(pred, axis=-1)
# Remove consecutive duplicates
decoded = []
prev = -1
for idx in pred_indices:
if idx != prev and idx != len(idx_to_char): # Not blank
if idx in idx_to_char:
decoded.append(idx_to_char[idx])
prev = idx
return ''.join(decoded)
# Example
sample_text = "เคจเคฎเคธเฅเคคเฅ"
encoded = encode_label(sample_text, char_to_idx, MAX_LABEL_LEN)
print(f"Text: {sample_text}")
print(f"Encoded: {encoded[:10]}...")
Python
# Step 4: Training
# Compile with dummy loss (CTC loss is built into model)
train_model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
loss=lambda y_true, y_pred: y_pred # CTC loss already computed
)
callbacks = [
tf.keras.callbacks.EarlyStopping(
monitor='val_loss', patience=10,
restore_best_weights=True
),
tf.keras.callbacks.ReduceLROnPlateau(
monitor='val_loss', factor=0.5, patience=5
),
tf.keras.callbacks.ModelCheckpoint(
'best_devanagari_ocr.keras',
monitor='val_loss', save_best_only=True
),
]
# Note: train_data and val_data should be tf.data.Dataset
# yielding ((image_batch, label_batch), dummy_output)
# history = train_model.fit(
# train_data,
# validation_data=val_data,
# epochs=50,
# callbacks=callbacks
# )
Python
# Step 5: Evaluation โ Character Error Rate (CER) & Word Error Rate (WER)
def character_error_rate(y_true_texts, y_pred_texts):
"""
CER = Edit_Distance(predicted, ground_truth) / len(ground_truth)
Standard metric for OCR evaluation.
"""
total_chars = 0
total_errors = 0
for true, pred in zip(y_true_texts, y_pred_texts):
# Levenshtein distance
distance = levenshtein_distance(true, pred)
total_errors += distance
total_chars += len(true)
return total_errors / total_chars if total_chars > 0 else 0
def levenshtein_distance(s1, s2):
"""Compute edit distance between two strings."""
if len(s1) < len(s2):
return levenshtein_distance(s2, s1)
if len(s2) == 0:
return len(s1)
prev_row = range(len(s2) + 1)
for i, c1 in enumerate(s1):
curr_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = prev_row[j + 1] + 1
deletions = curr_row[j] + 1
substitutions = prev_row[j] + (c1 != c2)
curr_row.append(min(insertions, deletions, substitutions))
prev_row = curr_row
return prev_row[-1]
# Example evaluation
true_texts = ["เคจเคฎเคธเฅเคคเฅ", "เคญเคพเคฐเคค", "เคถเคฟเคเฅเคทเคพ"]
pred_texts = ["เคจเคฎเคธเฅเคคเฅ", "เคญเคพเคฐเคค", "เคถเคฟเคเฅเคทเฅ"] # Last one has error
cer = character_error_rate(true_texts, pred_texts)
print(f"Character Error Rate (CER): {cer:.4f} ({cer*100:.1f}%)")
Improvement Suggestions
- Attention mechanism: Replace CTC with Transformer-based attention decoder for better accuracy on long words
- Language model: Post-process with a Hindi n-gram language model to correct OCR errors
- Multi-script: Extend to Bengali, Tamil, Telugu โ share CNN backbone, train separate decoders
- Line segmentation: Add a text line detection step (CRAFT or DBNet) before word recognition
- Synthetic data: Generate training data using Hindi fonts with random distortions to augment real handwriting
Visual Diagrams
6.1 Applied CV Project Pipeline (Universal)
6.2 Project Architecture Comparison
6.3 CTC Loss โ How It Aligns Predictions to Labels
Worked Example โ End-to-End: From Leaf Photo to Diagnosis
๐ฌ Step-by-Step: A Tomato Leaf Passes Through Project 1
Farmer photographs a tomato leaf showing yellow spots. Phone camera produces a 3024 ร 4032 ร 3 JPEG (12 MP).
Step 2: PreprocessingApp resizes to 224 ร 224 ร 3. MobileNetV2 preprocessing scales pixels to [-1, 1].
Original: 3024ร4032ร3 โ Resized: 224ร224ร3 โ Preprocessed: values in [-1, 1]
The frozen MobileNetV2 backbone extracts a 7 ร 7 ร 1280 feature map. Each of the 1280 channels detects different visual patterns โ edges, textures, color blobs, disease-specific patterns.
Input: (1, 224, 224, 3) โ MobileNetV2 โ Output: (1, 7, 7, 1280)
GAP compresses each 7 ร 7 feature map into a single number by averaging, producing a 1280-dim vector.
(1, 7, 7, 1280) โ GAP โ (1, 1280)
The trained head maps 1280 features โ 256 โ 128 โ 38 class probabilities.
(1, 1280) โ Dense(256, ReLU) โ Dense(128, ReLU) โ Dense(38, Softmax)
Softmax output: class 27 ("Tomato___Early_blight") gets probability 0.93. Top-3:
- Tomato Early Blight: 93.2%
- Tomato Septoria Leaf Spot: 4.1%
- Tomato Late Blight: 1.8%
App shows: "Early Blight detected (93% confidence). Recommended: Apply Mancozeb 75% WP @ 2.5 g/L water. Spray every 10 days." โ in Hindi if user preference is set.
Latency Breakdown (on Redmi Note 12, โน12,999)| Step | Time (ms) |
|---|---|
| Image decode + resize | 3.2 |
| MobileNetV2 inference (TFLite INT8) | 6.5 |
| Post-processing + UI update | 1.8 |
| Total | 11.5 ms |
Case Study โ CropIn: AI-Powered Agriculture at Scale in India
๐พ CropIn Technology โ From Bangalore Startup to Global AgriTech Leader
Background
CropIn Technology Solutions (founded 2010, Bangalore) is India's leading AgriTech AI company. Their platform "SmartFarm" uses satellite imagery + smartphone CV to monitor crop health across 56+ countries, covering 21.6 million acres.
The CV Pipeline
- Satellite imagery: Sentinel-2 and Planet Labs provide multi-spectral imagery at 10m resolution. NDVI (Normalized Difference Vegetation Index) computed for large-area crop health assessment.
- Smartphone CV: Field agents capture close-up images of crops. CNN-based models identify diseases, pest damage, and nutrient deficiencies โ similar to our Project 1.
- Model stack: EfficientNet-B3 backbone โ Multi-task head (disease classification + severity estimation + pest identification). Trained on 12 million annotated images.
Scale & Impact
| Metric | Value |
|---|---|
| Farmers impacted | 7.1 million across 56 countries |
| Acres monitored | 21.6 million |
| Crops covered | 388 crop types |
| Disease detection accuracy | ~95% (field conditions) |
| Funding raised | $24 million (Series C) |
| Yield improvement | 15-20% for monitored farms |
Technical Challenges Solved
- Low connectivity: Models run on-device with results synced when connectivity is available
- Diverse crops: Transfer learning from a shared backbone fine-tuned per crop family
- Regional diseases: Different diseases prevalent in Maharashtra vs. Karnataka vs. Punjab โ location-aware model selection
- Farmer trust: Grad-CAM visualizations show which leaf regions triggered the diagnosis
Key Takeaway
CropIn demonstrates that the exact techniques you learned in this chapter โ MobileNet transfer learning, TFLite conversion, on-device inference โ are the building blocks of a โน1,000+ crore business impacting millions of lives.
Common Mistakes in Applied CV Projects
class_weight or focal loss.
Comparison Table โ All Five Projects
| Aspect | P1: Plant Disease | P2: Currency | P3: Traffic Signs | P4: Mask Detection | P5: Devanagari OCR |
|---|---|---|---|---|---|
| Task Type | Classification | Classification | Classification | Object Detection | Sequence Recognition |
| Architecture | MobileNetV2 | Custom CNN | ResNet-18 | YOLOv5s | CRNN (CNN+LSTM) |
| Framework | TensorFlow/Keras | TensorFlow/Keras | PyTorch | PyTorch (YOLOv5) | TensorFlow/Keras |
| Classes | 38 | 7 | 57 | 3 | 56 characters |
| Key Metric | Accuracy: 96.2% | Accuracy: 98.4% | Accuracy: 94.7% | mAP@50: 90.0% | CER: 7.1% |
| Parameters | 3.4M | 0.97M | 11.3M | 7.2M | ~4M |
| Deployment | TFLite (phone) | TFLite + TTS | ONNX (car ADAS) | Edge GPU (Jetson) | Cloud/Server |
| Loss Function | Cross-Entropy | Cross-Entropy | Cross-Entropy + Label Smooth | YOLOv5 composite | CTC Loss |
| Real-Time? | Yes (single image) | Yes (single image) | Yes (14ms/frame) | Yes (28 FPS) | Near-real-time |
| Indian Context | โน90K cr crop loss | 63M visually impaired | 4.6L accidents/yr | COVID compliance | Digital India OCR |
When to Use Which Architecture?
| Scenario | Recommended Approach | Why |
|---|---|---|
| Single object in image, need class label | Classification (MobileNetV2/EfficientNet) | Simple, fast, well-studied |
| Multiple objects, need locations | Detection (YOLOv5/RetinaNet) | Outputs bounding boxes + classes |
| Need to read text in image | OCR (CRNN + CTC or Transformer) | Handles variable-length output |
| Mobile/edge deployment critical | MobileNetV2 + TFLite INT8 | 3.4M params, 6.5ms on phone |
| Highest accuracy, server deployment | EfficientNet-B4 or Vision Transformer | More params but cloud handles it |
Exercises
Section A โ Multiple Choice Questions (10)
In Project 1, why is MobileNetV2 chosen over ResNet-50 for plant disease detection on a โน8,000 smartphone?
- MobileNetV2 has higher ImageNet accuracy
- MobileNetV2 has fewer parameters and lower latency on mobile CPUs
- ResNet-50 cannot be converted to TFLite
- MobileNetV2 doesn't need pretraining
During fine-tuning in Phase 2, the learning rate is reduced from 1e-3 to 1e-5. What happens if you keep using 1e-3?
- Training converges faster to a better minimum
- Pretrained ImageNet features are destroyed by large gradient updates
- The model underfits because the learning rate is too small
- There is no effect since the base layers are frozen
In the currency recognition project (P2), why is a confidence threshold of 85% used before announcing the denomination?
- To save battery by reducing TTS calls
- To prevent incorrect announcements that could cause financial harm to visually impaired users
- Because the model always outputs confidence above 85%
- To comply with RBI regulations
Why should horizontal flip NOT be used when augmenting traffic sign data?
- It makes training slower
- It creates invalid signs โ "Keep Left" becomes "Keep Right," confusing the model
- Traffic signs are always perfectly upright
- Horizontal flip only works for natural images
In YOLOv5-based mask detection, what does NMS (Non-Maximum Suppression) with IoU threshold 0.45 do?
- Removes all detections with confidence below 0.45
- Removes duplicate bounding boxes for the same face by keeping only the highest-confidence one
- Limits the total number of detections to 45
- Increases the model's recall by 45%
In Project 5 (Devanagari OCR), why is CTC loss used instead of standard cross-entropy?
- CTC loss is faster to compute
- CTC handles variable-length output without requiring character-level alignment between input and labels
- Cross-entropy cannot be used with RNNs
- CTC automatically handles Devanagari matras
When converting a Keras model to TFLite with INT8 quantization, model size decreases from 10.2 MB to 3.1 MB. What is the approximate bit-width reduction?
- FP32 (4 bytes) โ INT8 (1 byte) = 4ร reduction, but overhead keeps it at 3.3ร
- FP64 โ FP16 = 4ร reduction
- Only weights are quantized, biases remain FP32
- The model architecture itself is simplified
Which evaluation metric is most appropriate for comparing OCR systems?
- Top-1 Accuracy
- mAP@50
- Character Error Rate (CER)
- AUC-ROC
In the traffic sign project, label smoothing of 0.1 is applied. What does this do?
- Smooths the input images to remove noise
- Replaces hard labels [0,0,1,0] with soft labels [0.0033, 0.0033, 0.9, 0.0033], preventing overconfident predictions
- Applies Gaussian smoothing to the loss curve
- Reduces the number of training labels by 10%
Which of the following is the correct order for building an applied CV project?
- Train model โ Collect data โ Evaluate โ Deploy
- Select architecture โ Collect data โ Deploy โ Evaluate
- Frame problem โ Collect/prepare data โ Select/train model โ Evaluate โ Optimize โ Deploy
- Deploy baseline โ Collect data from production โ Retrain โ Evaluate
Section B โ Short Answer Questions (5)
Explain why Global Average Pooling (GAP) is preferred over Flatten + Dense in MobileNetV2 for plant disease detection. What specific advantage does it provide for mobile deployment?
A currency recognition model achieves 99% accuracy in the lab but 82% accuracy when tested with real users. List three reasons for this performance gap and one solution for each.
Compare the loss functions used in Project 1 (categorical cross-entropy), Project 4 (YOLO composite loss), and Project 5 (CTC loss). Why can't a single loss function work for all three tasks?
Explain CTC greedy decoding with an example. Given a CTC output sequence [blank, เค, เค, blank, เคฎ, blank, เคฒ, เคฒ, blank], what is the decoded text? Show each step.
Why is cosine annealing learning rate schedule used in the traffic sign project instead of a constant learning rate? Draw a rough sketch of how the learning rate changes over 25 epochs.
Section C โ Long Answer Questions (3)
[15 marks] You are building a CV system for the Indian Postal Service to automatically read handwritten PIN codes (6 digits, Devanagari numerals เฅฆ-เฅฏ) from postcards. Design the complete pipeline: (a) dataset collection strategy, (b) model architecture with justification, (c) loss function choice, (d) evaluation metrics, (e) deployment plan for 100+ post offices. Discuss at least three challenges specific to this application.
[12 marks] Compare transfer learning (Project 1: MobileNetV2 pretrained) with training from scratch (Project 2: custom CNN). When does each approach work better? Analyze the data requirements, training time, and final accuracy for both approaches. Use specific numbers from this chapter to support your argument.
[12 marks] Discuss the ethical considerations in deploying face mask detection systems (Project 4) in Indian public spaces. Cover: (a) privacy concerns under the DPDPA (Digital Personal Data Protection Act, 2023), (b) potential for misuse (surveillance), (c) bias (performance across different skin tones, face types, head coverings), (d) consent and transparency requirements. Propose a responsible deployment framework.
Section D โ Programming Exercises (5)
Grad-CAM Visualization (Project 1): Implement Grad-CAM for the plant disease model. Given a leaf image, generate a heatmap showing which regions the model focuses on. Overlay the heatmap on the original image. Use TensorFlow's tf.GradientTape.
Data Augmentation Ablation (Project 2): Train the currency CNN three times: (a) no augmentation, (b) only geometric augmentation (flip, rotate), (c) full augmentation (geometric + color). Plot all three training curves on the same graph. Report final accuracy for each. What is the accuracy improvement from augmentation?
Confusion Matrix Analysis (Project 3): Generate and visualize the confusion matrix for the traffic sign model. Identify the top-3 most confused class pairs. For each pair, hypothesize why confusion occurs and suggest a data/model fix.
FPS Benchmarking (Project 4): Benchmark the mask detection model at three resolutions: 320ร320, 640ร640, and 1280ร1280. For each, record FPS, mAP@50, and memory usage. Plot accuracy vs. speed trade-off curve. Which resolution is optimal for Delhi Metro CCTV deployment?
CTC Beam Search (Project 5): Implement beam search decoding (beam width=5) for the OCR model and compare results against greedy decoding on 100 test samples. Report CER for both methods. How much does beam search improve accuracy?
Section E โ Mini-Projects
Indian Food Recognition (Classification): Build a model that classifies 20 popular Indian dishes (dosa, idli, biryani, chole bhature, pav bhaji, etc.) from food images. Use EfficientNet-B0 transfer learning. Collect 200+ images per class from Google Images/Zomato. Deploy as a Streamlit web app where users upload food photos and get calorie estimates. Target: >90% accuracy.
Vehicle Number Plate Recognition (OCR + Detection): Build an end-to-end ANPR (Automatic Number Plate Recognition) system for Indian number plates. Step 1: Detect plate region using YOLOv5. Step 2: Read characters using CRNN+CTC. Handle Indian plate formats (e.g., MH 12 AB 1234, DL 01 C AB 1234). Test on 200+ images. Target: >85% full-plate read accuracy.
Chapter Summary
Key Takeaways โ Applied Computer Vision
- The applied CV pipeline is universal: Problem Framing โ Dataset Engineering โ Model Selection โ Training โ Evaluation โ Optimization โ Deployment. Master this lifecycle, and you can tackle any CV problem.
- Transfer learning is the default: Start with ImageNet-pretrained backbones (MobileNetV2, ResNet, EfficientNet). Only train from scratch when your domain is radically different from natural images (medical X-rays, satellite imagery).
- Two-phase fine-tuning works best: Phase 1 โ freeze backbone, train only the new classification head with lr=1e-3. Phase 2 โ unfreeze top layers, fine-tune with lr=1e-5. This preserves learned features while adapting to your task.
- Data augmentation is non-negotiable: RandomFlip, Rotation, Zoom, Brightness, Contrast โ these simulate real-world variation and can improve accuracy by 5-15%. But be domain-aware: no horizontal flip for traffic signs!
- Classification โ Detection โ OCR: Different tasks need different architectures (CNN+Softmax vs. YOLO vs. CRNN+CTC) and different metrics (Accuracy vs. mAP vs. CER). Choose based on your problem, not popularity.
- Deployment optimization is half the battle: TFLite INT8 quantization gives ~3-4ร compression with <1% accuracy drop. Always benchmark on target hardware. A 96% accurate model that runs at 0.5 FPS is useless.
- Indian context matters: Bilingual traffic signs, Devanagari script complexity, โน8,000 smartphone constraints, diverse lighting conditions, faded/damaged objects โ solutions must be designed for local reality, not benchmarked on Western datasets.
- Evaluation must go beyond accuracy: Confusion matrices reveal which classes are confused. Grad-CAM shows where the model looks. Per-class F1 exposes minority-class failures. Always perform error analysis, not just metric reporting.
Project Quick Reference
| Project | Best Architecture | Key Metric | Deployment |
|---|---|---|---|
| P1: Plant Disease | MobileNetV2 (fine-tuned) | 96.2% accuracy | TFLite on Android |
| P2: Currency Notes | Custom CNN (4 blocks) | 98.4% accuracy | TFLite + TTS |
| P3: Traffic Signs | ResNet-18 (fine-tuned) | 94.7% accuracy | ONNX for ADAS |
| P4: Face Mask | YOLOv5s | 90.0% mAP@50 | Jetson Nano edge |
| P5: Devanagari OCR | CRNN (CNN+BiLSTM+CTC) | 7.1% CER | Cloud API |
References
Research Papers
- Sandler, M., et al. (2018). "MobileNetV2: Inverted Residuals and Linear Bottlenecks." CVPR 2018. โ Foundation for Project 1.
- He, K., et al. (2016). "Deep Residual Learning for Image Recognition." CVPR 2016. โ ResNet architecture used in Project 3.
- Redmon, J. & Farhadi, A. (2018). "YOLOv3: An Incremental Improvement." arXiv:1804.02767. โ Object detection foundation for Project 4.
- Shi, B., Bai, X., & Yao, C. (2017). "An End-to-End Trainable Neural Network for Image-Based Sequence Recognition." IEEE TPAMI. โ CRNN architecture for Project 5.
- Graves, A., et al. (2006). "Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with RNNs." ICML 2006. โ CTC loss theory.
- Hughes, D. P. & Salathรฉ, M. (2015). "An open access repository of images on plant health." arXiv:1511.08060. โ PlantVillage dataset paper.
- Stallkamp, J., et al. (2012). "Man vs. Computer: Benchmarking Machine Learning Algorithms for Traffic Sign Recognition." Neural Networks. โ GTSRB benchmark reference.
- Selvaraju, R. R., et al. (2017). "Grad-CAM: Visual Explanations from Deep Networks." ICCV 2017. โ Model interpretability technique.
Indian Context & Datasets
- ICAR Annual Report 2022-23. "Crop Losses due to Pests and Diseases in India." โ Agricultural loss statistics.
- Ministry of Road Transport & Highways (MoRTH). "Road Accidents in India โ 2022." โ Traffic safety statistics.
- Reserve Bank of India. "Banknote Features โ Mahatma Gandhi (New) Series." โ Currency note specifications.
- IIIT Hyderabad CVIT Lab. "Indian Language OCR Benchmark." โ Devanagari OCR research.
- CropIn Technology. "SmartFarm Platform โ Technical Whitepaper." โ Industry case study reference.
Libraries & Tools
- TensorFlow Team. TensorFlow Lite Documentation. tensorflow.org/lite
- Ultralytics. YOLOv5 Documentation. docs.ultralytics.com
- PyTorch Team. TorchVision Models. pytorch.org/vision
- ONNX Runtime. Cross-Platform Inference. onnxruntime.ai
Textbooks
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. โ Chapters 9 (CNNs) and 10 (Sequence Modeling).
- Chollet, F. (2021). Deep Learning with Python. 2nd Edition. Manning. โ Transfer learning and Keras implementation patterns.
- Gรฉron, A. (2022). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. 3rd Edition. O'Reilly. โ Practical CV project methodology.