EDUARTHA CV Chapter

Chapter 25: Computer Vision & Image Processing

📚 PART IX: Advanced Topics ⏱ Reading Time: 4 hours 🔗 Prerequisites: Chapter 18 (Deep Learning Basics)

1. Learning Objectives

By the end of this comprehensive chapter, you will be able to:

2. Introduction

Computer Vision (CV) is the subfield of Artificial Intelligence dedicated to enabling machines to interpret, understand, and extract actionable information from the visual world. While humans perform visual recognition effortlessly—differentiating a dog from a cat, or a highly illuminated street from a dark alley in a fraction of a second—replicating this biological capability in silicon is notoriously difficult.

The primary hurdle in Computer Vision is known as the Semantic Gap. A digital image is merely a 2D grid of numbers (pixels) indicating light intensity. The machine sees an array like [[255, 128], [0, 64]], but the human sees the edge of a building or the texture of a leaf. Bridging this gap between low-level pixel data and high-level semantic meaning requires layers of mathematical transformation and, more recently, deep neural networks.

Professor's Insight

To truly master Computer Vision, you must think in multi-dimensional arrays (tensors). When you look at an image, imagine it not as a flat picture, but as a stacked cube of spatial dimensions (Height, Width) and depth (Channels). Every operation—whether it is a simple blur or a complex YOLO detection—is fundamentally a geometric or statistical transformation of this tensor space.

Modern Computer Vision is no longer confined to academic laboratories. It powers autonomous driving, underpins facial biometric systems used by billions, enables augmented reality overlays in our smartphones, and helps radiologists detect anomalies faster than the human eye.

3. Historical Background

The evolution of Computer Vision is a fascinating tale of shifting paradigms, moving from rigid geometrical rules to data-driven deep learning.

4. Conceptual Explanation

4.1 Image Fundamentals: Pixels and Color Spaces

A digital image is composed of finite picture elements called pixels. The most common color space is RGB (Red, Green, Blue), an additive color model where colors are formed by combining different intensities of red, green, and blue light (0-255 for 8-bit depth).

The HSV (Hue, Saturation, Value) color space decouples color information (Hue) from lighting (Value). This makes HSV exceptionally robust for tasks like color tracking or green-screen segmentation, as variations in lighting only affect the 'Value' channel.

4.2 Image Preprocessing & Augmentation

Raw images are messy. Preprocessing standardizes them:

4.3 Edge Detection and Feature Extraction

Edges represent regions in an image where there is a sharp change in intensity. The Sobel operator computes an approximation of the gradient of the image intensity. The Canny edge detector is a multi-stage algorithm that finds pristine, thin edges.

Features are "interesting" parts of an image. HOG (Histogram of Oriented Gradients) counts occurrences of gradient orientation. SIFT, SURF, and ORB find keypoints that are scale and rotation invariant.

4.4 Modern Paradigms: YOLO, SSD, and R-CNN

Object Detection identifies what is in an image and where it is.

Exam Tip

If asked to compare object detectors: remember that Faster R-CNN prioritizes Accuracy over Speed (Two-stage), while YOLO / SSD prioritize Speed over Accuracy (One-stage), though newer YOLO architectures (v8, v9) bridge this gap.

4.5 Segmentation & Pose Estimation

Semantic Segmentation (e.g., U-Net, DeepLab) classifies every individual pixel. Instance Segmentation (e.g., Mask R-CNN) goes a step further by distinguishing between different objects of the same class. Pose Estimation (OpenPose, MediaPipe) tracks key structural points of the human body.

5. Mathematical Foundation

5.1 The Convolution Operation

The core mathematical engine of Computer Vision is 2D Convolution. Let $I$ be an image matrix and $K$ be a kernel (or filter) matrix of size $m \times n$. The discrete convolution operation is defined as:

(I * K)(i, j) = \sum_{m} \sum_{n} I(i - m, j - n) \cdot K(m, n)

In practice, deep learning frameworks implement cross-correlation, but the term "convolution" is universally used.

5.2 Sobel Gradients

To find edges, we calculate the derivative of the image. The Sobel operator uses two $3 \times 3$ kernels to approximate the derivative in the X and Y directions:

G_x = \begin{bmatrix} -1 & 0 & +1 \\ -2 & 0 & +2 \\ -1 & 0 & +1 \end{bmatrix} * A \quad , \quad G_y = \begin{bmatrix} +1 & +2 & +1 \\ 0 & 0 & 0 \\ -1 & -2 & -1 \end{bmatrix} * A

The magnitude of the gradient at a pixel is $G = \sqrt{G_x^2 + G_y^2}$ and the direction is $\theta = \arctan(G_y / G_x)$.

5.3 Intersection over Union (IoU)

IoU measures the overlap between two bounding boxes. It is calculated as:

IoU = \frac{\text{Area of Overlap}}{\text{Area of Union}}

An IoU > 0.5 is typically considered a "True Positive" detection.

6. Formula Derivations

6.1 Output Dimensions of a Convolutional Layer

Let:

The formula to calculate the output dimension $O$ is:

O = \lfloor \frac{W - K + 2P}{S} \rfloor + 1

6.2 YOLO Multi-Part Loss Function

YOLO formulates detection as a regression problem. Its loss function $\lambda_{coord} \dots$ is a sum of squared errors, but it penalizes different errors differently:

  1. Coordinate Loss: Penalizes errors in bounding box coordinates $(x, y, w, h)$. To ensure small deviations in large boxes matter less, YOLO predicts the square root of width and height: $\sum (\sqrt{w_i} - \sqrt{\hat{w}_i})^2$.
  2. Objectness Loss: Penalizes errors in the confidence score.
  3. Classification Loss: Standard categorical cross-entropy for the class probabilities.

7. Worked Numerical Examples

Example 1: Convolution Calculation

Let Input Image Matrix $I$ (3x3) and Kernel $K$ (2x2), with Stride=1, Padding=0.

I = [ [1, 2, 0],
      [0, 1, 1],
      [1, 2, 0] ]
      
K = [ [1, 0],
      [0, -1] ]
            

We slide the 2x2 kernel over the 3x3 image.
Top-Left element: $(1*1) + (2*0) + (0*0) + (1*-1) = 1 + 0 + 0 - 1 = 0$
Top-Right element: $(2*1) + (0*0) + (1*0) + (1*-1) = 2 - 1 = 1$
Bottom-Left element: $(0*1) + (1*0) + (1*0) + (2*-1) = -2$
Bottom-Right element: $(1*1) + (1*0) + (2*0) + (0*-1) = 1$

Resulting Feature Map: [[0, 1], [-2, 1]]

Example 2: Calculating IoU

Box A (Ground Truth): [x1=10, y1=10, x2=50, y2=50] -> Area = 1600
Box B (Prediction): [x1=30, y1=30, x2=70, y2=70] -> Area = 1600

Intersection coordinates:
$xI_1 = \max(10, 30) = 30$, $yI_1 = \max(10, 30) = 30$
$xI_2 = \min(50, 70) = 50$, $yI_2 = \min(50, 70) = 50$

Intersection Area = $(50 - 30) \times (50 - 30) = 400$.
Union Area = $1600 + 1600 - 400 = 2800$.
IoU = $400 / 2800 = 0.1428$ (Poor overlap).

8. Visual Diagrams (ASCII Art)

8.1 Convolution Sliding Window Concept

Input Image (5x5) Kernel (3x3) Output Feature Map (3x3) +---+---+---+---+---+ +---+---+---+ +---+---+---+ | a | b | c | 0 | 1 | | 1 | 0 | 1 | | X | | | +---+---+---+---+---+ +---+---+---+ +---+---+---+ | d | e | f | 1 | 0 | * | 0 | 1 | 0 | ==> | | | | +---+---+---+---+---+ +---+---+---+ +---+---+---+ | g | h | i | 0 | 0 | | 1 | 0 | 1 | | | | | +---+---+---+---+---+ +---+---+---+ +---+---+---+ | 1 | 0 | 1 | 1 | 0 | +---+---+---+---+---+ X is calculated by element-wise | 0 | 1 | 1 | 0 | 1 | multiplication of the kernel with the +---+---+---+---+---+ top-left 3x3 region [a..i] and summing.

8.2 YOLO Grid Architecture

Original Image SxS Grid (e.g., 7x7) Bounding Box Predictions +----------------+ +--+--+--+--+--+--+--+ +----------------+ | | | | | | | | | | | [Car] | | +-----+ | +--+--+--+--+--+--+--+ | +-----+ | | | CAR | | ==> | | |##|##| | | | | | | | | +-----+ | +--+--+--+--+--+--+--+ | +-----+ | | | | | |##|##| | | | | | | [Dog]| +--+--+--+--+--+--+--+ | | +----------------+ +--+--+--+--+--+--+--+ +----------------+ Grid cells (##) responsible for detecting the center of the car.

9. Flowcharts (ASCII Art)

9.1 General Deep Learning CV Pipeline

[ Raw Image Dataset ] | v [ Data Augmentation & Preprocessing ] --> (Resize, Normalize, Flip, Crop) | v [ Backbone CNN ] --> (Extracts Features: ResNet, CSPDarknet) | v [ Neck (Feature Aggregation) ] --> (Combines multi-scale features: FPN) | v [ Head (Prediction) ] --> (Outputs Classes, Bounding Boxes, or Masks) | v [ Non-Maximum Suppression ] --> (Removes duplicate overlapping boxes) | v [ Final Detection Output ]

9.2 Canny Edge Detection Pipeline

Input -> [ Gaussian Blur ] (Reduces noise) | v [ Sobel Filters ] (Computes Intensity Gradients Gx, Gy) | v [ Non-Maximum Suppression ] (Thins edges by removing non-local maxima) | v [ Hysteresis Thresholding ] (Links weak edges to strong edges) -> Output

10. Python Implementation (From Scratch)

We implement image to grayscale conversion, a 2D convolution function, and IoU using NumPy.

import numpy as np

def rgb_to_gray(img_rgb):
    """ Converts an RGB image (HxWx3) to Grayscale (HxW). """
    weights = np.array([0.2989, 0.5870, 0.1140])
    return np.dot(img_rgb[..., :3], weights)

def conv2d(image, kernel, stride=1, padding=0):
    """ Implements 2D Convolution on a 2D single-channel image. """
    if padding > 0:
        image = np.pad(image, pad_width=padding, mode='constant', constant_values=0)
        
    img_h, img_w = image.shape
    kernel_h, kernel_w = kernel.shape
    
    out_h = int((img_h - kernel_h) / stride) + 1
    out_w = int((img_w - kernel_w) / stride) + 1
    output = np.zeros((out_h, out_w))
    
    for y in range(0, out_h):
        for x in range(0, out_w):
            y_start = y * stride
            y_end = y_start + kernel_h
            x_start = x * stride
            x_end = x_start + kernel_w
            
            roi = image[y_start:y_end, x_start:x_end]
            output[y, x] = np.sum(roi * kernel)
            
    return output

def calculate_iou(boxA, boxB):
    """ Calculate IoU. Boxes are [x1, y1, x2, y2]. """
    xA = max(boxA[0], boxB[0])
    yA = max(boxA[1], boxB[1])
    xB = min(boxA[2], boxB[2])
    yB = min(boxA[3], boxB[3])
    
    interArea = max(0, xB - xA) * max(0, yB - yA)
    boxAArea = (boxA[2] - boxA[0]) * (boxA[3] - boxA[1])
    boxBArea = (boxB[2] - boxB[0]) * (boxB[3] - boxB[1])
    
    iou = interArea / float(boxAArea + boxBArea - interArea)
    return iou

# Test the Convolution
sobel_x = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]])
sample_image = np.array([[10, 10, 10, 0, 0], [10, 10, 10, 0, 0], [10, 10, 10, 0, 0]])
edges = conv2d(sample_image, sobel_x, padding=1)

Code Challenge

Modify the conv2d function above to handle 3-channel RGB images directly. You will need to make the kernel 3-dimensional (e.g., 3x3x3) and sum across the color channels as well!

11. TensorFlow Implementation

Using a pre-trained ResNet50 for image classification via TensorFlow/Keras.

import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50, preprocess_input, decode_predictions
from tensorflow.keras.preprocessing import image
import numpy as np

model = ResNet50(weights='imagenet')

def classify_image(img_path):
    img = image.load_img(img_path, target_size=(224, 224))
    x = image.img_to_array(img)
    x = np.expand_dims(x, axis=0)
    x = preprocess_input(x)
    
    preds = model.predict(x)
    results = decode_predictions(preds, top=3)[0]
    for i, (imagenet_id, label, prob) in enumerate(results):
        print(f"{i+1}. {label}: {prob*100:.2f}%")

12. Scikit-Learn Pipeline

Extracting HOG features and feeding them into an SVM classifier.

from skimage.feature import hog
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

def extract_features(images):
    features = []
    for img in images:
        fd = hog(img, orientations=9, pixels_per_cell=(8, 8),
                 cells_per_block=(2, 2), visualize=False)
        features.append(fd)
    return np.array(features)

clf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf', C=1.0))
])

13. Indian Case Studies

India Spotlight: CV at Population Scale

1. Aadhaar Face Authentication: UIDAI uses CV for deduplication, ensuring no individual registers twice using facial feature extraction that handles diverse Indian demographics.

2. ISRO's Satellite Imagery: ISRO processes satellite imagery with CV to classify land use, monitor deforestation, and predict crop yields for Indian agriculture grids.

3. AI Traffic Management: Metropolitan cities use OCR (ANPR) to read diverse Indian license plates and pose estimation to check for helmet compliance, generating e-challans automatically.

14. Global Case Studies

15. Startup Applications

Startups leverage CV in disruptive ways:

16. Government Applications

Governments deploy CV for public safety:

17. Industry Applications

Traditional industries use CV to automate processes:

Industry Alert: Edge AI

Heavy industries deploy models on "Edge Devices" operating locally on the factory floor to ensure low latency and data privacy.

18. Mini Projects

Mini Project 1: Indian License Plate Detector

Goal: Extract text from vehicle license plates.
Stack: OpenCV, EasyOCR.
Steps: Apply Bilateral Filter to remove noise, use Canny Edge detection, find rectangular contours, crop the region, and pass it to EasyOCR.

Mini Project 2: Smart Face Attendance System

Goal: Log attendance based on webcam face recognition.
Stack: OpenCV, face_recognition.
Steps: Store known face encodings, capture live video, extract current encodings, compare distances, and log matches to a CSV.

Mini Project 3: Automated Document Scanner

Goal: Convert a skewed photo of a document into a flat PDF style image.
Stack: OpenCV, NumPy.
Steps: Find the 4 corners of the document, apply cv2.getPerspectiveTransform() to warp it flat, and use Adaptive Gaussian Thresholding to sharpen text.

19. Exercises

  1. Explain the difference between RGB and HSV color spaces. Why is HSV generally preferred over RGB for object tracking?
  2. Calculate the output dimensions of a 128x128 image passed through a 5x5 convolution kernel with a stride of 2 and padding of 1.
  3. Apply the Sobel X matrix manually to a 3x3 matrix of all 1s. What is the result?
  4. Box A is [0,0,10,10]. Box B is [5,5,15,15]. Calculate the Intersection over Union (IoU) exactly.
  5. Given a 4x4 matrix, manually perform 2x2 Max Pooling with stride 2. Why is max pooling preferred over average pooling?
  6. List 5 different image augmentation techniques. Explain why rotating a handwritten '6' or '9' might be bad for an OCR dataset.
  7. What is the main structural difference between a one-stage detector (YOLO) and a two-stage detector (Faster R-CNN)?
  8. Give a real-world scenario where Semantic Segmentation fails but Instance Segmentation is required.
  9. Explain the algorithm of Non-Maximum Suppression (NMS).
  10. Define the concept of a "Receptive Field" in a deep CNN.
  11. Why does YOLO use a sum-of-squared-error loss for bounding box coordinates but cross-entropy for class probabilities?
  12. Write a 3-line Python script using OpenCV to read an image, convert it to grayscale, and save it back to disk.
  13. What makes SIFT features "scale-invariant"?
  14. Why are GPUs significantly better at processing image convolutions than traditional multi-core CPUs?
  15. What is Model Quantization and why is it crucial for deploying CV models on edge devices?
  16. Explain the basic adversarial concept behind Generative Adversarial Networks (GANs).
  17. How does a diffusion model differ from a GAN in generating images?
  18. Discuss potential algorithmic biases in facial recognition systems. Why is a diverse dataset critical?
  19. What are "Anchor Boxes" in SSD and Faster R-CNN?
  20. If an object detector takes 50ms to process a frame, what is the maximum FPS it can achieve?

20. Multiple Choice Questions (MCQs)

Click on the options to reveal the correct answer and explanation.

1. Which of the following color spaces separates image intensity from color information?
  • A) RGB
  • B) HSV
  • C) CMYK
  • D) Grayscale
Correct Answer: B) HSV.
2. In a CNN, what is the primary purpose of a Pooling layer?
  • A) To add non-linearity
  • B) To increase spatial dimensions
  • C) To reduce spatial dimensions and computational load
  • D) To calculate loss
Correct Answer: C.
3. You are designing an application to count exact numbers of overlapping blood cells. Which technique should you use?
  • A) Image Classification
  • B) Semantic Segmentation
  • C) Instance Segmentation
  • D) Pose Estimation
Correct Answer: C) Instance Segmentation.
4. What does YOLO stand for in the context of Object Detection?
  • A) You Only Learn Once
  • B) You Only Look Once
  • C) Yield Objects Locally Once
  • D) Yet another Object Learning Ontology
Correct Answer: B) You Only Look Once.
5. Which algorithm eliminates redundant overlapping bounding boxes?
  • A) Anchor Mapping
  • B) Gradient Descent
  • C) Non-Maximum Suppression (NMS)
  • D) Backpropagation
Correct Answer: C) Non-Maximum Suppression.
6. An IoU score of 0.95 indicates:
  • A) A terrible prediction
  • B) An almost perfect overlap
  • C) No intersection
  • D) Size difference
Correct Answer: B) An almost perfect overlap.
7. The Sobel operator is primarily used for:
  • A) Smoothing an image
  • B) Edge Detection
  • C) Changing color spaces
  • D) Generating images
Correct Answer: B) Edge Detection.
8. In U-Net, what is the purpose of "skip connections"?
  • A) To skip processing background
  • B) To pass high-resolution spatial information to the decoder
  • C) To prevent overfitting
  • D) To convert the model to a detector
Correct Answer: B.
9. Which of the following is a classic pre-DL "Feature Descriptor"?
  • A) ResNet
  • B) MobileNet
  • C) SIFT
  • D) U-Net
Correct Answer: C) SIFT.
10. 5 coordinates output per bounding box prediction in object detection usually represent:
  • A) Red, Green, Blue, Alpha, Class
  • B) x_center, y_center, width, height, confidence score
  • C) x1, y1, x2, y2, x3
  • D) Depth, Height, Width, Time, Class
Correct Answer: B.

21. Interview Questions

Career Path: CV Engineer Interviews

Interviewers in Computer Vision test your geometric intuition, matrix math, and classical algorithms alongside deep learning.

  1. Q: Explain the Semantic Gap in Computer Vision.
    A: It is the disconnect between the low-level pixel data (a matrix) and the high-level semantic meaning (e.g., "a dog") that a human perceives.
  2. Q: How do you handle scale variance in object detection?
    A: Using Feature Pyramid Networks (FPN) where predictions are made at multiple layers, and Anchor Boxes of various sizes.
  3. Q: Walk me through Canny Edge Detection.
    A: 1. Gaussian Blur. 2. Sobel operator. 3. Non-Maximum Suppression. 4. Hysteresis Thresholding.
  4. Q: Why does YOLO use fully convolutional architectures now?
    A: Fully connected layers force a fixed input image size. FCNs retain spatial dimensions and can process varying input sizes.
  5. Q: Explain Intersection over Union (IoU).
    A: Area of Overlap divided by Area of Union. It evaluates accuracy of bounding box predictions.
  6. Q: What is a Vision Transformer (ViT)?
    A: ViT splits an image into fixed-size patches, linearly embeds them, and uses a standard Transformer encoder with self-attention instead of convolutions.
  7. Q: Design a system to detect defective pills on a conveyor belt.
    A: Frame it as Anomaly Detection. Use a high-speed camera with consistent lighting, train a lightweight CNN (MobileNet) or one-class SVM, deployed on an Edge device.
  8. Q: What is the purpose of Data Augmentation?
    A: Increases dataset diversity to prevent overfitting. It is harmful if applied inappropriately (e.g., vertically flipping cars).
  9. Q: Contrast Semantic and Instance Segmentation.
    A: Semantic outputs a single class label map. Instance segmentation predicts a distinct binary mask for each specific object instance.
  10. Q: Explain 1x1 Convolutions.
    A: A 1x1 convolution looks at a single pixel across all channels, acting as cross-channel pooling to reduce dimensionality.

22. Research Problems

23. Key Takeaways

24. References & Further Reading