Chapter 3: Python & Data Science Ecosystem

3.1 🎯 Learning Objectives

🐍

Python Mastery

Master Python fundamentals — data types, control flow, OOP, comprehensions — tailored for ML/AI work.

🔢

NumPy Proficiency

Perform 20+ array operations: broadcasting, vectorization, linear algebra, reshaping, indexing.

🐼

Pandas Expertise

Load, clean, transform, and analyze real Indian datasets using DataFrames with 20+ operations.

📊

Visualization Skills

Create publication-quality plots — line, scatter, bar, histogram, heatmap — with Matplotlib & Seaborn.

⚙️

Scikit-Learn Pipelines

Understand the estimator API (fit/predict/transform), build preprocessing pipelines, and evaluate models.

🧠

TensorFlow Setup

Install TensorFlow/Keras, configure GPU, perform tensor operations, and use eager execution.

📓

Notebook Workflows

Use Jupyter Notebooks and Google Colab for interactive development and reproducible research.

🇮🇳

Indian Data Analysis

Perform EDA on NSE stock data, Census of India, IMD weather data, and agricultural datasets.

3.2 📖 Introduction

Python has become the lingua franca of artificial intelligence and data science. Over 87% of ML practitioners use Python as their primary language (Kaggle Survey 2024). This chapter transforms you from a Python beginner into a proficient data science programmer.

Why Python Dominates AI/ML

Feature	Python	R	Julia	MATLAB	C++
Learning Curve	⭐ Easy	Medium	Medium	Easy	Hard
ML Libraries	⭐ 500+	200+	50+	100+	30+
Deep Learning	⭐ TF, PyTorch	Limited	Flux.jl	DL Toolbox	LibTorch
Production Deploy	⭐ Excellent	Poor	Growing	Poor	⭐ Excellent
Community Size	⭐ Massive	Large	Small	Medium	Large
Speed	Moderate	Slow	⭐ Fast	Moderate	⭐ Fastest
Cost	⭐ Free	⭐ Free	⭐ Free	₹1,50,000+/yr	⭐ Free
Industry Adoption	⭐ 87%	15%	3%	8%	12%

The Data Science Stack

Python Stack
# The complete Python data science ecosystem

# Layer 1: Core Language
Python 3.10+   # Base language

# Layer 2: Scientific Computing
NumPy          # N-dimensional arrays & linear algebra
SciPy          # Scientific algorithms (optimization, integration)

# Layer 3: Data Manipulation
Pandas         # DataFrames for structured data
Polars         # Modern high-performance alternative

# Layer 4: Visualization
Matplotlib     # Core plotting library
Seaborn        # Statistical visualization
Plotly         # Interactive plots

# Layer 5: Machine Learning
Scikit-Learn   # Classical ML algorithms
XGBoost        # Gradient boosting
LightGBM       # Fast gradient boosting

# Layer 6: Deep Learning
TensorFlow     # Google's DL framework
PyTorch        # Meta's DL framework
Keras          # High-level API

# Layer 7: Deployment
Flask / FastAPI   # API servers
Docker            # Containerization
MLflow            # Model tracking

3.3 📜 Historical Background

Timeline of Python in Data Science

Timeline: Python's Journey to AI Dominance ═══════════════════════════════════════════ 1991 ─── Python 0.9 released by Guido van Rossum (Netherlands) │ Inspired by ABC language; focus on readability 1995 ─── NumPy predecessor (Numeric) created by Jim Hugunin │ First attempt at numerical computing in Python 2001 ─── SciPy project launched │ Scientific computing ecosystem begins 2003 ─── Matplotlib 0.1 released by John Hunter │ Plotting inspired by MATLAB's interface 2008 ─── Python 3.0 released (breaking change from 2.x) │ Pandas 0.1 by Wes McKinney at AQR Capital 2010 ─── Scikit-Learn 0.1 released (INRIA, France) │ Unified ML API: fit(), predict(), transform() 2011 ─── NumPy 1.6 with improved broadcasting │ IPython Notebook (precursor to Jupyter) 2014 ─── Jupyter Project born (Julia + Python + R) │ Google launches TensorFlow internally 2015 ─── TensorFlow open-sourced by Google Brain │ Keras released by François Chollet 2016 ─── PyTorch released by Facebook AI Research │ Python surpasses R in ML popularity 2017 ─── TensorFlow 1.0 + Google Colab launched │ India's AI/ML job postings grow 400% 2019 ─── TensorFlow 2.0 (eager execution by default) │ Python becomes #1 language on GitHub 2022 ─── ChatGPT launches (built with Python/PyTorch) │ Demand for Python+AI skills explodes globally 2024 ─── Python 3.12; 14M+ data science practitioners │ India becomes 2nd largest AI talent pool 2025 ─── Python ecosystem: 500K+ packages on PyPI AI/ML libraries downloaded 2B+ times/month

3.4 💡 Conceptual Explanation

Python Fundamentals Crash Course

4.1 Data Types

Python
# === Numeric Types ===
age = 21                    # int
gpa = 9.2                   # float
complex_num = 3 + 4j         # complex

# === Strings ===
name = "Aadhaar"
greeting = f"Hello, {name}!"  # f-string formatting
multiline = """India has
1.4 billion people"""

# === Boolean ===
is_indian = True
has_gpu = False

# === Collections ===
# List (ordered, mutable)
cities = ["Mumbai", "Delhi", "Bangalore", "Chennai"]
cities.append("Hyderabad")

# Tuple (ordered, immutable)
coordinates = (28.6139, 77.2090)  # Delhi lat, long

# Dictionary (key-value pairs)
student = {
    "name": "Priya Sharma",
    "roll": 101,
    "branch": "CSE",
    "cgpa": 8.7
}

# Set (unordered, unique elements)
languages = {"Python", "R", "Julia", "Python"}  # {'Python','R','Julia'}

# Type checking
print(type(age))      # <class 'int'>
print(type(cities))   # <class 'list'>
print(isinstance(gpa, float))  # True

4.2 Control Flow

Python
# === Conditional Statements ===
marks = 85
if marks >= 90:
    grade = "A+"
elif marks >= 80:
    grade = "A"
elif marks >= 70:
    grade = "B"
else:
    grade = "C"
print(f"Grade: {grade}")  # Grade: A

# Ternary operator
result = "Pass" if marks >= 40 else "Fail"

# === Loops ===
# For loop with range
for i in range(1, 6):
    print(f"{i} × 7 = {i*7}")

# Iterating over collections
states = ["Maharashtra", "Karnataka", "Tamil Nadu"]
for idx, state in enumerate(states, 1):
    print(f"{idx}. {state}")

# While loop
n = 10
factorial = 1
while n > 0:
    factorial *= n
    n -= 1
print(f"10! = {factorial}")  # 3628800

# === List Comprehensions ===
squares = [x**2 for x in range(1, 11)]
# [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

evens = [x for x in range(1, 21) if x % 2 == 0]
# [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]

# Dictionary comprehension
sq_dict = {x: x**2 for x in range(1, 6)}
# {1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

# Nested comprehension (matrix creation)
matrix = [[i * j for j in range(1, 4)] for i in range(1, 4)]
# [[1,2,3], [2,4,6], [3,6,9]]

4.3 Functions & Lambda

Python
# === Function Definition ===
def calculate_bmi(weight_kg, height_m):
    """Calculate Body Mass Index.
    
    Args:
        weight_kg (float): Weight in kilograms
        height_m (float): Height in meters
    
    Returns:
        float: BMI value
    """
    bmi = weight_kg / (height_m ** 2)
    return round(bmi, 2)

print(calculate_bmi(70, 1.75))  # 22.86

# Default parameters
def greet(name, language="Hindi"):
    greetings = {"Hindi": "Namaste", "English": "Hello", "Tamil": "Vanakkam"}
    return f"{greetings.get(language, 'Hi')}, {name}!"

# *args and **kwargs
def statistics(*values):
    n = len(values)
    mean = sum(values) / n
    variance = sum((x - mean)**2 for x in values) / n
    return {"mean": mean, "variance": variance, "std": variance**0.5}

print(statistics(85, 90, 78, 92, 88))

# Lambda functions
square = lambda x: x ** 2
add = lambda a, b: a + b

# Map, Filter, Reduce
prices_inr = [100, 250, 500, 1000]
prices_usd = list(map(lambda p: round(p / 83, 2), prices_inr))
expensive = list(filter(lambda p: p > 300, prices_inr))

from functools import reduce
total = reduce(lambda a, b: a + b, prices_inr)  # 1850

4.4 Object-Oriented Programming

Python
class Dataset:
    """A simple dataset class for ML workflows."""
    
    def __init__(self, name, features, target):
        self.name = name
        self.features = features      # list of lists
        self.target = target           # list
        self.n_samples = len(target)
        self.n_features = len(features[0]) if features else 0
    
    def __repr__(self):
        return f"Dataset('{self.name}', samples={self.n_samples}, features={self.n_features})"
    
    def head(self, n=5):
        for i in range(min(n, self.n_samples)):
            print(f"X={self.features[i]}, y={self.target[i]}")
    
    def train_test_split(self, test_ratio=0.2):
        split = int(self.n_samples * (1 - test_ratio))
        return (self.features[:split], self.target[:split],
                self.features[split:], self.target[split:])

# Inheritance
class ImageDataset(Dataset):
    def __init__(self, name, features, target, img_size):
        super().__init__(name, features, target)
        self.img_size = img_size
    
    def resize(self, new_size):
        self.img_size = new_size
        print(f"Images resized to {new_size}")

# Usage
ds = Dataset("Indian Housing", [[1200,2],[1800,3],[2400,4]], [50,75,100])
print(ds)  # Dataset('Indian Housing', samples=3, features=2)
ds.head()

3.5 📐 Mathematical Foundation

The mathematical operations that power data science are computed through NumPy's optimized C backend. Understanding the math behind these operations is crucial.

5.1 Vector Operations

A vector x ∈ ℝⁿ is an ordered collection of n real numbers:

x = [x₁, x₂, ..., xₙ]ᵀ

Dot Product:
x · y = Σᵢ xᵢyᵢ = x₁y₁ + x₂y₂ + ... + xₙyₙ

Euclidean Norm (L2):
‖x‖₂ = √(Σᵢ xᵢ²) = √(x₁² + x₂² + ... + xₙ²)

Manhattan Norm (L1):
‖x‖₁ = Σᵢ |xᵢ| = |x₁| + |x₂| + ... + |xₙ|

Cosine Similarity:
cos(θ) = (x · y) / (‖x‖₂ · ‖y‖₂)

5.2 Matrix Operations

Matrix Multiplication: C = A × B where A ∈ ℝᵐˣⁿ, B ∈ ℝⁿˣᵖ, C ∈ ℝᵐˣᵖ
Cᵢⱼ = Σₖ AᵢₖBₖⱼ  for k = 1 to n

Transpose:
(Aᵀ)ᵢⱼ = Aⱼᵢ

Determinant (2×2):
det(A) = |a b| = ad - bc
         |c d|

Inverse (2×2):
A⁻¹ = (1/det(A)) × | d  -b|
                     |-c   a|

Eigenvalue equation:
Av = λv  where λ = eigenvalue, v = eigenvector

5.3 Statistical Measures

Mean (Arithmetic Average):
μ = (1/n) Σᵢ xᵢ

Variance:
σ² = (1/n) Σᵢ (xᵢ - μ)²

Standard Deviation:
σ = √σ²

Covariance between X and Y:
Cov(X,Y) = (1/n) Σᵢ (xᵢ - μₓ)(yᵢ - μᵧ)

Pearson Correlation:
ρ(X,Y) = Cov(X,Y) / (σₓ · σᵧ)    where -1 ≤ ρ ≤ 1

Z-Score (Standardization):
z = (x - μ) / σ

Min-Max Normalization:
x_norm = (x - x_min) / (x_max - x_min)

3.6 🔣 Formula Derivations

Derivation 1: Why Vectorization is O(1) vs Loop O(n)

Given: Two vectors a, b ∈ ℝⁿ, compute c = a + b

Method 1: Python Loop
  For i = 0 to n-1:
    c[i] = a[i] + b[i]
  Time complexity: O(n) Python interpreter calls
  Each iteration: type-check → unbox → add → box → store
  Overhead per element: ~100 CPU cycles (interpreter)

Method 2: NumPy Vectorization
  c = np.add(a, b)    # Single C-level call
  Internally: SIMD instruction processes 4-8 elements simultaneously
  Overhead per element: ~1-2 CPU cycles (raw hardware)

Speedup = (100 × n) / (2 × n/4) = 200×
For n = 1,000,000: Loop ≈ 0.5s, NumPy ≈ 0.002s

Derivation 2: Broadcasting Rules (First Principles)

Broadcasting allows operations on arrays of different shapes.

Rule 1: If arrays differ in ndim, prepend 1s to smaller shape.
  A.shape = (3, 4)    →  (3, 4)
  B.shape = (4,)      →  (1, 4)

Rule 2: Arrays with size 1 along a dimension are stretched.
  A.shape = (3, 4)    →  (3, 4)
  B.shape = (1, 4)    →  (3, 4)  ← B is repeated 3 times

Rule 3: If sizes differ and neither is 1 → Error!
  A.shape = (3, 4)
  B.shape = (3, 5)    →  INCOMPATIBLE!

Example: Standardize a dataset X (m×n) column-wise
  X.shape = (1000, 5)    # 1000 samples, 5 features
  mean.shape = (5,)      # → (1, 5) → (1000, 5)
  std.shape = (5,)       # → (1, 5) → (1000, 5)
  Z = (X - mean) / std   # Broadcasting handles it!

Derivation 3: Covariance Matrix from First Principles

Given dataset X with m samples and n features: X ∈ ℝᵐˣⁿ

Step 1: Center the data (subtract column means)
  X̃ = X - μ    where μⱼ = (1/m) Σᵢ Xᵢⱼ

Step 2: Covariance matrix C ∈ ℝⁿˣⁿ
  C = (1/m) X̃ᵀ X̃

Step 3: Element Cⱼₖ = covariance between feature j and k
  Cⱼₖ = (1/m) Σᵢ (Xᵢⱼ - μⱼ)(Xᵢₖ - μₖ)

Properties:
  - C is symmetric: Cⱼₖ = Cₖⱼ
  - Diagonal: Cⱼⱼ = Var(feature j)
  - C is positive semi-definite: vᵀCv ≥ 0 for all v

In NumPy: C = np.cov(X.T) or (X_centered.T @ X_centered) / m

3.7 ✏️ Worked Numerical Examples

Example 1: NumPy Matrix Operations

Python
import numpy as np

# Given: Monthly sales data for 3 products in 4 cities (in ₹ lakhs)
sales = np.array([
    [12, 15, 8,  20],   # Product A: Mumbai, Delhi, Bangalore, Chennai
    [9,  11, 14, 7],    # Product B
    [18, 6,  10, 13]    # Product C
])

# Q1: Total sales per product
product_totals = sales.sum(axis=1)
print("Product totals:", product_totals)
# [55, 41, 47]  →  Product A leads with ₹55 lakhs

# Q2: Average sales per city
city_avg = sales.mean(axis=0)
print("City averages:", city_avg)
# [13.0, 10.67, 10.67, 13.33]  →  Chennai highest average

# Q3: Normalize sales (min-max per product)
mins = sales.min(axis=1, keepdims=True)
maxs = sales.max(axis=1, keepdims=True)
normalized = (sales - mins) / (maxs - mins)
print("Normalized:\n", np.round(normalized, 2))
# [[0.33, 0.58, 0.  , 1.  ],   ← Product A: Chennai is max
#  [0.29, 0.57, 1.  , 0.  ],   ← Product B: Bangalore is max
#  [1.  , 0.  , 0.33, 0.58]]   ← Product C: Mumbai is max

# Q4: Covariance between products
cov_matrix = np.cov(sales)
print("Covariance matrix:\n", np.round(cov_matrix, 2))

# Q5: Matrix multiplication — sales × price_per_unit
prices = np.array([[100], [150], [200], [120]])  # ₹ per unit per city
revenue = sales @ prices
print("Revenue per product (₹ lakhs):", revenue.flatten())
# [6050, 5900, 5760]

Example 2: Pandas EDA — Indian Student Dataset

Python
import pandas as pd
import numpy as np

# Create sample Indian student dataset
np.random.seed(42)
n = 200

data = {
    'Name': [f'Student_{i}' for i in range(1, n+1)],
    'State': np.random.choice(
        ['Maharashtra', 'Karnataka', 'Tamil Nadu', 'UP', 'Delhi', 'Kerala'], n),
    'Branch': np.random.choice(
        ['CSE', 'ECE', 'Mech', 'Civil', 'AI/ML'], n,
        p=[0.30, 0.20, 0.20, 0.15, 0.15]),
    'CGPA': np.random.normal(7.5, 1.2, n).clip(4.0, 10.0),
    'Package_LPA': np.random.exponential(6, n).clip(3, 50),
    'Python_Score': np.random.randint(40, 100, n),
    'Placed': np.random.choice([True, False], n, p=[0.72, 0.28])
}
df = pd.DataFrame(data)
df['CGPA'] = df['CGPA'].round(2)
df['Package_LPA'] = df['Package_LPA'].round(2)

# === EDA Operations ===

# 1. Basic info
print(df.shape)          # (200, 7)
print(df.info())          # Column types, non-null counts
print(df.describe())      # Statistical summary

# 2. Value counts
print(df['State'].value_counts())
print(df['Branch'].value_counts(normalize=True))

# 3. GroupBy — Average package by state
state_pkg = df.groupby('State')['Package_LPA'].agg(['mean', 'median', 'max'])
print(state_pkg.round(2))

# 4. Pivot table — Average CGPA by State × Branch
pivot = df.pivot_table(values='CGPA', index='State',
                       columns='Branch', aggfunc='mean')
print(pivot.round(2))

# 5. Correlation
numeric_cols = df.select_dtypes(include=['number'])
print(numeric_cols.corr().round(2))

# 6. Filtering — Top performers
toppers = df[(df['CGPA'] > 9.0) & (df['Placed'] == True)]
print(f"Toppers placed: {len(toppers)}")

# 7. Apply custom function
def classify_package(lpa):
    if lpa >= 15: return 'Super Dream'
    elif lpa >= 10: return 'Dream'
    elif lpa >= 5: return 'Good'
    else: return 'Average'

df['Offer_Type'] = df['Package_LPA'].apply(classify_package)
print(df['Offer_Type'].value_counts())

# 8. Handling missing data
df_with_na = df.copy()
df_with_na.loc[0:10, 'Package_LPA'] = np.nan
print(df_with_na.isnull().sum())
df_with_na['Package_LPA'].fillna(df_with_na['Package_LPA'].median(), inplace=True)

3.8 📊 Visual Diagrams

Data Science Workflow Architecture

┌─────────────────────────────────────────────────────────────────────┐ │ DATA SCIENCE WORKFLOW IN PYTHON │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ DATA │ │ DATA │ │ FEATURE │ │ MODEL │ │ │ │ LOADING │───▸│ CLEANING │───▸│ ENGG. │───▸│ TRAINING │ │ │ │ │ │ │ │ │ │ │ │ │ │ pandas │ │ pandas │ │ sklearn │ │ sklearn │ │ │ │ requests │ │ numpy │ │ numpy │ │ tensorflow│ │ │ │ sqlalchemy│ │ │ │ │ │ pytorch │ │ │ └──────────┘ └──────────┘ └──────────┘ └─────┬────┘ │ │ ▲ │ │ │ │ ┌──────────┐ ┌──────────┐ ▼ │ │ │ │ VISUALIZE │ │ EVALUATE │ ┌──────────┐ │ │ └──────────│ │◂───│ │◂───│ PREDICT │ │ │ │matplotlib│ │ metrics │ │ │ │ │ │ seaborn │ │ cross_val│ │ deploy │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘

NumPy Array Memory Layout

Python List NumPy Array ───────────── ───────────────────────── ┌─────┐ ┌───┬───┬───┬───┬───┬───┐ │ ptr ─┼──▸ [1] (PyObject 28B) │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ (contiguous) ├─────┤ └───┴───┴───┴───┴───┴───┘ │ ptr ─┼──▸ [2] (PyObject 28B) ↑ Each element: 8 bytes (float64) ├─────┤ Total: 48 bytes │ ptr ─┼──▸ [3] (PyObject 28B) ├─────┤ Cache-friendly: CPU loads entire │ ptr ─┼──▸ [4] (PyObject 28B) cache line (64B) in one fetch ├─────┤ │ ptr ─┼──▸ [5] (PyObject 28B) SIMD: Process 4 doubles at once ├─────┤ using AVX2 instructions │ ptr ─┼──▸ [6] (PyObject 28B) └─────┘ Result: 100-200× faster than Overhead: 48 + 6×28 = 216 bytes Python list for math operations!

Pandas DataFrame Architecture

DataFrame ┌─────────┬──────────┬─────────┬──────────┬─────────┐ │ Index │ Name │ State │ CGPA │ Package │ │ (int64) │ (object) │ (cat) │ (float64)│ (float64)│ ├─────────┼──────────┼─────────┼──────────┼─────────┤ │ 0 │ Aarav │ MH │ 8.50 │ 12.00 │ │ 1 │ Priya │ KA │ 9.20 │ 18.50 │ │ 2 │ Rahul │ TN │ 7.80 │ 8.00 │ │ 3 │ Sneha │ DL │ 8.90 │ 15.00 │ │ 4 │ Arjun │ KL │ 7.10 │ 6.50 │ └─────────┴──────────┴─────────┴──────────┴─────────┘ ↓ ↓ ↓ ↓ ↓ Series Series Series Series Series (each column is a NumPy array internally)

3.9 🔄 Flowcharts

EDA (Exploratory Data Analysis) Flowchart

┌────────────────┐ │ LOAD DATASET │ │ pd.read_csv() │ └───────┬────────┘ │ ┌───────▼────────┐ │ INITIAL INSPECT │ │ .head() │ │ .shape │ │ .dtypes │ │ .info() │ └───────┬────────┘ │ ┌──────────────▼──────────────┐ │ CHECK MISSING VALUES │ │ .isnull().sum() │ └──────────┬──────────────────┘ │ ┌────────────▼────────────┐ ┌────┤ Missing > 30%? ├────┐ │YES └────────────────────────┘ NO │ ▼ ▼ ┌────────────────┐ ┌────────────────┐ │ DROP COLUMN │ │ IMPUTE VALUES │ │ .drop() │ │ .fillna() │ └────────┬───────┘ │ mean/median │ │ └────────┬───────┘ └──────────┬───────────────────────┘ │ ┌───────▼───────┐ │ UNIVARIATE │ │ .describe() │ │ .value_counts│ │ histograms │ └───────┬───────┘ │ ┌───────▼───────┐ │ BIVARIATE │ │ .corr() │ │ scatter plots │ │ box plots │ └───────┬───────┘ │ ┌───────▼───────┐ │ MULTIVARIATE │ │ pair plots │ │ heatmaps │ │ PCA │ └───────┬───────┘ │ ┌───────▼───────┐ │ REPORT & │ │ INSIGHTS │ └───────────────┘

Scikit-Learn Pipeline Flowchart

Raw Data ──▸ ┌──────────────────────────────────────────────┐ │ sklearn Pipeline │ │ │ │ ┌─────────┐ ┌──────────┐ ┌────────────┐ │ │ │ Step 1 │ │ Step 2 │ │ Step 3 │ │ │ │Imputer ├─▸│ Scaler ├─▸│ Estimator │ │ │ │.fit() │ │.fit() │ │.fit() │ │ │ │.transform│ │.transform │ │.predict() │ │ │ └─────────┘ └──────────┘ └────────────┘ │ └──────────────────────────────────────────────┘ │ Predictions ◂─────────┘ pipeline.fit(X_train, y_train) ← fits ALL steps pipeline.predict(X_test) ← transforms + predicts pipeline.score(X_test, y_test) ← evaluate end-to-end

3.10 🐍 Python Implementation

10.1 NumPy — 20+ Operations

Python
import numpy as np

# === 1. Array Creation ===
a = np.array([1, 2, 3, 4, 5])
zeros = np.zeros((3, 4))
ones = np.ones((2, 3))
identity = np.eye(4)
randoms = np.random.randn(3, 3)
linspace = np.linspace(0, 10, 50)   # 50 evenly spaced points
arange = np.arange(0, 100, 5)      # [0, 5, 10, ..., 95]

# === 2. Array Properties ===
X = np.random.randn(100, 5)
print(f"Shape: {X.shape}")         # (100, 5)
print(f"Dimensions: {X.ndim}")     # 2
print(f"Size: {X.size}")           # 500
print(f"Dtype: {X.dtype}")         # float64
print(f"Memory: {X.nbytes} bytes") # 4000

# === 3. Indexing & Slicing ===
arr = np.arange(20).reshape(4, 5)
print(arr[0, :])        # First row
print(arr[:, 2])        # Third column
print(arr[1:3, 2:4])   # Sub-matrix
print(arr[arr > 10])    # Boolean indexing

# === 4. Reshaping ===
flat = np.arange(12)
matrix = flat.reshape(3, 4)
tensor = flat.reshape(2, 2, 3)
col_vector = flat.reshape(-1, 1)   # (12, 1)

# === 5. Broadcasting ===
A = np.array([[1, 2, 3],
              [4, 5, 6]])
b = np.array([10, 20, 30])
print(A + b)   # [[11,22,33], [14,25,36]]
print(A * b)   # [[10,40,90], [40,100,180]]

# === 6. Aggregation ===
data = np.random.randint(1, 100, (5, 4))
print("Sum all:", data.sum())
print("Row sums:", data.sum(axis=1))
print("Col means:", data.mean(axis=0))
print("Max per row:", data.max(axis=1))
print("Std per col:", data.std(axis=0))

# === 7. Linear Algebra ===
A = np.array([[2, 1], [5, 3]])
b = np.array([4, 7])

# Solve Ax = b
x = np.linalg.solve(A, b)
print("Solution:", x)   # [5. -6.]

# Determinant, Inverse, Eigenvalues
print("Det:", np.linalg.det(A))        # 1.0
print("Inverse:\n", np.linalg.inv(A))
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)

# === 8. Vectorized Operations vs Loops ===
import time
n = 1_000_000
a = np.random.randn(n)
b = np.random.randn(n)

# NumPy vectorized
start = time.time()
c = a + b
print(f"NumPy: {time.time()-start:.4f}s")  # ~0.002s

# Python loop
start = time.time()
c_loop = [a[i]+b[i] for i in range(n)]
print(f"Loop: {time.time()-start:.4f}s")   # ~0.5s (250× slower!)

# === 9-12. More Operations ===
# Sorting
arr = np.array([3, 1, 4, 1, 5, 9])
print("Sorted:", np.sort(arr))
print("Argsort:", np.argsort(arr))

# Stacking
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
print("Vstack:\n", np.vstack([x, y]))
print("Hstack:", np.hstack([x, y]))
print("Column stack:\n", np.column_stack([x, y]))

# Where (conditional)
ages = np.array([15, 22, 17, 30, 12])
category = np.where(ages >= 18, 'Adult', 'Minor')
print(category)  # ['Minor' 'Adult' 'Minor' 'Adult' 'Minor']

# Unique values
states = np.array(['MH', 'KA', 'MH', 'TN', 'KA'])
unique, counts = np.unique(states, return_counts=True)
print(dict(zip(unique, counts)))  # {'KA': 2, 'MH': 2, 'TN': 1}

# Random sampling (for ML)
np.random.seed(42)
indices = np.random.choice(100, size=20, replace=False)  # 20 random indices
bootstrap = np.random.choice(100, size=100, replace=True)  # Bootstrap sample

10.2 Pandas — 20+ Operations with Indian Data

Python
import pandas as pd
import numpy as np

# === DATA LOADING ===
# 1. CSV
# df = pd.read_csv('indian_census.csv')

# 2. JSON
# df = pd.read_json('nse_stocks.json')

# 3. Excel
# df = pd.read_excel('agriculture.xlsx', sheet_name='Wheat')

# 4. SQL
# from sqlalchemy import create_engine
# engine = create_engine('sqlite:///india_data.db')
# df = pd.read_sql('SELECT * FROM census', engine)

# 5. API / Web
# df = pd.read_html('https://en.wikipedia.org/wiki/States_of_India')[0]

# === Create Indian State GDP Dataset ===
gdp_data = {
    'State': ['Maharashtra', 'Tamil Nadu', 'Karnataka', 'Gujarat',
              'UP', 'West Bengal', 'Rajasthan', 'Telangana',
              'Kerala', 'AP', 'MP', 'Delhi'],
    'GDP_Lakh_Cr': [32.2, 19.4, 16.9, 16.7, 17.9, 12.5,
                     10.2, 11.5, 9.8, 10.1, 9.1, 9.0],
    'Population_Cr': [12.3, 7.7, 6.4, 6.4, 23.1, 9.7,
                       7.9, 3.7, 3.5, 5.3, 8.5, 1.9],
    'Literacy_%': [82.3, 80.1, 75.4, 78.0, 67.7, 76.3,
                    66.1, 72.8, 94.0, 67.0, 69.3, 86.2],
    'Region': ['West', 'South', 'South', 'West', 'North', 'East',
               'North', 'South', 'South', 'South', 'Central', 'North']
}
df = pd.DataFrame(gdp_data)

# === 20+ PANDAS OPERATIONS ===

# 1. Per-capita GDP
df['Per_Capita_Lakh'] = (df['GDP_Lakh_Cr'] / df['Population_Cr']).round(2)

# 2. Sort by GDP
print(df.sort_values('GDP_Lakh_Cr', ascending=False).head(5))

# 3. GroupBy region
regional = df.groupby('Region').agg({
    'GDP_Lakh_Cr': 'sum',
    'Population_Cr': 'sum',
    'Literacy_%': 'mean'
}).round(2)
print(regional)

# 4. Filter — high literacy states
literate = df[df['Literacy_%'] > 80]
print("High literacy:", literate['State'].tolist())

# 5. Rank
df['GDP_Rank'] = df['GDP_Lakh_Cr'].rank(ascending=False).astype(int)

# 6. String operations
df['State_Upper'] = df['State'].str.upper()
df['State_Len'] = df['State'].str.len()

# 7. Conditional column
df['Category'] = pd.cut(df['GDP_Lakh_Cr'],
    bins=[0, 10, 15, 20, 40],
    labels=['Small', 'Medium', 'Large', 'Mega'])

# 8. Merge with another dataset
it_data = pd.DataFrame({
    'State': ['Karnataka', 'Maharashtra', 'Tamil Nadu', 'Telangana'],
    'IT_Revenue_Cr': [5_00_000, 3_50_000, 1_80_000, 1_50_000]
})
merged = df.merge(it_data, on='State', how='left')
print(merged[['State', 'IT_Revenue_Cr']].dropna())

# 9. Pivot table
pivot = df.pivot_table(values='GDP_Lakh_Cr', index='Region',
                       aggfunc=['sum', 'mean', 'count'])
print(pivot)

# 10. Apply custom function
def development_index(row):
    return (row['Per_Capita_Lakh'] * 0.4 +
            row['Literacy_%'] * 0.01 * 0.6)

df['Dev_Index'] = df.apply(development_index, axis=1).round(3)
print(df.nlargest(5, 'Dev_Index')[['State', 'Dev_Index']])

10.3 Matplotlib & Seaborn — 15+ Plot Examples

Python
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('seaborn-v0_8-darkgrid')

# === 1. Line Plot: India GDP Growth ===
years = np.arange(2015, 2026)
gdp = [2.1, 2.3, 2.7, 2.7, 2.9, 2.7, 3.2, 3.4, 3.6, 3.9, 4.2]
plt.figure(figsize=(10, 5))
plt.plot(years, gdp, 'o-', color='#059669', linewidth=2, markersize=8)
plt.title('India GDP Growth (Trillion USD)', fontsize=14, fontweight='bold')
plt.xlabel('Year'); plt.ylabel('GDP (Trillion $)')
plt.fill_between(years, gdp, alpha=0.2, color='#059669')
plt.tight_layout(); plt.savefig('india_gdp.png', dpi=150); plt.show()

# === 2. Bar Chart: Top 5 States by GDP ===
states = ['MH', 'TN', 'UP', 'KA', 'GJ']
gdp_vals = [32.2, 19.4, 17.9, 16.9, 16.7]
colors = ['#059669', '#0891b2', '#f59e0b', '#ef4444', '#8b5cf6']
plt.figure(figsize=(8, 5))
bars = plt.bar(states, gdp_vals, color=colors, edgecolor='white')
for bar, val in zip(bars, gdp_vals):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3,
            f'₹{val}L Cr', ha='center', fontweight='bold')
plt.title('Top 5 Indian States by GDP'); plt.ylabel('GDP (₹ Lakh Crore)')
plt.tight_layout(); plt.show()

# === 3. Scatter Plot: Literacy vs Per-Capita GDP ===
literacy = [82.3, 80.1, 75.4, 78.0, 67.7, 94.0, 86.2]
per_cap = [2.62, 2.52, 2.64, 2.61, 0.78, 2.80, 4.74]
plt.figure(figsize=(8, 6))
plt.scatter(literacy, per_cap, s=150, c=per_cap, cmap='viridis',
           edgecolors='white', linewidth=2)
plt.colorbar(label='Per-Capita GDP (₹ Lakh Cr)')
plt.xlabel('Literacy Rate (%)'); plt.ylabel('Per-Capita GDP')
plt.title('Literacy vs Per-Capita GDP (Indian States)')
plt.tight_layout(); plt.show()

# === 4. Histogram: CGPA Distribution ===
np.random.seed(42)
cgpas = np.random.normal(7.5, 1.2, 500).clip(4, 10)
plt.figure(figsize=(8, 5))
plt.hist(cgpas, bins=25, color='#0891b2', edgecolor='white', alpha=0.8)
plt.axvline(np.mean(cgpas), color='red', linestyle='--', label=f'Mean={np.mean(cgpas):.2f}')
plt.legend(); plt.xlabel('CGPA'); plt.ylabel('Frequency')
plt.title('CGPA Distribution (Indian Engineering College)')
plt.tight_layout(); plt.show()

# === 5. Heatmap: Correlation Matrix ===
np.random.seed(42)
df_corr = pd.DataFrame({
    'CGPA': np.random.normal(7.5, 1.2, 200),
    'Python': np.random.randint(40, 100, 200),
    'Math': np.random.randint(50, 100, 200),
    'Package_LPA': np.random.exponential(6, 200)
})
plt.figure(figsize=(8, 6))
sns.heatmap(df_corr.corr(), annot=True, cmap='coolwarm', center=0,
            fmt='.2f', square=True, linewidths=1)
plt.title('Feature Correlation Heatmap'); plt.tight_layout(); plt.show()

# === 6. Subplots: Multi-panel Dashboard ===
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Panel 1: Line
axes[0,0].plot(years, gdp, 'o-', color='#059669')
axes[0,0].set_title('GDP Growth')
# Panel 2: Bar
axes[0,1].bar(states, gdp_vals, color=colors)
axes[0,1].set_title('State GDP')
# Panel 3: Histogram
axes[1,0].hist(cgpas, bins=20, color='#0891b2', edgecolor='white')
axes[1,0].set_title('CGPA Distribution')
# Panel 4: Pie
regions = ['South', 'West', 'North', 'East', 'Central']
shares = [35, 28, 22, 10, 5]
axes[1,1].pie(shares, labels=regions, autopct='%1.1f%%',
            colors=['#059669','#0891b2','#f59e0b','#ef4444','#8b5cf6'])
axes[1,1].set_title('IT Revenue by Region')
plt.suptitle('India Economic Dashboard', fontsize=16, fontweight='bold')
plt.tight_layout(); plt.show()

# === 7. Box Plot ===
plt.figure(figsize=(8, 5))
branches = ['CSE', 'ECE', 'Mech', 'Civil', 'AI/ML']
data_box = [np.random.normal(loc, 2, 50) for loc in [12, 8, 7, 6, 15]]
bp = plt.boxplot(data_box, labels=branches, patch_artist=True)
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
plt.title('Placement Package by Branch (LPA)')
plt.ylabel('Package (LPA)'); plt.tight_layout(); plt.show()

# === 8-15. More Seaborn Plots ===

# 8. Pair Plot
# sns.pairplot(df_corr, diag_kind='kde')

# 9. Violin Plot
# sns.violinplot(x='Branch', y='Package', data=df)

# 10. Count Plot
# sns.countplot(x='State', data=df, palette='viridis')

# 11. Joint Plot
# sns.jointplot(x='CGPA', y='Package_LPA', data=df, kind='reg')

# 12. KDE Plot
# sns.kdeplot(data=df, x='CGPA', hue='Placed', fill=True)

# 13. Swarm Plot
# sns.swarmplot(x='Branch', y='CGPA', data=df)

# 14. Stacked Area Chart
# plt.stackplot(years, [gdp_it, gdp_mfg, gdp_agri])

# 15. Horizontal Bar Chart
# plt.barh(states, literacy_rates, color='#059669')

3.11 🧠 TensorFlow Implementation

11.1 Installation & GPU Setup

Bash
# CPU-only installation
pip install tensorflow

# GPU installation (requires CUDA 11.x + cuDNN 8.x)
pip install tensorflow[and-cuda]

# Verify installation
python -c "import tensorflow as tf; print(tf.__version__)"
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

11.2 Tensor Operations

Python
import tensorflow as tf
import numpy as np

# === Tensor Creation ===
# Constant tensors
a = tf.constant([1, 2, 3], dtype=tf.float32)
b = tf.constant([[1, 2], [3, 4]], dtype=tf.float32)
print(f"Shape: {a.shape}, Dtype: {a.dtype}")

# Special tensors
zeros = tf.zeros((3, 4))
ones = tf.ones((2, 3))
eye = tf.eye(3)
rand = tf.random.normal((3, 3), mean=0, stddev=1)
uniform = tf.random.uniform((2, 5), minval=0, maxval=10)

# Variable tensors (trainable parameters)
weights = tf.Variable(tf.random.normal((784, 128)))
bias = tf.Variable(tf.zeros(128))

# === Tensor Operations ===
x = tf.constant([[1.0, 2.0], [3.0, 4.0]])
y = tf.constant([[5.0, 6.0], [7.0, 8.0]])

print("Add:", tf.add(x, y))
print("Multiply:", tf.multiply(x, y))    # Element-wise
print("MatMul:", tf.matmul(x, y))        # Matrix multiplication
print("Reduce Sum:", tf.reduce_sum(x))
print("Reduce Mean:", tf.reduce_mean(x))

# === Eager Execution (default in TF2) ===
print("Eager mode:", tf.executing_eagerly())  # True

# Automatic differentiation with GradientTape
x = tf.Variable(3.0)
with tf.GradientTape() as tape:
    y = x ** 2 + 2 * x + 1   # y = x² + 2x + 1
dy_dx = tape.gradient(y, x)
print(f"dy/dx at x=3: {dy_dx.numpy()}")  # 8.0 (2x + 2 = 8)

# === NumPy ↔ TensorFlow Conversion ===
np_arr = np.array([[1, 2], [3, 4]])
tensor = tf.convert_to_tensor(np_arr, dtype=tf.float32)
back_to_np = tensor.numpy()

# === GPU Check ===
print("GPUs:", tf.config.list_physical_devices('GPU'))
if tf.config.list_physical_devices('GPU'):
    with tf.device('/GPU:0'):
        large = tf.random.normal((10000, 10000))
        result = tf.matmul(large, large)
    print("GPU computation done!")

3.12 ⚙️ Scikit-Learn Implementation

12.1 The Estimator API

Python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import pandas as pd

# === Simulated Indian Housing Price Dataset ===
np.random.seed(42)
n = 500
area = np.random.uniform(500, 3000, n)          # sq ft
bedrooms = np.random.randint(1, 6, n)
distance_km = np.random.uniform(1, 50, n)        # from city center
floor_num = np.random.randint(1, 25, n)

# Price formula: base + area*rate - distance_penalty + bedroom_bonus + noise
price_lakhs = (20 + area * 0.03 - distance_km * 0.8 +
               bedrooms * 5 + np.random.normal(0, 5, n))

X = np.column_stack([area, bedrooms, distance_km, floor_num])
y = price_lakhs

# === Split Data ===
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)
print(f"Train: {X_train.shape}, Test: {X_test.shape}")

# === Pipeline: Scaler → Linear Regression ===
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Predict
y_pred = pipeline.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.2f}")
print(f"R² Score: {r2:.4f}")
print(f"RMSE: {mse**0.5:.2f} lakhs")

# === Feature Importance ===
model = pipeline.named_steps['regressor']
feature_names = ['Area_sqft', 'Bedrooms', 'Distance_km', 'Floor']
for name, coef in zip(feature_names, model.coef_):
    print(f"  {name}: {coef:.4f}")
print(f"  Intercept: {model.intercept_:.4f}")

# === Predict New House ===
new_house = np.array([[1500, 3, 10, 5]])  # 1500sqft, 3BHK, 10km, 5th floor
predicted_price = pipeline.predict(new_house)
print(f"\nPredicted price: ₹{predicted_price[0]:.2f} Lakhs")

12.2 Complete Preprocessing Pipeline

Python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

# Define column types
numeric_features = ['Area_sqft', 'Bedrooms', 'Distance_km']
categorical_features = ['City', 'Furnishing']

# Numeric pipeline: impute → scale
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical pipeline: impute → one-hot encode
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

# Combined preprocessor
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Full pipeline: preprocess → model
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(n_estimators=100, random_state=42))
])

# Usage:
# full_pipeline.fit(X_train, y_train)
# predictions = full_pipeline.predict(X_test)

3.13 🇮🇳 Indian Case Studies

Case Study 1: NSE/BSE Stock Data Analysis

Python
import pandas as pd
import numpy as np

# Simulated NSE stock data for demonstration
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=252, freq='B')  # Business days

nse_data = pd.DataFrame({
    'Date': dates,
    'RELIANCE': 2500 + np.random.normal(0, 30, 252).cumsum(),
    'TCS': 3600 + np.random.normal(0, 25, 252).cumsum(),
    'INFY': 1500 + np.random.normal(0, 20, 252).cumsum(),
    'HDFC': 1600 + np.random.normal(0, 22, 252).cumsum(),
    'Volume_Lakhs': np.random.randint(50, 500, 252)
})
nse_data.set_index('Date', inplace=True)

# Analysis
# 1. Daily returns
returns = nse_data[['RELIANCE','TCS','INFY','HDFC']].pct_change().dropna()
print("Average Daily Returns:\n", returns.mean().round(4))

# 2. Volatility (annualized std)
volatility = returns.std() * np.sqrt(252)
print("Annualized Volatility:\n", volatility.round(4))

# 3. Correlation between stocks
print("Stock Correlations:\n", returns.corr().round(3))

# 4. 20-day Moving Average
nse_data['RELIANCE_MA20'] = nse_data['RELIANCE'].rolling(20).mean()

# 5. Monthly resampling
monthly = nse_data[['RELIANCE']].resample('M').agg({
    'RELIANCE': ['first', 'last', 'max', 'min']
})
print(monthly.head())

Case Study 2: Census of India Data Analysis

Python
# Simulated Indian Census data
census = pd.DataFrame({
    'State': ['Uttar Pradesh','Maharashtra','Bihar','West Bengal','Madhya Pradesh',
              'Tamil Nadu','Rajasthan','Karnataka','Gujarat','Andhra Pradesh'],
    'Population_Cr': [23.1,12.3,12.4,9.7,8.5,7.7,7.9,6.4,6.4,5.3],
    'Decadal_Growth_%': [20.1,15.9,25.4,13.8,20.3,15.6,21.3,15.7,19.2,11.0],
    'Literacy_%': [67.7,82.3,61.8,76.3,69.3,80.1,66.1,75.4,78.0,67.0],
    'Sex_Ratio': [912,929,918,950,931,996,928,973,919,993],
    'Urban_%': [22.3,45.2,11.3,31.9,27.6,48.4,24.9,38.6,42.6,29.6]
})

# Key analyses
print("=== Statistical Summary ===")
print(census.describe())

print("\n=== Correlation: Literacy vs Sex Ratio ===")
print(f"r = {census['Literacy_%'].corr(census['Sex_Ratio']):.3f}")

print("\n=== States with Sex Ratio > 950 ===")
print(census[census['Sex_Ratio'] > 950][['State', 'Sex_Ratio', 'Literacy_%']])

# Insight: Southern states show higher literacy AND better sex ratios

Case Study 3: IMD Weather Data

Python
# Simulated IMD rainfall data (India Meteorological Department)
months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
cities_weather = pd.DataFrame({
    'Month': months,
    'Mumbai_mm': [0,1,0,1,20,530,840,585,340,75,15,3],
    'Delhi_mm': [20,18,13,7,25,54,210,233,117,14,4,8],
    'Chennai_mm': [36,10,8,15,35,48,93,119,118,267,356,152],
    'Bangalore_mm': [2,7,4,47,120,81,110,137,195,180,65,16]
})

# Total annual rainfall
for city in ['Mumbai_mm','Delhi_mm','Chennai_mm','Bangalore_mm']:
    print(f"{city}: {cities_weather[city].sum()} mm/year")

# Peak monsoon month per city
for city in ['Mumbai_mm','Delhi_mm','Chennai_mm']:
    peak_idx = cities_weather[city].idxmax()
    print(f"{city} peak: {months[peak_idx]} ({cities_weather[city].max()} mm)")

Case Study 4: Agricultural Production Data

Python
# Indian agricultural production (from data.gov.in patterns)
agri = pd.DataFrame({
    'Crop': ['Rice','Wheat','Maize','Sugarcane','Cotton','Pulses','Soybean'],
    'Production_MT': [130,110,33,400,35,27,12],   # Million tonnes
    'Area_Mha': [44,30,10,5,12,25,12],
    'Top_State': ['WB','UP','KA','UP','GJ','MP','MP'],
    'MSP_Rs_Quintal': [2183,2275,2090,315,6620,6600,4600]
})

# Yield = Production / Area
agri['Yield_T_per_Ha'] = (agri['Production_MT'] / agri['Area_Mha']).round(2)
print(agri.sort_values('Yield_T_per_Ha', ascending=False))
# Sugarcane has highest yield (80 T/Ha) — it's a high-input crop

3.14 🌍 Global Case Studies

Case Study 1: Google — TensorFlow at Scale

Google uses Python/TensorFlow across Search ranking, Gmail spam filters, Google Photos (object recognition), Google Translate (seq2seq models), and YouTube recommendations. Their Colab platform provides free GPU/TPU access to millions of developers worldwide.

Stack: Python 3.10 + TensorFlow 2.x + JAX + BigQuery + Vertex AI

Scale: Processes 8.5 billion searches/day, each touching 50+ ML models in the pipeline.

Case Study 2: Tesla — Real-Time Computer Vision

Tesla's Autopilot uses PyTorch for training vision models on 300+ petabytes of camera data. Their data pipeline: cameras → video decoding (C++) → labeling pipeline (Python/Pandas) → model training (PyTorch) → inference (C++/TensorRT). Python is the glue that connects all stages.

Case Study 3: Netflix — Recommendation Engine

Netflix saves $1B/year through its Python-powered recommendation system. Their stack includes: NumPy/SciPy for matrix factorization, Pandas for A/B test analysis, Scikit-Learn for collaborative filtering, and custom deep learning models in TensorFlow. They analyze 200+ million subscribers' viewing patterns.

Case Study 4: Amazon — Supply Chain ML

Amazon uses Python across: demand forecasting (Prophet + XGBoost), product recommendations (deep learning), Alexa NLP (PyTorch), warehouse robotics (ROS + Python), and fraud detection. Their SageMaker platform provides managed Jupyter notebooks for ML development.

Case Study 5: OpenAI — GPT & ChatGPT

ChatGPT/GPT-4 was built using Python + PyTorch. The training pipeline uses: tokenizers (Python), distributed training orchestration (Python), RLHF implementation (Python/NumPy), and the API server (Python/FastAPI). The entire AI revolution runs on Python.

3.15 🚀 Startup Applications

Indian AI Startups Using Python

Startup	Domain	Python Usage	Funding
Razorpay	FinTech	Fraud detection with Scikit-Learn, payment analytics with Pandas	$740M
Ola	Mobility	Demand prediction, surge pricing, route optimization with NumPy	$3.8B
Swiggy	FoodTech	Delivery time prediction, restaurant recommendation engines	$3.6B
CRED	FinTech	Credit scoring models, user segmentation with Pandas/Sklearn	$800M
Meesho	E-commerce	Product categorization, image search with TensorFlow	$1.1B
PhonePe	Payments	Transaction anomaly detection, user behavior modeling	$12B valuation
Nykaa	Beauty/E-com	Personalized recommendations, review sentiment analysis	IPO 2021
Zerodha	Stock Trading	Market analytics, algo trading signals, Kite API (Python SDK)	Bootstrapped

3.16 🏛️ Government Applications

Python in Indian Government Projects

1. Aadhaar Data Processing

UIDAI processes 1.4 billion biometric records. Python is used for: data quality checks (Pandas), demographic deduplication algorithms, enrollment analytics dashboards, and API response monitoring. The authentication system handles 100M+ requests/day.

2. UPI Transaction Analytics

NPCI uses Python for analyzing 10+ billion monthly UPI transactions. Pandas DataFrames process transaction logs for: fraud pattern detection, merchant analytics, peak usage analysis, and regulatory reporting to RBI.

3. ISRO Satellite Data Processing

ISRO uses Python (NumPy/SciPy) for: satellite image processing, orbit calculations, remote sensing data analysis (Bhuvan platform), and weather prediction models. The Chandrayaan missions used Python scripts for trajectory optimization.

4. data.gov.in — Open Data Platform

Python
# Accessing Indian government open data
import pandas as pd

# Agricultural data from data.gov.in
# url = 'https://data.gov.in/resource/...'
# df = pd.read_csv(url)  # Direct CSV download

# Example: Loading Indian crop production data
# df = pd.read_csv('IndiaAgricultureCropProduction.csv')
# print(df.groupby('State')['Production'].sum().nlargest(10))

# Jupyter Notebooks are used by IIT researchers for
# government data analysis projects under NITI Aayog

# Google Colab setup for government data analysis:
# 1. Go to colab.research.google.com
# 2. New notebook → Runtime → Change runtime type → GPU
# 3. Upload CSV or connect to Google Drive
# 4. !pip install any-needed-library

5. CoWIN Vaccination Analytics

During COVID-19, the CoWIN platform used Python for: real-time vaccination tracking (200M+ doses/month), slot availability prediction, demographic analysis of vaccination coverage, and AEFI (adverse events) monitoring using Pandas time-series analysis.

3.17 🏭 Industry Applications

Python Data Science Stack by Industry

Industry	Use Case	Libraries	Indian Example
Banking	Credit scoring, fraud detection	Pandas, XGBoost, Sklearn	SBI, ICICI risk models
Healthcare	Medical image analysis, drug discovery	TensorFlow, OpenCV, RDKit	Practo, Niramai (breast cancer AI)
Retail	Demand forecasting, inventory optimization	Prophet, Pandas, Plotly	Flipkart, Reliance Retail
Telecom	Churn prediction, network optimization	Sklearn, PySpark, Keras	Jio, Airtel analytics
Manufacturing	Predictive maintenance, quality control	SciPy, NumPy, TensorFlow	Tata Steel, Mahindra
Agriculture	Crop prediction, soil analysis	Sklearn, GeoPandas, Rasterio	CropIn, DeHaat
Education	Adaptive learning, performance prediction	Pandas, Sklearn, NLP libs	BYJU'S, Unacademy
Logistics	Route optimization, ETA prediction	NetworkX, OR-Tools, Sklearn	Delhivery, BlueDart

3.18 🛠️ Mini Projects

Mini Project 1: Indian Census EDA Dashboard

Python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# === MINI PROJECT: Indian State Census Dashboard ===

# 1. Create comprehensive dataset
census = pd.DataFrame({
    'State': ['UP','MH','Bihar','WB','MP','TN','RJ','KA','GJ','AP',
              'KL','TG','Delhi','Punjab','Haryana'],
    'Population_Cr': [23.1,12.3,12.4,9.7,8.5,7.7,7.9,6.4,6.4,5.3,
                       3.5,3.7,1.9,2.8,2.5],
    'Literacy': [67.7,82.3,61.8,76.3,69.3,80.1,66.1,75.4,78.0,67.0,
                  94.0,72.8,86.2,75.8,75.6],
    'Sex_Ratio': [912,929,918,950,931,996,928,973,919,993,
                   1084,988,868,895,879],
    'GDP_Lakh_Cr': [17.9,32.2,6.1,12.5,9.1,19.4,10.2,16.9,16.7,10.1,
                     9.8,11.5,9.0,5.5,8.2],
    'Region': ['North','West','East','East','Central','South','North','South',
               'West','South','South','South','North','North','North']
})
census['Per_Capita'] = (census['GDP_Lakh_Cr']/census['Population_Cr']).round(2)

# 2. Create 6-panel dashboard
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('🇮🇳 Indian Census EDA Dashboard', fontsize=20, fontweight='bold')

# Panel 1: Population bar chart
census_sorted = census.sort_values('Population_Cr', ascending=True)
axes[0,0].barh(census_sorted['State'], census_sorted['Population_Cr'],
             color='#059669')
axes[0,0].set_title('Population (Crores)')

# Panel 2: Literacy distribution
axes[0,1].hist(census['Literacy'], bins=10, color='#0891b2', edgecolor='white')
axes[0,1].axvline(census['Literacy'].mean(), color='red', linestyle='--')
axes[0,1].set_title('Literacy Rate Distribution')

# Panel 3: Scatter — Literacy vs Sex Ratio
scatter = axes[0,2].scatter(census['Literacy'], census['Sex_Ratio'],
    s=census['Population_Cr']*20, c=census['Per_Capita'],
    cmap='viridis', alpha=0.8, edgecolors='white')
axes[0,2].set_xlabel('Literacy %'); axes[0,2].set_ylabel('Sex Ratio')
axes[0,2].set_title('Literacy vs Sex Ratio')

# Panel 4: Region-wise GDP
regional_gdp = census.groupby('Region')['GDP_Lakh_Cr'].sum()
axes[1,0].pie(regional_gdp, labels=regional_gdp.index, autopct='%1.1f%%',
            colors=['#059669','#0891b2','#f59e0b','#ef4444','#8b5cf6'])
axes[1,0].set_title('GDP Share by Region')

# Panel 5: Box plot — Per-Capita by Region
regions = census['Region'].unique()
box_data = [census[census['Region']==r]['Per_Capita'].values for r in regions]
axes[1,1].boxplot(box_data, labels=regions)
axes[1,1].set_title('Per-Capita GDP by Region')

# Panel 6: Correlation heatmap
corr = census[['Population_Cr','Literacy','Sex_Ratio','GDP_Lakh_Cr','Per_Capita']].corr()
sns.heatmap(corr, annot=True, cmap='RdYlGn', center=0, ax=axes[1,2], fmt='.2f')
axes[1,2].set_title('Feature Correlations')

plt.tight_layout()
plt.savefig('census_dashboard.png', dpi=150, bbox_inches='tight')
plt.show()

# 3. Key Insights Report
print("="*50)
print("KEY INSIGHTS FROM INDIAN CENSUS EDA")
print("="*50)
print(f"1. Highest literacy: {census.loc[census['Literacy'].idxmax(),'State']} ({census['Literacy'].max()}%)")
print(f"2. Lowest sex ratio: {census.loc[census['Sex_Ratio'].idxmin(),'State']} ({census['Sex_Ratio'].min()})")
print(f"3. Literacy-Sex Ratio correlation: {census['Literacy'].corr(census['Sex_Ratio']):.3f}")
print(f"4. Highest per-capita GDP: {census.loc[census['Per_Capita'].idxmax(),'State']}")
print(f"5. South region avg literacy: {census[census['Region']=='South']['Literacy'].mean():.1f}%")

Mini Project 2: Stock Market Data Analyzer

Python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# === MINI PROJECT: NSE Stock Analyzer ===

class StockAnalyzer:
    """Analyze Indian stock market data using Python data science stack."""
    
    def __init__(self, ticker, days=252):
        self.ticker = ticker
        np.random.seed(hash(ticker) % 2**31)
        self.dates = pd.date_range('2024-01-01', periods=days, freq='B')
        base = {'RELIANCE':2500, 'TCS':3600, 'INFY':1500, 'HDFC':1600}
        start = base.get(ticker, 1000)
        self.data = pd.DataFrame({
            'Date': self.dates,
            'Close': start + np.random.normal(0.5, 25, days).cumsum(),
            'Volume': np.random.randint(1_00_000, 50_00_000, days)
        }).set_index('Date')
    
    def daily_returns(self):
        self.data['Returns'] = self.data['Close'].pct_change()
        return self.data['Returns'].describe()
    
    def moving_averages(self):
        self.data['MA20'] = self.data['Close'].rolling(20).mean()
        self.data['MA50'] = self.data['Close'].rolling(50).mean()
    
    def volatility(self):
        if 'Returns' not in self.data.columns:
            self.daily_returns()
        return self.data['Returns'].std() * np.sqrt(252)
    
    def plot_dashboard(self):
        self.daily_returns()
        self.moving_averages()
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        fig.suptitle(f'{self.ticker} — Stock Analysis Dashboard', fontsize=16)
        
        # Price + Moving Averages
        axes[0,0].plot(self.data['Close'], label='Close', alpha=0.7)
        axes[0,0].plot(self.data['MA20'], label='MA20', linewidth=2)
        axes[0,0].plot(self.data['MA50'], label='MA50', linewidth=2)
        axes[0,0].legend(); axes[0,0].set_title('Price & Moving Averages')
        
        # Returns distribution
        axes[0,1].hist(self.data['Returns'].dropna(), bins=30, color='#0891b2',
                     edgecolor='white')
        axes[0,1].set_title('Daily Returns Distribution')
        
        # Volume
        axes[1,0].bar(self.data.index, self.data['Volume'], width=1,
                    color='#059669', alpha=0.6)
        axes[1,0].set_title('Trading Volume')
        
        # Cumulative returns
        cum_ret = (1 + self.data['Returns']).cumprod()
        axes[1,1].plot(cum_ret, color='#f59e0b', linewidth=2)
        axes[1,1].set_title('Cumulative Returns')
        
        plt.tight_layout(); plt.show()

# Usage
analyzer = StockAnalyzer('RELIANCE')
print(analyzer.daily_returns())
print(f"Annualized Volatility: {analyzer.volatility():.2%}")
analyzer.plot_dashboard()

Mini Project 3: Jupyter/Colab EDA Template

Python
# === Reusable EDA Function Library ===
# Save as eda_utils.py and import in any Jupyter notebook

def quick_eda(df, target_col=None):
    """Perform comprehensive EDA on any DataFrame."""
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    print("="*60)
    print("AUTOMATED EDA REPORT")
    print("="*60)
    
    # 1. Shape and types
    print(f"\n📊 Shape: {df.shape[0]} rows × {df.shape[1]} columns")
    print(f"\n📋 Data Types:\n{df.dtypes.value_counts()}")
    
    # 2. Missing values
    missing = df.isnull().sum()
    if missing.sum() > 0:
        print(f"\n⚠️ Missing Values:\n{missing[missing > 0]}")
    else:
        print("\n✅ No missing values!")
    
    # 3. Numeric summary
    print(f"\n📈 Numeric Summary:\n{df.describe().round(2)}")
    
    # 4. Categorical summary
    cat_cols = df.select_dtypes(include=['object','category']).columns
    for col in cat_cols:
        print(f"\n🏷️ {col}: {df[col].nunique()} unique values")
        print(df[col].value_counts().head(5))
    
    # 5. Auto-generate plots
    num_cols = df.select_dtypes(include=['number']).columns
    n = len(num_cols)
    if n > 0:
        fig, axes = plt.subplots(1, min(n, 4), figsize=(5*min(n,4), 4))
        if n == 1: axes = [axes]
        for i, col in enumerate(num_cols[:4]):
            axes[i].hist(df[col].dropna(), bins=20, color='#059669', edgecolor='white')
            axes[i].set_title(col)
        plt.tight_layout(); plt.show()
    
    return {'shape': df.shape, 'missing': missing.sum(), 'num_cols': len(num_cols)}

# Usage: quick_eda(pd.read_csv('any_dataset.csv'))

3.19 📝 End-of-Chapter Exercises

Easy E1: Create a NumPy array of shape (5, 4) with random integers between 10 and 99. Find the row-wise maximum and column-wise mean.

Easy E2: Write a list comprehension that generates all Pythagorean triplets (a, b, c) where a, b, c ≤ 50.

Easy E3: Load the Iris dataset from Scikit-Learn and display the first 10 rows using Pandas. Print the shape and dtypes.

Easy E4: Create a Python dictionary mapping Indian state abbreviations to full names. Convert it to a Pandas Series.

Easy E5: Write a function that takes a list of marks and returns the mean, median, mode, and standard deviation.

Medium E6: Use NumPy broadcasting to add a bias vector [1, 2, 3] to every row of a (100, 3) matrix. Verify the shape.

Medium E7: Create a Pandas DataFrame with 500 random student records (name, state, marks, branch). Group by state and find the top-scoring student per state.

Medium E8: Write a function that performs Min-Max normalization on a NumPy array column-wise. Verify that all values are between 0 and 1.

Medium E9: Create a Matplotlib figure with 4 subplots showing: (a) sine wave, (b) cosine wave, (c) both overlaid, (d) their product.

Medium E10: Implement a simple linear regression from scratch using NumPy (closed-form solution: θ = (XᵀX)⁻¹Xᵀy). Test on a synthetic dataset.

Medium E11: Merge two DataFrames — one with Indian state populations and another with GDP data — using inner, left, right, and outer joins. Compare results.

Medium E12: Create a Seaborn pair plot for the Iris dataset. Color-code by species. Identify which features separate classes best.

Medium E13: Use Pandas to load a CSV, handle missing values (3 strategies: drop, mean fill, forward fill), and compare the effects on the dataset statistics.

Medium E14: Write a TensorFlow program that creates two random matrices and computes their product, sum, and element-wise multiplication. Print execution times vs NumPy.

Medium E15: Build a Scikit-Learn pipeline with StandardScaler → PCA(n_components=2) → LogisticRegression. Train on Iris dataset and report accuracy.

Hard E16: Implement the PageRank algorithm using NumPy matrix operations. Create a 10-node web graph and find page rankings.

Hard E17: Download NSE historical data for 5 stocks (simulate if needed). Compute: Sharpe ratio, beta, Sortino ratio, max drawdown for each stock.

Hard E18: Build a complete EDA pipeline that auto-generates: summary stats, missing data report, correlation matrix, distribution plots, and outlier detection for any input DataFrame.

Hard E19: Compare execution speed of: Python loop, list comprehension, NumPy vectorization, and TensorFlow operation for computing the dot product of two 10M-element vectors. Create a timing bar chart.

Hard E20: Use web scraping (BeautifulSoup + requests) to extract a table from Wikipedia's "List of Indian states by GDP" page. Clean and analyze the data with Pandas.

Hard E21: Build a class-based ML experiment tracker that logs: dataset info, model parameters, metrics, and plots to a structured directory. Use OOP principles.

Hard E22: Implement k-means clustering from scratch using only NumPy. Apply to synthetic 2D data and visualize cluster assignments at each iteration.

3.20 ❓ Multiple Choice Questions

Q1: What is the primary reason NumPy arrays are faster than Python lists for numerical operations?

(a) NumPy uses Python's built-in optimizer (b) NumPy arrays are stored contiguously in memory and operations use C-level loops with SIMD (c) NumPy arrays use GPU acceleration by default (d) NumPy compiles Python to machine code

✅ (b) NumPy arrays are homogeneous, contiguous blocks of memory. Operations dispatch to optimized C/Fortran routines that leverage SIMD (Single Instruction, Multiple Data) CPU instructions, processing 4-8 elements simultaneously.

Q2: In Pandas, what does df.groupby('State')['Sales'].transform('mean') return?

(a) A DataFrame with one row per state (b) A Series with the same index as the original DataFrame (c) A scalar value (d) A dictionary of state→mean pairs

✅ (b) transform() returns a result with the same shape as the input, broadcasting each group's aggregate back to all members. Unlike agg() which reduces rows, transform() preserves the original index.

Q3: Which NumPy broadcasting rule is INCORRECT?

(a) Arrays with shape (3,4) and (4,) can be broadcast (b) Arrays with shape (3,4) and (3,1) can be broadcast (c) Arrays with shape (3,4) and (2,4) can be broadcast (d) Arrays with shape (1,4) and (3,1) can be broadcast

✅ (c) Shape (3,4) and (2,4) cannot be broadcast because dimension 0 has sizes 3 and 2, neither of which is 1. Broadcasting requires each dimension to either match or be 1.

Q4: What does the Scikit-Learn method fit_transform(X) do?

(a) Only fits the model (b) Fits the transformer on X, then transforms X using the learned parameters (c) Transforms X without fitting (d) Predicts labels for X

✅ (b) fit_transform(X) is equivalent to fit(X).transform(X) but potentially more efficient. It learns parameters (e.g., mean/std for StandardScaler) from X and then applies the transformation.

Q5: In TensorFlow 2.x, what is eager execution?

(a) A mode where operations build a static computation graph first (b) A mode where operations execute immediately and return concrete values (c) A GPU-specific optimization mode (d) A training-only acceleration feature

✅ (b) Eager execution (default in TF2) means operations evaluate immediately without building a graph. This makes debugging easier and feels more "Pythonic." Use @tf.function to opt into graph mode for production performance.

Q6: What is the output of np.array([1,2,3]) * np.array([1,2,3])?

(a) 14 (dot product) (b) array([1, 4, 9]) (element-wise) (c) A 3×3 matrix (outer product) (d) Error: ambiguous operation

✅ (b) The * operator performs element-wise multiplication (Hadamard product). For dot product, use np.dot() or @. For outer product, use np.outer().

Q7: Which Pandas method is best for handling missing values by filling with the column mean?

(a) df.dropna() (b) df.fillna(df.mean()) (c) df.interpolate() (d) df.replace(np.nan, 0)

✅ (b) df.fillna(df.mean()) replaces NaN values in each numeric column with that column's mean. df.mean() calculates column-wise means, and fillna() broadcasts appropriately.

Q8: In Matplotlib, what does plt.tight_layout() do?

(a) Makes the plot smaller (b) Automatically adjusts subplot parameters to prevent overlapping (c) Removes axis labels (d) Sets the figure DPI to maximum

✅ (b) tight_layout() automatically adjusts padding between and around subplots so that labels, titles, and tick marks don't overlap. Always call it before show() or savefig().

Q9: What does df.pivot_table(values='Sales', index='State', columns='Quarter', aggfunc='sum') create?

(a) A transposed DataFrame (b) A cross-tabulation with States as rows, Quarters as columns, and sum of Sales as values (c) A filtered DataFrame (d) A melted DataFrame

✅ (b) pivot_table() creates a spreadsheet-style pivot table. Each cell contains the aggregate (sum) of Sales for that specific State-Quarter combination. Missing combinations show NaN.

Q10: Python's GIL (Global Interpreter Lock) prevents true multi-threading for CPU-bound tasks. How do NumPy and TensorFlow work around this?

(a) They remove the GIL entirely (b) They release the GIL during C-extension calls and use internal multi-threading (c) They use multiprocessing instead (d) They use async/await

✅ (b) NumPy's C extensions release the GIL during computation, allowing internal BLAS/LAPACK routines to use multiple threads. Similarly, TensorFlow's C++ backend runs multi-threaded operations independent of Python's GIL.

3.21 💼 Interview Questions

IQ1: Explain the difference between loc and iloc in Pandas.

Answer: loc uses label-based indexing (e.g., df.loc['row_label', 'col_name']), while iloc uses integer position-based indexing (e.g., df.iloc[0, 2]). Key difference: loc includes the end of the slice, iloc excludes it (like Python ranges).

IQ2: Why is vectorization preferred over loops in data science?

Answer: Vectorized operations (NumPy/Pandas) leverage: (1) contiguous memory access (cache-friendly), (2) SIMD CPU instructions that process multiple elements per clock cycle, (3) C-level loops that avoid Python's interpreter overhead (~100× faster), (4) optimized BLAS/LAPACK libraries for linear algebra.

IQ3: How do you handle a dataset with 40% missing values in a column?

Answer: Strategy depends on context: (1) If MCAR (Missing Completely at Random) → consider dropping if >50%, (2) Create a binary "is_missing" indicator feature, (3) Use model-based imputation (KNN, IterativeImputer), (4) For categorical: add "Unknown" category, (5) For time-series: forward/backward fill. Never just drop 40% of data without analysis.

IQ4: What is data leakage and how do Scikit-Learn pipelines prevent it?

Answer: Data leakage occurs when test data information "leaks" into training. Example: scaling the entire dataset before train/test split means the scaler learned test statistics. Pipelines prevent this by ensuring fit() is called only on training data, and transform() applies the same learned parameters to test data.

IQ5: Explain broadcasting in NumPy with an example.

Answer: Broadcasting allows arithmetic between arrays of different shapes by automatically expanding the smaller array. Rules: (1) Align shapes from the right, (2) Dimensions must match or be 1, (3) Size-1 dimensions are "stretched." Example: shape (3,4) + shape (4,) → (4,) becomes (1,4) → stretched to (3,4).

IQ6: How would you optimize a Pandas operation that takes too long on 10M rows?

Answer: (1) Use vectorized operations instead of apply(), (2) Use appropriate dtypes (category for strings, int32 instead of int64), (3) Use pd.eval() or query() for complex expressions, (4) Consider chunked reading with chunksize parameter, (5) Switch to Polars or Dask for out-of-memory computation, (6) Use numba or swifter for custom functions.

IQ7: What is the difference between tf.constant and tf.Variable?

Answer: tf.constant creates an immutable tensor (cannot be changed after creation) — used for input data and fixed hyperparameters. tf.Variable creates a mutable tensor — used for trainable parameters like weights and biases that are updated during gradient descent.

IQ8: Describe the Scikit-Learn estimator API. What are the core methods?

Answer: All Scikit-Learn estimators follow a consistent API: fit(X, y) learns parameters from training data, predict(X) makes predictions (classifiers/regressors), transform(X) transforms data (preprocessors), fit_transform(X) combines fit+transform, score(X, y) evaluates performance. This consistency allows pipeline chaining.

IQ9: How do you perform feature engineering for an Indian housing price dataset?

Answer: (1) Create per-sqft price from total price/area, (2) Extract floor category (ground, low, mid, high), (3) Distance to nearest metro station, (4) Encode locality as mean target encoding, (5) Age of building from year built, (6) Create boolean features: has_parking, has_gym, near_hospital, (7) Season of listing (festive season premium in India).

IQ10: What is the difference between deep copy and shallow copy in Python? Why does it matter for DataFrames?

Answer: Shallow copy (df.copy(deep=False) or df2 = df) creates a new object but references the same underlying data. Modifying one changes the other. Deep copy (df.copy()) creates an entirely independent copy. This matters in data science when you create modified versions of datasets — always use deep copy to avoid accidentally corrupting original data.

3.22 🔬 Research Problems

Research Problem 1: Benchmarking Python Data Science Libraries

Objective: Systematically compare the performance of Pandas vs Polars vs DuckDB vs PySpark for common EDA operations on Indian government datasets of varying sizes (100K, 1M, 10M, 100M rows).

Tasks: (1) Benchmark: read_csv, groupby-aggregate, join, sort, filter, and pivot operations, (2) Measure memory usage, CPU utilization, and wall-clock time, (3) Test with real data from data.gov.in, (4) Identify crossover points where one library becomes better than another, (5) Publish results with reproducible Jupyter notebooks.

Impact: Help Indian data scientists choose the right tool for their data scale. Could be published in venues like PyCon India or the Journal of Statistical Software.

Research Problem 2: Auto-EDA for Indian Multilingual Datasets

Objective: Build an automated EDA tool that handles datasets with Indian language text (Hindi, Tamil, Telugu, etc.) alongside English and numeric data. Current auto-EDA tools (pandas-profiling, sweetviz) fail on mixed-script data.

Challenges: Unicode handling for Devanagari/Dravidian scripts, mixed-language column detection, appropriate NLP preprocessing (tokenization differs by language), and culturally relevant statistical summaries (e.g., Indian number system: lakhs, crores).

Research Problem 3: Energy-Efficient ML Pipeline Design

Objective: Measure and optimize the carbon footprint of Python ML training pipelines. Given India's growing AI compute, understanding energy consumption is critical.

Approach: Use CodeCarbon library to measure CO₂ emissions per training run. Compare: (1) NumPy vs CuPy (GPU) for preprocessing, (2) TensorFlow vs PyTorch energy efficiency, (3) Impact of batch size, model size, and data pipeline design on energy. Propose a "green ML" scoring system for Indian AI labs.

3.23 🗝️ Key Takeaways

3.24 📚 References & Further Reading

Textbooks

McKinney, W. (2022). Python for Data Analysis, 3rd Edition. O'Reilly Media. — The definitive Pandas reference by its creator.
VanderPlas, J. (2023). Python Data Science Handbook, 2nd Edition. O'Reilly Media. — Covers NumPy, Pandas, Matplotlib, and Scikit-Learn comprehensively.
Géron, A. (2023). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition. O'Reilly. — Best practical ML book.
Raschka, S. & Mirjalili, V. (2022). Machine Learning with PyTorch and Scikit-Learn. Packt Publishing.
Harris, C. R. et al. (2020). "Array programming with NumPy." Nature, 585, 357–362.

Online Resources

NumPy Official Documentation: numpy.org/doc
Pandas Documentation: pandas.pydata.org/docs
Scikit-Learn User Guide: scikit-learn.org/stable
TensorFlow Tutorials: tensorflow.org/tutorials
Kaggle Learn — Free micro-courses: kaggle.com/learn
Google Colab: colab.research.google.com

Indian Data Sources

Open Government Data Platform India: data.gov.in
Census of India: censusindia.gov.in
Reserve Bank of India Data: rbi.org.in
India Meteorological Department: mausam.imd.gov.in
NSE India: nseindia.com
NITI Aayog Data Portal: niti.gov.in

Global Data Sources

Kaggle Datasets: kaggle.com/datasets
UCI Machine Learning Repository: archive.ics.uci.edu
World Bank Open Data: data.worldbank.org
WHO Global Health Observatory: who.int/data/gho
Google Dataset Search: datasetsearch.research.google.com

Research Papers

Harris, C. R. et al. (2020). "Array programming with NumPy." Nature, 585(7825), 357-362.
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." JMLR, 12, 2825-2830.
Abadi, M. et al. (2016). "TensorFlow: A System for Large-Scale Machine Learning." OSDI, 265-283.
McKinney, W. (2011). "pandas: a Foundational Python Library for Data Analysis and Statistics." PyHPC.

🎉 Chapter 3 Complete!

You now have the complete Python data science toolkit. In Chapter 4, we'll apply these tools to build your first ML models: Linear Regression and Classification from scratch.

📖 Next: Chapter 4 — Linear Models: Regression & Classification