Chapter 3: Python & Data Science Ecosystem
3.1 ๐ฏ Learning Objectives
Master Python fundamentals โ data types, control flow, OOP, comprehensions โ tailored for ML/AI work.
Perform 20+ array operations: broadcasting, vectorization, linear algebra, reshaping, indexing.
Load, clean, transform, and analyze real Indian datasets using DataFrames with 20+ operations.
Create publication-quality plots โ line, scatter, bar, histogram, heatmap โ with Matplotlib & Seaborn.
Understand the estimator API (fit/predict/transform), build preprocessing pipelines, and evaluate models.
Install TensorFlow/Keras, configure GPU, perform tensor operations, and use eager execution.
Use Jupyter Notebooks and Google Colab for interactive development and reproducible research.
Perform EDA on NSE stock data, Census of India, IMD weather data, and agricultural datasets.
3.2 ๐ Introduction
Python has become the lingua franca of artificial intelligence and data science. Over 87% of ML practitioners use Python as their primary language (Kaggle Survey 2024). This chapter transforms you from a Python beginner into a proficient data science programmer.
Why Python Dominates AI/ML
| Feature | Python | R | Julia | MATLAB | C++ |
|---|---|---|---|---|---|
| Learning Curve | โญ Easy | Medium | Medium | Easy | Hard |
| ML Libraries | โญ 500+ | 200+ | 50+ | 100+ | 30+ |
| Deep Learning | โญ TF, PyTorch | Limited | Flux.jl | DL Toolbox | LibTorch |
| Production Deploy | โญ Excellent | Poor | Growing | Poor | โญ Excellent |
| Community Size | โญ Massive | Large | Small | Medium | Large |
| Speed | Moderate | Slow | โญ Fast | Moderate | โญ Fastest |
| Cost | โญ Free | โญ Free | โญ Free | โน1,50,000+/yr | โญ Free |
| Industry Adoption | โญ 87% | 15% | 3% | 8% | 12% |
The Data Science Stack
Python Stack # The complete Python data science ecosystem # Layer 1: Core Language Python 3.10+ # Base language # Layer 2: Scientific Computing NumPy # N-dimensional arrays & linear algebra SciPy # Scientific algorithms (optimization, integration) # Layer 3: Data Manipulation Pandas # DataFrames for structured data Polars # Modern high-performance alternative # Layer 4: Visualization Matplotlib # Core plotting library Seaborn # Statistical visualization Plotly # Interactive plots # Layer 5: Machine Learning Scikit-Learn # Classical ML algorithms XGBoost # Gradient boosting LightGBM # Fast gradient boosting # Layer 6: Deep Learning TensorFlow # Google's DL framework PyTorch # Meta's DL framework Keras # High-level API # Layer 7: Deployment Flask / FastAPI # API servers Docker # Containerization MLflow # Model tracking
3.3 ๐ Historical Background
Timeline of Python in Data Science
3.4 ๐ก Conceptual Explanation
Python Fundamentals Crash Course
4.1 Data Types
Python # === Numeric Types === age = 21 # int gpa = 9.2 # float complex_num = 3 + 4j # complex # === Strings === name = "Aadhaar" greeting = f"Hello, {name}!" # f-string formatting multiline = """India has 1.4 billion people""" # === Boolean === is_indian = True has_gpu = False # === Collections === # List (ordered, mutable) cities = ["Mumbai", "Delhi", "Bangalore", "Chennai"] cities.append("Hyderabad") # Tuple (ordered, immutable) coordinates = (28.6139, 77.2090) # Delhi lat, long # Dictionary (key-value pairs) student = { "name": "Priya Sharma", "roll": 101, "branch": "CSE", "cgpa": 8.7 } # Set (unordered, unique elements) languages = {"Python", "R", "Julia", "Python"} # {'Python','R','Julia'} # Type checking print(type(age)) # <class 'int'> print(type(cities)) # <class 'list'> print(isinstance(gpa, float)) # True
4.2 Control Flow
Python # === Conditional Statements === marks = 85 if marks >= 90: grade = "A+" elif marks >= 80: grade = "A" elif marks >= 70: grade = "B" else: grade = "C" print(f"Grade: {grade}") # Grade: A # Ternary operator result = "Pass" if marks >= 40 else "Fail" # === Loops === # For loop with range for i in range(1, 6): print(f"{i} ร 7 = {i*7}") # Iterating over collections states = ["Maharashtra", "Karnataka", "Tamil Nadu"] for idx, state in enumerate(states, 1): print(f"{idx}. {state}") # While loop n = 10 factorial = 1 while n > 0: factorial *= n n -= 1 print(f"10! = {factorial}") # 3628800 # === List Comprehensions === squares = [x**2 for x in range(1, 11)] # [1, 4, 9, 16, 25, 36, 49, 64, 81, 100] evens = [x for x in range(1, 21) if x % 2 == 0] # [2, 4, 6, 8, 10, 12, 14, 16, 18, 20] # Dictionary comprehension sq_dict = {x: x**2 for x in range(1, 6)} # {1: 1, 2: 4, 3: 9, 4: 16, 5: 25} # Nested comprehension (matrix creation) matrix = [[i * j for j in range(1, 4)] for i in range(1, 4)] # [[1,2,3], [2,4,6], [3,6,9]]
4.3 Functions & Lambda
Python # === Function Definition === def calculate_bmi(weight_kg, height_m): """Calculate Body Mass Index. Args: weight_kg (float): Weight in kilograms height_m (float): Height in meters Returns: float: BMI value """ bmi = weight_kg / (height_m ** 2) return round(bmi, 2) print(calculate_bmi(70, 1.75)) # 22.86 # Default parameters def greet(name, language="Hindi"): greetings = {"Hindi": "Namaste", "English": "Hello", "Tamil": "Vanakkam"} return f"{greetings.get(language, 'Hi')}, {name}!" # *args and **kwargs def statistics(*values): n = len(values) mean = sum(values) / n variance = sum((x - mean)**2 for x in values) / n return {"mean": mean, "variance": variance, "std": variance**0.5} print(statistics(85, 90, 78, 92, 88)) # Lambda functions square = lambda x: x ** 2 add = lambda a, b: a + b # Map, Filter, Reduce prices_inr = [100, 250, 500, 1000] prices_usd = list(map(lambda p: round(p / 83, 2), prices_inr)) expensive = list(filter(lambda p: p > 300, prices_inr)) from functools import reduce total = reduce(lambda a, b: a + b, prices_inr) # 1850
4.4 Object-Oriented Programming
Python class Dataset: """A simple dataset class for ML workflows.""" def __init__(self, name, features, target): self.name = name self.features = features # list of lists self.target = target # list self.n_samples = len(target) self.n_features = len(features[0]) if features else 0 def __repr__(self): return f"Dataset('{self.name}', samples={self.n_samples}, features={self.n_features})" def head(self, n=5): for i in range(min(n, self.n_samples)): print(f"X={self.features[i]}, y={self.target[i]}") def train_test_split(self, test_ratio=0.2): split = int(self.n_samples * (1 - test_ratio)) return (self.features[:split], self.target[:split], self.features[split:], self.target[split:]) # Inheritance class ImageDataset(Dataset): def __init__(self, name, features, target, img_size): super().__init__(name, features, target) self.img_size = img_size def resize(self, new_size): self.img_size = new_size print(f"Images resized to {new_size}") # Usage ds = Dataset("Indian Housing", [[1200,2],[1800,3],[2400,4]], [50,75,100]) print(ds) # Dataset('Indian Housing', samples=3, features=2) ds.head()
3.5 ๐ Mathematical Foundation
The mathematical operations that power data science are computed through NumPy's optimized C backend. Understanding the math behind these operations is crucial.
5.1 Vector Operations
A vector x โ โโฟ is an ordered collection of n real numbers:
x = [xโ, xโ, ..., xโ]แต Dot Product: x ยท y = ฮฃแตข xแตขyแตข = xโyโ + xโyโ + ... + xโyโ Euclidean Norm (L2): โxโโ = โ(ฮฃแตข xแตขยฒ) = โ(xโยฒ + xโยฒ + ... + xโยฒ) Manhattan Norm (L1): โxโโ = ฮฃแตข |xแตข| = |xโ| + |xโ| + ... + |xโ| Cosine Similarity: cos(ฮธ) = (x ยท y) / (โxโโ ยท โyโโ)
5.2 Matrix Operations
Matrix Multiplication: C = A ร B where A โ โแตหฃโฟ, B โ โโฟหฃแต, C โ โแตหฃแต Cแตขโฑผ = ฮฃโ AแตขโBโโฑผ for k = 1 to n Transpose: (Aแต)แตขโฑผ = Aโฑผแตข Determinant (2ร2): det(A) = |a b| = ad - bc |c d| Inverse (2ร2): Aโปยน = (1/det(A)) ร | d -b| |-c a| Eigenvalue equation: Av = ฮปv where ฮป = eigenvalue, v = eigenvector
5.3 Statistical Measures
Mean (Arithmetic Average): ฮผ = (1/n) ฮฃแตข xแตข Variance: ฯยฒ = (1/n) ฮฃแตข (xแตข - ฮผ)ยฒ Standard Deviation: ฯ = โฯยฒ Covariance between X and Y: Cov(X,Y) = (1/n) ฮฃแตข (xแตข - ฮผโ)(yแตข - ฮผแตง) Pearson Correlation: ฯ(X,Y) = Cov(X,Y) / (ฯโ ยท ฯแตง) where -1 โค ฯ โค 1 Z-Score (Standardization): z = (x - ฮผ) / ฯ Min-Max Normalization: x_norm = (x - x_min) / (x_max - x_min)
3.6 ๐ฃ Formula Derivations
Derivation 1: Why Vectorization is O(1) vs Loop O(n)
Given: Two vectors a, b โ โโฟ, compute c = a + b Method 1: Python Loop For i = 0 to n-1: c[i] = a[i] + b[i] Time complexity: O(n) Python interpreter calls Each iteration: type-check โ unbox โ add โ box โ store Overhead per element: ~100 CPU cycles (interpreter) Method 2: NumPy Vectorization c = np.add(a, b) # Single C-level call Internally: SIMD instruction processes 4-8 elements simultaneously Overhead per element: ~1-2 CPU cycles (raw hardware) Speedup = (100 ร n) / (2 ร n/4) = 200ร For n = 1,000,000: Loop โ 0.5s, NumPy โ 0.002s
Derivation 2: Broadcasting Rules (First Principles)
Broadcasting allows operations on arrays of different shapes. Rule 1: If arrays differ in ndim, prepend 1s to smaller shape. A.shape = (3, 4) โ (3, 4) B.shape = (4,) โ (1, 4) Rule 2: Arrays with size 1 along a dimension are stretched. A.shape = (3, 4) โ (3, 4) B.shape = (1, 4) โ (3, 4) โ B is repeated 3 times Rule 3: If sizes differ and neither is 1 โ Error! A.shape = (3, 4) B.shape = (3, 5) โ INCOMPATIBLE! Example: Standardize a dataset X (mรn) column-wise X.shape = (1000, 5) # 1000 samples, 5 features mean.shape = (5,) # โ (1, 5) โ (1000, 5) std.shape = (5,) # โ (1, 5) โ (1000, 5) Z = (X - mean) / std # Broadcasting handles it!
Derivation 3: Covariance Matrix from First Principles
Given dataset X with m samples and n features: X โ โแตหฃโฟ Step 1: Center the data (subtract column means) Xฬ = X - ฮผ where ฮผโฑผ = (1/m) ฮฃแตข Xแตขโฑผ Step 2: Covariance matrix C โ โโฟหฃโฟ C = (1/m) Xฬแต Xฬ Step 3: Element Cโฑผโ = covariance between feature j and k Cโฑผโ = (1/m) ฮฃแตข (Xแตขโฑผ - ฮผโฑผ)(Xแตขโ - ฮผโ) Properties: - C is symmetric: Cโฑผโ = Cโโฑผ - Diagonal: Cโฑผโฑผ = Var(feature j) - C is positive semi-definite: vแตCv โฅ 0 for all v In NumPy: C = np.cov(X.T) or (X_centered.T @ X_centered) / m
3.7 โ๏ธ Worked Numerical Examples
Example 1: NumPy Matrix Operations
Python import numpy as np # Given: Monthly sales data for 3 products in 4 cities (in โน lakhs) sales = np.array([ [12, 15, 8, 20], # Product A: Mumbai, Delhi, Bangalore, Chennai [9, 11, 14, 7], # Product B [18, 6, 10, 13] # Product C ]) # Q1: Total sales per product product_totals = sales.sum(axis=1) print("Product totals:", product_totals) # [55, 41, 47] โ Product A leads with โน55 lakhs # Q2: Average sales per city city_avg = sales.mean(axis=0) print("City averages:", city_avg) # [13.0, 10.67, 10.67, 13.33] โ Chennai highest average # Q3: Normalize sales (min-max per product) mins = sales.min(axis=1, keepdims=True) maxs = sales.max(axis=1, keepdims=True) normalized = (sales - mins) / (maxs - mins) print("Normalized:\n", np.round(normalized, 2)) # [[0.33, 0.58, 0. , 1. ], โ Product A: Chennai is max # [0.29, 0.57, 1. , 0. ], โ Product B: Bangalore is max # [1. , 0. , 0.33, 0.58]] โ Product C: Mumbai is max # Q4: Covariance between products cov_matrix = np.cov(sales) print("Covariance matrix:\n", np.round(cov_matrix, 2)) # Q5: Matrix multiplication โ sales ร price_per_unit prices = np.array([[100], [150], [200], [120]]) # โน per unit per city revenue = sales @ prices print("Revenue per product (โน lakhs):", revenue.flatten()) # [6050, 5900, 5760]
Example 2: Pandas EDA โ Indian Student Dataset
Python import pandas as pd import numpy as np # Create sample Indian student dataset np.random.seed(42) n = 200 data = { 'Name': [f'Student_{i}' for i in range(1, n+1)], 'State': np.random.choice( ['Maharashtra', 'Karnataka', 'Tamil Nadu', 'UP', 'Delhi', 'Kerala'], n), 'Branch': np.random.choice( ['CSE', 'ECE', 'Mech', 'Civil', 'AI/ML'], n, p=[0.30, 0.20, 0.20, 0.15, 0.15]), 'CGPA': np.random.normal(7.5, 1.2, n).clip(4.0, 10.0), 'Package_LPA': np.random.exponential(6, n).clip(3, 50), 'Python_Score': np.random.randint(40, 100, n), 'Placed': np.random.choice([True, False], n, p=[0.72, 0.28]) } df = pd.DataFrame(data) df['CGPA'] = df['CGPA'].round(2) df['Package_LPA'] = df['Package_LPA'].round(2) # === EDA Operations === # 1. Basic info print(df.shape) # (200, 7) print(df.info()) # Column types, non-null counts print(df.describe()) # Statistical summary # 2. Value counts print(df['State'].value_counts()) print(df['Branch'].value_counts(normalize=True)) # 3. GroupBy โ Average package by state state_pkg = df.groupby('State')['Package_LPA'].agg(['mean', 'median', 'max']) print(state_pkg.round(2)) # 4. Pivot table โ Average CGPA by State ร Branch pivot = df.pivot_table(values='CGPA', index='State', columns='Branch', aggfunc='mean') print(pivot.round(2)) # 5. Correlation numeric_cols = df.select_dtypes(include=['number']) print(numeric_cols.corr().round(2)) # 6. Filtering โ Top performers toppers = df[(df['CGPA'] > 9.0) & (df['Placed'] == True)] print(f"Toppers placed: {len(toppers)}") # 7. Apply custom function def classify_package(lpa): if lpa >= 15: return 'Super Dream' elif lpa >= 10: return 'Dream' elif lpa >= 5: return 'Good' else: return 'Average' df['Offer_Type'] = df['Package_LPA'].apply(classify_package) print(df['Offer_Type'].value_counts()) # 8. Handling missing data df_with_na = df.copy() df_with_na.loc[0:10, 'Package_LPA'] = np.nan print(df_with_na.isnull().sum()) df_with_na['Package_LPA'].fillna(df_with_na['Package_LPA'].median(), inplace=True)
3.8 ๐ Visual Diagrams
Data Science Workflow Architecture
NumPy Array Memory Layout
Pandas DataFrame Architecture
3.9 ๐ Flowcharts
EDA (Exploratory Data Analysis) Flowchart
Scikit-Learn Pipeline Flowchart
3.10 ๐ Python Implementation
10.1 NumPy โ 20+ Operations
Python import numpy as np # === 1. Array Creation === a = np.array([1, 2, 3, 4, 5]) zeros = np.zeros((3, 4)) ones = np.ones((2, 3)) identity = np.eye(4) randoms = np.random.randn(3, 3) linspace = np.linspace(0, 10, 50) # 50 evenly spaced points arange = np.arange(0, 100, 5) # [0, 5, 10, ..., 95] # === 2. Array Properties === X = np.random.randn(100, 5) print(f"Shape: {X.shape}") # (100, 5) print(f"Dimensions: {X.ndim}") # 2 print(f"Size: {X.size}") # 500 print(f"Dtype: {X.dtype}") # float64 print(f"Memory: {X.nbytes} bytes") # 4000 # === 3. Indexing & Slicing === arr = np.arange(20).reshape(4, 5) print(arr[0, :]) # First row print(arr[:, 2]) # Third column print(arr[1:3, 2:4]) # Sub-matrix print(arr[arr > 10]) # Boolean indexing # === 4. Reshaping === flat = np.arange(12) matrix = flat.reshape(3, 4) tensor = flat.reshape(2, 2, 3) col_vector = flat.reshape(-1, 1) # (12, 1) # === 5. Broadcasting === A = np.array([[1, 2, 3], [4, 5, 6]]) b = np.array([10, 20, 30]) print(A + b) # [[11,22,33], [14,25,36]] print(A * b) # [[10,40,90], [40,100,180]] # === 6. Aggregation === data = np.random.randint(1, 100, (5, 4)) print("Sum all:", data.sum()) print("Row sums:", data.sum(axis=1)) print("Col means:", data.mean(axis=0)) print("Max per row:", data.max(axis=1)) print("Std per col:", data.std(axis=0)) # === 7. Linear Algebra === A = np.array([[2, 1], [5, 3]]) b = np.array([4, 7]) # Solve Ax = b x = np.linalg.solve(A, b) print("Solution:", x) # [5. -6.] # Determinant, Inverse, Eigenvalues print("Det:", np.linalg.det(A)) # 1.0 print("Inverse:\n", np.linalg.inv(A)) eigenvalues, eigenvectors = np.linalg.eig(A) print("Eigenvalues:", eigenvalues) # === 8. Vectorized Operations vs Loops === import time n = 1_000_000 a = np.random.randn(n) b = np.random.randn(n) # NumPy vectorized start = time.time() c = a + b print(f"NumPy: {time.time()-start:.4f}s") # ~0.002s # Python loop start = time.time() c_loop = [a[i]+b[i] for i in range(n)] print(f"Loop: {time.time()-start:.4f}s") # ~0.5s (250ร slower!) # === 9-12. More Operations === # Sorting arr = np.array([3, 1, 4, 1, 5, 9]) print("Sorted:", np.sort(arr)) print("Argsort:", np.argsort(arr)) # Stacking x = np.array([1, 2, 3]) y = np.array([4, 5, 6]) print("Vstack:\n", np.vstack([x, y])) print("Hstack:", np.hstack([x, y])) print("Column stack:\n", np.column_stack([x, y])) # Where (conditional) ages = np.array([15, 22, 17, 30, 12]) category = np.where(ages >= 18, 'Adult', 'Minor') print(category) # ['Minor' 'Adult' 'Minor' 'Adult' 'Minor'] # Unique values states = np.array(['MH', 'KA', 'MH', 'TN', 'KA']) unique, counts = np.unique(states, return_counts=True) print(dict(zip(unique, counts))) # {'KA': 2, 'MH': 2, 'TN': 1} # Random sampling (for ML) np.random.seed(42) indices = np.random.choice(100, size=20, replace=False) # 20 random indices bootstrap = np.random.choice(100, size=100, replace=True) # Bootstrap sample
10.2 Pandas โ 20+ Operations with Indian Data
Python import pandas as pd import numpy as np # === DATA LOADING === # 1. CSV # df = pd.read_csv('indian_census.csv') # 2. JSON # df = pd.read_json('nse_stocks.json') # 3. Excel # df = pd.read_excel('agriculture.xlsx', sheet_name='Wheat') # 4. SQL # from sqlalchemy import create_engine # engine = create_engine('sqlite:///india_data.db') # df = pd.read_sql('SELECT * FROM census', engine) # 5. API / Web # df = pd.read_html('https://en.wikipedia.org/wiki/States_of_India')[0] # === Create Indian State GDP Dataset === gdp_data = { 'State': ['Maharashtra', 'Tamil Nadu', 'Karnataka', 'Gujarat', 'UP', 'West Bengal', 'Rajasthan', 'Telangana', 'Kerala', 'AP', 'MP', 'Delhi'], 'GDP_Lakh_Cr': [32.2, 19.4, 16.9, 16.7, 17.9, 12.5, 10.2, 11.5, 9.8, 10.1, 9.1, 9.0], 'Population_Cr': [12.3, 7.7, 6.4, 6.4, 23.1, 9.7, 7.9, 3.7, 3.5, 5.3, 8.5, 1.9], 'Literacy_%': [82.3, 80.1, 75.4, 78.0, 67.7, 76.3, 66.1, 72.8, 94.0, 67.0, 69.3, 86.2], 'Region': ['West', 'South', 'South', 'West', 'North', 'East', 'North', 'South', 'South', 'South', 'Central', 'North'] } df = pd.DataFrame(gdp_data) # === 20+ PANDAS OPERATIONS === # 1. Per-capita GDP df['Per_Capita_Lakh'] = (df['GDP_Lakh_Cr'] / df['Population_Cr']).round(2) # 2. Sort by GDP print(df.sort_values('GDP_Lakh_Cr', ascending=False).head(5)) # 3. GroupBy region regional = df.groupby('Region').agg({ 'GDP_Lakh_Cr': 'sum', 'Population_Cr': 'sum', 'Literacy_%': 'mean' }).round(2) print(regional) # 4. Filter โ high literacy states literate = df[df['Literacy_%'] > 80] print("High literacy:", literate['State'].tolist()) # 5. Rank df['GDP_Rank'] = df['GDP_Lakh_Cr'].rank(ascending=False).astype(int) # 6. String operations df['State_Upper'] = df['State'].str.upper() df['State_Len'] = df['State'].str.len() # 7. Conditional column df['Category'] = pd.cut(df['GDP_Lakh_Cr'], bins=[0, 10, 15, 20, 40], labels=['Small', 'Medium', 'Large', 'Mega']) # 8. Merge with another dataset it_data = pd.DataFrame({ 'State': ['Karnataka', 'Maharashtra', 'Tamil Nadu', 'Telangana'], 'IT_Revenue_Cr': [5_00_000, 3_50_000, 1_80_000, 1_50_000] }) merged = df.merge(it_data, on='State', how='left') print(merged[['State', 'IT_Revenue_Cr']].dropna()) # 9. Pivot table pivot = df.pivot_table(values='GDP_Lakh_Cr', index='Region', aggfunc=['sum', 'mean', 'count']) print(pivot) # 10. Apply custom function def development_index(row): return (row['Per_Capita_Lakh'] * 0.4 + row['Literacy_%'] * 0.01 * 0.6) df['Dev_Index'] = df.apply(development_index, axis=1).round(3) print(df.nlargest(5, 'Dev_Index')[['State', 'Dev_Index']])
10.3 Matplotlib & Seaborn โ 15+ Plot Examples
Python import matplotlib.pyplot as plt import seaborn as sns import numpy as np import pandas as pd plt.style.use('seaborn-v0_8-darkgrid') # === 1. Line Plot: India GDP Growth === years = np.arange(2015, 2026) gdp = [2.1, 2.3, 2.7, 2.7, 2.9, 2.7, 3.2, 3.4, 3.6, 3.9, 4.2] plt.figure(figsize=(10, 5)) plt.plot(years, gdp, 'o-', color='#059669', linewidth=2, markersize=8) plt.title('India GDP Growth (Trillion USD)', fontsize=14, fontweight='bold') plt.xlabel('Year'); plt.ylabel('GDP (Trillion $)') plt.fill_between(years, gdp, alpha=0.2, color='#059669') plt.tight_layout(); plt.savefig('india_gdp.png', dpi=150); plt.show() # === 2. Bar Chart: Top 5 States by GDP === states = ['MH', 'TN', 'UP', 'KA', 'GJ'] gdp_vals = [32.2, 19.4, 17.9, 16.9, 16.7] colors = ['#059669', '#0891b2', '#f59e0b', '#ef4444', '#8b5cf6'] plt.figure(figsize=(8, 5)) bars = plt.bar(states, gdp_vals, color=colors, edgecolor='white') for bar, val in zip(bars, gdp_vals): plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3, f'โน{val}L Cr', ha='center', fontweight='bold') plt.title('Top 5 Indian States by GDP'); plt.ylabel('GDP (โน Lakh Crore)') plt.tight_layout(); plt.show() # === 3. Scatter Plot: Literacy vs Per-Capita GDP === literacy = [82.3, 80.1, 75.4, 78.0, 67.7, 94.0, 86.2] per_cap = [2.62, 2.52, 2.64, 2.61, 0.78, 2.80, 4.74] plt.figure(figsize=(8, 6)) plt.scatter(literacy, per_cap, s=150, c=per_cap, cmap='viridis', edgecolors='white', linewidth=2) plt.colorbar(label='Per-Capita GDP (โน Lakh Cr)') plt.xlabel('Literacy Rate (%)'); plt.ylabel('Per-Capita GDP') plt.title('Literacy vs Per-Capita GDP (Indian States)') plt.tight_layout(); plt.show() # === 4. Histogram: CGPA Distribution === np.random.seed(42) cgpas = np.random.normal(7.5, 1.2, 500).clip(4, 10) plt.figure(figsize=(8, 5)) plt.hist(cgpas, bins=25, color='#0891b2', edgecolor='white', alpha=0.8) plt.axvline(np.mean(cgpas), color='red', linestyle='--', label=f'Mean={np.mean(cgpas):.2f}') plt.legend(); plt.xlabel('CGPA'); plt.ylabel('Frequency') plt.title('CGPA Distribution (Indian Engineering College)') plt.tight_layout(); plt.show() # === 5. Heatmap: Correlation Matrix === np.random.seed(42) df_corr = pd.DataFrame({ 'CGPA': np.random.normal(7.5, 1.2, 200), 'Python': np.random.randint(40, 100, 200), 'Math': np.random.randint(50, 100, 200), 'Package_LPA': np.random.exponential(6, 200) }) plt.figure(figsize=(8, 6)) sns.heatmap(df_corr.corr(), annot=True, cmap='coolwarm', center=0, fmt='.2f', square=True, linewidths=1) plt.title('Feature Correlation Heatmap'); plt.tight_layout(); plt.show() # === 6. Subplots: Multi-panel Dashboard === fig, axes = plt.subplots(2, 2, figsize=(12, 10)) # Panel 1: Line axes[0,0].plot(years, gdp, 'o-', color='#059669') axes[0,0].set_title('GDP Growth') # Panel 2: Bar axes[0,1].bar(states, gdp_vals, color=colors) axes[0,1].set_title('State GDP') # Panel 3: Histogram axes[1,0].hist(cgpas, bins=20, color='#0891b2', edgecolor='white') axes[1,0].set_title('CGPA Distribution') # Panel 4: Pie regions = ['South', 'West', 'North', 'East', 'Central'] shares = [35, 28, 22, 10, 5] axes[1,1].pie(shares, labels=regions, autopct='%1.1f%%', colors=['#059669','#0891b2','#f59e0b','#ef4444','#8b5cf6']) axes[1,1].set_title('IT Revenue by Region') plt.suptitle('India Economic Dashboard', fontsize=16, fontweight='bold') plt.tight_layout(); plt.show() # === 7. Box Plot === plt.figure(figsize=(8, 5)) branches = ['CSE', 'ECE', 'Mech', 'Civil', 'AI/ML'] data_box = [np.random.normal(loc, 2, 50) for loc in [12, 8, 7, 6, 15]] bp = plt.boxplot(data_box, labels=branches, patch_artist=True) for patch, color in zip(bp['boxes'], colors): patch.set_facecolor(color) plt.title('Placement Package by Branch (LPA)') plt.ylabel('Package (LPA)'); plt.tight_layout(); plt.show() # === 8-15. More Seaborn Plots === # 8. Pair Plot # sns.pairplot(df_corr, diag_kind='kde') # 9. Violin Plot # sns.violinplot(x='Branch', y='Package', data=df) # 10. Count Plot # sns.countplot(x='State', data=df, palette='viridis') # 11. Joint Plot # sns.jointplot(x='CGPA', y='Package_LPA', data=df, kind='reg') # 12. KDE Plot # sns.kdeplot(data=df, x='CGPA', hue='Placed', fill=True) # 13. Swarm Plot # sns.swarmplot(x='Branch', y='CGPA', data=df) # 14. Stacked Area Chart # plt.stackplot(years, [gdp_it, gdp_mfg, gdp_agri]) # 15. Horizontal Bar Chart # plt.barh(states, literacy_rates, color='#059669')
3.11 ๐ง TensorFlow Implementation
11.1 Installation & GPU Setup
Bash # CPU-only installation pip install tensorflow # GPU installation (requires CUDA 11.x + cuDNN 8.x) pip install tensorflow[and-cuda] # Verify installation python -c "import tensorflow as tf; print(tf.__version__)" python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
11.2 Tensor Operations
Python import tensorflow as tf import numpy as np # === Tensor Creation === # Constant tensors a = tf.constant([1, 2, 3], dtype=tf.float32) b = tf.constant([[1, 2], [3, 4]], dtype=tf.float32) print(f"Shape: {a.shape}, Dtype: {a.dtype}") # Special tensors zeros = tf.zeros((3, 4)) ones = tf.ones((2, 3)) eye = tf.eye(3) rand = tf.random.normal((3, 3), mean=0, stddev=1) uniform = tf.random.uniform((2, 5), minval=0, maxval=10) # Variable tensors (trainable parameters) weights = tf.Variable(tf.random.normal((784, 128))) bias = tf.Variable(tf.zeros(128)) # === Tensor Operations === x = tf.constant([[1.0, 2.0], [3.0, 4.0]]) y = tf.constant([[5.0, 6.0], [7.0, 8.0]]) print("Add:", tf.add(x, y)) print("Multiply:", tf.multiply(x, y)) # Element-wise print("MatMul:", tf.matmul(x, y)) # Matrix multiplication print("Reduce Sum:", tf.reduce_sum(x)) print("Reduce Mean:", tf.reduce_mean(x)) # === Eager Execution (default in TF2) === print("Eager mode:", tf.executing_eagerly()) # True # Automatic differentiation with GradientTape x = tf.Variable(3.0) with tf.GradientTape() as tape: y = x ** 2 + 2 * x + 1 # y = xยฒ + 2x + 1 dy_dx = tape.gradient(y, x) print(f"dy/dx at x=3: {dy_dx.numpy()}") # 8.0 (2x + 2 = 8) # === NumPy โ TensorFlow Conversion === np_arr = np.array([[1, 2], [3, 4]]) tensor = tf.convert_to_tensor(np_arr, dtype=tf.float32) back_to_np = tensor.numpy() # === GPU Check === print("GPUs:", tf.config.list_physical_devices('GPU')) if tf.config.list_physical_devices('GPU'): with tf.device('/GPU:0'): large = tf.random.normal((10000, 10000)) result = tf.matmul(large, large) print("GPU computation done!")
3.12 โ๏ธ Scikit-Learn Implementation
12.1 The Estimator API
Python from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score import numpy as np import pandas as pd # === Simulated Indian Housing Price Dataset === np.random.seed(42) n = 500 area = np.random.uniform(500, 3000, n) # sq ft bedrooms = np.random.randint(1, 6, n) distance_km = np.random.uniform(1, 50, n) # from city center floor_num = np.random.randint(1, 25, n) # Price formula: base + area*rate - distance_penalty + bedroom_bonus + noise price_lakhs = (20 + area * 0.03 - distance_km * 0.8 + bedrooms * 5 + np.random.normal(0, 5, n)) X = np.column_stack([area, bedrooms, distance_km, floor_num]) y = price_lakhs # === Split Data === X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) print(f"Train: {X_train.shape}, Test: {X_test.shape}") # === Pipeline: Scaler โ Linear Regression === pipeline = Pipeline([ ('scaler', StandardScaler()), ('regressor', LinearRegression()) ]) # Fit the pipeline pipeline.fit(X_train, y_train) # Predict y_pred = pipeline.predict(X_test) # Evaluate mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f"MSE: {mse:.2f}") print(f"Rยฒ Score: {r2:.4f}") print(f"RMSE: {mse**0.5:.2f} lakhs") # === Feature Importance === model = pipeline.named_steps['regressor'] feature_names = ['Area_sqft', 'Bedrooms', 'Distance_km', 'Floor'] for name, coef in zip(feature_names, model.coef_): print(f" {name}: {coef:.4f}") print(f" Intercept: {model.intercept_:.4f}") # === Predict New House === new_house = np.array([[1500, 3, 10, 5]]) # 1500sqft, 3BHK, 10km, 5th floor predicted_price = pipeline.predict(new_house) print(f"\nPredicted price: โน{predicted_price[0]:.2f} Lakhs")
12.2 Complete Preprocessing Pipeline
Python from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.impute import SimpleImputer from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestRegressor # Define column types numeric_features = ['Area_sqft', 'Bedrooms', 'Distance_km'] categorical_features = ['City', 'Furnishing'] # Numeric pipeline: impute โ scale numeric_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) # Categorical pipeline: impute โ one-hot encode categorical_transformer = Pipeline([ ('imputer', SimpleImputer(strategy='most_frequent')), ('encoder', OneHotEncoder(handle_unknown='ignore')) ]) # Combined preprocessor preprocessor = ColumnTransformer([ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ]) # Full pipeline: preprocess โ model full_pipeline = Pipeline([ ('preprocessor', preprocessor), ('model', RandomForestRegressor(n_estimators=100, random_state=42)) ]) # Usage: # full_pipeline.fit(X_train, y_train) # predictions = full_pipeline.predict(X_test)
3.13 ๐ฎ๐ณ Indian Case Studies
Case Study 1: NSE/BSE Stock Data Analysis
Python import pandas as pd import numpy as np # Simulated NSE stock data for demonstration np.random.seed(42) dates = pd.date_range('2024-01-01', periods=252, freq='B') # Business days nse_data = pd.DataFrame({ 'Date': dates, 'RELIANCE': 2500 + np.random.normal(0, 30, 252).cumsum(), 'TCS': 3600 + np.random.normal(0, 25, 252).cumsum(), 'INFY': 1500 + np.random.normal(0, 20, 252).cumsum(), 'HDFC': 1600 + np.random.normal(0, 22, 252).cumsum(), 'Volume_Lakhs': np.random.randint(50, 500, 252) }) nse_data.set_index('Date', inplace=True) # Analysis # 1. Daily returns returns = nse_data[['RELIANCE','TCS','INFY','HDFC']].pct_change().dropna() print("Average Daily Returns:\n", returns.mean().round(4)) # 2. Volatility (annualized std) volatility = returns.std() * np.sqrt(252) print("Annualized Volatility:\n", volatility.round(4)) # 3. Correlation between stocks print("Stock Correlations:\n", returns.corr().round(3)) # 4. 20-day Moving Average nse_data['RELIANCE_MA20'] = nse_data['RELIANCE'].rolling(20).mean() # 5. Monthly resampling monthly = nse_data[['RELIANCE']].resample('M').agg({ 'RELIANCE': ['first', 'last', 'max', 'min'] }) print(monthly.head())
Case Study 2: Census of India Data Analysis
Python # Simulated Indian Census data census = pd.DataFrame({ 'State': ['Uttar Pradesh','Maharashtra','Bihar','West Bengal','Madhya Pradesh', 'Tamil Nadu','Rajasthan','Karnataka','Gujarat','Andhra Pradesh'], 'Population_Cr': [23.1,12.3,12.4,9.7,8.5,7.7,7.9,6.4,6.4,5.3], 'Decadal_Growth_%': [20.1,15.9,25.4,13.8,20.3,15.6,21.3,15.7,19.2,11.0], 'Literacy_%': [67.7,82.3,61.8,76.3,69.3,80.1,66.1,75.4,78.0,67.0], 'Sex_Ratio': [912,929,918,950,931,996,928,973,919,993], 'Urban_%': [22.3,45.2,11.3,31.9,27.6,48.4,24.9,38.6,42.6,29.6] }) # Key analyses print("=== Statistical Summary ===") print(census.describe()) print("\n=== Correlation: Literacy vs Sex Ratio ===") print(f"r = {census['Literacy_%'].corr(census['Sex_Ratio']):.3f}") print("\n=== States with Sex Ratio > 950 ===") print(census[census['Sex_Ratio'] > 950][['State', 'Sex_Ratio', 'Literacy_%']]) # Insight: Southern states show higher literacy AND better sex ratios
Case Study 3: IMD Weather Data
Python # Simulated IMD rainfall data (India Meteorological Department) months = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'] cities_weather = pd.DataFrame({ 'Month': months, 'Mumbai_mm': [0,1,0,1,20,530,840,585,340,75,15,3], 'Delhi_mm': [20,18,13,7,25,54,210,233,117,14,4,8], 'Chennai_mm': [36,10,8,15,35,48,93,119,118,267,356,152], 'Bangalore_mm': [2,7,4,47,120,81,110,137,195,180,65,16] }) # Total annual rainfall for city in ['Mumbai_mm','Delhi_mm','Chennai_mm','Bangalore_mm']: print(f"{city}: {cities_weather[city].sum()} mm/year") # Peak monsoon month per city for city in ['Mumbai_mm','Delhi_mm','Chennai_mm']: peak_idx = cities_weather[city].idxmax() print(f"{city} peak: {months[peak_idx]} ({cities_weather[city].max()} mm)")
Case Study 4: Agricultural Production Data
Python # Indian agricultural production (from data.gov.in patterns) agri = pd.DataFrame({ 'Crop': ['Rice','Wheat','Maize','Sugarcane','Cotton','Pulses','Soybean'], 'Production_MT': [130,110,33,400,35,27,12], # Million tonnes 'Area_Mha': [44,30,10,5,12,25,12], 'Top_State': ['WB','UP','KA','UP','GJ','MP','MP'], 'MSP_Rs_Quintal': [2183,2275,2090,315,6620,6600,4600] }) # Yield = Production / Area agri['Yield_T_per_Ha'] = (agri['Production_MT'] / agri['Area_Mha']).round(2) print(agri.sort_values('Yield_T_per_Ha', ascending=False)) # Sugarcane has highest yield (80 T/Ha) โ it's a high-input crop
3.14 ๐ Global Case Studies
Case Study 1: Google โ TensorFlow at Scale
Google uses Python/TensorFlow across Search ranking, Gmail spam filters, Google Photos (object recognition), Google Translate (seq2seq models), and YouTube recommendations. Their Colab platform provides free GPU/TPU access to millions of developers worldwide.
Stack: Python 3.10 + TensorFlow 2.x + JAX + BigQuery + Vertex AI
Scale: Processes 8.5 billion searches/day, each touching 50+ ML models in the pipeline.
Case Study 2: Tesla โ Real-Time Computer Vision
Tesla's Autopilot uses PyTorch for training vision models on 300+ petabytes of camera data. Their data pipeline: cameras โ video decoding (C++) โ labeling pipeline (Python/Pandas) โ model training (PyTorch) โ inference (C++/TensorRT). Python is the glue that connects all stages.
Case Study 3: Netflix โ Recommendation Engine
Netflix saves $1B/year through its Python-powered recommendation system. Their stack includes: NumPy/SciPy for matrix factorization, Pandas for A/B test analysis, Scikit-Learn for collaborative filtering, and custom deep learning models in TensorFlow. They analyze 200+ million subscribers' viewing patterns.
Case Study 4: Amazon โ Supply Chain ML
Amazon uses Python across: demand forecasting (Prophet + XGBoost), product recommendations (deep learning), Alexa NLP (PyTorch), warehouse robotics (ROS + Python), and fraud detection. Their SageMaker platform provides managed Jupyter notebooks for ML development.
Case Study 5: OpenAI โ GPT & ChatGPT
ChatGPT/GPT-4 was built using Python + PyTorch. The training pipeline uses: tokenizers (Python), distributed training orchestration (Python), RLHF implementation (Python/NumPy), and the API server (Python/FastAPI). The entire AI revolution runs on Python.
3.15 ๐ Startup Applications
Indian AI Startups Using Python
| Startup | Domain | Python Usage | Funding |
|---|---|---|---|
| Razorpay | FinTech | Fraud detection with Scikit-Learn, payment analytics with Pandas | $740M |
| Ola | Mobility | Demand prediction, surge pricing, route optimization with NumPy | $3.8B |
| Swiggy | FoodTech | Delivery time prediction, restaurant recommendation engines | $3.6B |
| CRED | FinTech | Credit scoring models, user segmentation with Pandas/Sklearn | $800M |
| Meesho | E-commerce | Product categorization, image search with TensorFlow | $1.1B |
| PhonePe | Payments | Transaction anomaly detection, user behavior modeling | $12B valuation |
| Nykaa | Beauty/E-com | Personalized recommendations, review sentiment analysis | IPO 2021 |
| Zerodha | Stock Trading | Market analytics, algo trading signals, Kite API (Python SDK) | Bootstrapped |
3.16 ๐๏ธ Government Applications
Python in Indian Government Projects
1. Aadhaar Data Processing
UIDAI processes 1.4 billion biometric records. Python is used for: data quality checks (Pandas), demographic deduplication algorithms, enrollment analytics dashboards, and API response monitoring. The authentication system handles 100M+ requests/day.
2. UPI Transaction Analytics
NPCI uses Python for analyzing 10+ billion monthly UPI transactions. Pandas DataFrames process transaction logs for: fraud pattern detection, merchant analytics, peak usage analysis, and regulatory reporting to RBI.
3. ISRO Satellite Data Processing
ISRO uses Python (NumPy/SciPy) for: satellite image processing, orbit calculations, remote sensing data analysis (Bhuvan platform), and weather prediction models. The Chandrayaan missions used Python scripts for trajectory optimization.
4. data.gov.in โ Open Data Platform
Python # Accessing Indian government open data import pandas as pd # Agricultural data from data.gov.in # url = 'https://data.gov.in/resource/...' # df = pd.read_csv(url) # Direct CSV download # Example: Loading Indian crop production data # df = pd.read_csv('IndiaAgricultureCropProduction.csv') # print(df.groupby('State')['Production'].sum().nlargest(10)) # Jupyter Notebooks are used by IIT researchers for # government data analysis projects under NITI Aayog # Google Colab setup for government data analysis: # 1. Go to colab.research.google.com # 2. New notebook โ Runtime โ Change runtime type โ GPU # 3. Upload CSV or connect to Google Drive # 4. !pip install any-needed-library
5. CoWIN Vaccination Analytics
During COVID-19, the CoWIN platform used Python for: real-time vaccination tracking (200M+ doses/month), slot availability prediction, demographic analysis of vaccination coverage, and AEFI (adverse events) monitoring using Pandas time-series analysis.
3.17 ๐ญ Industry Applications
Python Data Science Stack by Industry
| Industry | Use Case | Libraries | Indian Example |
|---|---|---|---|
| Banking | Credit scoring, fraud detection | Pandas, XGBoost, Sklearn | SBI, ICICI risk models |
| Healthcare | Medical image analysis, drug discovery | TensorFlow, OpenCV, RDKit | Practo, Niramai (breast cancer AI) |
| Retail | Demand forecasting, inventory optimization | Prophet, Pandas, Plotly | Flipkart, Reliance Retail |
| Telecom | Churn prediction, network optimization | Sklearn, PySpark, Keras | Jio, Airtel analytics |
| Manufacturing | Predictive maintenance, quality control | SciPy, NumPy, TensorFlow | Tata Steel, Mahindra |
| Agriculture | Crop prediction, soil analysis | Sklearn, GeoPandas, Rasterio | CropIn, DeHaat |
| Education | Adaptive learning, performance prediction | Pandas, Sklearn, NLP libs | BYJU'S, Unacademy |
| Logistics | Route optimization, ETA prediction | NetworkX, OR-Tools, Sklearn | Delhivery, BlueDart |
3.18 ๐ ๏ธ Mini Projects
Mini Project 1: Indian Census EDA Dashboard
Python import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np # === MINI PROJECT: Indian State Census Dashboard === # 1. Create comprehensive dataset census = pd.DataFrame({ 'State': ['UP','MH','Bihar','WB','MP','TN','RJ','KA','GJ','AP', 'KL','TG','Delhi','Punjab','Haryana'], 'Population_Cr': [23.1,12.3,12.4,9.7,8.5,7.7,7.9,6.4,6.4,5.3, 3.5,3.7,1.9,2.8,2.5], 'Literacy': [67.7,82.3,61.8,76.3,69.3,80.1,66.1,75.4,78.0,67.0, 94.0,72.8,86.2,75.8,75.6], 'Sex_Ratio': [912,929,918,950,931,996,928,973,919,993, 1084,988,868,895,879], 'GDP_Lakh_Cr': [17.9,32.2,6.1,12.5,9.1,19.4,10.2,16.9,16.7,10.1, 9.8,11.5,9.0,5.5,8.2], 'Region': ['North','West','East','East','Central','South','North','South', 'West','South','South','South','North','North','North'] }) census['Per_Capita'] = (census['GDP_Lakh_Cr']/census['Population_Cr']).round(2) # 2. Create 6-panel dashboard fig, axes = plt.subplots(2, 3, figsize=(18, 12)) fig.suptitle('๐ฎ๐ณ Indian Census EDA Dashboard', fontsize=20, fontweight='bold') # Panel 1: Population bar chart census_sorted = census.sort_values('Population_Cr', ascending=True) axes[0,0].barh(census_sorted['State'], census_sorted['Population_Cr'], color='#059669') axes[0,0].set_title('Population (Crores)') # Panel 2: Literacy distribution axes[0,1].hist(census['Literacy'], bins=10, color='#0891b2', edgecolor='white') axes[0,1].axvline(census['Literacy'].mean(), color='red', linestyle='--') axes[0,1].set_title('Literacy Rate Distribution') # Panel 3: Scatter โ Literacy vs Sex Ratio scatter = axes[0,2].scatter(census['Literacy'], census['Sex_Ratio'], s=census['Population_Cr']*20, c=census['Per_Capita'], cmap='viridis', alpha=0.8, edgecolors='white') axes[0,2].set_xlabel('Literacy %'); axes[0,2].set_ylabel('Sex Ratio') axes[0,2].set_title('Literacy vs Sex Ratio') # Panel 4: Region-wise GDP regional_gdp = census.groupby('Region')['GDP_Lakh_Cr'].sum() axes[1,0].pie(regional_gdp, labels=regional_gdp.index, autopct='%1.1f%%', colors=['#059669','#0891b2','#f59e0b','#ef4444','#8b5cf6']) axes[1,0].set_title('GDP Share by Region') # Panel 5: Box plot โ Per-Capita by Region regions = census['Region'].unique() box_data = [census[census['Region']==r]['Per_Capita'].values for r in regions] axes[1,1].boxplot(box_data, labels=regions) axes[1,1].set_title('Per-Capita GDP by Region') # Panel 6: Correlation heatmap corr = census[['Population_Cr','Literacy','Sex_Ratio','GDP_Lakh_Cr','Per_Capita']].corr() sns.heatmap(corr, annot=True, cmap='RdYlGn', center=0, ax=axes[1,2], fmt='.2f') axes[1,2].set_title('Feature Correlations') plt.tight_layout() plt.savefig('census_dashboard.png', dpi=150, bbox_inches='tight') plt.show() # 3. Key Insights Report print("="*50) print("KEY INSIGHTS FROM INDIAN CENSUS EDA") print("="*50) print(f"1. Highest literacy: {census.loc[census['Literacy'].idxmax(),'State']} ({census['Literacy'].max()}%)") print(f"2. Lowest sex ratio: {census.loc[census['Sex_Ratio'].idxmin(),'State']} ({census['Sex_Ratio'].min()})") print(f"3. Literacy-Sex Ratio correlation: {census['Literacy'].corr(census['Sex_Ratio']):.3f}") print(f"4. Highest per-capita GDP: {census.loc[census['Per_Capita'].idxmax(),'State']}") print(f"5. South region avg literacy: {census[census['Region']=='South']['Literacy'].mean():.1f}%")
Mini Project 2: Stock Market Data Analyzer
Python import pandas as pd import numpy as np import matplotlib.pyplot as plt # === MINI PROJECT: NSE Stock Analyzer === class StockAnalyzer: """Analyze Indian stock market data using Python data science stack.""" def __init__(self, ticker, days=252): self.ticker = ticker np.random.seed(hash(ticker) % 2**31) self.dates = pd.date_range('2024-01-01', periods=days, freq='B') base = {'RELIANCE':2500, 'TCS':3600, 'INFY':1500, 'HDFC':1600} start = base.get(ticker, 1000) self.data = pd.DataFrame({ 'Date': self.dates, 'Close': start + np.random.normal(0.5, 25, days).cumsum(), 'Volume': np.random.randint(1_00_000, 50_00_000, days) }).set_index('Date') def daily_returns(self): self.data['Returns'] = self.data['Close'].pct_change() return self.data['Returns'].describe() def moving_averages(self): self.data['MA20'] = self.data['Close'].rolling(20).mean() self.data['MA50'] = self.data['Close'].rolling(50).mean() def volatility(self): if 'Returns' not in self.data.columns: self.daily_returns() return self.data['Returns'].std() * np.sqrt(252) def plot_dashboard(self): self.daily_returns() self.moving_averages() fig, axes = plt.subplots(2, 2, figsize=(14, 10)) fig.suptitle(f'{self.ticker} โ Stock Analysis Dashboard', fontsize=16) # Price + Moving Averages axes[0,0].plot(self.data['Close'], label='Close', alpha=0.7) axes[0,0].plot(self.data['MA20'], label='MA20', linewidth=2) axes[0,0].plot(self.data['MA50'], label='MA50', linewidth=2) axes[0,0].legend(); axes[0,0].set_title('Price & Moving Averages') # Returns distribution axes[0,1].hist(self.data['Returns'].dropna(), bins=30, color='#0891b2', edgecolor='white') axes[0,1].set_title('Daily Returns Distribution') # Volume axes[1,0].bar(self.data.index, self.data['Volume'], width=1, color='#059669', alpha=0.6) axes[1,0].set_title('Trading Volume') # Cumulative returns cum_ret = (1 + self.data['Returns']).cumprod() axes[1,1].plot(cum_ret, color='#f59e0b', linewidth=2) axes[1,1].set_title('Cumulative Returns') plt.tight_layout(); plt.show() # Usage analyzer = StockAnalyzer('RELIANCE') print(analyzer.daily_returns()) print(f"Annualized Volatility: {analyzer.volatility():.2%}") analyzer.plot_dashboard()
Mini Project 3: Jupyter/Colab EDA Template
Python # === Reusable EDA Function Library === # Save as eda_utils.py and import in any Jupyter notebook def quick_eda(df, target_col=None): """Perform comprehensive EDA on any DataFrame.""" import pandas as pd import matplotlib.pyplot as plt import seaborn as sns print("="*60) print("AUTOMATED EDA REPORT") print("="*60) # 1. Shape and types print(f"\n๐ Shape: {df.shape[0]} rows ร {df.shape[1]} columns") print(f"\n๐ Data Types:\n{df.dtypes.value_counts()}") # 2. Missing values missing = df.isnull().sum() if missing.sum() > 0: print(f"\nโ ๏ธ Missing Values:\n{missing[missing > 0]}") else: print("\nโ No missing values!") # 3. Numeric summary print(f"\n๐ Numeric Summary:\n{df.describe().round(2)}") # 4. Categorical summary cat_cols = df.select_dtypes(include=['object','category']).columns for col in cat_cols: print(f"\n๐ท๏ธ {col}: {df[col].nunique()} unique values") print(df[col].value_counts().head(5)) # 5. Auto-generate plots num_cols = df.select_dtypes(include=['number']).columns n = len(num_cols) if n > 0: fig, axes = plt.subplots(1, min(n, 4), figsize=(5*min(n,4), 4)) if n == 1: axes = [axes] for i, col in enumerate(num_cols[:4]): axes[i].hist(df[col].dropna(), bins=20, color='#059669', edgecolor='white') axes[i].set_title(col) plt.tight_layout(); plt.show() return {'shape': df.shape, 'missing': missing.sum(), 'num_cols': len(num_cols)} # Usage: quick_eda(pd.read_csv('any_dataset.csv'))
3.19 ๐ End-of-Chapter Exercises
3.20 โ Multiple Choice Questions
df.groupby('State')['Sales'].transform('mean') return?transform() returns a result with the same shape as the input, broadcasting each group's aggregate back to all members. Unlike agg() which reduces rows, transform() preserves the original index.fit_transform(X) do?fit_transform(X) is equivalent to fit(X).transform(X) but potentially more efficient. It learns parameters (e.g., mean/std for StandardScaler) from X and then applies the transformation.@tf.function to opt into graph mode for production performance.np.array([1,2,3]) * np.array([1,2,3])?* operator performs element-wise multiplication (Hadamard product). For dot product, use np.dot() or @. For outer product, use np.outer().df.fillna(df.mean()) replaces NaN values in each numeric column with that column's mean. df.mean() calculates column-wise means, and fillna() broadcasts appropriately.plt.tight_layout() do?tight_layout() automatically adjusts padding between and around subplots so that labels, titles, and tick marks don't overlap. Always call it before show() or savefig().df.pivot_table(values='Sales', index='State', columns='Quarter', aggfunc='sum') create?pivot_table() creates a spreadsheet-style pivot table. Each cell contains the aggregate (sum) of Sales for that specific State-Quarter combination. Missing combinations show NaN.3.21 ๐ผ Interview Questions
loc and iloc in Pandas.
Answer: loc uses label-based indexing (e.g., df.loc['row_label', 'col_name']), while iloc uses integer position-based indexing (e.g., df.iloc[0, 2]). Key difference: loc includes the end of the slice, iloc excludes it (like Python ranges).
Answer: Vectorized operations (NumPy/Pandas) leverage: (1) contiguous memory access (cache-friendly), (2) SIMD CPU instructions that process multiple elements per clock cycle, (3) C-level loops that avoid Python's interpreter overhead (~100ร faster), (4) optimized BLAS/LAPACK libraries for linear algebra.
Answer: Strategy depends on context: (1) If MCAR (Missing Completely at Random) โ consider dropping if >50%, (2) Create a binary "is_missing" indicator feature, (3) Use model-based imputation (KNN, IterativeImputer), (4) For categorical: add "Unknown" category, (5) For time-series: forward/backward fill. Never just drop 40% of data without analysis.
Answer: Data leakage occurs when test data information "leaks" into training. Example: scaling the entire dataset before train/test split means the scaler learned test statistics. Pipelines prevent this by ensuring fit() is called only on training data, and transform() applies the same learned parameters to test data.
Answer: Broadcasting allows arithmetic between arrays of different shapes by automatically expanding the smaller array. Rules: (1) Align shapes from the right, (2) Dimensions must match or be 1, (3) Size-1 dimensions are "stretched." Example: shape (3,4) + shape (4,) โ (4,) becomes (1,4) โ stretched to (3,4).
Answer: (1) Use vectorized operations instead of apply(), (2) Use appropriate dtypes (category for strings, int32 instead of int64), (3) Use pd.eval() or query() for complex expressions, (4) Consider chunked reading with chunksize parameter, (5) Switch to Polars or Dask for out-of-memory computation, (6) Use numba or swifter for custom functions.
tf.constant and tf.Variable?
Answer: tf.constant creates an immutable tensor (cannot be changed after creation) โ used for input data and fixed hyperparameters. tf.Variable creates a mutable tensor โ used for trainable parameters like weights and biases that are updated during gradient descent.
Answer: All Scikit-Learn estimators follow a consistent API: fit(X, y) learns parameters from training data, predict(X) makes predictions (classifiers/regressors), transform(X) transforms data (preprocessors), fit_transform(X) combines fit+transform, score(X, y) evaluates performance. This consistency allows pipeline chaining.
Answer: (1) Create per-sqft price from total price/area, (2) Extract floor category (ground, low, mid, high), (3) Distance to nearest metro station, (4) Encode locality as mean target encoding, (5) Age of building from year built, (6) Create boolean features: has_parking, has_gym, near_hospital, (7) Season of listing (festive season premium in India).
deep copy and shallow copy in Python? Why does it matter for DataFrames?
Answer: Shallow copy (df.copy(deep=False) or df2 = df) creates a new object but references the same underlying data. Modifying one changes the other. Deep copy (df.copy()) creates an entirely independent copy. This matters in data science when you create modified versions of datasets โ always use deep copy to avoid accidentally corrupting original data.
3.22 ๐ฌ Research Problems
Research Problem 1: Benchmarking Python Data Science Libraries
Objective: Systematically compare the performance of Pandas vs Polars vs DuckDB vs PySpark for common EDA operations on Indian government datasets of varying sizes (100K, 1M, 10M, 100M rows).
Tasks: (1) Benchmark: read_csv, groupby-aggregate, join, sort, filter, and pivot operations, (2) Measure memory usage, CPU utilization, and wall-clock time, (3) Test with real data from data.gov.in, (4) Identify crossover points where one library becomes better than another, (5) Publish results with reproducible Jupyter notebooks.
Impact: Help Indian data scientists choose the right tool for their data scale. Could be published in venues like PyCon India or the Journal of Statistical Software.
Research Problem 2: Auto-EDA for Indian Multilingual Datasets
Objective: Build an automated EDA tool that handles datasets with Indian language text (Hindi, Tamil, Telugu, etc.) alongside English and numeric data. Current auto-EDA tools (pandas-profiling, sweetviz) fail on mixed-script data.
Challenges: Unicode handling for Devanagari/Dravidian scripts, mixed-language column detection, appropriate NLP preprocessing (tokenization differs by language), and culturally relevant statistical summaries (e.g., Indian number system: lakhs, crores).
Research Problem 3: Energy-Efficient ML Pipeline Design
Objective: Measure and optimize the carbon footprint of Python ML training pipelines. Given India's growing AI compute, understanding energy consumption is critical.
Approach: Use CodeCarbon library to measure COโ emissions per training run. Compare: (1) NumPy vs CuPy (GPU) for preprocessing, (2) TensorFlow vs PyTorch energy efficiency, (3) Impact of batch size, model size, and data pipeline design on energy. Propose a "green ML" scoring system for Indian AI labs.
3.23 ๐๏ธ Key Takeaways
3.24 ๐ References & Further Reading
Textbooks
- McKinney, W. (2022). Python for Data Analysis, 3rd Edition. O'Reilly Media. โ The definitive Pandas reference by its creator.
- VanderPlas, J. (2023). Python Data Science Handbook, 2nd Edition. O'Reilly Media. โ Covers NumPy, Pandas, Matplotlib, and Scikit-Learn comprehensively.
- Gรฉron, A. (2023). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition. O'Reilly. โ Best practical ML book.
- Raschka, S. & Mirjalili, V. (2022). Machine Learning with PyTorch and Scikit-Learn. Packt Publishing.
- Harris, C. R. et al. (2020). "Array programming with NumPy." Nature, 585, 357โ362.
Online Resources
- NumPy Official Documentation: numpy.org/doc
- Pandas Documentation: pandas.pydata.org/docs
- Scikit-Learn User Guide: scikit-learn.org/stable
- TensorFlow Tutorials: tensorflow.org/tutorials
- Kaggle Learn โ Free micro-courses: kaggle.com/learn
- Google Colab: colab.research.google.com
Indian Data Sources
- Open Government Data Platform India: data.gov.in
- Census of India: censusindia.gov.in
- Reserve Bank of India Data: rbi.org.in
- India Meteorological Department: mausam.imd.gov.in
- NSE India: nseindia.com
- NITI Aayog Data Portal: niti.gov.in
Global Data Sources
- Kaggle Datasets: kaggle.com/datasets
- UCI Machine Learning Repository: archive.ics.uci.edu
- World Bank Open Data: data.worldbank.org
- WHO Global Health Observatory: who.int/data/gho
- Google Dataset Search: datasetsearch.research.google.com
Research Papers
- Harris, C. R. et al. (2020). "Array programming with NumPy." Nature, 585(7825), 357-362.
- Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." JMLR, 12, 2825-2830.
- Abadi, M. et al. (2016). "TensorFlow: A System for Large-Scale Machine Learning." OSDI, 265-283.
- McKinney, W. (2011). "pandas: a Foundational Python Library for Data Analysis and Statistics." PyHPC.
๐ Chapter 3 Complete!
You now have the complete Python data science toolkit. In Chapter 4, we'll apply these tools to build your first ML models: Linear Regression and Classification from scratch.
๐ Next: Chapter 4 โ Linear Models: Regression & Classification