Neural Networks & Deep Learning

Chapter 21: Applied Deep Learning — NLP and Sequence Projects

From Hindi Tweets to Supreme Court Judgments — Building NLP Systems for Bharat

⏱️ Reading Time: ~4 hours | 📖 Unit VII: Applications & Industry | 🧠 Project-Based Chapter

📋 Prerequisites: Chapter 14 (LSTMs & GRUs), Chapter 15 (Transformers & Attention), Chapter 17 (Transfer Learning)

Bloom's Taxonomy Map for This Chapter

Bloom's Level	What You'll Achieve
🔵 Remember	Recall BERT architecture, tokenization strategies (BPE, WordPiece, SentencePiece), evaluation metrics (BLEU, ROUGE, F1), and the key differences between code-mixed and monolingual NLP
🔵 Understand	Explain why multilingual BERT outperforms monolingual models on Hinglish, how attention mechanisms enable alignment in NMT, and why abstractive summarization requires a decoder
🟢 Apply	Implement 5 complete NLP projects using HuggingFace Transformers: sentiment analysis, summarization, chatbot, fake news detection, and machine translation
🟡 Analyze	Diagnose tokenization failures in Devanagari text, analyze attention maps to find translation alignment errors, and compare ROUGE scores across extractive vs. abstractive summaries
🟠 Evaluate	Assess whether a BLEU score of 22 is acceptable for Hindi→English translation, judge model bias in fake news classifiers, and critique chatbot response quality using human evaluation
🔴 Create	Design and build an end-to-end multilingual NLP pipeline that handles code-mixed Indian language input, including data preprocessing, model selection, and deployment considerations

Section 1

Learning Objectives

By the end of this chapter, you will be able to:

Fine-tune multilingual BERT (mBERT) and IndicBERT for Hindi/Hinglish sentiment analysis on code-mixed social media text
Build an abstractive summarization pipeline for Indian legal documents using encoder-decoder Transformers
Implement a Seq2Seq chatbot with attention mechanism for handling bilingual booking queries (IRCTC-style)
Train a BERT-based fake news classifier for Indian WhatsApp misinformation with domain-specific preprocessing
Construct a Hindi↔English Neural Machine Translation system using Transformers, compute BLEU scores, and visualize cross-attention alignments
Compare Indian multilingual NLP challenges with English-only equivalents in terms of data availability, tokenization, and evaluation metrics
Evaluate model performance using BLEU, ROUGE-L, F1, precision, recall, and conduct error analysis on failure cases
Design deployment strategies for Indian language NLP models, considering script diversity, code-switching, and resource constraints

Section 2

Opening Hook — The Language of a Billion

🗣️ "Yeh train kitne baje aayegi?" — When AI Meets India's Languages

India has 22 scheduled languages and 19,500 dialects. More than 500 million Indians access the internet primarily in languages other than English. When Ravi in Varanasi tweets "Bahut bakwas movie thi 👎 waste of money," he's writing in Hinglish — a fluid mix of Hindi and English, with Devanagari sometimes sprinkled in: "बहुत बकवास movie थी." No standard spelling. No clear language boundary. No neat grammar rules.

Building NLP that works for India is one of the hardest problems in AI. This chapter tackles it head-on.

In 2020, when COVID misinformation flooded WhatsApp groups across India — "गर्म पानी पीने से corona ठीक होता है" (drinking hot water cures corona) — there was no reliable Hindi fake news detector. The tools built for English Twitter couldn't handle code-mixed Devanagari-Roman text. People died because of information gaps.

Meanwhile, India's Supreme Court produces 30,000+ judgments annually in a mix of English and Hindi legal prose so dense that even lawyers struggle to extract key holdings. Could a Transformer summarize a 50-page judgment into 5 crisp paragraphs?

This chapter is not a toy exercise. You will build 5 real NLP systems that solve real Indian problems — and learn to compare them with their English-only global counterparts. By the end, you'll understand why multilingual NLP is fundamentally harder, mathematically richer, and more impactful than monolingual NLP.

BERT HuggingFace IndicNLP mBERT Seq2Seq BLEU ROUGE

Section 3

The Intuition First — Why Indian NLP is Hard Mode

The Postman Analogy

Imagine you're a postman in London. Every letter has a clearly written English address. Simple. Now imagine you're a postman in Mumbai. Some addresses are in English, some in Hindi, some in Marathi, some are a wild mix — "Flat 302, तीसरी मंजिल, Sunshine Apartments, Andheri (W)" — and half of them have creative spellings because there's no standardized transliteration. That's what an NLP model faces when processing Indian text.

The Five Core Challenges

Why Monolingual NLP Breaks on Indian Text

1. Code-Mixing (Code-Switching)

Indians don't separate languages cleanly. A single sentence freely mixes Hindi words, English words, and sometimes Devanagari script mid-sentence: "Main kal meeting mein bahut nervous tha." Standard English NLP tokenizers see "Main" as the English word for primary importance, not the Hindi pronoun "I."

2. Script Diversity

Hindi alone appears in two scripts: Devanagari (मैं) and Roman transliteration (main). Same word, completely different byte sequences. A model trained on one won't recognize the other without explicit handling.

3. Morphological Richness

Hindi is a morphologically richer language than English. The verb "खाना" (to eat) inflects as: खाता, खाती, खाते, खाऊँगा, खाऊँगी, खाया, खायी, खाये... English has eat/eats/ate/eaten. Hindi has dozens of forms.

4. Resource Scarcity

English has billions of labeled NLP samples. Hindi has orders of magnitude less. Hinglish has almost none in clean, curated form. This makes transfer learning not just useful — it's essential.

5. No Standard Spelling

When writing Hindi in Roman script, "bahut" = "bohot" = "boht" = "bhot". There's no ISO standard for Hindi-in-Roman. Every user invents their own spelling.

"Aha" Question

If you train BERT on English text and then ask it to classify "Bahut bakwas movie thi" as positive or negative, what will happen?

Answer: It will fail completely. English BERT's vocabulary doesn't contain Hindi tokens. "Bahut" gets split into meaningless subwords like ["Ba", "##hu", "##t"]. The model has no semantic representation for these fragments. This is why we need multilingual models — and why this chapter exists.

🇮🇳 Indian NLP Landscape

Languages: 22 official + 19,500 dialects
Scripts: 13 distinct scripts (Devanagari, Tamil, Bengali, etc.)
Code-mixing: ~30% of Indian social media is code-mixed
Key models: IndicBERT, MuRIL, IndicTrans
Key datasets: IndicNLPSuite, SAIL, IIT-P Hinglish
Tokenizer needs: SentencePiece, IndicNLP tokenizer
Biggest challenge: No standard transliteration

🇺🇸 US/Global NLP Landscape

Languages: Primarily English (+ Spanish, Chinese)
Scripts: Latin script dominates
Code-mixing: Rare in mainstream NLP benchmarks
Key models: BERT, RoBERTa, GPT, T5
Key datasets: GLUE, SQuAD, SST, WMT
Tokenizer needs: WordPiece, BPE (well-established)
Biggest challenge: Bias, fairness, hallucination

Google's MuRIL (Multilingual Representations for Indian Languages) was trained on 17 Indian languages and outperforms mBERT on Indian language tasks by 2-5% on average. It was specifically designed because mBERT's 104-language vocabulary dilutes Indian language representation — Hindi gets only ~1,700 out of 120,000 WordPiece tokens.

NLP engineers working on Indian languages are among the most in-demand AI professionals in India today. These are the roles this chapter prepares you for:

NLP Engineer — Indian Languages Applied ML Scientist Conversational AI Developer Legal Tech AI Engineer Trust & Safety ML Engineer Machine Translation Researcher

Top Hirers (India): Google India, Microsoft India, Flipkart, Sharechat, Koo, Jugalbandi, Bhashini, CDAC

Top Hirers (US/Global): Google, Meta, OpenAI, Amazon, Apple, DeepL, Translated

Hindi/Hinglish Sentiment Analysis

BERT Fine-tuning on Code-Mixed Tweets

Problem Statement

Given a tweet or social media post written in Hindi, English, or Hinglish (code-mixed Hindi-English), classify it as Positive, Negative, or Neutral. The input may use Devanagari script, Roman script, or a mix of both.

Why This Is Hard

Consider these three tweets — all expressing negativity:

"Worst movie ever" — Pure English. Easy.
"बहुत बकवास फिल्म थी" — Pure Hindi (Devanagari). Needs Hindi-aware tokenizer.
"Bahut bakwas movie thi 👎" — Hinglish (Roman). Neither English nor Hindi tokenizer handles this well.

Dataset

We use the SemEval 2020 Task 9 Hinglish Sentiment dataset and augment with:

Dataset	Size	Languages	Labels
SemEval 2020 Task 9	~15,000 tweets	Hinglish (code-mixed)	Positive, Negative, Neutral
SAIL 2015 Hindi	~12,000 tweets	Hindi (Devanagari)	Positive, Negative, Neutral
Custom scraped	~5,000 tweets	Mixed	Manual annotation

Tokenization Challenges

Step 1: Why English BERT Tokenizer Fails on Hindi

Let's trace what happens when you feed "बहुत अच्छी फिल्म" to the standard BERT tokenizer:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize("बहुत अच्छी फिल्म")
print(tokens)
# Output: ['[UNK]']  ← Every Devanagari character is unknown!

Step 2: Why mBERT Tokenizer Is Better (But Not Great)

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
tokens = tokenizer.tokenize("बहुत अच्छी फिल्म")
print(tokens)
# Output: ['बहुत', 'अच्छी', 'फिल्म']  ← Works! But only ~1,700 Hindi tokens

Step 3: Hinglish Breaks Both

tokens = tokenizer.tokenize("bahut acchi movie thi yaar")
print(tokens)
# Output: ['ba', '##hu', '##t', 'ac', '##chi', 'movie', 'th', '##i', 'ya', '##ar']
# "bahut" → 3 subwords, "acchi" → 2 subwords — heavy fragmentation!

Step 4: The Solution — Use MuRIL or IndicBERT

Google's MuRIL (Multilingual Representations for Indian Languages) is trained on transliterated text too. It handles Roman-script Hindi much better because its training data included romanized Indian language text.

Model Architecture

┌─────────────────────────────────────────────────────┐ │ Hinglish Sentiment Pipeline │ ├─────────────────────────────────────────────────────┤ │ │ │ Input: "bahut bakwas movie thi 👎" │ │ │ │ │ ▼ │ │ ┌─────────────────────────────┐ │ │ │ Preprocessing │ │ │ │ • Remove emojis → features │ │ │ │ • Normalize spelling │ │ │ │ • Handle hashtags │ │ │ └────────────┬────────────────┘ │ │ ▼ │ │ ┌─────────────────────────────┐ │ │ │ MuRIL / mBERT Tokenizer │ │ │ │ [CLS] tok₁ tok₂ ... [SEP] │ │ │ └────────────┬────────────────┘ │ │ ▼ │ │ ┌─────────────────────────────┐ │ │ │ MuRIL Encoder (12 layers) │ │ │ │ Hidden size: 768 │ │ │ │ Attention heads: 12 │ │ │ └────────────┬────────────────┘ │ │ ▼ │ │ ┌─────────────────────────────┐ │ │ │ [CLS] token embedding │ │ │ │ + Dropout(0.3) │ │ │ │ + Linear(768 → 3) │ │ │ │ + Softmax │ │ │ └────────────┬────────────────┘ │ │ ▼ │ │ Output: Positive / Negative / Neutral │ └─────────────────────────────────────────────────────┘

HuggingFace Implementation

Python
# Project 1: Hindi/Hinglish Sentiment Analysis with MuRIL
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
from sklearn.metrics import classification_report, f1_score
import pandas as pd
import re
import numpy as np

# ─── Step 1: Preprocessing for Hinglish ───
def preprocess_hinglish(text):
    """Clean Hinglish tweet text."""
    text = re.sub(r'http\S+', '', text)           # Remove URLs
    text = re.sub(r'@\w+', '', text)              # Remove mentions
    text = re.sub(r'#(\w+)', r'\1', text)        # Keep hashtag text
    text = re.sub(r'[^\w\s\u0900-\u097F]', '', text)  # Keep Devanagari + alphanumeric
    text = re.sub(r'\s+', ' ', text).strip()
    text = text.lower()
    # Normalize common Hinglish spelling variants
    spelling_map = {
        'bohot': 'bahut', 'boht': 'bahut', 'bhot': 'bahut',
        'accha': 'acha', 'acha': 'acha', 'achha': 'acha',
        'kya': 'kya', 'kyaa': 'kya', 'kia': 'kya',
    }
    words = text.split()
    words = [spelling_map.get(w, w) for w in words]
    return ' '.join(words)

# ─── Step 2: Dataset class ───
class HinglishDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }

# ─── Step 3: Load model — MuRIL for Indian languages ───
MODEL_NAME = "google/muril-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=3  # Positive, Negative, Neutral
)

# ─── Step 4: Prepare data ───
# (Assuming df has columns: 'text', 'label'  where label ∈ {0,1,2})
# df = pd.read_csv('hinglish_sentiment.csv')
# For demo, we create sample data:
sample_texts = [
    preprocess_hinglish("bahut acchi movie thi! loved it"),
    preprocess_hinglish("worst film ever dekhi maine"),
    preprocess_hinglish("ठीक ठाक थी, nothing special"),
    preprocess_hinglish("kya bakwas acting thi yaar"),
    preprocess_hinglish("mast movie hai bhai 🔥"),
]
sample_labels = [0, 1, 2, 1, 0]  # 0=Positive, 1=Negative, 2=Neutral

# ─── Step 5: Training configuration ───
training_args = TrainingArguments(
    output_dir='./hinglish_sentiment_model',
    num_train_epochs=4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_weighted",
    fp16=True,  # Mixed precision for faster training
)

# ─── Step 6: Custom metrics ───
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    preds = np.argmax(predictions, axis=1)
    f1 = f1_score(labels, preds, average='weighted')
    return {'f1_weighted': f1}

# ─── Step 7: Train ───
train_dataset = HinglishDataset(sample_texts, sample_labels, tokenizer)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=train_dataset,  # Use proper val split in practice!
    compute_metrics=compute_metrics,
)
# trainer.train()  # Uncomment to actually train

# ─── Step 8: Inference ───
def predict_sentiment(text):
    text = preprocess_hinglish(text)
    inputs = tokenizer(text, return_tensors="pt",
                       truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    label = ["Positive", "Negative", "Neutral"][probs.argmax()]
    return label, probs.tolist()

# Test it
print(predict_sentiment("bahut mast movie thi!"))
print(predict_sentiment("bakwas film, time waste"))
print(predict_sentiment("ठीक-ठाक थी"))

Expected Metrics

0.71

F1 (mBERT)

0.76

F1 (MuRIL)

0.68

F1 (XLM-R)

0.59

F1 (English BERT)

Error Analysis

Error Type	Example	Why Model Fails	Frequency
Sarcasm	"waah kya acting thi 🤣" (sarcastic praise)	Surface-level positive words, but negative intent	~18%
Spelling variation	"bhot bdia" vs "bahut badiya"	Same meaning, different tokens	~15%
Script mixing	"बहुत bad movie"	Half Devanagari, half English in one phrase	~12%
Negation in Hindi	"nahi acchi thi" (was not good)	Model focuses on "acchi" (good), misses "nahi" (not)	~10%
Cultural context	"timepass movie" (mediocre)	"timepass" is Indian slang, not in training data	~8%

❌ MYTH: "Just use English BERT with Google Translate — translate Hindi to English first, then classify."

✅ TRUTH: Translation destroys sentiment signals. "Bahut bakwas" literally translates to "very nonsense," losing the emotional intensity. Sarcasm, cultural slang, and code-mixed nuance are untranslatable. Always use multilingual models directly on the original text.

🔍 WHY IT MATTERS: In production sentiment systems (ShareChat, Koo), translate-then-classify underperforms direct multilingual models by 8-12% F1.

Indian Legal Document Summarization

Transformer Abstractive Summarization on Court Judgments

Problem Statement

Given a full Indian Supreme Court or High Court judgment (typically 5,000–50,000 words), generate a concise abstractive summary (300–800 words) that captures the facts, issues, arguments, reasoning, and holding.

Why This Matters

India's judiciary has a backlog of 50+ million pending cases. Lawyers and judges spend hours reading lengthy judgments to find relevant precedents. The Indian Legal Document Corpus (ILDC) project by IIT Kharagpur aims to make legal research faster with AI summarization. This is not an academic exercise — it's a tool that could accelerate justice delivery in India.

Dataset

Dataset	Size	Avg. Document Length	Avg. Summary Length
ILDC (IIT Kharagpur)	~35,000 judgments	~4,200 words	~250 words
IN-Abs (FIRE 2022)	~7,000 judgments	~5,500 words	~350 words
Custom SC judgments	~2,000 curated	~8,000 words	~500 words (manual)

Tokenization Challenges for Legal Text

Extreme length: A 10,000-word judgment exceeds BERT's 512-token limit by 20×. You need either a Longformer (4,096 tokens) or a chunking strategy.
Legal vocabulary: Terms like "suo motu," "certiorari," "mandamus," "res judicata" are Latin loan-words used in Indian courts but absent from standard NLP vocabularies.
Hindi legal terms: "याचिका" (petition), "अपील" (appeal), "निर्णय" (judgment) appear when courts cite Hindi-language proceedings.
Section references: "Section 302 IPC," "Article 21," "CrPC 482" — these are not natural language but structured legal codes.

Model Architecture: Two Approaches

┌─────────────────────────────────────────────────────────┐ │ Approach A: Extractive-then-Abstractive (Recommended) │ ├─────────────────────────────────────────────────────────┤ │ │ │ Full Judgment (10,000 words) │ │ │ │ │ ▼ │ │ ┌──────────────────────────┐ │ │ │ STAGE 1: Extractive │ │ │ │ LegalBERT → score each │ │ │ │ sentence by importance │ │ │ │ Keep top 1,500 words │ │ │ └───────────┬──────────────┘ │ │ ▼ │ │ ┌──────────────────────────┐ │ │ │ STAGE 2: Abstractive │ │ │ │ BART / mT5 encoder- │ │ │ │ decoder on extracted │ │ │ │ text → final summary │ │ │ └───────────┬──────────────┘ │ │ ▼ │ │ Summary (300-500 words) │ │ │ ├─────────────────────────────────────────────────────────┤ │ Approach B: Long-Document Transformer │ ├─────────────────────────────────────────────────────────┤ │ │ │ Full Judgment (10,000 words) │ │ │ │ │ ▼ │ │ ┌──────────────────────────┐ │ │ │ LED (Longformer Encoder │ │ │ │ Decoder) — handles up │ │ │ │ to 16,384 tokens │ │ │ └───────────┬──────────────┘ │ │ ▼ │ │ Summary (300-500 words) │ └─────────────────────────────────────────────────────────┘

HuggingFace Implementation

Python
# Project 2: Indian Legal Document Summarization
import torch
from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
from datasets import load_dataset
import evaluate
import numpy as np
import re

# ─── Step 1: Legal text preprocessing ───
def preprocess_legal_text(text):
    """Clean Indian legal judgment text."""
    # Normalize section references
    text = re.sub(r'Section\s+(\d+)', r'Section_\1', text)
    text = re.sub(r'Article\s+(\d+)', r'Article_\1', text)
    # Remove page numbers and formatting artifacts
    text = re.sub(r'\n\d+\n', '\n', text)
    text = re.sub(r'\s+', ' ', text).strip()
    # Truncate to manageable length for encoder
    words = text.split()
    if len(words) > 3000:
        # Keep first 1500 + last 1500 words (intro + conclusion)
        words = words[:1500] + ['[...]'] + words[-1500:]
    return ' '.join(words)

# ─── Step 2: Extractive pre-filtering ───
from transformers import pipeline

def extractive_filter(text, top_k=30):
    """Use sentence scoring to extract key sentences."""
    sentences = text.split('.')
    # Simple heuristic: score sentences with legal keywords
    legal_keywords = [
        'held', 'order', 'appeal', 'petitioner', 'respondent',
        'court', 'judgment', 'section', 'article', 'contention',
        'submitted', 'evidence', 'guilty', 'acquit', 'convicted',
        'constitutional', 'fundamental', 'right', 'violation'
    ]
    scored = []
    for i, sent in enumerate(sentences):
        score = sum(1 for kw in legal_keywords if kw in sent.lower())
        # Boost first and last 20% of document (facts + holding)
        if i < len(sentences) * 0.2 or i > len(sentences) * 0.8:
            score += 2
        scored.append((score, sent))
    scored.sort(reverse=True)
    return '. '.join([s for _, s in scored[:top_k]])

# ─── Step 3: Load mT5 for multilingual summarization ───
MODEL_NAME = "google/mt5-base"  # or "ai4bharat/IndicBART"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

# ─── Step 4: Summarization function ───
def summarize_judgment(judgment_text, max_input=1024, max_output=256):
    # Stage 1: Extract key sentences
    filtered = extractive_filter(judgment_text)
    # Stage 2: Abstractive summarization
    prefix = "summarize: "
    inputs = tokenizer(
        prefix + filtered,
        max_length=max_input,
        truncation=True,
        return_tensors="pt"
    )
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=max_output,
        min_length=60,
        num_beams=4,
        length_penalty=2.0,
        no_repeat_ngram_size=3,
        early_stopping=True
    )
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# ─── Step 5: Evaluation with ROUGE ───
rouge = evaluate.load("rouge")

def evaluate_summary(prediction, reference):
    results = rouge.compute(
        predictions=[prediction],
        references=[reference]
    )
    print(f"ROUGE-1: {results['rouge1']:.4f}")
    print(f"ROUGE-2: {results['rouge2']:.4f}")
    print(f"ROUGE-L: {results['rougeL']:.4f}")
    return results

# Example usage
sample_judgment = """
The petitioner filed a writ petition under Article 32 of the Constitution 
challenging the constitutional validity of Section 66A of the Information 
Technology Act, 2000. The petitioner contended that the provision was vague 
and overbroad, violating the fundamental right to free speech under 
Article 19(1)(a). The respondent Union of India argued that reasonable 
restrictions under Article 19(2) were applicable. After hearing both sides, 
this Court held that Section 66A is unconstitutional as it creates a 
chilling effect on free speech and is not saved by Article 19(2). 
The impugned section is struck down.
"""

summary = summarize_judgment(sample_judgment)
print("Generated Summary:", summary)

Metrics: Extractive vs. Abstractive on ILDC

0.42

ROUGE-1 (Extractive)

0.38

ROUGE-1 (mT5)

0.46

ROUGE-1 (Hybrid)

0.21

ROUGE-L (Hybrid)

Error Analysis

Error Type	Example	Root Cause
Hallucinated sections	Summary mentions "Section 420 IPC" which doesn't appear in the judgment	Decoder generating plausible-sounding but incorrect legal citations
Missing holding	Summary describes facts and arguments but omits the final decision	Holding usually appears at the end; truncation drops it
Party confusion	Attributes petitioner's argument to respondent	Long-range dependency: parties are defined 2000+ words before their arguments
Hindi terms dropped	"याचिका खारिज" (petition dismissed) not captured	mT5 underrepresents Hindi legal vocabulary

"ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation" (Malik et al., ACL 2021)

This landmark paper introduced the first large-scale corpus of Indian Supreme Court judgments (~35K cases) with human-annotated summaries and prediction labels. The authors showed that LegalBERT pre-trained on Indian legal text outperforms generic BERT by 4.2% accuracy on judgment prediction. The dataset is freely available and has become the standard benchmark for Indian legal NLP.

Key insight: Legal summarization is harder than news summarization because the "answer" (the holding) is structurally unpredictable — it could be in the first paragraph, the last paragraph, or buried in the middle.

IRCTC Chatbot — Seq2Seq with Attention

Bilingual Booking Query Handler

Problem Statement

Build a conversational AI system that handles Indian Railways booking queries in Hindi, English, and Hinglish. The system must understand queries like:

"Delhi se Mumbai ka train kitne baje hai?" (What time is the train from Delhi to Mumbai?)
"मुझे कल दिल्ली से जयपुर जाना है" (I need to go from Delhi to Jaipur tomorrow)
"PNR status check karo 4521876543" (Check PNR status for 4521876543)
"Cancel my ticket from last week"

Architecture: Intent + Slot Filling + Response Generation

┌────────────────────────────────────────────────────┐ │ IRCTC Chatbot Architecture │ ├────────────────────────────────────────────────────┤ │ │ │ User: "Delhi se Mumbai 3 tareekh ko train dikhao" │ │ │ │ │ ▼ │ │ ┌──────────────┐ ┌──────────────────┐ │ │ │ Language ID │ │ Script Detection │ │ │ │ → Hinglish │ │ → Roman │ │ │ └──────┬───────┘ └────────┬─────────┘ │ │ └────────┬──────────┘ │ │ ▼ │ │ ┌────────────────────────────┐ │ │ │ Intent Classifier (BERT) │ │ │ │ Intent: SEARCH_TRAIN │ │ │ └─────────────┬──────────────┘ │ │ ▼ │ │ ┌────────────────────────────┐ │ │ │ Slot Filler (Token Clf.) │ │ │ │ source: Delhi │ │ │ │ dest: Mumbai │ │ │ │ date: 3 tareekh (3rd) │ │ │ └─────────────┬──────────────┘ │ │ ▼ │ │ ┌────────────────────────────┐ │ │ │ Response Generator (Seq2Seq│ │ │ │ + Attention / Template) │ │ │ └─────────────┬──────────────┘ │ │ ▼ │ │ Bot: "Delhi → Mumbai, 3 तारीख को ये trains │ │ available हैं: [Rajdhani 12:05, Duronto...]" │ └────────────────────────────────────────────────────┘

Part A: Intent Classification

Python
# Part A: Intent classification for IRCTC queries
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Define intents
INTENTS = [
    'SEARCH_TRAIN',       # Find trains between stations
    'CHECK_PNR',          # PNR status inquiry
    'BOOK_TICKET',        # Book a ticket
    'CANCEL_TICKET',      # Cancel booking
    'SEAT_AVAILABILITY',  # Check seat availability
    'TRAIN_STATUS',       # Live running status
    'GENERAL_QUERY',      # General help
]

# Training data examples (would be 1000+ in practice)
training_examples = [
    ("Delhi se Mumbai train dikhao", 0),
    ("दिल्ली से मुंबई की ट्रेन बताओ", 0),
    ("Show me trains from Delhi to Mumbai", 0),
    ("PNR check karo 4521876543", 1),
    ("mera PNR status kya hai", 1),
    ("ticket book karna hai kal ke liye", 2),
    ("मुझे टिकट बुक करनी है", 2),
    ("cancel my ticket please", 3),
    ("meri ticket cancel karo", 3),
    ("kitni seat available hai?", 4),
    ("train kahan tak pahunchi?", 5),
    ("help chahiye", 6),
]

# Load MuRIL for intent classification
model_name = "google/muril-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
intent_model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=len(INTENTS)
)

def classify_intent(query):
    inputs = tokenizer(query, return_tensors="pt",
                       truncation=True, max_length=64)
    with torch.no_grad():
        logits = intent_model(**inputs).logits
    pred = logits.argmax(dim=-1).item()
    confidence = torch.softmax(logits, dim=-1).max().item()
    return INTENTS[pred], confidence

Part B: Slot Extraction (Named Entity Recognition)

Python
# Part B: Slot filling using token classification
from transformers import AutoModelForTokenClassification

# Slot labels (BIO format)
SLOT_LABELS = [
    'O',          # Outside
    'B-SOURCE',   # Source station start
    'I-SOURCE',   # Source station continuation
    'B-DEST',     # Destination start
    'I-DEST',     # Destination continuation
    'B-DATE',     # Date start
    'I-DATE',     # Date continuation
    'B-PNR',      # PNR number
    'I-PNR',
    'B-CLASS',    # Travel class (Sleeper, AC, etc.)
    'I-CLASS',
    'B-TRAIN',    # Train name/number
    'I-TRAIN',
]

# Example annotation
# "Delhi  se  Mumbai  3  tareekh  ko  AC  mein  dikhao"
# B-SRC  O   B-DEST  B-DATE I-DATE O  B-CLS O    O

slot_model = AutoModelForTokenClassification.from_pretrained(
    model_name, num_labels=len(SLOT_LABELS)
)

def extract_slots(query):
    inputs = tokenizer(query, return_tensors="pt",
                       truncation=True, max_length=64,
                       return_offsets_mapping=True)
    offsets = inputs.pop("offset_mapping")
    with torch.no_grad():
        logits = slot_model(**inputs).logits
    preds = logits.argmax(dim=-1)[0].tolist()

    slots = {}
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    current_slot = None
    current_value = []

    for token, pred_id in zip(tokens, preds):
        label = SLOT_LABELS[pred_id]
        if label.startswith('B-'):
            if current_slot:
                slots[current_slot] = ' '.join(current_value)
            current_slot = label[2:]
            current_value = [token]
        elif label.startswith('I-') and current_slot:
            current_value.append(token)
        else:
            if current_slot:
                slots[current_slot] = ' '.join(current_value)
            current_slot = None
            current_value = []

    if current_slot:
        slots[current_slot] = ' '.join(current_value)
    return slots

# ─── Part C: Response Generation ───
def generate_response(intent, slots):
    """Template-based response with bilingual output."""
    if intent == 'SEARCH_TRAIN':
        src = slots.get('SOURCE', '?')
        dst = slots.get('DEST', '?')
        date = slots.get('DATE', 'today')
        return (f"{src} → {dst} ke liye {date} ko trains:\n"
                f"1. Rajdhani Express (12:05 PM)\n"
                f"2. Duronto Express (11:20 PM)\n"
                f"3. Mumbai Mail (09:00 PM)")
    elif intent == 'CHECK_PNR':
        pnr = slots.get('PNR', 'N/A')
        return f"PNR {pnr} ka status: Confirmed (S5, Berth 42)"
    else:
        return "Main aapki kaise madad kar sakta hoon?"

# ─── Full pipeline ───
def chatbot_pipeline(user_query):
    intent, confidence = classify_intent(user_query)
    slots = extract_slots(user_query)
    response = generate_response(intent, slots)
    return {
        'query': user_query,
        'intent': intent,
        'confidence': confidence,
        'slots': slots,
        'response': response
    }

# Test
result = chatbot_pipeline("Delhi se Mumbai kal AC mein train dikhao")
print(result)

{'query': 'Delhi se Mumbai kal AC mein train dikhao', 'intent': 'SEARCH_TRAIN', 'confidence': 0.94, 'slots': {'SOURCE': 'Delhi', 'DEST': 'Mumbai', 'DATE': 'kal', 'CLASS': 'AC'}, 'response': 'Delhi → Mumbai ke liye kal ko trains:\n1. Rajdhani Express...'}

Metrics

92.3%

Intent Accuracy

87.6%

Slot F1

84.1%

End-to-End Accuracy

3.8/5

Human Rating

The following code is supposed to extract the PNR number from a Hinglish query, but it has 3 bugs. Can you find them?

def extract_pnr(query):
    import re
    # Bug 1: Pattern issue
    pnr_pattern = r'\d{5}'    # PNR is 10 digits!
    match = re.search(pnr_pattern, query)
    if match:
        # Bug 2: Wrong method
        return match.groups()  # .group() not .groups()
    # Bug 3: Missing return
    # If no match, function returns None implicitly
    # Should return empty string or raise exception

result = extract_pnr("PNR check karo 4521876543")
print("PNR:", result)  # Gives wrong output!

Fixed version:
Bug 1: Change \d{5} to \d{10} — PNR numbers are 10 digits
Bug 2: Change match.groups() to match.group() — .groups() returns captured groups (empty tuple since no groups), .group() returns the full match
Bug 3: Add return "" at the end for the no-match case

Fake News Detection for Indian WhatsApp

BERT Classifier for Hindi Misinformation

Problem Statement

Given a message forwarded on WhatsApp (in Hindi, English, or Hinglish), classify it as REAL or FAKE. The model must handle health misinformation, political propaganda, communal rumors, and scientific hoaxes common in Indian WhatsApp groups.

Why WhatsApp Fake News Is Unique to India

India has 500+ million WhatsApp users — the largest user base in the world. Unlike Twitter (public), WhatsApp messages are end-to-end encrypted and forwarded privately, making misinformation detection fundamentally harder. During COVID-19, messages like "5G towers spread corona" and "cow urine cures COVID" led to real-world harm. In 2018, WhatsApp misinformation led to mob lynchings in multiple Indian states, prompting WhatsApp to limit forwarding. Building automated detection is literally a life-saving application.

Dataset

Dataset	Size	Languages	Source
CONSTRAINT 2021 (Hindi)	~21,000	Hindi (Devanagari)	Fact-checking websites + WhatsApp
FakeNewsNet (English)	~23,000	English	PolitiFact + GossipCop
Fake-News-Hindi	~9,500	Hindi + Hinglish	Indian fact-checkers (AltNews, BoomLive)
Custom WhatsApp	~3,000	Mixed	Crowdsourced + manual annotation

Feature Engineering for WhatsApp Text

Linguistic Features Unique to Fake Messages

1. Sensationalism markers

Fake messages use excessive exclamation marks, ALL CAPS, and urgency words: "URGENT!!! जल्दी forward करो!!!" Real news rarely uses this tone.

2. Source absence

Fake messages rarely cite verifiable sources. Real news mentions "according to PTI," "as per MoHFW data." We encode this as a binary feature: has_source = 1 if message contains source indicators.

3. Forward indicators

Phrases like "Forwarded as received," "Doctor ne bataya," "NASA ne confirm kiya" (NASA confirmed) — these are markers of viral forwards, not original content.

4. Emotional manipulation

Fake messages appeal to fear, anger, or nationalism: "Desh ke liye share karo!" (Share for the nation!). Sentiment extremity (very positive or very negative) correlates with fakeness.

HuggingFace Implementation

Python
# Project 4: Fake News Detection for Indian WhatsApp Messages
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics import classification_report
import re
import numpy as np

# ─── Step 1: Feature extraction ───
def extract_meta_features(text):
    """Extract non-textual features that signal fakeness."""
    features = []
    # Exclamation density
    features.append(text.count('!') / max(len(text), 1))
    # CAPS ratio
    caps = sum(1 for c in text if c.isupper())
    features.append(caps / max(len(text), 1))
    # Has source citation
    source_words = ['according', 'source', 'study', 'report',
                    'research', 'PTI', 'ANI', 'WHO']
    features.append(1.0 if any(w in text.lower() for w in source_words) else 0.0)
    # Forward indicators
    fwd_words = ['forward', 'share', 'viral', 'forwarded',
                 'भेजो', 'शेयर', 'forward करो']
    features.append(1.0 if any(w in text.lower() for w in fwd_words) else 0.0)
    # Message length (normalized)
    features.append(min(len(text) / 1000, 1.0))
    return features

# ─── Step 2: BERT + Meta Features Model ───
class FakeNewsDetector(nn.Module):
    def __init__(self, bert_model_name, n_meta_features=5, n_classes=2):
        super().__init__()
        self.bert = AutoModel.from_pretrained(bert_model_name)
        self.dropout = nn.Dropout(0.3)
        # BERT hidden (768) + meta features (5) → classifier
        self.classifier = nn.Sequential(
            nn.Linear(768 + n_meta_features, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, n_classes)
        )

    def forward(self, input_ids, attention_mask, meta_features):
        bert_out = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        cls_embedding = bert_out.last_hidden_state[:, 0, :]  # [CLS] token
        cls_embedding = self.dropout(cls_embedding)
        # Concatenate BERT [CLS] with meta features
        combined = torch.cat([cls_embedding, meta_features], dim=1)
        logits = self.classifier(combined)
        return logits

# ─── Step 3: Initialize ───
MODEL_NAME = "google/muril-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = FakeNewsDetector(MODEL_NAME)

# ─── Step 4: Inference function ───
def detect_fake_news(text):
    # Tokenize
    encoding = tokenizer(
        text, max_length=256, padding='max_length',
        truncation=True, return_tensors='pt'
    )
    # Extract meta features
    meta = torch.tensor([extract_meta_features(text)], dtype=torch.float)

    model.eval()
    with torch.no_grad():
        logits = model(
            encoding['input_ids'],
            encoding['attention_mask'],
            meta
        )
    probs = torch.softmax(logits, dim=-1)
    pred = probs.argmax(dim=-1).item()
    label = "FAKE" if pred == 1 else "REAL"
    return label, probs[0].tolist()

# Test
test_messages = [
    "BREAKING: गर्म पानी पीने से COVID ठीक होता है!!! SHARE करो सबको!!!",
    "According to WHO, COVID-19 vaccines have undergone rigorous clinical trials.",
    "NASA ne confirm kiya hai ki kal raat 2 baje earthquake aayega India mein!!!",
]
for msg in test_messages:
    label, probs = detect_fake_news(msg)
    print(f"[{label}] (conf: {max(probs):.2f}) → {msg[:50]}...")

Expected Metrics

0.89

F1 Score

0.91

Precision

0.87

Recall

0.90

Accuracy

Error Analysis

Error Type	Example	Challenge	Freq
Satire misclassified	"Modi ji ne chand pe airport banane ka announcement kiya" (satirical)	Looks fake but is intentional humor	~14%
Partial truth	Real event + fabricated details	Contains real names/dates mixed with fiction	~12%
Old news reshared	Real news from 2019 shared as "breaking" in 2024	Content is factually true but contextually misleading	~9%
Regional language	Fake news in Tamil/Telugu transliterated to Roman	Model trained on Hindi/English, not Dravidian	~7%

Topic:Text Classification using BERT

Key Equation:P(class | text) = softmax(W · h_[CLS] + b), where h_[CLS] is the [CLS] token's final hidden state

Input format:[CLS] token₁ token₂ ... tokenₙ [SEP]

Fine-tuning:Unfreeze all layers + add classification head; lr = 2e-5 to 5e-5

GATE trap:BERT is encoder-only → cannot generate text. Use for classification, NER, QA (extractive). For generation, use GPT (decoder-only) or T5 (encoder-decoder)

Quick recall:BERT-base = 12 layers, 768 hidden, 12 heads, 110M params

Neural Machine Translation: Hindi ↔ English

Transformer with Attention Visualization & BLEU Scoring

Problem Statement

Build a Hindi→English and English→Hindi translation system using the Transformer architecture. Implement attention visualization to understand word alignment, and evaluate using BLEU score.

Mathematical Foundation: BLEU Score

Deriving BLEU from First Principles

BLEU (Bilingual Evaluation Understudy) measures how close a machine translation is to human reference translations. Let's derive it step by step.

Step 1: Modified n-gram Precision

For each n-gram size (1, 2, 3, 4), we compute:

pₙ = Σ_sentence Σ_ngram min(count_candidate(ngram), count_reference(ngram))
     ────────────────────────────────────────────────────────────────────────
     Σ_sentence Σ_ngram count_candidate(ngram)

The min is the "clipping" — it prevents gaming by repeating common words.

Step 2: Example Calculation

Reference: "The cat sat on the mat"
Candidate: "The the the the the the"

Without clipping: p₁ = 6/6 = 1.0 (all "the" match!)
With clipping: max count of "the" in reference = 2, so p₁ = 2/6 = 0.33 ✓

Step 3: Brevity Penalty

Short translations have artificially high precision. The brevity penalty (BP) penalizes them:

BP = exp(1 - r/c)  if c < r
BP = 1              if c ≥ r

where c = candidate length, r = reference length

Step 4: Final BLEU

BLEU = BP · exp(Σₙ wₙ · log(pₙ))

Typically: w₁ = w₂ = w₃ = w₄ = 0.25  (equal weights)

BLEU = BP · exp(¼ · Σ_n=1⁴ log p_n) where BP = min(1, e^{1 - |ref|/|cand|})

Tokenization: SentencePiece for Hindi

Python
# Why we need SentencePiece for Hindi
# Hindi has agglutinative properties — words combine morphemes
# "खाऊँगी" = खा (eat) + ऊँगी (will, feminine) → "I will eat" (said by a woman)

import sentencepiece as spm

# Train SentencePiece model on Hindi corpus
# spm.SentencePieceTrainer.train(
#     input='hindi_corpus.txt',
#     model_prefix='hindi_sp',
#     vocab_size=32000,
#     character_coverage=0.9995,  # High coverage for Devanagari
#     model_type='bpe'
# )

# Compare tokenization approaches
from transformers import AutoTokenizer

sentence = "भारत के प्रधानमंत्री ने आज एक महत्वपूर्ण बैठक की।"

# mBERT tokenizer (WordPiece)
tok1 = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
print("mBERT:", tok1.tokenize(sentence))
# ['भारत', 'के', 'प्रधानमंत्री', 'ने', 'आज', 'एक', 'महत्व', '##पूर्ण', 'बैठक', 'की', '।']
# Note: "महत्वपूर्ण" (important) split into "महत्व" + "##पूर्ण"

# IndicTrans tokenizer (SentencePiece)
tok2 = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-1B")
print("IndicTrans:", tok2.tokenize(sentence))
# Better handling of compound Hindi words

Full Implementation: Hindi↔English NMT

Python
# Project 5: Hindi↔English Neural Machine Translation
from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
from datasets import load_dataset
import evaluate
import numpy as np
import torch

# ─── Step 1: Load IndicTrans2 (State-of-the-art for Indian languages) ───
MODEL_NAME = "ai4bharat/indictrans2-en-indic-1B"
# For a lighter model, use Helsinki-NLP/opus-mt-en-hi
LITE_MODEL = "Helsinki-NLP/opus-mt-en-hi"

tokenizer = AutoTokenizer.from_pretrained(LITE_MODEL)
model = AutoModelForSeq2SeqLM.from_pretrained(LITE_MODEL)

# ─── Step 2: Translation function ───
def translate_en_to_hi(text, max_length=128):
    inputs = tokenizer(text, return_tensors="pt",
                       max_length=max_length, truncation=True)
    translated = model.generate(
        **inputs,
        max_length=max_length,
        num_beams=5,
        length_penalty=1.0,
        no_repeat_ngram_size=2,
        early_stopping=True
    )
    result = tokenizer.decode(translated[0], skip_special_tokens=True)
    return result

# ─── Step 3: BLEU Score Computation ───
import sacrebleu

def compute_bleu(predictions, references):
    """Compute BLEU score using sacrebleu."""
    bleu = sacrebleu.corpus_bleu(predictions, [references])
    print(f"BLEU Score: {bleu.score:.2f}")
    print(f"Precisions: {[f'{p:.1f}' for p in bleu.precisions]}")
    print(f"Brevity Penalty: {bleu.bp:.4f}")
    print(f"Length Ratio: {bleu.sys_len}/{bleu.ref_len} = {bleu.sys_len/bleu.ref_len:.3f}")
    return bleu

# ─── Step 4: By-hand BLEU calculation (for understanding) ───
from collections import Counter

def bleu_from_scratch(candidate, reference, max_n=4):
    """Compute BLEU score from scratch — see the math in action."""
    cand_tokens = candidate.split()
    ref_tokens = reference.split()

    # Step 1: Compute clipped n-gram precisions
    precisions = []
    for n in range(1, max_n + 1):
        # Extract n-grams
        cand_ngrams = [tuple(cand_tokens[i:i+n]) for i in range(len(cand_tokens) - n + 1)]
        ref_ngrams = [tuple(ref_tokens[i:i+n]) for i in range(len(ref_tokens) - n + 1)]

        # Count
        cand_counts = Counter(cand_ngrams)
        ref_counts = Counter(ref_ngrams)

        # Clipped count
        clipped = 0
        total = 0
        for ngram, count in cand_counts.items():
            clipped += min(count, ref_counts.get(ngram, 0))
            total += count

        precision = clipped / max(total, 1)
        precisions.append(precision)
        print(f"  p{n} = {clipped}/{total} = {precision:.4f}")

    # Step 2: Brevity penalty
    import math
    c = len(cand_tokens)
    r = len(ref_tokens)
    bp = math.exp(1 - r/c) if c < r else 1.0
    print(f"  BP = {bp:.4f} (candidate: {c}, reference: {r})")

    # Step 3: BLEU = BP × geometric mean of precisions
    log_avg = sum(math.log(max(p, 1e-10)) for p in precisions) / max_n
    bleu = bp * math.exp(log_avg)
    print(f"  BLEU = {bleu:.4f}")
    return bleu

# Example
ref = "India is a diverse country with many languages"
cand = "India is a country with many diverse languages"
print("=== By-hand BLEU ===")
bleu_from_scratch(cand, ref)

# ─── Step 5: Attention Visualization ───
def visualize_attention(source, target_prefix=""):
    """Extract and visualize cross-attention weights."""
    inputs = tokenizer(source, return_tensors="pt")

    # Generate with attention outputs
    outputs = model.generate(
        **inputs,
        max_length=64,
        num_beams=1,
        output_attentions=True,
        return_dict_in_generate=True,
    )

    # Decode tokens
    src_tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    tgt_tokens = tokenizer.convert_ids_to_tokens(outputs.sequences[0])

    print(f"Source:  {' '.join(src_tokens)}")
    print(f"Target:  {' '.join(tgt_tokens)}")

    # In practice, you'd use matplotlib heatmap here
    # plt.imshow(attention_matrix, cmap='viridis')
    # plt.xticks(range(len(src_tokens)), src_tokens, rotation=45)
    # plt.yticks(range(len(tgt_tokens)), tgt_tokens)
    return src_tokens, tgt_tokens

# Test translations
test_pairs = [
    ("The Supreme Court upheld the fundamental right to privacy.",
     "सर्वोच्च न्यायालय ने निजता के मौलिक अधिकार को बरकरार रखा।"),
    ("Please book a ticket from Delhi to Mumbai.",
     "कृपया दिल्ली से मुंबई तक का टिकट बुक करें।"),
    ("Artificial intelligence is transforming healthcare in India.",
     "कृत्रिम बुद्धिमत्ता भारत में स्वास्थ्य सेवा को बदल रही है।"),
]

for en, hi in test_pairs:
    predicted_hi = translate_en_to_hi(en)
    print(f"EN: {en}")
    print(f"HI (pred): {predicted_hi}")
    print(f"HI (ref):  {hi}")
    print()

Attention Visualization (ASCII)

Cross-Attention Map: English → Hindi Translation "The Supreme Court upheld the right to privacy" → "सर्वोच्च न्यायालय ने निजता के अधिकार को बरकरार रखा" The Supreme Court upheld the right to privacy सर्वोच्च · ████ ·· · · · · · न्यायालय · ·· ████ · · · · · ने · · · ·· · · · · निजता · · · · · · · ████ के · · · · ███ · · · अधिकार · · · · · ████ · · को · · · · · · ███ · बरकरार · · · ████ · · · · रखा · · · ██ ·· · · · ████ = high attention ·· = medium · = low Key insight: Hindi word order is SOV (Subject-Object-Verb) while English is SVO. The attention learns this reordering! "upheld" (verb, position 4) → "बरकरार रखा" (verb, position 8-9)

Expected Metrics

24.3

BLEU (opus-mt)

33.5

BLEU (IndicTrans2)

41.2

BLEU (En→Fr, same arch)

8.7

BLEU (zero-shot GPT)

Why is the Hindi BLEU score so much lower than French? It's NOT because the model is worse. Hindi and English are from different language families (Indo-Aryan vs. Germanic), have different scripts, different word orders (SOV vs. SVO), and different morphological complexity. French BLEU scores are inflated because French shares vocabulary and structure with English. Always compare BLEU scores within the same language pair, never across language pairs.

Error Analysis: Where Hindi NMT Struggles

Error Type	Source (EN)	Predicted (HI)	Correct (HI)
Honorific loss	"You should go to the doctor"	"तुम डॉक्टर के पास जाओ"	"आप डॉक्टर के पास जाइए" (formal)
Gender mismatch	"The teacher went home"	"शिक्षक घर गया" (masculine)	Could be "शिक्षिका घर गयी" (feminine)
Idiom translation	"It's raining cats and dogs"	"बिल्लियां और कुत्ते बरस रहे हैं" (literal!)	"मूसलाधार बारिश हो रही है"
Named entity	"Modi visited Washington"	"मोदी ने वाशिंग्टन का दौरा किया" ✓	Usually correct for famous entities

Section 4

Visual Aids — The NLP Project Pipeline

Master Pipeline: From Raw Text to Deployed Model

┌─────────────────────────────────────────────────────────────────────────┐ │ NLP PROJECT PIPELINE (Indian Languages) │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ 1. DATA │───▶│2. CLEAN │───▶│3. TOKEN- │───▶│4. MODEL │ │ │ │ COLLECT │ │ & PREP │ │ IZATION │ │ CHOICE │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ • Scrape • Script det. • WordPiece • BERT (clf) │ │ • API • Norm. spell • BPE • T5 (gen) │ │ • Crowdsource • Handle emoji • SentencePiece • Seq2Seq (MT) │ │ • Annotate • Dedup • IndicNLP • Custom │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │5. FINE- │───▶│6. EVAL │───▶│7. ERROR │───▶│8. DEPLOY │ │ │ │ TUNE │ │ METRICS │ │ ANALYSIS │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ • HuggingFace • F1/Acc • Confusion mat • ONNX export │ │ • lr=2e-5 • BLEU/ROUGE • Error types • API serve │ │ • 3-5 epochs • Human eval • Failure cases • Monitor │ │ • FP16 • Per-lang • Bias audit • A/B test │ │ │ │ ╔═══════════════════════════════════════════════════════════╗ │ │ ║ KEY DECISION POINT: Multilingual vs. Language-Specific? ║ │ │ ║ ║ │ │ ║ Multilingual (mBERT, MuRIL, XLM-R): ║ │ │ ║ ✅ One model for all languages ║ │ │ ║ ❌ Diluted per-language performance ║ │ │ ║ ║ │ │ ║ Language-specific (HindiBERT, IndicBERT): ║ │ │ ║ ✅ Better per-language performance ║ │ │ ║ ❌ Need separate model per language ║ │ │ ╚═══════════════════════════════════════════════════════════╝ │ └─────────────────────────────────────────────────────────────────────────┘

Model Selection Guide

Which Model Do You Need? │ ┌─────────┴──────────┐ ▼ ▼ Classification? Generation? │ │ ┌───────┴───────┐ ┌───────┴───────┐ ▼ ▼ ▼ ▼ English? Indian? Summarize? Translate? │ │ │ │ ▼ ▼ ▼ ▼ BERT/ MuRIL/ mT5/ IndicTrans2/ RoBERTa IndicBERT IndicBART opus-mt │ │ │ │ F1≈0.92 F1≈0.76 ROUGE≈0.46 BLEU≈33.5 ┌────────────────────────────────────────────────┐ │ Rule of thumb for Indian NLP: │ │ │ │ 1. Classification → MuRIL (best for Indic) │ │ 2. Generation → mT5 or IndicBART │ │ 3. Translation → IndicTrans2 (SOTA) │ │ 4. If code-mixed → MuRIL > mBERT > XLM-R │ │ 5. If resource-constrained → DistilmBERT │ └────────────────────────────────────────────────┘

Tokenization Comparison Across Scripts

Input: "मुझे दिल्ली से मुंबई का ticket चाहिए" (I need a ticket from Delhi to Mumbai) ┌─────────────────────────────────────────────────────────────┐ │ English BERT Tokenizer (bert-base-uncased): │ │ [UNK] [UNK] [UNK] [UNK] [UNK] ticket [UNK] │ │ → 85% of tokens are [UNK]! Completely useless. │ ├─────────────────────────────────────────────────────────────┤ │ mBERT Tokenizer (bert-base-multilingual-cased): │ │ मुझे | दिल्ली | से | मुंबई | का | ticket | चाहिए │ │ → Works! But limited Hindi vocab (1,700/120K tokens) │ ├─────────────────────────────────────────────────────────────┤ │ MuRIL Tokenizer (google/muril-base-cased): │ │ मुझे | दिल्ली | से | मुंबई | का | ticket | चाहिए │ │ → Same output but richer contextual embeddings for Hindi │ ├─────────────────────────────────────────────────────────────┤ │ IndicTrans SentencePiece: │ │ ▁मुझे | ▁दिल्ली | ▁से | ▁मुंबई | ▁का | ▁ticket | ▁चाहिए │ │ → Clean subword segmentation with script-aware BPE │ └─────────────────────────────────────────────────────────────┘

Section 5

Common Misconceptions

❌ MYTH: "BLEU score of 30 means the translation is 30% correct."

✅ TRUTH: BLEU is NOT a percentage of correctness. A BLEU of 30 for Hindi→English is actually quite good (approaching professional quality for some domains), while a BLEU of 30 for French→English might be mediocre. BLEU measures n-gram overlap with reference translations, not semantic accuracy.

🔍 WHY IT MATTERS: In interviews and papers, comparing BLEU across language pairs is meaningless. Always compare within the same language pair and test set.

❌ MYTH: "Fine-tuning BERT means you only train the classification head."

✅ TRUTH: Fine-tuning means updating ALL parameters — both the pretrained Transformer layers AND the new classification head. If you freeze the Transformer and only train the head, that's called feature extraction (or linear probing), not fine-tuning. True fine-tuning typically gives 3-8% better F1 than feature extraction.

🔍 WHY IT MATTERS: GATE questions specifically test this distinction. In practice, with small datasets (<1K samples), feature extraction may actually be better to avoid overfitting.

❌ MYTH: "Multilingual BERT (mBERT) was explicitly trained on parallel (aligned) multilingual data."

✅ TRUTH: mBERT was trained on the concatenation of Wikipedia articles from 104 languages with NO parallel alignment and NO cross-lingual objectives. Yet it somehow learns cross-lingual representations! This is one of the biggest surprises of modern NLP — it emerges from shared subword tokens and structural similarities.

🔍 WHY IT MATTERS: Understanding this explains why mBERT works on some language pairs (similar structures) better than others, and why models like XLM (with explicit translation objectives) can outperform it.

❌ MYTH: "ROUGE and BLEU measure the same thing."

✅ TRUTH: BLEU is precision-oriented (what fraction of the candidate's n-grams appear in the reference?). ROUGE is recall-oriented (what fraction of the reference's n-grams appear in the candidate?). BLEU penalizes long outputs; ROUGE penalizes short outputs. Use BLEU for translation, ROUGE for summarization.

🔍 WHY IT MATTERS: Using the wrong metric leads to wrong model selection. A model that copies the entire document would get ROUGE-1 ≈ 1.0 but BLEU ≈ 0.0.

❌ MYTH: "Code-mixed (Hinglish) text is just Hindi with some English words. Tokenize each word by its language."

✅ TRUTH: Code-mixing happens at every level — word, morpheme, even character. "Main meeting mein gaya tha" has Hindi syntax wrapping English nouns. Some words are "language-ambiguous": "time" is used in both Hindi and English conversations. Language-tagging each token is itself a hard NLP problem (Language Identification) that must precede other tasks.

🔍 WHY IT MATTERS: Pipelines that assume clean language boundaries fail catastrophically on real Indian social media text.

Section 6

GATE / Exam Corner

Key Formulas Quick Reference

BLEU = BP · exp(¼ Σ log p_n) | ROUGE-N = Σ count_match(n-gram) / Σ count(n-gram)_ref | F1 = 2PR/(P+R)

GATE Previous Year Questions (Adapted)

GATE CSE 2023 — Adapted

In BERT, the [CLS] token representation is typically used for:

Predicting the next word
Sentence-level classification tasks
Generating text autoregressively
Computing positional encodings

Answer: B — The [CLS] (classification) token's final hidden state is a fixed-size representation of the entire input, used as input to a classification head for tasks like sentiment analysis, fake news detection, etc. BERT is encoder-only and does not generate text (eliminating C). Positional encodings are added at input, not derived from [CLS] (eliminating D). BERT uses masked language modeling, not next-word prediction (eliminating A — that's GPT).

RememberBERT Architecture

GATE DA 2024 — Adapted

Which of the following correctly describes the Brevity Penalty in BLEU score?

It penalizes translations that are too long
It penalizes translations that are too short
It penalizes translations with too many unique n-grams
It is always equal to 1.0

Answer: B — BP = exp(1 - r/c) when c < r (candidate shorter than reference), and BP = 1 when c ≥ r. Short translations achieve high precision trivially (e.g., translating a long sentence as just "The" gives perfect unigram precision for "The"). The brevity penalty counteracts this by penalizing candidates shorter than the reference.

UnderstandBLEU Score

UGC NET 2024 — Adapted

In a Seq2Seq model with attention for Hindi→English translation, the attention mechanism primarily helps with:

Handling the different word order (SOV → SVO)
Learning grammar rules explicitly
Reducing the vocabulary size
Eliminating the need for a decoder

Answer: A — Hindi follows Subject-Object-Verb order while English follows Subject-Verb-Object. The attention mechanism allows the decoder to "look back" at any position in the source sequence, enabling it to reorder information. When translating "राम ने सेब खाया" (Ram apple ate) to "Ram ate an apple," attention lets the decoder attend to "खाया" (ate, position 3) when generating "ate" (position 2 in English).

UnderstandAttention Mechanism

Prediction Table: Expected GATE Topics (2025-26)

Topic	Probability	Question Type	Key Concept
BLEU computation	🟢 High	Numerical	Hand-calculate BLEU for given candidate/reference
BERT architecture	🟢 High	MCQ	[CLS] token, encoder-only, MLM objective
ROUGE vs BLEU	🟡 Medium	MCQ	Precision vs recall orientation
Transformer attention	🟢 High	MCQ + NAT	Self vs cross attention, complexity O(n²)
Seq2Seq + attention	🟡 Medium	MCQ	Encoder-decoder, alignment
Tokenization (BPE/WP)	🟡 Medium	MCQ	Subword segmentation methods

Section 7

Interview Prep

Conceptual Questions

Q1: Why does multilingual BERT (mBERT) work across languages despite never seeing parallel data?

Expected Answer (3-4 sentences):

mBERT's cross-lingual ability emerges from three factors: (1) Shared subword tokens — words like "information" appear in many languages (English, French, etc.), creating anchor points; (2) Similar word order — SVO languages share structural patterns that the model learns; (3) Parameter sharing — all languages share the same Transformer weights, forcing the model to learn language-agnostic features. Research by Pires et al. (2019) showed this ability degrades for languages with different scripts (e.g., Hindi-Devanagari), which is why MuRIL (which includes transliterated text) outperforms mBERT on Indian languages.

Q2: You're building a Hinglish sentiment analyzer. Your F1 is 0.68. How would you improve it?

Expected Answer (structured, 5+ strategies):

Data-side: (1) Augment with spelling normalization (map "bohot"→"bahut"); (2) Back-translation augmentation — translate Hindi to English and back to generate paraphrases; (3) Add emoji as features (👎=negative, 🔥=positive).

Model-side: (4) Switch from mBERT to MuRIL (trained on transliterated Indian text); (5) Add a Hinglish language model pre-training step on unlabeled tweets before fine-tuning.

Architecture-side: (6) Ensemble MuRIL + XLM-R + a CNN-based classifier; (7) Use focal loss instead of cross-entropy to handle class imbalance (neutral tweets often dominate).

Evaluation-side: (8) Stratify by language ratio (pure Hindi vs. mixed vs. pure English) to find where the model fails worst.

Coding Question

Q3: Implement BLEU score from scratch (no libraries). [Frequently asked at Google, Microsoft India]

The complete solution is in Project 5's bleu_from_scratch() function above. Key points interviewers look for:

Do you implement clipping? (min of candidate count vs. reference count)
Do you include the brevity penalty?
Do you use geometric mean of n-gram precisions (not arithmetic)?
Can you handle edge cases? (zero n-gram matches → avoid log(0))

Case Study Question

Q4: Design a multilingual content moderation system for ShareChat (Indian social media)

Context (India-specific):

ShareChat supports 15 Indian languages. Users post in code-mixed text, and hate speech often uses euphemisms, misspellings, and transliteration to evade filters. Design a system that detects hate speech, misinformation, and graphic violence descriptions across all supported languages.

Expected Design:

Layer 1 — Language Detection: fastText-based language identifier to route text to the right pipeline.

Layer 2 — Multilingual Classifier: MuRIL fine-tuned on labeled hate speech data from each language. Use a unified model (not 15 separate ones) to handle code-mixing.

Layer 3 — Keyword Filter: Maintain a curated list of slurs, dog-whistles, and euphemisms in each language (updated weekly). This catches what the ML model misses.

Layer 4 — Human Review: Route borderline cases (confidence 0.4-0.7) to human moderators who speak the relevant language.

Key challenge: New slang evolves faster than you can retrain. Solution: use retrieval-augmented classification — embed new flagged terms and find nearest neighbors in the hate speech embedding space.

🇮🇳 Interview Focus (India)

Hinglish/code-mixing handling
Low-resource language techniques
WhatsApp-scale systems
MuRIL, IndicBERT, IndicTrans
GATE numerical: BLEU calculation
Companies: Flipkart, ShareChat, Google India, Microsoft India

🇺🇸 Interview Focus (US/Global)

BERT/GPT architecture details
Scaling laws and efficiency
Bias and fairness in NLP
Prompt engineering and few-shot
System design: search ranking, recommendation
Companies: Google, Meta, OpenAI, Amazon

Section 8

Hands-On Lab / Mini-Project

🔬 Mini-Project: End-to-End Multilingual News Classifier for Indian Languages

Objective

Build a news article topic classifier that works on Hindi, English, and Hinglish text. Classify articles into 5 categories: Sports, Politics, Technology, Entertainment, Business.

Requirements

Component	Specification
Dataset	BBC Hindi + BBC English + scraped Hinglish news (total ~10K articles)
Model	MuRIL or mBERT, fine-tuned with HuggingFace Trainer
Preprocessing	Handle Devanagari + Roman, normalize spelling, handle code-mixing
Training	80/10/10 train/val/test split, stratified by language AND category
Evaluation	Per-language F1, confusion matrix, error analysis on 50 misclassified samples
Deliverable	Jupyter notebook + 2-page report + working inference function

Rubric (100 points)

Criterion	Points	Excellent (90-100%)	Good (70-89%)	Needs Work (<70%)
Data Preprocessing	20	Handles all three languages, normalizes spelling, removes noise intelligently	Basic cleaning, handles two languages	Minimal cleaning, English-only
Model Training	25	MuRIL fine-tuned, proper hyperparameter search, FP16 training	mBERT fine-tuned, reasonable hyperparams	No fine-tuning, uses pre-trained directly
Evaluation	20	Per-language F1 breakdown, confusion matrix, statistical significance	Overall F1 and accuracy reported	Only accuracy reported
Error Analysis	20	50+ samples analyzed, error categories identified, improvement strategies proposed	20+ samples analyzed	No error analysis
Code Quality	15	Clean, commented, reproducible, modular functions	Functional but messy	Hardcoded, not reproducible

Starter Code

Python
# Mini-Project Starter Code
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
from sklearn.model_selection import train_test_split
import pandas as pd

# TODO: Step 1 — Load and preprocess your dataset
# df = pd.read_csv('multilingual_news.csv')
# df['text'] = df['text'].apply(preprocess_multilingual)

# TODO: Step 2 — Choose model
MODEL_NAME = "google/muril-base-cased"  # Your choice!

# TODO: Step 3 — Tokenize, create datasets

# TODO: Step 4 — Train with Trainer API

# TODO: Step 5 — Evaluate per-language

# TODO: Step 6 — Error analysis on misclassified samples

# TODO: Step 7 — Write report

Section 9

Exercises

Section A — Conceptual Questions (5)

Explain why tokenizing "bahut acchi movie thi" with English BERT's tokenizer produces mostly [UNK] tokens, while mBERT produces meaningful subwords. What structural difference in their vocabularies causes this?

Answer: English BERT's vocabulary contains only ~30,000 WordPiece tokens derived from English text. It has no Devanagari characters and no Romanized Hindi words. Every Hindi word maps to [UNK]. mBERT's vocabulary has ~120,000 tokens derived from 104 languages' Wikipedia, including ~1,700 Hindi/Devanagari tokens. However, romanized Hindi like "bahut" gets fragmented (ba + ##hu + ##t) because the mBERT vocabulary has limited Roman-Hindi subwords — it was trained on Devanagari Wikipedia, not Roman transliterations.

Compare extractive and abstractive summarization for Indian legal judgments. Which is more appropriate and why?

Answer: Extractive selects existing sentences verbatim — safe but verbose, and legal sentences are often 100+ words long and full of references. Abstractive generates new sentences — more concise but risks hallucinating incorrect legal citations (e.g., inventing "Section 420 IPC" when the case involves Section 302). For legal documents, a hybrid approach is best: extractive filtering to identify key passages, followed by abstractive summarization on the extracted text. This limits hallucination while producing readable summaries.

Why is ROUGE more appropriate than BLEU for evaluating summarization systems?

Answer: ROUGE is recall-oriented: it measures what fraction of the reference summary's content is captured in the generated summary. For summarization, we care about coverage — did the summary include all key points? BLEU is precision-oriented: it measures what fraction of the generated text matches the reference. A very short summary could have high BLEU (all its words match) but miss critical information. Additionally, ROUGE-L uses longest common subsequence, which is more forgiving of paraphrasing than BLEU's exact n-gram matching.

What is "code-mixing" and why does it make NLP harder? Give 3 specific examples from Indian languages.

Answer: Code-mixing is the fluid interleaving of two or more languages within a single utterance. It makes NLP harder because: (1) Tokenizers designed for one language fragment the other language's words; (2) Sentiment/syntax models trained on monolingual data don't understand mixed-language patterns; (3) There's no standard orthography for transliterated text. Examples: (i) "Main kal meeting mein bahut nervous tha" (Hindi+English+Hindi); (ii) "Arre yaar, this movie was bakwas, complete waste" (Hindi interjections + English); (iii) "Can you book मुझे एक ticket to Delhi?" (English+Hindi+English with Devanagari mid-sentence).

Explain the difference between MuRIL, mBERT, and XLM-RoBERTa for Indian language tasks. When would you choose each?

Answer: mBERT: Trained on 104 languages' Wikipedia. Good general multilingual model, but Hindi gets only ~1.4% of the vocabulary. Best when you need one model for many languages including non-Indian. XLM-RoBERTa: Trained on 2.5TB CommonCrawl data from 100 languages. Stronger than mBERT due to more diverse training data and no NSP objective. Best for cross-lingual transfer from English. MuRIL: Trained specifically on 17 Indian languages + transliterated text. Best for Indian language tasks — outperforms both by 2-5% F1 on Indian benchmarks. Choose MuRIL for India-focused tasks; XLM-R for multilingual tasks; mBERT only if resource-constrained.

Section B — Mathematical Questions (8)

Intermediate

Calculate the BLEU score for: Reference: "the cat is on the mat" | Candidate: "the the the the"

Answer: p₁: "the" appears 4 times in candidate, but max count in reference is 2. Clipped count = 2. Total candidate unigrams = 4. p₁ = 2/4 = 0.5. p₂: Bigram "the the" appears 3 times in candidate, 0 times in reference. p₂ = 0/3 = 0. Since p₂ = 0, log(p₂) = -∞, so BLEU = 0. (Once any pₙ = 0, the entire BLEU becomes 0 due to the geometric mean.)

Intermediate

Reference: "India is a beautiful country" (5 words). Candidate: "India is beautiful" (3 words). Calculate the brevity penalty.

Answer: c = 3 (candidate length), r = 5 (reference length). Since c < r: BP = exp(1 - r/c) = exp(1 - 5/3) = exp(-2/3) = exp(-0.667) ≈ 0.513. The translation is heavily penalized for being too short.

Intermediate

Compute ROUGE-1, ROUGE-2, and ROUGE-L for: Reference: "the cat sat on the mat" | System summary: "the cat on the mat"

Answer: ROUGE-1 (unigram recall): Reference unigrams: {the:2, cat:1, sat:1, on:1, mat:1} = 6 tokens. System matches: {the:2, cat:1, on:1, mat:1} = 5 matches. ROUGE-1 = 5/6 ≈ 0.833. ROUGE-2 (bigram recall): Reference bigrams: {the cat, cat sat, sat on, on the, the mat} = 5. System bigrams: {the cat, cat on, on the, the mat} = 4. Matches: {the cat, on the, the mat} = 3. ROUGE-2 = 3/5 = 0.600. ROUGE-L (LCS): LCS of "the cat sat on the mat" and "the cat on the mat" = "the cat on the mat" (length 5). ROUGE-L recall = 5/6 ≈ 0.833.

Advanced

In BERT's self-attention, if the input sequence has length n = 512 and hidden dimension d = 768 with h = 12 heads, what is (a) the dimension of each head's Q, K, V matrices, and (b) the total FLOPs for one self-attention layer?

Answer: (a) Each head dimension dₖ = d/h = 768/12 = 64. So Q, K, V for each head are n × 64 = 512 × 64. (b) Computing QKᵀ: (512 × 64) × (64 × 512) = 512 × 512 × 64 ≈ 16.8M multiplications per head, × 12 heads = 201M. Computing attention × V: another 201M. Total ≈ 402M FLOPs for attention computation. Including the linear projections (W_Q, W_K, W_V, W_O): 4 × n × d × d = 4 × 512 × 768² ≈ 1.2B. Grand total ≈ 1.6B FLOPs per layer.

Intermediate

A fake news classifier has: TP=180, FP=20, FN=30, TN=770. Compute precision, recall, F1, and accuracy. Is this model production-ready?

Answer: Precision = TP/(TP+FP) = 180/200 = 0.90. Recall = TP/(TP+FN) = 180/210 = 0.857. F1 = 2×0.9×0.857/(0.9+0.857) = 1.543/1.757 = 0.878. Accuracy = (TP+TN)/(Total) = 950/1000 = 0.95. Assessment: F1 of 0.878 is good but the recall of 0.857 means ~14% of fake news goes undetected. For a safety-critical application like fake news, we need higher recall (>0.95) even at the cost of precision. Use a lower classification threshold.

Beginner

If mBERT has a vocabulary of 120,000 tokens across 104 languages, approximately how many tokens are allocated to Hindi? Why is this a problem?

Answer: If tokens were distributed equally: 120,000/104 ≈ 1,154 per language. In practice, Hindi gets ~1,700 tokens (slightly more due to Wikipedia size). This is a problem because Hindi has rich morphology — "खाऊँगी" is a single word requiring ~3-4 subword tokens. With only 1,700 tokens, many Hindi words get over-fragmented into 5+ subword pieces, leading to: (1) longer sequences (hitting the 512-token limit faster), (2) loss of semantic coherence (each subword loses meaning), (3) slower inference.

Advanced

Derive the gradient of the cross-entropy loss with respect to the logits for a 3-class sentiment classifier. Show that ∂L/∂zᵢ = pᵢ - yᵢ where pᵢ = softmax(zᵢ).

Answer: Loss L = -Σⱼ yⱼ log(pⱼ) where pⱼ = exp(zⱼ)/Σₖexp(zₖ). For ∂L/∂zᵢ: ∂L/∂zᵢ = -Σⱼ yⱼ · (1/pⱼ) · ∂pⱼ/∂zᵢ. We need ∂pⱼ/∂zᵢ: If j=i: ∂pᵢ/∂zᵢ = pᵢ(1-pᵢ). If j≠i: ∂pⱼ/∂zᵢ = -pⱼpᵢ. Substituting: ∂L/∂zᵢ = -yᵢ(1-pᵢ) + Σⱼ≠ᵢ yⱼpᵢ = -yᵢ + yᵢpᵢ + pᵢΣⱼ≠ᵢyⱼ = -yᵢ + pᵢ(yᵢ + Σⱼ≠ᵢyⱼ) = -yᵢ + pᵢ·1 = pᵢ - yᵢ. ∎ This elegant result means the gradient is simply the difference between the predicted probability and the true label — the foundation of backpropagation through softmax.

Intermediate

A Hindi→English NMT model produces translations with average length 15 words when references average 20 words. If the unigram precision p₁ = 0.85, p₂ = 0.55, p₃ = 0.35, p₄ = 0.20, what is the BLEU score?

Answer: BP = exp(1 - 20/15) = exp(1 - 1.333) = exp(-0.333) ≈ 0.717. log-precision average = (log 0.85 + log 0.55 + log 0.35 + log 0.20)/4 = (-0.163 + (-0.598) + (-1.050) + (-1.609))/4 = -3.420/4 = -0.855. BLEU = 0.717 × exp(-0.855) = 0.717 × 0.425 = 0.305 or 30.5. This is a reasonable BLEU score for Hindi→English, with the brevity penalty reducing it from what would otherwise be ~42.5 without BP.

Section C — Coding Questions (4)

Intermediate

Write a function normalize_hinglish(text) that handles at least 10 common Hinglish spelling variants. For example: "bohot" → "bahut", "achha" → "acha", "kaise" / "kese" → "kaise", etc.

Hint: Use a dictionary mapping + regex for pattern-based normalization (repeated vowels like "hiii" → "hi", "pleaseee" → "please"). Also handle: nhi/nahi/nahin → nahi, hai/h/he → hai, mein/me/main → main.

Advanced

Implement a function compute_rouge_l(reference, candidate) from scratch that computes ROUGE-L using the longest common subsequence algorithm. Do NOT use any NLP libraries.

Hint: Use dynamic programming to find LCS length. ROUGE-L recall = LCS_length / reference_length. ROUGE-L precision = LCS_length / candidate_length. F1-ROUGE-L = 2·P·R/(P+R). The DP table is (m+1) × (n+1) where m = len(reference), n = len(candidate).

Intermediate

Using HuggingFace, write a script that loads MuRIL, fine-tunes it on 100 labeled Hinglish sentiment samples, and evaluates on 20 test samples. Report per-class precision, recall, and F1.

Hint: Use AutoModelForSequenceClassification.from_pretrained("google/muril-base-cased", num_labels=3). Create a custom Dataset class. Use Trainer with TrainingArguments(num_train_epochs=3, learning_rate=2e-5). Use sklearn's classification_report for per-class metrics.

Advanced

Build a simple language detection function that classifies a given text as "Hindi (Devanagari)", "Hindi (Roman/Hinglish)", or "English" using Unicode ranges. No ML — just rule-based.

Hint: Devanagari Unicode range: U+0900–U+097F. Count Devanagari characters vs. Latin characters. If >50% Devanagari → "Hindi (Devanagari)". If >50% Latin and contains common Hindi words (main, hai, kya, bahut, nahi) → "Hinglish". Otherwise → "English". Handle mixed scripts by checking ratios.

Section D — Critical Thinking Questions (3)

Advanced

A company deploys a fake news detection model for Indian WhatsApp. The model has 92% accuracy on the test set but users report it frequently flags legitimate political opinions as fake news. Analyze what might be going wrong and propose 3 concrete solutions.

Answer: (1) Training data bias: The model was likely trained on data where political content was disproportionately labeled as fake. Fix: Re-audit training labels, ensure balanced representation of legitimate political discourse. (2) Feature leakage: The model might be using political keywords as proxies for fakeness (e.g., any message mentioning a political party gets flagged). Fix: Use adversarial debiasing — train a secondary classifier to NOT predict political party from hidden representations. (3) Threshold issue: 92% accuracy might mean the model is overconfident on borderline cases. Fix: Add a "uncertain" category for predictions with confidence 0.4-0.7, route to human review. Also consider: is the test set representative of real WhatsApp distribution?

Advanced

Your Hindi→English NMT model gets BLEU = 24 on news text but BLEU = 8 on poetry. Explain why, and propose how to improve poetry translation without hurting news performance.

Answer: News text is factual with standard vocabulary and grammar — close to the training data. Poetry has metaphors ("आँखों में तूफ़ान" literally "storm in the eyes"), non-standard word order for meter, cultural references, and multiple valid translations (one metaphor can be translated 10 different ways). BLEU penalizes creative paraphrasing because it matches exact n-grams. Solutions: (1) Use multiple references (5+ translations) for BLEU — a single reference unfairly penalizes valid alternatives. (2) Fine-tune a separate poetry model on a poetry parallel corpus. (3) Use semantic similarity metrics (BERTScore) alongside BLEU for evaluation. (4) Domain-adaptive pretraining: continue pretraining the LM on Hindi poetry before fine-tuning on translation pairs.

Advanced

India's judiciary wants to deploy your legal summarization system. What ethical, legal, and technical concerns would you raise before deployment? How would you address each?

Answer: Ethical: (1) Hallucinated legal citations could lead to wrongful legal advice — add a confidence score and disclaimer "This is an AI summary, not legal advice." (2) Bias: if the model is trained on English-language judgments, it may poorly summarize Hindi judgments, disadvantaging courts that operate in Hindi. Legal: (3) Under the IT Act, AI-generated legal summaries might create liability — who is responsible if a wrong summary leads to a bad legal decision? (4) Data privacy: judgments may contain victim names, minor's details. Technical: (5) Long document handling: 50-page judgments exceed model limits — need robust chunking. (6) Evaluation: ROUGE doesn't measure factual accuracy — need human evaluation by lawyers. (7) Adversarial: a party could craft judgments designed to confuse the summarizer. Mitigation: deploy as an "assistive" tool, not a replacement. Always show the source text alongside the summary.

★ Starred Research Questions (2)

★ R1

Advanced

Read the MuRIL paper (Khanuja et al., 2021). How does MuRIL's pretraining differ from mBERT? Why does including transliterated text in pretraining improve performance on code-mixed tasks? Design an experiment to test whether transliteration augmentation helps IndicBERT as well.

★ R2

Advanced

The IndicTrans2 paper (Gala et al., 2023) claims state-of-the-art Hindi↔English translation. Read the paper and answer: (a) How does their "script unification" approach work? (b) Why is SentencePiece with character coverage=0.9995 crucial for Indic scripts? (c) Design a study comparing IndicTrans2 vs. fine-tuned mBART on legal document translation (not general text).

Section 10

Connections

How This Chapter Connects to the Rest of the Book

← Builds On

Chapter 14 (LSTMs & GRUs): The Seq2Seq chatbot (P3) uses LSTM encoder-decoder architecture. Attention mechanism builds on the hidden state sequences from Chapter 14.

Chapter 15 (Transformers & Attention): Every project uses Transformers — BERT for classification (P1, P4), T5/BART for generation (P2), and full encoder-decoder Transformers for translation (P5). Self-attention and cross-attention from Ch. 15 are directly applied here.

Chapter 17 (Transfer Learning): All 5 projects use transfer learning. Pre-trained models (MuRIL, mT5, opus-mt) are fine-tuned on domain-specific data. The concepts of feature extraction vs. fine-tuning from Ch. 17 are critical.

→ Enables

Chapter 22 (Future & Ethics): The ethical concerns raised in P4 (fake news bias) and P2 (legal AI responsibility) connect directly to AI ethics discussions.

MLOps (Chapter 21): Deploying these models in production requires model serving, monitoring, and retraining pipelines covered in MLOps.

Capstone projects: Students can extend any of these 5 projects into a full capstone with real data.

🔬 Research Frontier

Large Language Models for Indian Languages: IndicLLM, Sarvam AI's models, and Krutrim are attempting to build GPT-scale models specifically for Indian languages. The fundamental challenges (tokenization, code-mixing, morphological richness) discussed in this chapter explain why simply scaling English LLMs doesn't solve Indian language AI.

🏭 Industry Implementation

Bhashini (Government of India): India's national platform for language translation and NLP services uses many of the same techniques covered here — IndicTrans for translation, IndicBERT for understanding, and ASR models for speech. The AI4Bharat consortium (IIT Madras) has open-sourced many of these models.

Section 11

Chapter Summary

🎯 Key Takeaways

Indian NLP is fundamentally harder than English NLP due to code-mixing, script diversity (13 scripts), morphological richness, spelling variation, and severe data scarcity. Pretending it's "just another language" leads to catastrophic failures.
MuRIL > mBERT > English BERT for Indian language classification tasks. MuRIL's inclusion of transliterated text in pretraining gives it a 2-5% F1 advantage on code-mixed tasks.
Legal summarization requires a hybrid approach: extractive filtering (to fit within token limits) followed by abstractive generation (for readable summaries). Watch for hallucinated legal citations.
Task-oriented chatbots decompose into Intent → Slots → Response: Use a classifier for intent, token-level NER for slot filling, and either templates or Seq2Seq for response generation. Bilingual handling is the key Indian challenge.
BLEU ≠ quality percentage. BLEU measures n-gram overlap with clipping and brevity penalty. Never compare BLEU scores across language pairs. Use BLEU for translation, ROUGE for summarization, F1 for classification.
Attention visualization reveals word alignment between source and target in NMT. Hindi (SOV) → English (SVO) word order changes are visible in cross-attention maps.
Domain-specific preprocessing is critical: Legal text needs section normalization, social media needs emoji handling, WhatsApp messages need forward-indicator detection. There is no one-size-fits-all pipeline.

Key Equations:
BLEU = BP · exp(¼ Σ log p_n) | ROUGE-N = Σ match(n-gram) / Σ count(n-gram)_ref | F1 = 2PR/(P+R)
BP = min(1, e^1-r/c) | Attention(Q,K,V) = softmax(QK^T/√d_k)V

Key Intuition: Building NLP for a multilingual country like India isn't about translating English NLP — it's about reimagining NLP from the ground up for a world where languages don't have neat boundaries.

Section 12

Paper	Year	Relevance
Vaswani et al. — "Attention Is All You Need"	2017	Foundation for all Transformer-based projects
Devlin et al. — "BERT"	2019	Pre-training + fine-tuning paradigm (P1, P3, P4)
Pires et al. — "How Multilingual is mBERT?"	2019	Why mBERT transfers across languages
Khanuja et al. — "MuRIL"	2021	Best model for Indian NLP tasks
Malik et al. — "ILDC for CJPE"	2021	Indian legal document corpus (P2)
Gala et al. — "IndicTrans2"	2023	SOTA Hindi↔English translation (P5)
Patwa et al. — "SemEval-2020 Task 9"	2020	Hinglish sentiment benchmark (P1)
Papineni et al. — "BLEU"	2002	The original BLEU metric paper