Neural Networks & Deep Learning
Chapter 21: Applied Deep Learning โ NLP and Sequence Projects
From Hindi Tweets to Supreme Court Judgments โ Building NLP Systems for Bharat
โฑ๏ธ Reading Time: ~4 hours | ๐ Unit VII: Applications & Industry | ๐ง Project-Based Chapter
๐ Prerequisites: Chapter 14 (LSTMs & GRUs), Chapter 15 (Transformers & Attention), Chapter 17 (Transfer Learning)
Bloom's Taxonomy Map for This Chapter
| Bloom's Level | What You'll Achieve |
|---|---|
| ๐ต Remember | Recall BERT architecture, tokenization strategies (BPE, WordPiece, SentencePiece), evaluation metrics (BLEU, ROUGE, F1), and the key differences between code-mixed and monolingual NLP |
| ๐ต Understand | Explain why multilingual BERT outperforms monolingual models on Hinglish, how attention mechanisms enable alignment in NMT, and why abstractive summarization requires a decoder |
| ๐ข Apply | Implement 5 complete NLP projects using HuggingFace Transformers: sentiment analysis, summarization, chatbot, fake news detection, and machine translation |
| ๐ก Analyze | Diagnose tokenization failures in Devanagari text, analyze attention maps to find translation alignment errors, and compare ROUGE scores across extractive vs. abstractive summaries |
| ๐ Evaluate | Assess whether a BLEU score of 22 is acceptable for HindiโEnglish translation, judge model bias in fake news classifiers, and critique chatbot response quality using human evaluation |
| ๐ด Create | Design and build an end-to-end multilingual NLP pipeline that handles code-mixed Indian language input, including data preprocessing, model selection, and deployment considerations |
Learning Objectives
By the end of this chapter, you will be able to:
- Fine-tune multilingual BERT (mBERT) and IndicBERT for Hindi/Hinglish sentiment analysis on code-mixed social media text
- Build an abstractive summarization pipeline for Indian legal documents using encoder-decoder Transformers
- Implement a Seq2Seq chatbot with attention mechanism for handling bilingual booking queries (IRCTC-style)
- Train a BERT-based fake news classifier for Indian WhatsApp misinformation with domain-specific preprocessing
- Construct a HindiโEnglish Neural Machine Translation system using Transformers, compute BLEU scores, and visualize cross-attention alignments
- Compare Indian multilingual NLP challenges with English-only equivalents in terms of data availability, tokenization, and evaluation metrics
- Evaluate model performance using BLEU, ROUGE-L, F1, precision, recall, and conduct error analysis on failure cases
- Design deployment strategies for Indian language NLP models, considering script diversity, code-switching, and resource constraints
Opening Hook โ The Language of a Billion
๐ฃ๏ธ "Yeh train kitne baje aayegi?" โ When AI Meets India's Languages
India has 22 scheduled languages and 19,500 dialects. More than 500 million Indians access the internet primarily in languages other than English. When Ravi in Varanasi tweets "Bahut bakwas movie thi ๐ waste of money," he's writing in Hinglish โ a fluid mix of Hindi and English, with Devanagari sometimes sprinkled in: "เคฌเคนเฅเคค เคฌเคเคตเคพเคธ movie เคฅเฅ." No standard spelling. No clear language boundary. No neat grammar rules.
Building NLP that works for India is one of the hardest problems in AI. This chapter tackles it head-on.
In 2020, when COVID misinformation flooded WhatsApp groups across India โ "เคเคฐเฅเคฎ เคชเคพเคจเฅ เคชเฅเคจเฅ เคธเฅ corona เค เฅเค เคนเฅเคคเคพ เคนเฅ" (drinking hot water cures corona) โ there was no reliable Hindi fake news detector. The tools built for English Twitter couldn't handle code-mixed Devanagari-Roman text. People died because of information gaps.
Meanwhile, India's Supreme Court produces 30,000+ judgments annually in a mix of English and Hindi legal prose so dense that even lawyers struggle to extract key holdings. Could a Transformer summarize a 50-page judgment into 5 crisp paragraphs?
This chapter is not a toy exercise. You will build 5 real NLP systems that solve real Indian problems โ and learn to compare them with their English-only global counterparts. By the end, you'll understand why multilingual NLP is fundamentally harder, mathematically richer, and more impactful than monolingual NLP.
The Intuition First โ Why Indian NLP is Hard Mode
The Postman Analogy
Imagine you're a postman in London. Every letter has a clearly written English address. Simple. Now imagine you're a postman in Mumbai. Some addresses are in English, some in Hindi, some in Marathi, some are a wild mix โ "Flat 302, เคคเฅเคธเคฐเฅ เคฎเคเคเคฟเคฒ, Sunshine Apartments, Andheri (W)" โ and half of them have creative spellings because there's no standardized transliteration. That's what an NLP model faces when processing Indian text.
The Five Core Challenges
Why Monolingual NLP Breaks on Indian Text
Indians don't separate languages cleanly. A single sentence freely mixes Hindi words, English words, and sometimes Devanagari script mid-sentence: "Main kal meeting mein bahut nervous tha." Standard English NLP tokenizers see "Main" as the English word for primary importance, not the Hindi pronoun "I."
2. Script DiversityHindi alone appears in two scripts: Devanagari (เคฎเฅเค) and Roman transliteration (main). Same word, completely different byte sequences. A model trained on one won't recognize the other without explicit handling.
3. Morphological RichnessHindi is a morphologically richer language than English. The verb "เคเคพเคจเคพ" (to eat) inflects as: เคเคพเคคเคพ, เคเคพเคคเฅ, เคเคพเคคเฅ, เคเคพเคเคเคเคพ, เคเคพเคเคเคเฅ, เคเคพเคฏเคพ, เคเคพเคฏเฅ, เคเคพเคฏเฅ... English has eat/eats/ate/eaten. Hindi has dozens of forms.
4. Resource ScarcityEnglish has billions of labeled NLP samples. Hindi has orders of magnitude less. Hinglish has almost none in clean, curated form. This makes transfer learning not just useful โ it's essential.
5. No Standard SpellingWhen writing Hindi in Roman script, "bahut" = "bohot" = "boht" = "bhot". There's no ISO standard for Hindi-in-Roman. Every user invents their own spelling.
"Aha" Question
If you train BERT on English text and then ask it to classify "Bahut bakwas movie thi" as positive or negative, what will happen?
Answer: It will fail completely. English BERT's vocabulary doesn't contain Hindi tokens. "Bahut" gets split into meaningless subwords like ["Ba", "##hu", "##t"]. The model has no semantic representation for these fragments. This is why we need multilingual models โ and why this chapter exists.
- Languages: 22 official + 19,500 dialects
- Scripts: 13 distinct scripts (Devanagari, Tamil, Bengali, etc.)
- Code-mixing: ~30% of Indian social media is code-mixed
- Key models: IndicBERT, MuRIL, IndicTrans
- Key datasets: IndicNLPSuite, SAIL, IIT-P Hinglish
- Tokenizer needs: SentencePiece, IndicNLP tokenizer
- Biggest challenge: No standard transliteration
- Languages: Primarily English (+ Spanish, Chinese)
- Scripts: Latin script dominates
- Code-mixing: Rare in mainstream NLP benchmarks
- Key models: BERT, RoBERTa, GPT, T5
- Key datasets: GLUE, SQuAD, SST, WMT
- Tokenizer needs: WordPiece, BPE (well-established)
- Biggest challenge: Bias, fairness, hallucination
Google's MuRIL (Multilingual Representations for Indian Languages) was trained on 17 Indian languages and outperforms mBERT on Indian language tasks by 2-5% on average. It was specifically designed because mBERT's 104-language vocabulary dilutes Indian language representation โ Hindi gets only ~1,700 out of 120,000 WordPiece tokens.
NLP engineers working on Indian languages are among the most in-demand AI professionals in India today. These are the roles this chapter prepares you for:
NLP Engineer โ Indian Languages Applied ML Scientist Conversational AI Developer Legal Tech AI Engineer Trust & Safety ML Engineer Machine Translation ResearcherTop Hirers (India): Google India, Microsoft India, Flipkart, Sharechat, Koo, Jugalbandi, Bhashini, CDAC
Top Hirers (US/Global): Google, Meta, OpenAI, Amazon, Apple, DeepL, Translated
Hindi/Hinglish Sentiment Analysis
Problem Statement
Given a tweet or social media post written in Hindi, English, or Hinglish (code-mixed Hindi-English), classify it as Positive, Negative, or Neutral. The input may use Devanagari script, Roman script, or a mix of both.
Why This Is Hard
Consider these three tweets โ all expressing negativity:
"Worst movie ever"โ Pure English. Easy."เคฌเคนเฅเคค เคฌเคเคตเคพเคธ เคซเคฟเคฒเฅเคฎ เคฅเฅ"โ Pure Hindi (Devanagari). Needs Hindi-aware tokenizer."Bahut bakwas movie thi ๐"โ Hinglish (Roman). Neither English nor Hindi tokenizer handles this well.
Dataset
We use the SemEval 2020 Task 9 Hinglish Sentiment dataset and augment with:
| Dataset | Size | Languages | Labels |
|---|---|---|---|
| SemEval 2020 Task 9 | ~15,000 tweets | Hinglish (code-mixed) | Positive, Negative, Neutral |
| SAIL 2015 Hindi | ~12,000 tweets | Hindi (Devanagari) | Positive, Negative, Neutral |
| Custom scraped | ~5,000 tweets | Mixed | Manual annotation |
Tokenization Challenges
Step 1: Why English BERT Tokenizer Fails on Hindi
Let's trace what happens when you feed "เคฌเคนเฅเคค เค เคเฅเคเฅ เคซเคฟเคฒเฅเคฎ" to the standard BERT tokenizer:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize("เคฌเคนเฅเคค เค
เคเฅเคเฅ เคซเคฟเคฒเฅเคฎ")
print(tokens)
# Output: ['[UNK]'] โ Every Devanagari character is unknown!
Step 2: Why mBERT Tokenizer Is Better (But Not Great)
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
tokens = tokenizer.tokenize("เคฌเคนเฅเคค เค
เคเฅเคเฅ เคซเคฟเคฒเฅเคฎ")
print(tokens)
# Output: ['เคฌเคนเฅเคค', 'เค
เคเฅเคเฅ', 'เคซเคฟเคฒเฅเคฎ'] โ Works! But only ~1,700 Hindi tokens
Step 3: Hinglish Breaks Both
tokens = tokenizer.tokenize("bahut acchi movie thi yaar")
print(tokens)
# Output: ['ba', '##hu', '##t', 'ac', '##chi', 'movie', 'th', '##i', 'ya', '##ar']
# "bahut" โ 3 subwords, "acchi" โ 2 subwords โ heavy fragmentation!
Step 4: The Solution โ Use MuRIL or IndicBERT
Google's MuRIL (Multilingual Representations for Indian Languages) is trained on transliterated text too. It handles Roman-script Hindi much better because its training data included romanized Indian language text.
Model Architecture
HuggingFace Implementation
Python
# Project 1: Hindi/Hinglish Sentiment Analysis with MuRIL
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer
)
from sklearn.metrics import classification_report, f1_score
import pandas as pd
import re
import numpy as np
# โโโ Step 1: Preprocessing for Hinglish โโโ
def preprocess_hinglish(text):
"""Clean Hinglish tweet text."""
text = re.sub(r'http\S+', '', text) # Remove URLs
text = re.sub(r'@\w+', '', text) # Remove mentions
text = re.sub(r'#(\w+)', r'\1', text) # Keep hashtag text
text = re.sub(r'[^\w\s\u0900-\u097F]', '', text) # Keep Devanagari + alphanumeric
text = re.sub(r'\s+', ' ', text).strip()
text = text.lower()
# Normalize common Hinglish spelling variants
spelling_map = {
'bohot': 'bahut', 'boht': 'bahut', 'bhot': 'bahut',
'accha': 'acha', 'acha': 'acha', 'achha': 'acha',
'kya': 'kya', 'kyaa': 'kya', 'kia': 'kya',
}
words = text.split()
words = [spelling_map.get(w, w) for w in words]
return ' '.join(words)
# โโโ Step 2: Dataset class โโโ
class HinglishDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_len=128):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer(
self.texts[idx],
max_length=self.max_len,
padding='max_length',
truncation=True,
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'labels': torch.tensor(self.labels[idx], dtype=torch.long)
}
# โโโ Step 3: Load model โ MuRIL for Indian languages โโโ
MODEL_NAME = "google/muril-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(
MODEL_NAME, num_labels=3 # Positive, Negative, Neutral
)
# โโโ Step 4: Prepare data โโโ
# (Assuming df has columns: 'text', 'label' where label โ {0,1,2})
# df = pd.read_csv('hinglish_sentiment.csv')
# For demo, we create sample data:
sample_texts = [
preprocess_hinglish("bahut acchi movie thi! loved it"),
preprocess_hinglish("worst film ever dekhi maine"),
preprocess_hinglish("เค เฅเค เค เคพเค เคฅเฅ, nothing special"),
preprocess_hinglish("kya bakwas acting thi yaar"),
preprocess_hinglish("mast movie hai bhai ๐ฅ"),
]
sample_labels = [0, 1, 2, 1, 0] # 0=Positive, 1=Negative, 2=Neutral
# โโโ Step 5: Training configuration โโโ
training_args = TrainingArguments(
output_dir='./hinglish_sentiment_model',
num_train_epochs=4,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
learning_rate=2e-5,
weight_decay=0.01,
warmup_ratio=0.1,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="f1_weighted",
fp16=True, # Mixed precision for faster training
)
# โโโ Step 6: Custom metrics โโโ
def compute_metrics(eval_pred):
predictions, labels = eval_pred
preds = np.argmax(predictions, axis=1)
f1 = f1_score(labels, preds, average='weighted')
return {'f1_weighted': f1}
# โโโ Step 7: Train โโโ
train_dataset = HinglishDataset(sample_texts, sample_labels, tokenizer)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=train_dataset, # Use proper val split in practice!
compute_metrics=compute_metrics,
)
# trainer.train() # Uncomment to actually train
# โโโ Step 8: Inference โโโ
def predict_sentiment(text):
text = preprocess_hinglish(text)
inputs = tokenizer(text, return_tensors="pt",
truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
label = ["Positive", "Negative", "Neutral"][probs.argmax()]
return label, probs.tolist()
# Test it
print(predict_sentiment("bahut mast movie thi!"))
print(predict_sentiment("bakwas film, time waste"))
print(predict_sentiment("เค เฅเค-เค เคพเค เคฅเฅ"))
Expected Metrics
Error Analysis
| Error Type | Example | Why Model Fails | Frequency |
|---|---|---|---|
| Sarcasm | "waah kya acting thi ๐คฃ" (sarcastic praise) | Surface-level positive words, but negative intent | ~18% |
| Spelling variation | "bhot bdia" vs "bahut badiya" | Same meaning, different tokens | ~15% |
| Script mixing | "เคฌเคนเฅเคค bad movie" | Half Devanagari, half English in one phrase | ~12% |
| Negation in Hindi | "nahi acchi thi" (was not good) | Model focuses on "acchi" (good), misses "nahi" (not) | ~10% |
| Cultural context | "timepass movie" (mediocre) | "timepass" is Indian slang, not in training data | ~8% |
โ MYTH: "Just use English BERT with Google Translate โ translate Hindi to English first, then classify."
โ TRUTH: Translation destroys sentiment signals. "Bahut bakwas" literally translates to "very nonsense," losing the emotional intensity. Sarcasm, cultural slang, and code-mixed nuance are untranslatable. Always use multilingual models directly on the original text.
๐ WHY IT MATTERS: In production sentiment systems (ShareChat, Koo), translate-then-classify underperforms direct multilingual models by 8-12% F1.
Indian Legal Document Summarization
Problem Statement
Given a full Indian Supreme Court or High Court judgment (typically 5,000โ50,000 words), generate a concise abstractive summary (300โ800 words) that captures the facts, issues, arguments, reasoning, and holding.
Why This Matters
India's judiciary has a backlog of 50+ million pending cases. Lawyers and judges spend hours reading lengthy judgments to find relevant precedents. The Indian Legal Document Corpus (ILDC) project by IIT Kharagpur aims to make legal research faster with AI summarization. This is not an academic exercise โ it's a tool that could accelerate justice delivery in India.
Dataset
| Dataset | Size | Avg. Document Length | Avg. Summary Length |
|---|---|---|---|
| ILDC (IIT Kharagpur) | ~35,000 judgments | ~4,200 words | ~250 words |
| IN-Abs (FIRE 2022) | ~7,000 judgments | ~5,500 words | ~350 words |
| Custom SC judgments | ~2,000 curated | ~8,000 words | ~500 words (manual) |
Tokenization Challenges for Legal Text
- Extreme length: A 10,000-word judgment exceeds BERT's 512-token limit by 20ร. You need either a Longformer (4,096 tokens) or a chunking strategy.
- Legal vocabulary: Terms like "suo motu," "certiorari," "mandamus," "res judicata" are Latin loan-words used in Indian courts but absent from standard NLP vocabularies.
- Hindi legal terms: "เคฏเคพเคเคฟเคเคพ" (petition), "เค เคชเฅเคฒ" (appeal), "เคจเคฟเคฐเฅเคฃเคฏ" (judgment) appear when courts cite Hindi-language proceedings.
- Section references: "Section 302 IPC," "Article 21," "CrPC 482" โ these are not natural language but structured legal codes.
Model Architecture: Two Approaches
HuggingFace Implementation
Python
# Project 2: Indian Legal Document Summarization
import torch
from transformers import (
AutoTokenizer, AutoModelForSeq2SeqLM,
Seq2SeqTrainingArguments, Seq2SeqTrainer,
DataCollatorForSeq2Seq
)
from datasets import load_dataset
import evaluate
import numpy as np
import re
# โโโ Step 1: Legal text preprocessing โโโ
def preprocess_legal_text(text):
"""Clean Indian legal judgment text."""
# Normalize section references
text = re.sub(r'Section\s+(\d+)', r'Section_\1', text)
text = re.sub(r'Article\s+(\d+)', r'Article_\1', text)
# Remove page numbers and formatting artifacts
text = re.sub(r'\n\d+\n', '\n', text)
text = re.sub(r'\s+', ' ', text).strip()
# Truncate to manageable length for encoder
words = text.split()
if len(words) > 3000:
# Keep first 1500 + last 1500 words (intro + conclusion)
words = words[:1500] + ['[...]'] + words[-1500:]
return ' '.join(words)
# โโโ Step 2: Extractive pre-filtering โโโ
from transformers import pipeline
def extractive_filter(text, top_k=30):
"""Use sentence scoring to extract key sentences."""
sentences = text.split('.')
# Simple heuristic: score sentences with legal keywords
legal_keywords = [
'held', 'order', 'appeal', 'petitioner', 'respondent',
'court', 'judgment', 'section', 'article', 'contention',
'submitted', 'evidence', 'guilty', 'acquit', 'convicted',
'constitutional', 'fundamental', 'right', 'violation'
]
scored = []
for i, sent in enumerate(sentences):
score = sum(1 for kw in legal_keywords if kw in sent.lower())
# Boost first and last 20% of document (facts + holding)
if i < len(sentences) * 0.2 or i > len(sentences) * 0.8:
score += 2
scored.append((score, sent))
scored.sort(reverse=True)
return '. '.join([s for _, s in scored[:top_k]])
# โโโ Step 3: Load mT5 for multilingual summarization โโโ
MODEL_NAME = "google/mt5-base" # or "ai4bharat/IndicBART"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
# โโโ Step 4: Summarization function โโโ
def summarize_judgment(judgment_text, max_input=1024, max_output=256):
# Stage 1: Extract key sentences
filtered = extractive_filter(judgment_text)
# Stage 2: Abstractive summarization
prefix = "summarize: "
inputs = tokenizer(
prefix + filtered,
max_length=max_input,
truncation=True,
return_tensors="pt"
)
summary_ids = model.generate(
inputs["input_ids"],
max_length=max_output,
min_length=60,
num_beams=4,
length_penalty=2.0,
no_repeat_ngram_size=3,
early_stopping=True
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
return summary
# โโโ Step 5: Evaluation with ROUGE โโโ
rouge = evaluate.load("rouge")
def evaluate_summary(prediction, reference):
results = rouge.compute(
predictions=[prediction],
references=[reference]
)
print(f"ROUGE-1: {results['rouge1']:.4f}")
print(f"ROUGE-2: {results['rouge2']:.4f}")
print(f"ROUGE-L: {results['rougeL']:.4f}")
return results
# Example usage
sample_judgment = """
The petitioner filed a writ petition under Article 32 of the Constitution
challenging the constitutional validity of Section 66A of the Information
Technology Act, 2000. The petitioner contended that the provision was vague
and overbroad, violating the fundamental right to free speech under
Article 19(1)(a). The respondent Union of India argued that reasonable
restrictions under Article 19(2) were applicable. After hearing both sides,
this Court held that Section 66A is unconstitutional as it creates a
chilling effect on free speech and is not saved by Article 19(2).
The impugned section is struck down.
"""
summary = summarize_judgment(sample_judgment)
print("Generated Summary:", summary)
Metrics: Extractive vs. Abstractive on ILDC
Error Analysis
| Error Type | Example | Root Cause |
|---|---|---|
| Hallucinated sections | Summary mentions "Section 420 IPC" which doesn't appear in the judgment | Decoder generating plausible-sounding but incorrect legal citations |
| Missing holding | Summary describes facts and arguments but omits the final decision | Holding usually appears at the end; truncation drops it |
| Party confusion | Attributes petitioner's argument to respondent | Long-range dependency: parties are defined 2000+ words before their arguments |
| Hindi terms dropped | "เคฏเคพเคเคฟเคเคพ เคเคพเคฐเคฟเค" (petition dismissed) not captured | mT5 underrepresents Hindi legal vocabulary |
"ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation" (Malik et al., ACL 2021)
This landmark paper introduced the first large-scale corpus of Indian Supreme Court judgments (~35K cases) with human-annotated summaries and prediction labels. The authors showed that LegalBERT pre-trained on Indian legal text outperforms generic BERT by 4.2% accuracy on judgment prediction. The dataset is freely available and has become the standard benchmark for Indian legal NLP.
Key insight: Legal summarization is harder than news summarization because the "answer" (the holding) is structurally unpredictable โ it could be in the first paragraph, the last paragraph, or buried in the middle.
IRCTC Chatbot โ Seq2Seq with Attention
Problem Statement
Build a conversational AI system that handles Indian Railways booking queries in Hindi, English, and Hinglish. The system must understand queries like:
"Delhi se Mumbai ka train kitne baje hai?"(What time is the train from Delhi to Mumbai?)"เคฎเฅเคเฅ เคเคฒ เคฆเคฟเคฒเฅเคฒเฅ เคธเฅ เคเคฏเคชเฅเคฐ เคเคพเคจเคพ เคนเฅ"(I need to go from Delhi to Jaipur tomorrow)"PNR status check karo 4521876543"(Check PNR status for 4521876543)"Cancel my ticket from last week"
Architecture: Intent + Slot Filling + Response Generation
Part A: Intent Classification
Python
# Part A: Intent classification for IRCTC queries
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Define intents
INTENTS = [
'SEARCH_TRAIN', # Find trains between stations
'CHECK_PNR', # PNR status inquiry
'BOOK_TICKET', # Book a ticket
'CANCEL_TICKET', # Cancel booking
'SEAT_AVAILABILITY', # Check seat availability
'TRAIN_STATUS', # Live running status
'GENERAL_QUERY', # General help
]
# Training data examples (would be 1000+ in practice)
training_examples = [
("Delhi se Mumbai train dikhao", 0),
("เคฆเคฟเคฒเฅเคฒเฅ เคธเฅ เคฎเฅเคเคฌเค เคเฅ เคเฅเคฐเฅเคจ เคฌเคคเคพเค", 0),
("Show me trains from Delhi to Mumbai", 0),
("PNR check karo 4521876543", 1),
("mera PNR status kya hai", 1),
("ticket book karna hai kal ke liye", 2),
("เคฎเฅเคเฅ เคเคฟเคเค เคฌเฅเค เคเคฐเคจเฅ เคนเฅ", 2),
("cancel my ticket please", 3),
("meri ticket cancel karo", 3),
("kitni seat available hai?", 4),
("train kahan tak pahunchi?", 5),
("help chahiye", 6),
]
# Load MuRIL for intent classification
model_name = "google/muril-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
intent_model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=len(INTENTS)
)
def classify_intent(query):
inputs = tokenizer(query, return_tensors="pt",
truncation=True, max_length=64)
with torch.no_grad():
logits = intent_model(**inputs).logits
pred = logits.argmax(dim=-1).item()
confidence = torch.softmax(logits, dim=-1).max().item()
return INTENTS[pred], confidence
Part B: Slot Extraction (Named Entity Recognition)
Python
# Part B: Slot filling using token classification
from transformers import AutoModelForTokenClassification
# Slot labels (BIO format)
SLOT_LABELS = [
'O', # Outside
'B-SOURCE', # Source station start
'I-SOURCE', # Source station continuation
'B-DEST', # Destination start
'I-DEST', # Destination continuation
'B-DATE', # Date start
'I-DATE', # Date continuation
'B-PNR', # PNR number
'I-PNR',
'B-CLASS', # Travel class (Sleeper, AC, etc.)
'I-CLASS',
'B-TRAIN', # Train name/number
'I-TRAIN',
]
# Example annotation
# "Delhi se Mumbai 3 tareekh ko AC mein dikhao"
# B-SRC O B-DEST B-DATE I-DATE O B-CLS O O
slot_model = AutoModelForTokenClassification.from_pretrained(
model_name, num_labels=len(SLOT_LABELS)
)
def extract_slots(query):
inputs = tokenizer(query, return_tensors="pt",
truncation=True, max_length=64,
return_offsets_mapping=True)
offsets = inputs.pop("offset_mapping")
with torch.no_grad():
logits = slot_model(**inputs).logits
preds = logits.argmax(dim=-1)[0].tolist()
slots = {}
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
current_slot = None
current_value = []
for token, pred_id in zip(tokens, preds):
label = SLOT_LABELS[pred_id]
if label.startswith('B-'):
if current_slot:
slots[current_slot] = ' '.join(current_value)
current_slot = label[2:]
current_value = [token]
elif label.startswith('I-') and current_slot:
current_value.append(token)
else:
if current_slot:
slots[current_slot] = ' '.join(current_value)
current_slot = None
current_value = []
if current_slot:
slots[current_slot] = ' '.join(current_value)
return slots
# โโโ Part C: Response Generation โโโ
def generate_response(intent, slots):
"""Template-based response with bilingual output."""
if intent == 'SEARCH_TRAIN':
src = slots.get('SOURCE', '?')
dst = slots.get('DEST', '?')
date = slots.get('DATE', 'today')
return (f"{src} โ {dst} ke liye {date} ko trains:\n"
f"1. Rajdhani Express (12:05 PM)\n"
f"2. Duronto Express (11:20 PM)\n"
f"3. Mumbai Mail (09:00 PM)")
elif intent == 'CHECK_PNR':
pnr = slots.get('PNR', 'N/A')
return f"PNR {pnr} ka status: Confirmed (S5, Berth 42)"
else:
return "Main aapki kaise madad kar sakta hoon?"
# โโโ Full pipeline โโโ
def chatbot_pipeline(user_query):
intent, confidence = classify_intent(user_query)
slots = extract_slots(user_query)
response = generate_response(intent, slots)
return {
'query': user_query,
'intent': intent,
'confidence': confidence,
'slots': slots,
'response': response
}
# Test
result = chatbot_pipeline("Delhi se Mumbai kal AC mein train dikhao")
print(result)
Metrics
The following code is supposed to extract the PNR number from a Hinglish query, but it has 3 bugs. Can you find them?
def extract_pnr(query):
import re
# Bug 1: Pattern issue
pnr_pattern = r'\d{5}' # PNR is 10 digits!
match = re.search(pnr_pattern, query)
if match:
# Bug 2: Wrong method
return match.groups() # .group() not .groups()
# Bug 3: Missing return
# If no match, function returns None implicitly
# Should return empty string or raise exception
result = extract_pnr("PNR check karo 4521876543")
print("PNR:", result) # Gives wrong output!
Bug 1: Change
\d{5} to \d{10} โ PNR numbers are 10 digitsBug 2: Change
match.groups() to match.group() โ .groups() returns captured groups (empty tuple since no groups), .group() returns the full matchBug 3: Add
return "" at the end for the no-match case
Fake News Detection for Indian WhatsApp
Problem Statement
Given a message forwarded on WhatsApp (in Hindi, English, or Hinglish), classify it as REAL or FAKE. The model must handle health misinformation, political propaganda, communal rumors, and scientific hoaxes common in Indian WhatsApp groups.
Why WhatsApp Fake News Is Unique to India
India has 500+ million WhatsApp users โ the largest user base in the world. Unlike Twitter (public), WhatsApp messages are end-to-end encrypted and forwarded privately, making misinformation detection fundamentally harder. During COVID-19, messages like "5G towers spread corona" and "cow urine cures COVID" led to real-world harm. In 2018, WhatsApp misinformation led to mob lynchings in multiple Indian states, prompting WhatsApp to limit forwarding. Building automated detection is literally a life-saving application.
Dataset
| Dataset | Size | Languages | Source |
|---|---|---|---|
| CONSTRAINT 2021 (Hindi) | ~21,000 | Hindi (Devanagari) | Fact-checking websites + WhatsApp |
| FakeNewsNet (English) | ~23,000 | English | PolitiFact + GossipCop |
| Fake-News-Hindi | ~9,500 | Hindi + Hinglish | Indian fact-checkers (AltNews, BoomLive) |
| Custom WhatsApp | ~3,000 | Mixed | Crowdsourced + manual annotation |
Feature Engineering for WhatsApp Text
Linguistic Features Unique to Fake Messages
Fake messages use excessive exclamation marks, ALL CAPS, and urgency words: "URGENT!!! เคเคฒเฅเคฆเฅ forward เคเคฐเฅ!!!" Real news rarely uses this tone.
2. Source absenceFake messages rarely cite verifiable sources. Real news mentions "according to PTI," "as per MoHFW data." We encode this as a binary feature: has_source = 1 if message contains source indicators.
Phrases like "Forwarded as received," "Doctor ne bataya," "NASA ne confirm kiya" (NASA confirmed) โ these are markers of viral forwards, not original content.
4. Emotional manipulationFake messages appeal to fear, anger, or nationalism: "Desh ke liye share karo!" (Share for the nation!). Sentiment extremity (very positive or very negative) correlates with fakeness.
HuggingFace Implementation
Python
# Project 4: Fake News Detection for Indian WhatsApp Messages
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics import classification_report
import re
import numpy as np
# โโโ Step 1: Feature extraction โโโ
def extract_meta_features(text):
"""Extract non-textual features that signal fakeness."""
features = []
# Exclamation density
features.append(text.count('!') / max(len(text), 1))
# CAPS ratio
caps = sum(1 for c in text if c.isupper())
features.append(caps / max(len(text), 1))
# Has source citation
source_words = ['according', 'source', 'study', 'report',
'research', 'PTI', 'ANI', 'WHO']
features.append(1.0 if any(w in text.lower() for w in source_words) else 0.0)
# Forward indicators
fwd_words = ['forward', 'share', 'viral', 'forwarded',
'เคญเฅเคเฅ', 'เคถเฅเคฏเคฐ', 'forward เคเคฐเฅ']
features.append(1.0 if any(w in text.lower() for w in fwd_words) else 0.0)
# Message length (normalized)
features.append(min(len(text) / 1000, 1.0))
return features
# โโโ Step 2: BERT + Meta Features Model โโโ
class FakeNewsDetector(nn.Module):
def __init__(self, bert_model_name, n_meta_features=5, n_classes=2):
super().__init__()
self.bert = AutoModel.from_pretrained(bert_model_name)
self.dropout = nn.Dropout(0.3)
# BERT hidden (768) + meta features (5) โ classifier
self.classifier = nn.Sequential(
nn.Linear(768 + n_meta_features, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, n_classes)
)
def forward(self, input_ids, attention_mask, meta_features):
bert_out = self.bert(
input_ids=input_ids,
attention_mask=attention_mask
)
cls_embedding = bert_out.last_hidden_state[:, 0, :] # [CLS] token
cls_embedding = self.dropout(cls_embedding)
# Concatenate BERT [CLS] with meta features
combined = torch.cat([cls_embedding, meta_features], dim=1)
logits = self.classifier(combined)
return logits
# โโโ Step 3: Initialize โโโ
MODEL_NAME = "google/muril-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = FakeNewsDetector(MODEL_NAME)
# โโโ Step 4: Inference function โโโ
def detect_fake_news(text):
# Tokenize
encoding = tokenizer(
text, max_length=256, padding='max_length',
truncation=True, return_tensors='pt'
)
# Extract meta features
meta = torch.tensor([extract_meta_features(text)], dtype=torch.float)
model.eval()
with torch.no_grad():
logits = model(
encoding['input_ids'],
encoding['attention_mask'],
meta
)
probs = torch.softmax(logits, dim=-1)
pred = probs.argmax(dim=-1).item()
label = "FAKE" if pred == 1 else "REAL"
return label, probs[0].tolist()
# Test
test_messages = [
"BREAKING: เคเคฐเฅเคฎ เคชเคพเคจเฅ เคชเฅเคจเฅ เคธเฅ COVID เค เฅเค เคนเฅเคคเคพ เคนเฅ!!! SHARE เคเคฐเฅ เคธเคฌเคเฅ!!!",
"According to WHO, COVID-19 vaccines have undergone rigorous clinical trials.",
"NASA ne confirm kiya hai ki kal raat 2 baje earthquake aayega India mein!!!",
]
for msg in test_messages:
label, probs = detect_fake_news(msg)
print(f"[{label}] (conf: {max(probs):.2f}) โ {msg[:50]}...")
Expected Metrics
Error Analysis
| Error Type | Example | Challenge | Freq |
|---|---|---|---|
| Satire misclassified | "Modi ji ne chand pe airport banane ka announcement kiya" (satirical) | Looks fake but is intentional humor | ~14% |
| Partial truth | Real event + fabricated details | Contains real names/dates mixed with fiction | ~12% |
| Old news reshared | Real news from 2019 shared as "breaking" in 2024 | Content is factually true but contextually misleading | ~9% |
| Regional language | Fake news in Tamil/Telugu transliterated to Roman | Model trained on Hindi/English, not Dravidian | ~7% |
Neural Machine Translation: Hindi โ English
Problem Statement
Build a HindiโEnglish and EnglishโHindi translation system using the Transformer architecture. Implement attention visualization to understand word alignment, and evaluate using BLEU score.
Mathematical Foundation: BLEU Score
Deriving BLEU from First Principles
BLEU (Bilingual Evaluation Understudy) measures how close a machine translation is to human reference translations. Let's derive it step by step.
Step 1: Modified n-gram Precision
For each n-gram size (1, 2, 3, 4), we compute:
pโ = ฮฃ_sentence ฮฃ_ngram min(count_candidate(ngram), count_reference(ngram))
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
ฮฃ_sentence ฮฃ_ngram count_candidate(ngram)
The min is the "clipping" โ it prevents gaming by repeating common words.
Step 2: Example Calculation
Reference: "The cat sat on the mat"
Candidate: "The the the the the the"
Without clipping: pโ = 6/6 = 1.0 (all "the" match!)
With clipping: max count of "the" in reference = 2, so pโ = 2/6 = 0.33 โ
Step 3: Brevity Penalty
Short translations have artificially high precision. The brevity penalty (BP) penalizes them:
BP = exp(1 - r/c) if c < r
BP = 1 if c โฅ r
where c = candidate length, r = reference length
Step 4: Final BLEU
BLEU = BP ยท exp(ฮฃโ wโ ยท log(pโ))
Typically: wโ = wโ = wโ = wโ = 0.25 (equal weights)
Tokenization: SentencePiece for Hindi
Python
# Why we need SentencePiece for Hindi
# Hindi has agglutinative properties โ words combine morphemes
# "เคเคพเคเคเคเฅ" = เคเคพ (eat) + เคเคเคเฅ (will, feminine) โ "I will eat" (said by a woman)
import sentencepiece as spm
# Train SentencePiece model on Hindi corpus
# spm.SentencePieceTrainer.train(
# input='hindi_corpus.txt',
# model_prefix='hindi_sp',
# vocab_size=32000,
# character_coverage=0.9995, # High coverage for Devanagari
# model_type='bpe'
# )
# Compare tokenization approaches
from transformers import AutoTokenizer
sentence = "เคญเคพเคฐเคค เคเฅ เคชเฅเคฐเคงเคพเคจเคฎเคเคคเฅเคฐเฅ เคจเฅ เคเค เคเค เคฎเคนเคคเฅเคตเคชเฅเคฐเฅเคฃ เคฌเฅเค เค เคเฅเฅค"
# mBERT tokenizer (WordPiece)
tok1 = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
print("mBERT:", tok1.tokenize(sentence))
# ['เคญเคพเคฐเคค', 'เคเฅ', 'เคชเฅเคฐเคงเคพเคจเคฎเคเคคเฅเคฐเฅ', 'เคจเฅ', 'เคเค', 'เคเค', 'เคฎเคนเคคเฅเคต', '##เคชเฅเคฐเฅเคฃ', 'เคฌเฅเค เค', 'เคเฅ', 'เฅค']
# Note: "เคฎเคนเคคเฅเคตเคชเฅเคฐเฅเคฃ" (important) split into "เคฎเคนเคคเฅเคต" + "##เคชเฅเคฐเฅเคฃ"
# IndicTrans tokenizer (SentencePiece)
tok2 = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-1B")
print("IndicTrans:", tok2.tokenize(sentence))
# Better handling of compound Hindi words
Full Implementation: HindiโEnglish NMT
Python
# Project 5: HindiโEnglish Neural Machine Translation
from transformers import (
AutoTokenizer, AutoModelForSeq2SeqLM,
Seq2SeqTrainingArguments, Seq2SeqTrainer,
DataCollatorForSeq2Seq
)
from datasets import load_dataset
import evaluate
import numpy as np
import torch
# โโโ Step 1: Load IndicTrans2 (State-of-the-art for Indian languages) โโโ
MODEL_NAME = "ai4bharat/indictrans2-en-indic-1B"
# For a lighter model, use Helsinki-NLP/opus-mt-en-hi
LITE_MODEL = "Helsinki-NLP/opus-mt-en-hi"
tokenizer = AutoTokenizer.from_pretrained(LITE_MODEL)
model = AutoModelForSeq2SeqLM.from_pretrained(LITE_MODEL)
# โโโ Step 2: Translation function โโโ
def translate_en_to_hi(text, max_length=128):
inputs = tokenizer(text, return_tensors="pt",
max_length=max_length, truncation=True)
translated = model.generate(
**inputs,
max_length=max_length,
num_beams=5,
length_penalty=1.0,
no_repeat_ngram_size=2,
early_stopping=True
)
result = tokenizer.decode(translated[0], skip_special_tokens=True)
return result
# โโโ Step 3: BLEU Score Computation โโโ
import sacrebleu
def compute_bleu(predictions, references):
"""Compute BLEU score using sacrebleu."""
bleu = sacrebleu.corpus_bleu(predictions, [references])
print(f"BLEU Score: {bleu.score:.2f}")
print(f"Precisions: {[f'{p:.1f}' for p in bleu.precisions]}")
print(f"Brevity Penalty: {bleu.bp:.4f}")
print(f"Length Ratio: {bleu.sys_len}/{bleu.ref_len} = {bleu.sys_len/bleu.ref_len:.3f}")
return bleu
# โโโ Step 4: By-hand BLEU calculation (for understanding) โโโ
from collections import Counter
def bleu_from_scratch(candidate, reference, max_n=4):
"""Compute BLEU score from scratch โ see the math in action."""
cand_tokens = candidate.split()
ref_tokens = reference.split()
# Step 1: Compute clipped n-gram precisions
precisions = []
for n in range(1, max_n + 1):
# Extract n-grams
cand_ngrams = [tuple(cand_tokens[i:i+n]) for i in range(len(cand_tokens) - n + 1)]
ref_ngrams = [tuple(ref_tokens[i:i+n]) for i in range(len(ref_tokens) - n + 1)]
# Count
cand_counts = Counter(cand_ngrams)
ref_counts = Counter(ref_ngrams)
# Clipped count
clipped = 0
total = 0
for ngram, count in cand_counts.items():
clipped += min(count, ref_counts.get(ngram, 0))
total += count
precision = clipped / max(total, 1)
precisions.append(precision)
print(f" p{n} = {clipped}/{total} = {precision:.4f}")
# Step 2: Brevity penalty
import math
c = len(cand_tokens)
r = len(ref_tokens)
bp = math.exp(1 - r/c) if c < r else 1.0
print(f" BP = {bp:.4f} (candidate: {c}, reference: {r})")
# Step 3: BLEU = BP ร geometric mean of precisions
log_avg = sum(math.log(max(p, 1e-10)) for p in precisions) / max_n
bleu = bp * math.exp(log_avg)
print(f" BLEU = {bleu:.4f}")
return bleu
# Example
ref = "India is a diverse country with many languages"
cand = "India is a country with many diverse languages"
print("=== By-hand BLEU ===")
bleu_from_scratch(cand, ref)
# โโโ Step 5: Attention Visualization โโโ
def visualize_attention(source, target_prefix=""):
"""Extract and visualize cross-attention weights."""
inputs = tokenizer(source, return_tensors="pt")
# Generate with attention outputs
outputs = model.generate(
**inputs,
max_length=64,
num_beams=1,
output_attentions=True,
return_dict_in_generate=True,
)
# Decode tokens
src_tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
tgt_tokens = tokenizer.convert_ids_to_tokens(outputs.sequences[0])
print(f"Source: {' '.join(src_tokens)}")
print(f"Target: {' '.join(tgt_tokens)}")
# In practice, you'd use matplotlib heatmap here
# plt.imshow(attention_matrix, cmap='viridis')
# plt.xticks(range(len(src_tokens)), src_tokens, rotation=45)
# plt.yticks(range(len(tgt_tokens)), tgt_tokens)
return src_tokens, tgt_tokens
# Test translations
test_pairs = [
("The Supreme Court upheld the fundamental right to privacy.",
"เคธเคฐเฅเคตเฅเคเฅเค เคจเฅเคฏเคพเคฏเคพเคฒเคฏ เคจเฅ เคจเคฟเคเคคเคพ เคเฅ เคฎเฅเคฒเคฟเค เค
เคงเคฟเคเคพเคฐ เคเฅ เคฌเคฐเคเคฐเคพเคฐ เคฐเคเคพเฅค"),
("Please book a ticket from Delhi to Mumbai.",
"เคเฅเคชเคฏเคพ เคฆเคฟเคฒเฅเคฒเฅ เคธเฅ เคฎเฅเคเคฌเค เคคเค เคเคพ เคเคฟเคเค เคฌเฅเค เคเคฐเฅเคเฅค"),
("Artificial intelligence is transforming healthcare in India.",
"เคเฅเคคเฅเคฐเคฟเคฎ เคฌเฅเคฆเฅเคงเคฟเคฎเคคเฅเคคเคพ เคญเคพเคฐเคค เคฎเฅเค เคธเฅเคตเคพเคธเฅเคฅเฅเคฏ เคธเฅเคตเคพ เคเฅ เคฌเคฆเคฒ เคฐเคนเฅ เคนเฅเฅค"),
]
for en, hi in test_pairs:
predicted_hi = translate_en_to_hi(en)
print(f"EN: {en}")
print(f"HI (pred): {predicted_hi}")
print(f"HI (ref): {hi}")
print()
Attention Visualization (ASCII)
Expected Metrics
Why is the Hindi BLEU score so much lower than French? It's NOT because the model is worse. Hindi and English are from different language families (Indo-Aryan vs. Germanic), have different scripts, different word orders (SOV vs. SVO), and different morphological complexity. French BLEU scores are inflated because French shares vocabulary and structure with English. Always compare BLEU scores within the same language pair, never across language pairs.
Error Analysis: Where Hindi NMT Struggles
| Error Type | Source (EN) | Predicted (HI) | Correct (HI) |
|---|---|---|---|
| Honorific loss | "You should go to the doctor" | "เคคเฅเคฎ เคกเฅเคเฅเคเคฐ เคเฅ เคชเคพเคธ เคเคพเค" | "เคเคช เคกเฅเคเฅเคเคฐ เคเฅ เคชเคพเคธ เคเคพเคเค" (formal) |
| Gender mismatch | "The teacher went home" | "เคถเคฟเคเฅเคทเค เคเคฐ เคเคฏเคพ" (masculine) | Could be "เคถเคฟเคเฅเคทเคฟเคเคพ เคเคฐ เคเคฏเฅ" (feminine) |
| Idiom translation | "It's raining cats and dogs" | "เคฌเคฟเคฒเฅเคฒเคฟเคฏเคพเค เคเคฐ เคเฅเคคเฅเคคเฅ เคฌเคฐเคธ เคฐเคนเฅ เคนเฅเค" (literal!) | "เคฎเฅเคธเคฒเคพเคงเคพเคฐ เคฌเคพเคฐเคฟเคถ เคนเฅ เคฐเคนเฅ เคนเฅ" |
| Named entity | "Modi visited Washington" | "เคฎเฅเคฆเฅ เคจเฅ เคตเคพเคถเคฟเคเคเฅเคเคจ เคเคพ เคฆเฅเคฐเคพ เคเคฟเคฏเคพ" โ | Usually correct for famous entities |
Visual Aids โ The NLP Project Pipeline
Master Pipeline: From Raw Text to Deployed Model
Model Selection Guide
Tokenization Comparison Across Scripts
Common Misconceptions
โ MYTH: "BLEU score of 30 means the translation is 30% correct."
โ TRUTH: BLEU is NOT a percentage of correctness. A BLEU of 30 for HindiโEnglish is actually quite good (approaching professional quality for some domains), while a BLEU of 30 for FrenchโEnglish might be mediocre. BLEU measures n-gram overlap with reference translations, not semantic accuracy.
๐ WHY IT MATTERS: In interviews and papers, comparing BLEU across language pairs is meaningless. Always compare within the same language pair and test set.
โ MYTH: "Fine-tuning BERT means you only train the classification head."
โ TRUTH: Fine-tuning means updating ALL parameters โ both the pretrained Transformer layers AND the new classification head. If you freeze the Transformer and only train the head, that's called feature extraction (or linear probing), not fine-tuning. True fine-tuning typically gives 3-8% better F1 than feature extraction.
๐ WHY IT MATTERS: GATE questions specifically test this distinction. In practice, with small datasets (<1K samples), feature extraction may actually be better to avoid overfitting.
โ MYTH: "Multilingual BERT (mBERT) was explicitly trained on parallel (aligned) multilingual data."
โ TRUTH: mBERT was trained on the concatenation of Wikipedia articles from 104 languages with NO parallel alignment and NO cross-lingual objectives. Yet it somehow learns cross-lingual representations! This is one of the biggest surprises of modern NLP โ it emerges from shared subword tokens and structural similarities.
๐ WHY IT MATTERS: Understanding this explains why mBERT works on some language pairs (similar structures) better than others, and why models like XLM (with explicit translation objectives) can outperform it.
โ MYTH: "ROUGE and BLEU measure the same thing."
โ TRUTH: BLEU is precision-oriented (what fraction of the candidate's n-grams appear in the reference?). ROUGE is recall-oriented (what fraction of the reference's n-grams appear in the candidate?). BLEU penalizes long outputs; ROUGE penalizes short outputs. Use BLEU for translation, ROUGE for summarization.
๐ WHY IT MATTERS: Using the wrong metric leads to wrong model selection. A model that copies the entire document would get ROUGE-1 โ 1.0 but BLEU โ 0.0.
โ MYTH: "Code-mixed (Hinglish) text is just Hindi with some English words. Tokenize each word by its language."
โ TRUTH: Code-mixing happens at every level โ word, morpheme, even character. "Main meeting mein gaya tha" has Hindi syntax wrapping English nouns. Some words are "language-ambiguous": "time" is used in both Hindi and English conversations. Language-tagging each token is itself a hard NLP problem (Language Identification) that must precede other tasks.
๐ WHY IT MATTERS: Pipelines that assume clean language boundaries fail catastrophically on real Indian social media text.
GATE / Exam Corner
Key Formulas Quick Reference
GATE Previous Year Questions (Adapted)
In BERT, the [CLS] token representation is typically used for:
- Predicting the next word
- Sentence-level classification tasks
- Generating text autoregressively
- Computing positional encodings
Which of the following correctly describes the Brevity Penalty in BLEU score?
- It penalizes translations that are too long
- It penalizes translations that are too short
- It penalizes translations with too many unique n-grams
- It is always equal to 1.0
In a Seq2Seq model with attention for HindiโEnglish translation, the attention mechanism primarily helps with:
- Handling the different word order (SOV โ SVO)
- Learning grammar rules explicitly
- Reducing the vocabulary size
- Eliminating the need for a decoder
Prediction Table: Expected GATE Topics (2025-26)
| Topic | Probability | Question Type | Key Concept |
|---|---|---|---|
| BLEU computation | ๐ข High | Numerical | Hand-calculate BLEU for given candidate/reference |
| BERT architecture | ๐ข High | MCQ | [CLS] token, encoder-only, MLM objective |
| ROUGE vs BLEU | ๐ก Medium | MCQ | Precision vs recall orientation |
| Transformer attention | ๐ข High | MCQ + NAT | Self vs cross attention, complexity O(nยฒ) |
| Seq2Seq + attention | ๐ก Medium | MCQ | Encoder-decoder, alignment |
| Tokenization (BPE/WP) | ๐ก Medium | MCQ | Subword segmentation methods |
Interview Prep
Conceptual Questions
Q1: Why does multilingual BERT (mBERT) work across languages despite never seeing parallel data?
mBERT's cross-lingual ability emerges from three factors: (1) Shared subword tokens โ words like "information" appear in many languages (English, French, etc.), creating anchor points; (2) Similar word order โ SVO languages share structural patterns that the model learns; (3) Parameter sharing โ all languages share the same Transformer weights, forcing the model to learn language-agnostic features. Research by Pires et al. (2019) showed this ability degrades for languages with different scripts (e.g., Hindi-Devanagari), which is why MuRIL (which includes transliterated text) outperforms mBERT on Indian languages.
Q2: You're building a Hinglish sentiment analyzer. Your F1 is 0.68. How would you improve it?
Data-side: (1) Augment with spelling normalization (map "bohot"โ"bahut"); (2) Back-translation augmentation โ translate Hindi to English and back to generate paraphrases; (3) Add emoji as features (๐=negative, ๐ฅ=positive).
Model-side: (4) Switch from mBERT to MuRIL (trained on transliterated Indian text); (5) Add a Hinglish language model pre-training step on unlabeled tweets before fine-tuning.
Architecture-side: (6) Ensemble MuRIL + XLM-R + a CNN-based classifier; (7) Use focal loss instead of cross-entropy to handle class imbalance (neutral tweets often dominate).
Evaluation-side: (8) Stratify by language ratio (pure Hindi vs. mixed vs. pure English) to find where the model fails worst.
Coding Question
Q3: Implement BLEU score from scratch (no libraries). [Frequently asked at Google, Microsoft India]
The complete solution is in Project 5's bleu_from_scratch() function above. Key points interviewers look for:
- Do you implement clipping? (min of candidate count vs. reference count)
- Do you include the brevity penalty?
- Do you use geometric mean of n-gram precisions (not arithmetic)?
- Can you handle edge cases? (zero n-gram matches โ avoid log(0))
Case Study Question
Q4: Design a multilingual content moderation system for ShareChat (Indian social media)
ShareChat supports 15 Indian languages. Users post in code-mixed text, and hate speech often uses euphemisms, misspellings, and transliteration to evade filters. Design a system that detects hate speech, misinformation, and graphic violence descriptions across all supported languages.
Expected Design:Layer 1 โ Language Detection: fastText-based language identifier to route text to the right pipeline.
Layer 2 โ Multilingual Classifier: MuRIL fine-tuned on labeled hate speech data from each language. Use a unified model (not 15 separate ones) to handle code-mixing.
Layer 3 โ Keyword Filter: Maintain a curated list of slurs, dog-whistles, and euphemisms in each language (updated weekly). This catches what the ML model misses.
Layer 4 โ Human Review: Route borderline cases (confidence 0.4-0.7) to human moderators who speak the relevant language.
Key challenge: New slang evolves faster than you can retrain. Solution: use retrieval-augmented classification โ embed new flagged terms and find nearest neighbors in the hate speech embedding space.
- Hinglish/code-mixing handling
- Low-resource language techniques
- WhatsApp-scale systems
- MuRIL, IndicBERT, IndicTrans
- GATE numerical: BLEU calculation
- Companies: Flipkart, ShareChat, Google India, Microsoft India
- BERT/GPT architecture details
- Scaling laws and efficiency
- Bias and fairness in NLP
- Prompt engineering and few-shot
- System design: search ranking, recommendation
- Companies: Google, Meta, OpenAI, Amazon
Hands-On Lab / Mini-Project
๐ฌ Mini-Project: End-to-End Multilingual News Classifier for Indian Languages
Objective
Build a news article topic classifier that works on Hindi, English, and Hinglish text. Classify articles into 5 categories: Sports, Politics, Technology, Entertainment, Business.
Requirements
| Component | Specification |
|---|---|
| Dataset | BBC Hindi + BBC English + scraped Hinglish news (total ~10K articles) |
| Model | MuRIL or mBERT, fine-tuned with HuggingFace Trainer |
| Preprocessing | Handle Devanagari + Roman, normalize spelling, handle code-mixing |
| Training | 80/10/10 train/val/test split, stratified by language AND category |
| Evaluation | Per-language F1, confusion matrix, error analysis on 50 misclassified samples |
| Deliverable | Jupyter notebook + 2-page report + working inference function |
Rubric (100 points)
| Criterion | Points | Excellent (90-100%) | Good (70-89%) | Needs Work (<70%) |
|---|---|---|---|---|
| Data Preprocessing | 20 | Handles all three languages, normalizes spelling, removes noise intelligently | Basic cleaning, handles two languages | Minimal cleaning, English-only |
| Model Training | 25 | MuRIL fine-tuned, proper hyperparameter search, FP16 training | mBERT fine-tuned, reasonable hyperparams | No fine-tuning, uses pre-trained directly |
| Evaluation | 20 | Per-language F1 breakdown, confusion matrix, statistical significance | Overall F1 and accuracy reported | Only accuracy reported |
| Error Analysis | 20 | 50+ samples analyzed, error categories identified, improvement strategies proposed | 20+ samples analyzed | No error analysis |
| Code Quality | 15 | Clean, commented, reproducible, modular functions | Functional but messy | Hardcoded, not reproducible |
Starter Code
Python
# Mini-Project Starter Code
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
from sklearn.model_selection import train_test_split
import pandas as pd
# TODO: Step 1 โ Load and preprocess your dataset
# df = pd.read_csv('multilingual_news.csv')
# df['text'] = df['text'].apply(preprocess_multilingual)
# TODO: Step 2 โ Choose model
MODEL_NAME = "google/muril-base-cased" # Your choice!
# TODO: Step 3 โ Tokenize, create datasets
# TODO: Step 4 โ Train with Trainer API
# TODO: Step 5 โ Evaluate per-language
# TODO: Step 6 โ Error analysis on misclassified samples
# TODO: Step 7 โ Write report
Exercises
Section A โ Conceptual Questions (5)
Explain why tokenizing "bahut acchi movie thi" with English BERT's tokenizer produces mostly [UNK] tokens, while mBERT produces meaningful subwords. What structural difference in their vocabularies causes this?
Compare extractive and abstractive summarization for Indian legal judgments. Which is more appropriate and why?
Why is ROUGE more appropriate than BLEU for evaluating summarization systems?
What is "code-mixing" and why does it make NLP harder? Give 3 specific examples from Indian languages.
Explain the difference between MuRIL, mBERT, and XLM-RoBERTa for Indian language tasks. When would you choose each?
Section B โ Mathematical Questions (8)
Calculate the BLEU score for: Reference: "the cat is on the mat" | Candidate: "the the the the"
Reference: "India is a beautiful country" (5 words). Candidate: "India is beautiful" (3 words). Calculate the brevity penalty.
Compute ROUGE-1, ROUGE-2, and ROUGE-L for: Reference: "the cat sat on the mat" | System summary: "the cat on the mat"
In BERT's self-attention, if the input sequence has length n = 512 and hidden dimension d = 768 with h = 12 heads, what is (a) the dimension of each head's Q, K, V matrices, and (b) the total FLOPs for one self-attention layer?
A fake news classifier has: TP=180, FP=20, FN=30, TN=770. Compute precision, recall, F1, and accuracy. Is this model production-ready?
If mBERT has a vocabulary of 120,000 tokens across 104 languages, approximately how many tokens are allocated to Hindi? Why is this a problem?
Derive the gradient of the cross-entropy loss with respect to the logits for a 3-class sentiment classifier. Show that โL/โzแตข = pแตข - yแตข where pแตข = softmax(zแตข).
A HindiโEnglish NMT model produces translations with average length 15 words when references average 20 words. If the unigram precision pโ = 0.85, pโ = 0.55, pโ = 0.35, pโ = 0.20, what is the BLEU score?
Section C โ Coding Questions (4)
Write a function normalize_hinglish(text) that handles at least 10 common Hinglish spelling variants. For example: "bohot" โ "bahut", "achha" โ "acha", "kaise" / "kese" โ "kaise", etc.
Implement a function compute_rouge_l(reference, candidate) from scratch that computes ROUGE-L using the longest common subsequence algorithm. Do NOT use any NLP libraries.
Using HuggingFace, write a script that loads MuRIL, fine-tunes it on 100 labeled Hinglish sentiment samples, and evaluates on 20 test samples. Report per-class precision, recall, and F1.
AutoModelForSequenceClassification.from_pretrained("google/muril-base-cased", num_labels=3). Create a custom Dataset class. Use Trainer with TrainingArguments(num_train_epochs=3, learning_rate=2e-5). Use sklearn's classification_report for per-class metrics.Build a simple language detection function that classifies a given text as "Hindi (Devanagari)", "Hindi (Roman/Hinglish)", or "English" using Unicode ranges. No ML โ just rule-based.
Section D โ Critical Thinking Questions (3)
A company deploys a fake news detection model for Indian WhatsApp. The model has 92% accuracy on the test set but users report it frequently flags legitimate political opinions as fake news. Analyze what might be going wrong and propose 3 concrete solutions.
Your HindiโEnglish NMT model gets BLEU = 24 on news text but BLEU = 8 on poetry. Explain why, and propose how to improve poetry translation without hurting news performance.
India's judiciary wants to deploy your legal summarization system. What ethical, legal, and technical concerns would you raise before deployment? How would you address each?
โ Starred Research Questions (2)
Read the MuRIL paper (Khanuja et al., 2021). How does MuRIL's pretraining differ from mBERT? Why does including transliterated text in pretraining improve performance on code-mixed tasks? Design an experiment to test whether transliteration augmentation helps IndicBERT as well.
The IndicTrans2 paper (Gala et al., 2023) claims state-of-the-art HindiโEnglish translation. Read the paper and answer: (a) How does their "script unification" approach work? (b) Why is SentencePiece with character coverage=0.9995 crucial for Indic scripts? (c) Design a study comparing IndicTrans2 vs. fine-tuned mBART on legal document translation (not general text).
Connections
How This Chapter Connects to the Rest of the Book
Chapter 14 (LSTMs & GRUs): The Seq2Seq chatbot (P3) uses LSTM encoder-decoder architecture. Attention mechanism builds on the hidden state sequences from Chapter 14.
Chapter 15 (Transformers & Attention): Every project uses Transformers โ BERT for classification (P1, P4), T5/BART for generation (P2), and full encoder-decoder Transformers for translation (P5). Self-attention and cross-attention from Ch. 15 are directly applied here.
Chapter 17 (Transfer Learning): All 5 projects use transfer learning. Pre-trained models (MuRIL, mT5, opus-mt) are fine-tuned on domain-specific data. The concepts of feature extraction vs. fine-tuning from Ch. 17 are critical.
โ EnablesChapter 22 (Future & Ethics): The ethical concerns raised in P4 (fake news bias) and P2 (legal AI responsibility) connect directly to AI ethics discussions.
MLOps (Chapter 21): Deploying these models in production requires model serving, monitoring, and retraining pipelines covered in MLOps.
Capstone projects: Students can extend any of these 5 projects into a full capstone with real data.
๐ฌ Research FrontierLarge Language Models for Indian Languages: IndicLLM, Sarvam AI's models, and Krutrim are attempting to build GPT-scale models specifically for Indian languages. The fundamental challenges (tokenization, code-mixing, morphological richness) discussed in this chapter explain why simply scaling English LLMs doesn't solve Indian language AI.
๐ญ Industry ImplementationBhashini (Government of India): India's national platform for language translation and NLP services uses many of the same techniques covered here โ IndicTrans for translation, IndicBERT for understanding, and ASR models for speech. The AI4Bharat consortium (IIT Madras) has open-sourced many of these models.
Chapter Summary
๐ฏ Key Takeaways
- Indian NLP is fundamentally harder than English NLP due to code-mixing, script diversity (13 scripts), morphological richness, spelling variation, and severe data scarcity. Pretending it's "just another language" leads to catastrophic failures.
- MuRIL > mBERT > English BERT for Indian language classification tasks. MuRIL's inclusion of transliterated text in pretraining gives it a 2-5% F1 advantage on code-mixed tasks.
- Legal summarization requires a hybrid approach: extractive filtering (to fit within token limits) followed by abstractive generation (for readable summaries). Watch for hallucinated legal citations.
- Task-oriented chatbots decompose into Intent โ Slots โ Response: Use a classifier for intent, token-level NER for slot filling, and either templates or Seq2Seq for response generation. Bilingual handling is the key Indian challenge.
- BLEU โ quality percentage. BLEU measures n-gram overlap with clipping and brevity penalty. Never compare BLEU scores across language pairs. Use BLEU for translation, ROUGE for summarization, F1 for classification.
- Attention visualization reveals word alignment between source and target in NMT. Hindi (SOV) โ English (SVO) word order changes are visible in cross-attention maps.
- Domain-specific preprocessing is critical: Legal text needs section normalization, social media needs emoji handling, WhatsApp messages need forward-indicator detection. There is no one-size-fits-all pipeline.
BLEU = BP ยท exp(ยผ ฮฃ log pn) | ROUGE-N = ฮฃ match(n-gram) / ฮฃ count(n-gram)ref | F1 = 2PR/(P+R)
BP = min(1, e1-r/c) | Attention(Q,K,V) = softmax(QKT/โdk)V
Key Intuition: Building NLP for a multilingual country like India isn't about translating English NLP โ it's about reimagining NLP from the ground up for a world where languages don't have neat boundaries.
Further Reading
๐ฎ๐ณ Indian Resources
- NPTEL โ NLP by Prof. Pushpak Bhattacharyya (IIT Bombay): The definitive Indian lecture series covering Hindi NLP, WordNet, and Indian language processing. nptel.ac.in
- AI4Bharat (IIT Madras): Open-source models (IndicTrans2, IndicBERT, MuRIL), datasets, and benchmarks for 22 Indian languages. ai4bharat.iitm.ac.in
- Bhashini Platform: Government of India's national language translation mission โ see the production deployment of the techniques from this chapter. bhashini.gov.in
- GATE DA Syllabus โ NLP Section: Official syllabus covering text classification, sequence models, attention mechanisms, and evaluation metrics
- ILDC Dataset (IIT Kharagpur): 35K Indian Supreme Court judgments with labels and summaries โ the benchmark for Indian legal NLP
๐ Global Resources
- "Attention Is All You Need" (Vaswani et al., 2017): The original Transformer paper. Every model in this chapter is built on this architecture.
- "BERT: Pre-training of Deep Bidirectional Transformers" (Devlin et al., 2019): The foundation for Projects 1, 3, 4.
- HuggingFace Course: Free interactive course covering all the APIs used in this chapter. huggingface.co/course
- "MuRIL: Multilingual Representations for Indian Languages" (Khanuja et al., 2021): The paper behind the best-performing model for Indian NLP.
- "IndicTrans2: Towards High-Quality and Accessible Machine Translation Models" (Gala et al., 2023): State-of-the-art Indian language translation.
- Jay Alammar's "The Illustrated Transformer": The single best visual explanation of attention and Transformers. jalammar.github.io
- 3Blue1Brown โ Attention in Transformers (YouTube): Visual, intuition-first explanation of attention mechanism
- Distill.pub โ "Attention and Augmented Recurrent Neural Networks": Interactive article on attention mechanisms with beautiful visualizations
๐ Key Papers for Deep Dive
| Paper | Year | Relevance |
|---|---|---|
| Vaswani et al. โ "Attention Is All You Need" | 2017 | Foundation for all Transformer-based projects |
| Devlin et al. โ "BERT" | 2019 | Pre-training + fine-tuning paradigm (P1, P3, P4) |
| Pires et al. โ "How Multilingual is mBERT?" | 2019 | Why mBERT transfers across languages |
| Khanuja et al. โ "MuRIL" | 2021 | Best model for Indian NLP tasks |
| Malik et al. โ "ILDC for CJPE" | 2021 | Indian legal document corpus (P2) |
| Gala et al. โ "IndicTrans2" | 2023 | SOTA HindiโEnglish translation (P5) |
| Patwa et al. โ "SemEval-2020 Task 9" | 2020 | Hinglish sentiment benchmark (P1) |
| Papineni et al. โ "BLEU" | 2002 | The original BLEU metric paper |
"IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages" (Gala et al., 2023)
This AI4Bharat paper introduced IndicTrans2, trained on the Bharat Parallel Corpus (230M sentence pairs across 22 languages). Key innovations: (1) Script unification โ converting all Indic scripts to a common representation before tokenization; (2) Language-tagged SentencePiece โ prefixing each sentence with a language tag; (3) Two-stage fine-tuning โ first on general data, then on high-quality human-translated data. It achieves BLEU scores 5-8 points above Google Translate for most Indian language pairs. The models are freely available on HuggingFace.