Neural Networks & Deep Learning

Chapter 21: Applied Deep Learning โ€” NLP and Sequence Projects

From Hindi Tweets to Supreme Court Judgments โ€” Building NLP Systems for Bharat

โฑ๏ธ Reading Time: ~4 hours  |  ๐Ÿ“– Unit VII: Applications & Industry  |  ๐Ÿง  Project-Based Chapter

๐Ÿ“‹ Prerequisites: Chapter 14 (LSTMs & GRUs), Chapter 15 (Transformers & Attention), Chapter 17 (Transfer Learning)

Bloom's Taxonomy Map for This Chapter

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberRecall BERT architecture, tokenization strategies (BPE, WordPiece, SentencePiece), evaluation metrics (BLEU, ROUGE, F1), and the key differences between code-mixed and monolingual NLP
๐Ÿ”ต UnderstandExplain why multilingual BERT outperforms monolingual models on Hinglish, how attention mechanisms enable alignment in NMT, and why abstractive summarization requires a decoder
๐ŸŸข ApplyImplement 5 complete NLP projects using HuggingFace Transformers: sentiment analysis, summarization, chatbot, fake news detection, and machine translation
๐ŸŸก AnalyzeDiagnose tokenization failures in Devanagari text, analyze attention maps to find translation alignment errors, and compare ROUGE scores across extractive vs. abstractive summaries
๐ŸŸ  EvaluateAssess whether a BLEU score of 22 is acceptable for Hindiโ†’English translation, judge model bias in fake news classifiers, and critique chatbot response quality using human evaluation
๐Ÿ”ด CreateDesign and build an end-to-end multilingual NLP pipeline that handles code-mixed Indian language input, including data preprocessing, model selection, and deployment considerations
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  • Fine-tune multilingual BERT (mBERT) and IndicBERT for Hindi/Hinglish sentiment analysis on code-mixed social media text
  • Build an abstractive summarization pipeline for Indian legal documents using encoder-decoder Transformers
  • Implement a Seq2Seq chatbot with attention mechanism for handling bilingual booking queries (IRCTC-style)
  • Train a BERT-based fake news classifier for Indian WhatsApp misinformation with domain-specific preprocessing
  • Construct a Hindiโ†”English Neural Machine Translation system using Transformers, compute BLEU scores, and visualize cross-attention alignments
  • Compare Indian multilingual NLP challenges with English-only equivalents in terms of data availability, tokenization, and evaluation metrics
  • Evaluate model performance using BLEU, ROUGE-L, F1, precision, recall, and conduct error analysis on failure cases
  • Design deployment strategies for Indian language NLP models, considering script diversity, code-switching, and resource constraints
Section 2

Opening Hook โ€” The Language of a Billion

๐Ÿ—ฃ๏ธ "Yeh train kitne baje aayegi?" โ€” When AI Meets India's Languages

India has 22 scheduled languages and 19,500 dialects. More than 500 million Indians access the internet primarily in languages other than English. When Ravi in Varanasi tweets "Bahut bakwas movie thi ๐Ÿ‘Ž waste of money," he's writing in Hinglish โ€” a fluid mix of Hindi and English, with Devanagari sometimes sprinkled in: "เคฌเคนเฅเคค เคฌเค•เคตเคพเคธ movie เคฅเฅ€." No standard spelling. No clear language boundary. No neat grammar rules.

Building NLP that works for India is one of the hardest problems in AI. This chapter tackles it head-on.

In 2020, when COVID misinformation flooded WhatsApp groups across India โ€” "เค—เคฐเฅเคฎ เคชเคพเคจเฅ€ เคชเฅ€เคจเฅ‡ เคธเฅ‡ corona เค เฅ€เค• เคนเฅ‹เคคเคพ เคนเฅˆ" (drinking hot water cures corona) โ€” there was no reliable Hindi fake news detector. The tools built for English Twitter couldn't handle code-mixed Devanagari-Roman text. People died because of information gaps.

Meanwhile, India's Supreme Court produces 30,000+ judgments annually in a mix of English and Hindi legal prose so dense that even lawyers struggle to extract key holdings. Could a Transformer summarize a 50-page judgment into 5 crisp paragraphs?

This chapter is not a toy exercise. You will build 5 real NLP systems that solve real Indian problems โ€” and learn to compare them with their English-only global counterparts. By the end, you'll understand why multilingual NLP is fundamentally harder, mathematically richer, and more impactful than monolingual NLP.

BERT HuggingFace IndicNLP mBERT Seq2Seq BLEU ROUGE
Section 3

The Intuition First โ€” Why Indian NLP is Hard Mode

The Postman Analogy

Imagine you're a postman in London. Every letter has a clearly written English address. Simple. Now imagine you're a postman in Mumbai. Some addresses are in English, some in Hindi, some in Marathi, some are a wild mix โ€” "Flat 302, เคคเฅ€เคธเคฐเฅ€ เคฎเค‚เคœเคฟเคฒ, Sunshine Apartments, Andheri (W)" โ€” and half of them have creative spellings because there's no standardized transliteration. That's what an NLP model faces when processing Indian text.

The Five Core Challenges

Why Monolingual NLP Breaks on Indian Text

1. Code-Mixing (Code-Switching)

Indians don't separate languages cleanly. A single sentence freely mixes Hindi words, English words, and sometimes Devanagari script mid-sentence: "Main kal meeting mein bahut nervous tha." Standard English NLP tokenizers see "Main" as the English word for primary importance, not the Hindi pronoun "I."

2. Script Diversity

Hindi alone appears in two scripts: Devanagari (เคฎเฅˆเค‚) and Roman transliteration (main). Same word, completely different byte sequences. A model trained on one won't recognize the other without explicit handling.

3. Morphological Richness

Hindi is a morphologically richer language than English. The verb "เค–เคพเคจเคพ" (to eat) inflects as: เค–เคพเคคเคพ, เค–เคพเคคเฅ€, เค–เคพเคคเฅ‡, เค–เคพเคŠเคเค—เคพ, เค–เคพเคŠเคเค—เฅ€, เค–เคพเคฏเคพ, เค–เคพเคฏเฅ€, เค–เคพเคฏเฅ‡... English has eat/eats/ate/eaten. Hindi has dozens of forms.

4. Resource Scarcity

English has billions of labeled NLP samples. Hindi has orders of magnitude less. Hinglish has almost none in clean, curated form. This makes transfer learning not just useful โ€” it's essential.

5. No Standard Spelling

When writing Hindi in Roman script, "bahut" = "bohot" = "boht" = "bhot". There's no ISO standard for Hindi-in-Roman. Every user invents their own spelling.

"Aha" Question

If you train BERT on English text and then ask it to classify "Bahut bakwas movie thi" as positive or negative, what will happen?

Answer: It will fail completely. English BERT's vocabulary doesn't contain Hindi tokens. "Bahut" gets split into meaningless subwords like ["Ba", "##hu", "##t"]. The model has no semantic representation for these fragments. This is why we need multilingual models โ€” and why this chapter exists.

๐Ÿ‡ฎ๐Ÿ‡ณ Indian NLP Landscape
  • Languages: 22 official + 19,500 dialects
  • Scripts: 13 distinct scripts (Devanagari, Tamil, Bengali, etc.)
  • Code-mixing: ~30% of Indian social media is code-mixed
  • Key models: IndicBERT, MuRIL, IndicTrans
  • Key datasets: IndicNLPSuite, SAIL, IIT-P Hinglish
  • Tokenizer needs: SentencePiece, IndicNLP tokenizer
  • Biggest challenge: No standard transliteration
๐Ÿ‡บ๐Ÿ‡ธ US/Global NLP Landscape
  • Languages: Primarily English (+ Spanish, Chinese)
  • Scripts: Latin script dominates
  • Code-mixing: Rare in mainstream NLP benchmarks
  • Key models: BERT, RoBERTa, GPT, T5
  • Key datasets: GLUE, SQuAD, SST, WMT
  • Tokenizer needs: WordPiece, BPE (well-established)
  • Biggest challenge: Bias, fairness, hallucination

Google's MuRIL (Multilingual Representations for Indian Languages) was trained on 17 Indian languages and outperforms mBERT on Indian language tasks by 2-5% on average. It was specifically designed because mBERT's 104-language vocabulary dilutes Indian language representation โ€” Hindi gets only ~1,700 out of 120,000 WordPiece tokens.

NLP engineers working on Indian languages are among the most in-demand AI professionals in India today. These are the roles this chapter prepares you for:

NLP Engineer โ€” Indian Languages Applied ML Scientist Conversational AI Developer Legal Tech AI Engineer Trust & Safety ML Engineer Machine Translation Researcher

Top Hirers (India): Google India, Microsoft India, Flipkart, Sharechat, Koo, Jugalbandi, Bhashini, CDAC

Top Hirers (US/Global): Google, Meta, OpenAI, Amazon, Apple, DeepL, Translated

P1

Hindi/Hinglish Sentiment Analysis

BERT Fine-tuning on Code-Mixed Tweets

Problem Statement

Given a tweet or social media post written in Hindi, English, or Hinglish (code-mixed Hindi-English), classify it as Positive, Negative, or Neutral. The input may use Devanagari script, Roman script, or a mix of both.

Why This Is Hard

Consider these three tweets โ€” all expressing negativity:

  • "Worst movie ever" โ€” Pure English. Easy.
  • "เคฌเคนเฅเคค เคฌเค•เคตเคพเคธ เคซเคฟเคฒเฅเคฎ เคฅเฅ€" โ€” Pure Hindi (Devanagari). Needs Hindi-aware tokenizer.
  • "Bahut bakwas movie thi ๐Ÿ‘Ž" โ€” Hinglish (Roman). Neither English nor Hindi tokenizer handles this well.

Dataset

We use the SemEval 2020 Task 9 Hinglish Sentiment dataset and augment with:

DatasetSizeLanguagesLabels
SemEval 2020 Task 9~15,000 tweetsHinglish (code-mixed)Positive, Negative, Neutral
SAIL 2015 Hindi~12,000 tweetsHindi (Devanagari)Positive, Negative, Neutral
Custom scraped~5,000 tweetsMixedManual annotation

Tokenization Challenges

Step 1: Why English BERT Tokenizer Fails on Hindi

Let's trace what happens when you feed "เคฌเคนเฅเคค เค…เคšเฅเค›เฅ€ เคซเคฟเคฒเฅเคฎ" to the standard BERT tokenizer:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokens = tokenizer.tokenize("เคฌเคนเฅเคค เค…เคšเฅเค›เฅ€ เคซเคฟเคฒเฅเคฎ")
print(tokens)
# Output: ['[UNK]']  โ† Every Devanagari character is unknown!

Step 2: Why mBERT Tokenizer Is Better (But Not Great)

tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
tokens = tokenizer.tokenize("เคฌเคนเฅเคค เค…เคšเฅเค›เฅ€ เคซเคฟเคฒเฅเคฎ")
print(tokens)
# Output: ['เคฌเคนเฅเคค', 'เค…เคšเฅเค›เฅ€', 'เคซเคฟเคฒเฅเคฎ']  โ† Works! But only ~1,700 Hindi tokens

Step 3: Hinglish Breaks Both

tokens = tokenizer.tokenize("bahut acchi movie thi yaar")
print(tokens)
# Output: ['ba', '##hu', '##t', 'ac', '##chi', 'movie', 'th', '##i', 'ya', '##ar']
# "bahut" โ†’ 3 subwords, "acchi" โ†’ 2 subwords โ€” heavy fragmentation!

Step 4: The Solution โ€” Use MuRIL or IndicBERT

Google's MuRIL (Multilingual Representations for Indian Languages) is trained on transliterated text too. It handles Roman-script Hindi much better because its training data included romanized Indian language text.

Model Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Hinglish Sentiment Pipeline โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ Input: "bahut bakwas movie thi ๐Ÿ‘Ž" โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Preprocessing โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Remove emojis โ†’ features โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Normalize spelling โ”‚ โ”‚ โ”‚ โ”‚ โ€ข Handle hashtags โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ MuRIL / mBERT Tokenizer โ”‚ โ”‚ โ”‚ โ”‚ [CLS] tokโ‚ tokโ‚‚ ... [SEP] โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ MuRIL Encoder (12 layers) โ”‚ โ”‚ โ”‚ โ”‚ Hidden size: 768 โ”‚ โ”‚ โ”‚ โ”‚ Attention heads: 12 โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ [CLS] token embedding โ”‚ โ”‚ โ”‚ โ”‚ + Dropout(0.3) โ”‚ โ”‚ โ”‚ โ”‚ + Linear(768 โ†’ 3) โ”‚ โ”‚ โ”‚ โ”‚ + Softmax โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ Output: Positive / Negative / Neutral โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

HuggingFace Implementation

Python
# Project 1: Hindi/Hinglish Sentiment Analysis with MuRIL
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer
)
from sklearn.metrics import classification_report, f1_score
import pandas as pd
import re
import numpy as np

# โ”€โ”€โ”€ Step 1: Preprocessing for Hinglish โ”€โ”€โ”€
def preprocess_hinglish(text):
    """Clean Hinglish tweet text."""
    text = re.sub(r'http\S+', '', text)           # Remove URLs
    text = re.sub(r'@\w+', '', text)              # Remove mentions
    text = re.sub(r'#(\w+)', r'\1', text)        # Keep hashtag text
    text = re.sub(r'[^\w\s\u0900-\u097F]', '', text)  # Keep Devanagari + alphanumeric
    text = re.sub(r'\s+', ' ', text).strip()
    text = text.lower()
    # Normalize common Hinglish spelling variants
    spelling_map = {
        'bohot': 'bahut', 'boht': 'bahut', 'bhot': 'bahut',
        'accha': 'acha', 'acha': 'acha', 'achha': 'acha',
        'kya': 'kya', 'kyaa': 'kya', 'kia': 'kya',
    }
    words = text.split()
    words = [spelling_map.get(w, w) for w in words]
    return ' '.join(words)

# โ”€โ”€โ”€ Step 2: Dataset class โ”€โ”€โ”€
class HinglishDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }

# โ”€โ”€โ”€ Step 3: Load model โ€” MuRIL for Indian languages โ”€โ”€โ”€
MODEL_NAME = "google/muril-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=3  # Positive, Negative, Neutral
)

# โ”€โ”€โ”€ Step 4: Prepare data โ”€โ”€โ”€
# (Assuming df has columns: 'text', 'label'  where label โˆˆ {0,1,2})
# df = pd.read_csv('hinglish_sentiment.csv')
# For demo, we create sample data:
sample_texts = [
    preprocess_hinglish("bahut acchi movie thi! loved it"),
    preprocess_hinglish("worst film ever dekhi maine"),
    preprocess_hinglish("เค เฅ€เค• เค เคพเค• เคฅเฅ€, nothing special"),
    preprocess_hinglish("kya bakwas acting thi yaar"),
    preprocess_hinglish("mast movie hai bhai ๐Ÿ”ฅ"),
]
sample_labels = [0, 1, 2, 1, 0]  # 0=Positive, 1=Negative, 2=Neutral

# โ”€โ”€โ”€ Step 5: Training configuration โ”€โ”€โ”€
training_args = TrainingArguments(
    output_dir='./hinglish_sentiment_model',
    num_train_epochs=4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_weighted",
    fp16=True,  # Mixed precision for faster training
)

# โ”€โ”€โ”€ Step 6: Custom metrics โ”€โ”€โ”€
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    preds = np.argmax(predictions, axis=1)
    f1 = f1_score(labels, preds, average='weighted')
    return {'f1_weighted': f1}

# โ”€โ”€โ”€ Step 7: Train โ”€โ”€โ”€
train_dataset = HinglishDataset(sample_texts, sample_labels, tokenizer)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=train_dataset,  # Use proper val split in practice!
    compute_metrics=compute_metrics,
)
# trainer.train()  # Uncomment to actually train

# โ”€โ”€โ”€ Step 8: Inference โ”€โ”€โ”€
def predict_sentiment(text):
    text = preprocess_hinglish(text)
    inputs = tokenizer(text, return_tensors="pt",
                       truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    label = ["Positive", "Negative", "Neutral"][probs.argmax()]
    return label, probs.tolist()

# Test it
print(predict_sentiment("bahut mast movie thi!"))
print(predict_sentiment("bakwas film, time waste"))
print(predict_sentiment("เค เฅ€เค•-เค เคพเค• เคฅเฅ€"))

Expected Metrics

0.71
F1 (mBERT)
0.76
F1 (MuRIL)
0.68
F1 (XLM-R)
0.59
F1 (English BERT)

Error Analysis

Error TypeExampleWhy Model FailsFrequency
Sarcasm"waah kya acting thi ๐Ÿคฃ" (sarcastic praise)Surface-level positive words, but negative intent~18%
Spelling variation"bhot bdia" vs "bahut badiya"Same meaning, different tokens~15%
Script mixing"เคฌเคนเฅเคค bad movie"Half Devanagari, half English in one phrase~12%
Negation in Hindi"nahi acchi thi" (was not good)Model focuses on "acchi" (good), misses "nahi" (not)~10%
Cultural context"timepass movie" (mediocre)"timepass" is Indian slang, not in training data~8%

โŒ MYTH: "Just use English BERT with Google Translate โ€” translate Hindi to English first, then classify."

โœ… TRUTH: Translation destroys sentiment signals. "Bahut bakwas" literally translates to "very nonsense," losing the emotional intensity. Sarcasm, cultural slang, and code-mixed nuance are untranslatable. Always use multilingual models directly on the original text.

๐Ÿ” WHY IT MATTERS: In production sentiment systems (ShareChat, Koo), translate-then-classify underperforms direct multilingual models by 8-12% F1.

P2

Indian Legal Document Summarization

Transformer Abstractive Summarization on Court Judgments

Problem Statement

Given a full Indian Supreme Court or High Court judgment (typically 5,000โ€“50,000 words), generate a concise abstractive summary (300โ€“800 words) that captures the facts, issues, arguments, reasoning, and holding.

Why This Matters

India's judiciary has a backlog of 50+ million pending cases. Lawyers and judges spend hours reading lengthy judgments to find relevant precedents. The Indian Legal Document Corpus (ILDC) project by IIT Kharagpur aims to make legal research faster with AI summarization. This is not an academic exercise โ€” it's a tool that could accelerate justice delivery in India.

Dataset

DatasetSizeAvg. Document LengthAvg. Summary Length
ILDC (IIT Kharagpur)~35,000 judgments~4,200 words~250 words
IN-Abs (FIRE 2022)~7,000 judgments~5,500 words~350 words
Custom SC judgments~2,000 curated~8,000 words~500 words (manual)

Tokenization Challenges for Legal Text

  • Extreme length: A 10,000-word judgment exceeds BERT's 512-token limit by 20ร—. You need either a Longformer (4,096 tokens) or a chunking strategy.
  • Legal vocabulary: Terms like "suo motu," "certiorari," "mandamus," "res judicata" are Latin loan-words used in Indian courts but absent from standard NLP vocabularies.
  • Hindi legal terms: "เคฏเคพเคšเคฟเค•เคพ" (petition), "เค…เคชเฅ€เคฒ" (appeal), "เคจเคฟเคฐเฅเคฃเคฏ" (judgment) appear when courts cite Hindi-language proceedings.
  • Section references: "Section 302 IPC," "Article 21," "CrPC 482" โ€” these are not natural language but structured legal codes.

Model Architecture: Two Approaches

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Approach A: Extractive-then-Abstractive (Recommended) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ Full Judgment (10,000 words) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ STAGE 1: Extractive โ”‚ โ”‚ โ”‚ โ”‚ LegalBERT โ†’ score each โ”‚ โ”‚ โ”‚ โ”‚ sentence by importance โ”‚ โ”‚ โ”‚ โ”‚ Keep top 1,500 words โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ STAGE 2: Abstractive โ”‚ โ”‚ โ”‚ โ”‚ BART / mT5 encoder- โ”‚ โ”‚ โ”‚ โ”‚ decoder on extracted โ”‚ โ”‚ โ”‚ โ”‚ text โ†’ final summary โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ Summary (300-500 words) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Approach B: Long-Document Transformer โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ Full Judgment (10,000 words) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ LED (Longformer Encoder โ”‚ โ”‚ โ”‚ โ”‚ Decoder) โ€” handles up โ”‚ โ”‚ โ”‚ โ”‚ to 16,384 tokens โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ Summary (300-500 words) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

HuggingFace Implementation

Python
# Project 2: Indian Legal Document Summarization
import torch
from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
from datasets import load_dataset
import evaluate
import numpy as np
import re

# โ”€โ”€โ”€ Step 1: Legal text preprocessing โ”€โ”€โ”€
def preprocess_legal_text(text):
    """Clean Indian legal judgment text."""
    # Normalize section references
    text = re.sub(r'Section\s+(\d+)', r'Section_\1', text)
    text = re.sub(r'Article\s+(\d+)', r'Article_\1', text)
    # Remove page numbers and formatting artifacts
    text = re.sub(r'\n\d+\n', '\n', text)
    text = re.sub(r'\s+', ' ', text).strip()
    # Truncate to manageable length for encoder
    words = text.split()
    if len(words) > 3000:
        # Keep first 1500 + last 1500 words (intro + conclusion)
        words = words[:1500] + ['[...]'] + words[-1500:]
    return ' '.join(words)

# โ”€โ”€โ”€ Step 2: Extractive pre-filtering โ”€โ”€โ”€
from transformers import pipeline

def extractive_filter(text, top_k=30):
    """Use sentence scoring to extract key sentences."""
    sentences = text.split('.')
    # Simple heuristic: score sentences with legal keywords
    legal_keywords = [
        'held', 'order', 'appeal', 'petitioner', 'respondent',
        'court', 'judgment', 'section', 'article', 'contention',
        'submitted', 'evidence', 'guilty', 'acquit', 'convicted',
        'constitutional', 'fundamental', 'right', 'violation'
    ]
    scored = []
    for i, sent in enumerate(sentences):
        score = sum(1 for kw in legal_keywords if kw in sent.lower())
        # Boost first and last 20% of document (facts + holding)
        if i < len(sentences) * 0.2 or i > len(sentences) * 0.8:
            score += 2
        scored.append((score, sent))
    scored.sort(reverse=True)
    return '. '.join([s for _, s in scored[:top_k]])

# โ”€โ”€โ”€ Step 3: Load mT5 for multilingual summarization โ”€โ”€โ”€
MODEL_NAME = "google/mt5-base"  # or "ai4bharat/IndicBART"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

# โ”€โ”€โ”€ Step 4: Summarization function โ”€โ”€โ”€
def summarize_judgment(judgment_text, max_input=1024, max_output=256):
    # Stage 1: Extract key sentences
    filtered = extractive_filter(judgment_text)
    # Stage 2: Abstractive summarization
    prefix = "summarize: "
    inputs = tokenizer(
        prefix + filtered,
        max_length=max_input,
        truncation=True,
        return_tensors="pt"
    )
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=max_output,
        min_length=60,
        num_beams=4,
        length_penalty=2.0,
        no_repeat_ngram_size=3,
        early_stopping=True
    )
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# โ”€โ”€โ”€ Step 5: Evaluation with ROUGE โ”€โ”€โ”€
rouge = evaluate.load("rouge")

def evaluate_summary(prediction, reference):
    results = rouge.compute(
        predictions=[prediction],
        references=[reference]
    )
    print(f"ROUGE-1: {results['rouge1']:.4f}")
    print(f"ROUGE-2: {results['rouge2']:.4f}")
    print(f"ROUGE-L: {results['rougeL']:.4f}")
    return results

# Example usage
sample_judgment = """
The petitioner filed a writ petition under Article 32 of the Constitution 
challenging the constitutional validity of Section 66A of the Information 
Technology Act, 2000. The petitioner contended that the provision was vague 
and overbroad, violating the fundamental right to free speech under 
Article 19(1)(a). The respondent Union of India argued that reasonable 
restrictions under Article 19(2) were applicable. After hearing both sides, 
this Court held that Section 66A is unconstitutional as it creates a 
chilling effect on free speech and is not saved by Article 19(2). 
The impugned section is struck down.
"""

summary = summarize_judgment(sample_judgment)
print("Generated Summary:", summary)

Metrics: Extractive vs. Abstractive on ILDC

0.42
ROUGE-1 (Extractive)
0.38
ROUGE-1 (mT5)
0.46
ROUGE-1 (Hybrid)
0.21
ROUGE-L (Hybrid)

Error Analysis

Error TypeExampleRoot Cause
Hallucinated sectionsSummary mentions "Section 420 IPC" which doesn't appear in the judgmentDecoder generating plausible-sounding but incorrect legal citations
Missing holdingSummary describes facts and arguments but omits the final decisionHolding usually appears at the end; truncation drops it
Party confusionAttributes petitioner's argument to respondentLong-range dependency: parties are defined 2000+ words before their arguments
Hindi terms dropped"เคฏเคพเคšเคฟเค•เคพ เค–เคพเคฐเคฟเคœ" (petition dismissed) not capturedmT5 underrepresents Hindi legal vocabulary

"ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation" (Malik et al., ACL 2021)

This landmark paper introduced the first large-scale corpus of Indian Supreme Court judgments (~35K cases) with human-annotated summaries and prediction labels. The authors showed that LegalBERT pre-trained on Indian legal text outperforms generic BERT by 4.2% accuracy on judgment prediction. The dataset is freely available and has become the standard benchmark for Indian legal NLP.

Key insight: Legal summarization is harder than news summarization because the "answer" (the holding) is structurally unpredictable โ€” it could be in the first paragraph, the last paragraph, or buried in the middle.

P3

IRCTC Chatbot โ€” Seq2Seq with Attention

Bilingual Booking Query Handler

Problem Statement

Build a conversational AI system that handles Indian Railways booking queries in Hindi, English, and Hinglish. The system must understand queries like:

  • "Delhi se Mumbai ka train kitne baje hai?" (What time is the train from Delhi to Mumbai?)
  • "เคฎเฅเคเฅ‡ เค•เคฒ เคฆเคฟเคฒเฅเคฒเฅ€ เคธเฅ‡ เคœเคฏเคชเฅเคฐ เคœเคพเคจเคพ เคนเฅˆ" (I need to go from Delhi to Jaipur tomorrow)
  • "PNR status check karo 4521876543" (Check PNR status for 4521876543)
  • "Cancel my ticket from last week"

Architecture: Intent + Slot Filling + Response Generation

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ IRCTC Chatbot Architecture โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ User: "Delhi se Mumbai 3 tareekh ko train dikhao" โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Language ID โ”‚ โ”‚ Script Detection โ”‚ โ”‚ โ”‚ โ”‚ โ†’ Hinglish โ”‚ โ”‚ โ†’ Roman โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Intent Classifier (BERT) โ”‚ โ”‚ โ”‚ โ”‚ Intent: SEARCH_TRAIN โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Slot Filler (Token Clf.) โ”‚ โ”‚ โ”‚ โ”‚ source: Delhi โ”‚ โ”‚ โ”‚ โ”‚ dest: Mumbai โ”‚ โ”‚ โ”‚ โ”‚ date: 3 tareekh (3rd) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Response Generator (Seq2Seqโ”‚ โ”‚ โ”‚ โ”‚ + Attention / Template) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ Bot: "Delhi โ†’ Mumbai, 3 เคคเคพเคฐเฅ€เค– เค•เฅ‹ เคฏเฅ‡ trains โ”‚ โ”‚ available เคนเฅˆเค‚: [Rajdhani 12:05, Duronto...]" โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Part A: Intent Classification

Python
# Part A: Intent classification for IRCTC queries
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Define intents
INTENTS = [
    'SEARCH_TRAIN',       # Find trains between stations
    'CHECK_PNR',          # PNR status inquiry
    'BOOK_TICKET',        # Book a ticket
    'CANCEL_TICKET',      # Cancel booking
    'SEAT_AVAILABILITY',  # Check seat availability
    'TRAIN_STATUS',       # Live running status
    'GENERAL_QUERY',      # General help
]

# Training data examples (would be 1000+ in practice)
training_examples = [
    ("Delhi se Mumbai train dikhao", 0),
    ("เคฆเคฟเคฒเฅเคฒเฅ€ เคธเฅ‡ เคฎเฅเค‚เคฌเคˆ เค•เฅ€ เคŸเฅเคฐเฅ‡เคจ เคฌเคคเคพเค“", 0),
    ("Show me trains from Delhi to Mumbai", 0),
    ("PNR check karo 4521876543", 1),
    ("mera PNR status kya hai", 1),
    ("ticket book karna hai kal ke liye", 2),
    ("เคฎเฅเคเฅ‡ เคŸเคฟเค•เคŸ เคฌเฅเค• เค•เคฐเคจเฅ€ เคนเฅˆ", 2),
    ("cancel my ticket please", 3),
    ("meri ticket cancel karo", 3),
    ("kitni seat available hai?", 4),
    ("train kahan tak pahunchi?", 5),
    ("help chahiye", 6),
]

# Load MuRIL for intent classification
model_name = "google/muril-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
intent_model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=len(INTENTS)
)

def classify_intent(query):
    inputs = tokenizer(query, return_tensors="pt",
                       truncation=True, max_length=64)
    with torch.no_grad():
        logits = intent_model(**inputs).logits
    pred = logits.argmax(dim=-1).item()
    confidence = torch.softmax(logits, dim=-1).max().item()
    return INTENTS[pred], confidence

Part B: Slot Extraction (Named Entity Recognition)

Python
# Part B: Slot filling using token classification
from transformers import AutoModelForTokenClassification

# Slot labels (BIO format)
SLOT_LABELS = [
    'O',          # Outside
    'B-SOURCE',   # Source station start
    'I-SOURCE',   # Source station continuation
    'B-DEST',     # Destination start
    'I-DEST',     # Destination continuation
    'B-DATE',     # Date start
    'I-DATE',     # Date continuation
    'B-PNR',      # PNR number
    'I-PNR',
    'B-CLASS',    # Travel class (Sleeper, AC, etc.)
    'I-CLASS',
    'B-TRAIN',    # Train name/number
    'I-TRAIN',
]

# Example annotation
# "Delhi  se  Mumbai  3  tareekh  ko  AC  mein  dikhao"
# B-SRC  O   B-DEST  B-DATE I-DATE O  B-CLS O    O

slot_model = AutoModelForTokenClassification.from_pretrained(
    model_name, num_labels=len(SLOT_LABELS)
)

def extract_slots(query):
    inputs = tokenizer(query, return_tensors="pt",
                       truncation=True, max_length=64,
                       return_offsets_mapping=True)
    offsets = inputs.pop("offset_mapping")
    with torch.no_grad():
        logits = slot_model(**inputs).logits
    preds = logits.argmax(dim=-1)[0].tolist()

    slots = {}
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    current_slot = None
    current_value = []

    for token, pred_id in zip(tokens, preds):
        label = SLOT_LABELS[pred_id]
        if label.startswith('B-'):
            if current_slot:
                slots[current_slot] = ' '.join(current_value)
            current_slot = label[2:]
            current_value = [token]
        elif label.startswith('I-') and current_slot:
            current_value.append(token)
        else:
            if current_slot:
                slots[current_slot] = ' '.join(current_value)
            current_slot = None
            current_value = []

    if current_slot:
        slots[current_slot] = ' '.join(current_value)
    return slots

# โ”€โ”€โ”€ Part C: Response Generation โ”€โ”€โ”€
def generate_response(intent, slots):
    """Template-based response with bilingual output."""
    if intent == 'SEARCH_TRAIN':
        src = slots.get('SOURCE', '?')
        dst = slots.get('DEST', '?')
        date = slots.get('DATE', 'today')
        return (f"{src} โ†’ {dst} ke liye {date} ko trains:\n"
                f"1. Rajdhani Express (12:05 PM)\n"
                f"2. Duronto Express (11:20 PM)\n"
                f"3. Mumbai Mail (09:00 PM)")
    elif intent == 'CHECK_PNR':
        pnr = slots.get('PNR', 'N/A')
        return f"PNR {pnr} ka status: Confirmed (S5, Berth 42)"
    else:
        return "Main aapki kaise madad kar sakta hoon?"

# โ”€โ”€โ”€ Full pipeline โ”€โ”€โ”€
def chatbot_pipeline(user_query):
    intent, confidence = classify_intent(user_query)
    slots = extract_slots(user_query)
    response = generate_response(intent, slots)
    return {
        'query': user_query,
        'intent': intent,
        'confidence': confidence,
        'slots': slots,
        'response': response
    }

# Test
result = chatbot_pipeline("Delhi se Mumbai kal AC mein train dikhao")
print(result)
{'query': 'Delhi se Mumbai kal AC mein train dikhao', 'intent': 'SEARCH_TRAIN', 'confidence': 0.94, 'slots': {'SOURCE': 'Delhi', 'DEST': 'Mumbai', 'DATE': 'kal', 'CLASS': 'AC'}, 'response': 'Delhi โ†’ Mumbai ke liye kal ko trains:\n1. Rajdhani Express...'}

Metrics

92.3%
Intent Accuracy
87.6%
Slot F1
84.1%
End-to-End Accuracy
3.8/5
Human Rating

The following code is supposed to extract the PNR number from a Hinglish query, but it has 3 bugs. Can you find them?

def extract_pnr(query):
    import re
    # Bug 1: Pattern issue
    pnr_pattern = r'\d{5}'    # PNR is 10 digits!
    match = re.search(pnr_pattern, query)
    if match:
        # Bug 2: Wrong method
        return match.groups()  # .group() not .groups()
    # Bug 3: Missing return
    # If no match, function returns None implicitly
    # Should return empty string or raise exception

result = extract_pnr("PNR check karo 4521876543")
print("PNR:", result)  # Gives wrong output!
Fixed version:
Bug 1: Change \d{5} to \d{10} โ€” PNR numbers are 10 digits
Bug 2: Change match.groups() to match.group() โ€” .groups() returns captured groups (empty tuple since no groups), .group() returns the full match
Bug 3: Add return "" at the end for the no-match case
P4

Fake News Detection for Indian WhatsApp

BERT Classifier for Hindi Misinformation

Problem Statement

Given a message forwarded on WhatsApp (in Hindi, English, or Hinglish), classify it as REAL or FAKE. The model must handle health misinformation, political propaganda, communal rumors, and scientific hoaxes common in Indian WhatsApp groups.

Why WhatsApp Fake News Is Unique to India

India has 500+ million WhatsApp users โ€” the largest user base in the world. Unlike Twitter (public), WhatsApp messages are end-to-end encrypted and forwarded privately, making misinformation detection fundamentally harder. During COVID-19, messages like "5G towers spread corona" and "cow urine cures COVID" led to real-world harm. In 2018, WhatsApp misinformation led to mob lynchings in multiple Indian states, prompting WhatsApp to limit forwarding. Building automated detection is literally a life-saving application.

Dataset

DatasetSizeLanguagesSource
CONSTRAINT 2021 (Hindi)~21,000Hindi (Devanagari)Fact-checking websites + WhatsApp
FakeNewsNet (English)~23,000EnglishPolitiFact + GossipCop
Fake-News-Hindi~9,500Hindi + HinglishIndian fact-checkers (AltNews, BoomLive)
Custom WhatsApp~3,000MixedCrowdsourced + manual annotation

Feature Engineering for WhatsApp Text

Linguistic Features Unique to Fake Messages

1. Sensationalism markers

Fake messages use excessive exclamation marks, ALL CAPS, and urgency words: "URGENT!!! เคœเคฒเฅเคฆเฅ€ forward เค•เคฐเฅ‹!!!" Real news rarely uses this tone.

2. Source absence

Fake messages rarely cite verifiable sources. Real news mentions "according to PTI," "as per MoHFW data." We encode this as a binary feature: has_source = 1 if message contains source indicators.

3. Forward indicators

Phrases like "Forwarded as received," "Doctor ne bataya," "NASA ne confirm kiya" (NASA confirmed) โ€” these are markers of viral forwards, not original content.

4. Emotional manipulation

Fake messages appeal to fear, anger, or nationalism: "Desh ke liye share karo!" (Share for the nation!). Sentiment extremity (very positive or very negative) correlates with fakeness.

HuggingFace Implementation

Python
# Project 4: Fake News Detection for Indian WhatsApp Messages
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics import classification_report
import re
import numpy as np

# โ”€โ”€โ”€ Step 1: Feature extraction โ”€โ”€โ”€
def extract_meta_features(text):
    """Extract non-textual features that signal fakeness."""
    features = []
    # Exclamation density
    features.append(text.count('!') / max(len(text), 1))
    # CAPS ratio
    caps = sum(1 for c in text if c.isupper())
    features.append(caps / max(len(text), 1))
    # Has source citation
    source_words = ['according', 'source', 'study', 'report',
                    'research', 'PTI', 'ANI', 'WHO']
    features.append(1.0 if any(w in text.lower() for w in source_words) else 0.0)
    # Forward indicators
    fwd_words = ['forward', 'share', 'viral', 'forwarded',
                 'เคญเฅ‡เคœเฅ‹', 'เคถเฅ‡เคฏเคฐ', 'forward เค•เคฐเฅ‹']
    features.append(1.0 if any(w in text.lower() for w in fwd_words) else 0.0)
    # Message length (normalized)
    features.append(min(len(text) / 1000, 1.0))
    return features

# โ”€โ”€โ”€ Step 2: BERT + Meta Features Model โ”€โ”€โ”€
class FakeNewsDetector(nn.Module):
    def __init__(self, bert_model_name, n_meta_features=5, n_classes=2):
        super().__init__()
        self.bert = AutoModel.from_pretrained(bert_model_name)
        self.dropout = nn.Dropout(0.3)
        # BERT hidden (768) + meta features (5) โ†’ classifier
        self.classifier = nn.Sequential(
            nn.Linear(768 + n_meta_features, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, n_classes)
        )

    def forward(self, input_ids, attention_mask, meta_features):
        bert_out = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        cls_embedding = bert_out.last_hidden_state[:, 0, :]  # [CLS] token
        cls_embedding = self.dropout(cls_embedding)
        # Concatenate BERT [CLS] with meta features
        combined = torch.cat([cls_embedding, meta_features], dim=1)
        logits = self.classifier(combined)
        return logits

# โ”€โ”€โ”€ Step 3: Initialize โ”€โ”€โ”€
MODEL_NAME = "google/muril-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = FakeNewsDetector(MODEL_NAME)

# โ”€โ”€โ”€ Step 4: Inference function โ”€โ”€โ”€
def detect_fake_news(text):
    # Tokenize
    encoding = tokenizer(
        text, max_length=256, padding='max_length',
        truncation=True, return_tensors='pt'
    )
    # Extract meta features
    meta = torch.tensor([extract_meta_features(text)], dtype=torch.float)

    model.eval()
    with torch.no_grad():
        logits = model(
            encoding['input_ids'],
            encoding['attention_mask'],
            meta
        )
    probs = torch.softmax(logits, dim=-1)
    pred = probs.argmax(dim=-1).item()
    label = "FAKE" if pred == 1 else "REAL"
    return label, probs[0].tolist()

# Test
test_messages = [
    "BREAKING: เค—เคฐเฅเคฎ เคชเคพเคจเฅ€ เคชเฅ€เคจเฅ‡ เคธเฅ‡ COVID เค เฅ€เค• เคนเฅ‹เคคเคพ เคนเฅˆ!!! SHARE เค•เคฐเฅ‹ เคธเคฌเค•เฅ‹!!!",
    "According to WHO, COVID-19 vaccines have undergone rigorous clinical trials.",
    "NASA ne confirm kiya hai ki kal raat 2 baje earthquake aayega India mein!!!",
]
for msg in test_messages:
    label, probs = detect_fake_news(msg)
    print(f"[{label}] (conf: {max(probs):.2f}) โ†’ {msg[:50]}...")

Expected Metrics

0.89
F1 Score
0.91
Precision
0.87
Recall
0.90
Accuracy

Error Analysis

Error TypeExampleChallengeFreq
Satire misclassified"Modi ji ne chand pe airport banane ka announcement kiya" (satirical)Looks fake but is intentional humor~14%
Partial truthReal event + fabricated detailsContains real names/dates mixed with fiction~12%
Old news resharedReal news from 2019 shared as "breaking" in 2024Content is factually true but contextually misleading~9%
Regional languageFake news in Tamil/Telugu transliterated to RomanModel trained on Hindi/English, not Dravidian~7%
Topic:Text Classification using BERT
Key Equation:P(class | text) = softmax(W ยท h_[CLS] + b), where h_[CLS] is the [CLS] token's final hidden state
Input format:[CLS] tokenโ‚ tokenโ‚‚ ... tokenโ‚™ [SEP]
Fine-tuning:Unfreeze all layers + add classification head; lr = 2e-5 to 5e-5
GATE trap:BERT is encoder-only โ†’ cannot generate text. Use for classification, NER, QA (extractive). For generation, use GPT (decoder-only) or T5 (encoder-decoder)
Quick recall:BERT-base = 12 layers, 768 hidden, 12 heads, 110M params
P5

Neural Machine Translation: Hindi โ†” English

Transformer with Attention Visualization & BLEU Scoring

Problem Statement

Build a Hindiโ†’English and Englishโ†’Hindi translation system using the Transformer architecture. Implement attention visualization to understand word alignment, and evaluate using BLEU score.

Mathematical Foundation: BLEU Score

Deriving BLEU from First Principles

BLEU (Bilingual Evaluation Understudy) measures how close a machine translation is to human reference translations. Let's derive it step by step.

Step 1: Modified n-gram Precision

For each n-gram size (1, 2, 3, 4), we compute:

pโ‚™ = ฮฃ_sentence ฮฃ_ngram min(count_candidate(ngram), count_reference(ngram))
     โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
     ฮฃ_sentence ฮฃ_ngram count_candidate(ngram)

The min is the "clipping" โ€” it prevents gaming by repeating common words.

Step 2: Example Calculation

Reference: "The cat sat on the mat"
Candidate: "The the the the the the"

Without clipping: pโ‚ = 6/6 = 1.0 (all "the" match!)
With clipping: max count of "the" in reference = 2, so pโ‚ = 2/6 = 0.33 โœ“

Step 3: Brevity Penalty

Short translations have artificially high precision. The brevity penalty (BP) penalizes them:

BP = exp(1 - r/c)  if c < r
BP = 1              if c โ‰ฅ r

where c = candidate length, r = reference length

Step 4: Final BLEU

BLEU = BP ยท exp(ฮฃโ‚™ wโ‚™ ยท log(pโ‚™))

Typically: wโ‚ = wโ‚‚ = wโ‚ƒ = wโ‚„ = 0.25  (equal weights)
BLEU = BP ยท exp(ยผ ยท ฮฃn=14 log pn)   where   BP = min(1, e1 - |ref|/|cand|)

Tokenization: SentencePiece for Hindi

Python
# Why we need SentencePiece for Hindi
# Hindi has agglutinative properties โ€” words combine morphemes
# "เค–เคพเคŠเคเค—เฅ€" = เค–เคพ (eat) + เคŠเคเค—เฅ€ (will, feminine) โ†’ "I will eat" (said by a woman)

import sentencepiece as spm

# Train SentencePiece model on Hindi corpus
# spm.SentencePieceTrainer.train(
#     input='hindi_corpus.txt',
#     model_prefix='hindi_sp',
#     vocab_size=32000,
#     character_coverage=0.9995,  # High coverage for Devanagari
#     model_type='bpe'
# )

# Compare tokenization approaches
from transformers import AutoTokenizer

sentence = "เคญเคพเคฐเคค เค•เฅ‡ เคชเฅเคฐเคงเคพเคจเคฎเค‚เคคเฅเคฐเฅ€ เคจเฅ‡ เค†เคœ เคเค• เคฎเคนเคคเฅเคตเคชเฅ‚เคฐเฅเคฃ เคฌเฅˆเค เค• เค•เฅ€เฅค"

# mBERT tokenizer (WordPiece)
tok1 = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
print("mBERT:", tok1.tokenize(sentence))
# ['เคญเคพเคฐเคค', 'เค•เฅ‡', 'เคชเฅเคฐเคงเคพเคจเคฎเค‚เคคเฅเคฐเฅ€', 'เคจเฅ‡', 'เค†เคœ', 'เคเค•', 'เคฎเคนเคคเฅเคต', '##เคชเฅ‚เคฐเฅเคฃ', 'เคฌเฅˆเค เค•', 'เค•เฅ€', 'เฅค']
# Note: "เคฎเคนเคคเฅเคตเคชเฅ‚เคฐเฅเคฃ" (important) split into "เคฎเคนเคคเฅเคต" + "##เคชเฅ‚เคฐเฅเคฃ"

# IndicTrans tokenizer (SentencePiece)
tok2 = AutoTokenizer.from_pretrained("ai4bharat/indictrans2-en-indic-1B")
print("IndicTrans:", tok2.tokenize(sentence))
# Better handling of compound Hindi words

Full Implementation: Hindiโ†”English NMT

Python
# Project 5: Hindiโ†”English Neural Machine Translation
from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer,
    DataCollatorForSeq2Seq
)
from datasets import load_dataset
import evaluate
import numpy as np
import torch

# โ”€โ”€โ”€ Step 1: Load IndicTrans2 (State-of-the-art for Indian languages) โ”€โ”€โ”€
MODEL_NAME = "ai4bharat/indictrans2-en-indic-1B"
# For a lighter model, use Helsinki-NLP/opus-mt-en-hi
LITE_MODEL = "Helsinki-NLP/opus-mt-en-hi"

tokenizer = AutoTokenizer.from_pretrained(LITE_MODEL)
model = AutoModelForSeq2SeqLM.from_pretrained(LITE_MODEL)

# โ”€โ”€โ”€ Step 2: Translation function โ”€โ”€โ”€
def translate_en_to_hi(text, max_length=128):
    inputs = tokenizer(text, return_tensors="pt",
                       max_length=max_length, truncation=True)
    translated = model.generate(
        **inputs,
        max_length=max_length,
        num_beams=5,
        length_penalty=1.0,
        no_repeat_ngram_size=2,
        early_stopping=True
    )
    result = tokenizer.decode(translated[0], skip_special_tokens=True)
    return result

# โ”€โ”€โ”€ Step 3: BLEU Score Computation โ”€โ”€โ”€
import sacrebleu

def compute_bleu(predictions, references):
    """Compute BLEU score using sacrebleu."""
    bleu = sacrebleu.corpus_bleu(predictions, [references])
    print(f"BLEU Score: {bleu.score:.2f}")
    print(f"Precisions: {[f'{p:.1f}' for p in bleu.precisions]}")
    print(f"Brevity Penalty: {bleu.bp:.4f}")
    print(f"Length Ratio: {bleu.sys_len}/{bleu.ref_len} = {bleu.sys_len/bleu.ref_len:.3f}")
    return bleu

# โ”€โ”€โ”€ Step 4: By-hand BLEU calculation (for understanding) โ”€โ”€โ”€
from collections import Counter

def bleu_from_scratch(candidate, reference, max_n=4):
    """Compute BLEU score from scratch โ€” see the math in action."""
    cand_tokens = candidate.split()
    ref_tokens = reference.split()

    # Step 1: Compute clipped n-gram precisions
    precisions = []
    for n in range(1, max_n + 1):
        # Extract n-grams
        cand_ngrams = [tuple(cand_tokens[i:i+n]) for i in range(len(cand_tokens) - n + 1)]
        ref_ngrams = [tuple(ref_tokens[i:i+n]) for i in range(len(ref_tokens) - n + 1)]

        # Count
        cand_counts = Counter(cand_ngrams)
        ref_counts = Counter(ref_ngrams)

        # Clipped count
        clipped = 0
        total = 0
        for ngram, count in cand_counts.items():
            clipped += min(count, ref_counts.get(ngram, 0))
            total += count

        precision = clipped / max(total, 1)
        precisions.append(precision)
        print(f"  p{n} = {clipped}/{total} = {precision:.4f}")

    # Step 2: Brevity penalty
    import math
    c = len(cand_tokens)
    r = len(ref_tokens)
    bp = math.exp(1 - r/c) if c < r else 1.0
    print(f"  BP = {bp:.4f} (candidate: {c}, reference: {r})")

    # Step 3: BLEU = BP ร— geometric mean of precisions
    log_avg = sum(math.log(max(p, 1e-10)) for p in precisions) / max_n
    bleu = bp * math.exp(log_avg)
    print(f"  BLEU = {bleu:.4f}")
    return bleu

# Example
ref = "India is a diverse country with many languages"
cand = "India is a country with many diverse languages"
print("=== By-hand BLEU ===")
bleu_from_scratch(cand, ref)

# โ”€โ”€โ”€ Step 5: Attention Visualization โ”€โ”€โ”€
def visualize_attention(source, target_prefix=""):
    """Extract and visualize cross-attention weights."""
    inputs = tokenizer(source, return_tensors="pt")

    # Generate with attention outputs
    outputs = model.generate(
        **inputs,
        max_length=64,
        num_beams=1,
        output_attentions=True,
        return_dict_in_generate=True,
    )

    # Decode tokens
    src_tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    tgt_tokens = tokenizer.convert_ids_to_tokens(outputs.sequences[0])

    print(f"Source:  {' '.join(src_tokens)}")
    print(f"Target:  {' '.join(tgt_tokens)}")

    # In practice, you'd use matplotlib heatmap here
    # plt.imshow(attention_matrix, cmap='viridis')
    # plt.xticks(range(len(src_tokens)), src_tokens, rotation=45)
    # plt.yticks(range(len(tgt_tokens)), tgt_tokens)
    return src_tokens, tgt_tokens

# Test translations
test_pairs = [
    ("The Supreme Court upheld the fundamental right to privacy.",
     "เคธเคฐเฅเคตเฅ‹เคšเฅเคš เคจเฅเคฏเคพเคฏเคพเคฒเคฏ เคจเฅ‡ เคจเคฟเคœเคคเคพ เค•เฅ‡ เคฎเฅŒเคฒเคฟเค• เค…เคงเคฟเค•เคพเคฐ เค•เฅ‹ เคฌเคฐเค•เคฐเคพเคฐ เคฐเค–เคพเฅค"),
    ("Please book a ticket from Delhi to Mumbai.",
     "เค•เฅƒเคชเคฏเคพ เคฆเคฟเคฒเฅเคฒเฅ€ เคธเฅ‡ เคฎเฅเค‚เคฌเคˆ เคคเค• เค•เคพ เคŸเคฟเค•เคŸ เคฌเฅเค• เค•เคฐเฅ‡เค‚เฅค"),
    ("Artificial intelligence is transforming healthcare in India.",
     "เค•เฅƒเคคเฅเคฐเคฟเคฎ เคฌเฅเคฆเฅเคงเคฟเคฎเคคเฅเคคเคพ เคญเคพเคฐเคค เคฎเฅ‡เค‚ เคธเฅเคตเคพเคธเฅเคฅเฅเคฏ เคธเฅ‡เคตเคพ เค•เฅ‹ เคฌเคฆเคฒ เคฐเคนเฅ€ เคนเฅˆเฅค"),
]

for en, hi in test_pairs:
    predicted_hi = translate_en_to_hi(en)
    print(f"EN: {en}")
    print(f"HI (pred): {predicted_hi}")
    print(f"HI (ref):  {hi}")
    print()

Attention Visualization (ASCII)

Cross-Attention Map: English โ†’ Hindi Translation "The Supreme Court upheld the right to privacy" โ†’ "เคธเคฐเฅเคตเฅ‹เคšเฅเคš เคจเฅเคฏเคพเคฏเคพเคฒเคฏ เคจเฅ‡ เคจเคฟเคœเคคเคพ เค•เฅ‡ เค…เคงเคฟเค•เคพเคฐ เค•เฅ‹ เคฌเคฐเค•เคฐเคพเคฐ เคฐเค–เคพ" The Supreme Court upheld the right to privacy เคธเคฐเฅเคตเฅ‹เคšเฅเคš ยท โ–ˆโ–ˆโ–ˆโ–ˆ ยทยท ยท ยท ยท ยท ยท เคจเฅเคฏเคพเคฏเคพเคฒเคฏ ยท ยทยท โ–ˆโ–ˆโ–ˆโ–ˆ ยท ยท ยท ยท ยท เคจเฅ‡ ยท ยท ยท ยทยท ยท ยท ยท ยท เคจเคฟเคœเคคเคพ ยท ยท ยท ยท ยท ยท ยท โ–ˆโ–ˆโ–ˆโ–ˆ เค•เฅ‡ ยท ยท ยท ยท โ–ˆโ–ˆโ–ˆ ยท ยท ยท เค…เคงเคฟเค•เคพเคฐ ยท ยท ยท ยท ยท โ–ˆโ–ˆโ–ˆโ–ˆ ยท ยท เค•เฅ‹ ยท ยท ยท ยท ยท ยท โ–ˆโ–ˆโ–ˆ ยท เคฌเคฐเค•เคฐเคพเคฐ ยท ยท ยท โ–ˆโ–ˆโ–ˆโ–ˆ ยท ยท ยท ยท เคฐเค–เคพ ยท ยท ยท โ–ˆโ–ˆ ยทยท ยท ยท ยท โ–ˆโ–ˆโ–ˆโ–ˆ = high attention ยทยท = medium ยท = low Key insight: Hindi word order is SOV (Subject-Object-Verb) while English is SVO. The attention learns this reordering! "upheld" (verb, position 4) โ†’ "เคฌเคฐเค•เคฐเคพเคฐ เคฐเค–เคพ" (verb, position 8-9)

Expected Metrics

24.3
BLEU (opus-mt)
33.5
BLEU (IndicTrans2)
41.2
BLEU (Enโ†’Fr, same arch)
8.7
BLEU (zero-shot GPT)

Why is the Hindi BLEU score so much lower than French? It's NOT because the model is worse. Hindi and English are from different language families (Indo-Aryan vs. Germanic), have different scripts, different word orders (SOV vs. SVO), and different morphological complexity. French BLEU scores are inflated because French shares vocabulary and structure with English. Always compare BLEU scores within the same language pair, never across language pairs.

Error Analysis: Where Hindi NMT Struggles

Error TypeSource (EN)Predicted (HI)Correct (HI)
Honorific loss"You should go to the doctor""เคคเฅเคฎ เคกเฅ‰เค•เฅเคŸเคฐ เค•เฅ‡ เคชเคพเคธ เคœเคพเค“""เค†เคช เคกเฅ‰เค•เฅเคŸเคฐ เค•เฅ‡ เคชเคพเคธ เคœเคพเค‡เค" (formal)
Gender mismatch"The teacher went home""เคถเคฟเค•เฅเคทเค• เค˜เคฐ เค—เคฏเคพ" (masculine)Could be "เคถเคฟเค•เฅเคทเคฟเค•เคพ เค˜เคฐ เค—เคฏเฅ€" (feminine)
Idiom translation"It's raining cats and dogs""เคฌเคฟเคฒเฅเคฒเคฟเคฏเคพเค‚ เค”เคฐ เค•เฅเคคเฅเคคเฅ‡ เคฌเคฐเคธ เคฐเคนเฅ‡ เคนเฅˆเค‚" (literal!)"เคฎเฅ‚เคธเคฒเคพเคงเคพเคฐ เคฌเคพเคฐเคฟเคถ เคนเฅ‹ เคฐเคนเฅ€ เคนเฅˆ"
Named entity"Modi visited Washington""เคฎเฅ‹เคฆเฅ€ เคจเฅ‡ เคตเคพเคถเคฟเค‚เค—เฅเคŸเคจ เค•เคพ เคฆเฅŒเคฐเคพ เค•เคฟเคฏเคพ" โœ“Usually correct for famous entities
Section 4

Visual Aids โ€” The NLP Project Pipeline

Master Pipeline: From Raw Text to Deployed Model

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ NLP PROJECT PIPELINE (Indian Languages) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ 1. DATA โ”‚โ”€โ”€โ”€โ–ถโ”‚2. CLEAN โ”‚โ”€โ”€โ”€โ–ถโ”‚3. TOKEN- โ”‚โ”€โ”€โ”€โ–ถโ”‚4. MODEL โ”‚ โ”‚ โ”‚ โ”‚ COLLECT โ”‚ โ”‚ & PREP โ”‚ โ”‚ IZATION โ”‚ โ”‚ CHOICE โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ€ข Scrape โ€ข Script det. โ€ข WordPiece โ€ข BERT (clf) โ”‚ โ”‚ โ€ข API โ€ข Norm. spell โ€ข BPE โ€ข T5 (gen) โ”‚ โ”‚ โ€ข Crowdsource โ€ข Handle emoji โ€ข SentencePiece โ€ข Seq2Seq (MT) โ”‚ โ”‚ โ€ข Annotate โ€ข Dedup โ€ข IndicNLP โ€ข Custom โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚5. FINE- โ”‚โ”€โ”€โ”€โ–ถโ”‚6. EVAL โ”‚โ”€โ”€โ”€โ–ถโ”‚7. ERROR โ”‚โ”€โ”€โ”€โ–ถโ”‚8. DEPLOY โ”‚ โ”‚ โ”‚ โ”‚ TUNE โ”‚ โ”‚ METRICS โ”‚ โ”‚ ANALYSIS โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ€ข HuggingFace โ€ข F1/Acc โ€ข Confusion mat โ€ข ONNX export โ”‚ โ”‚ โ€ข lr=2e-5 โ€ข BLEU/ROUGE โ€ข Error types โ€ข API serve โ”‚ โ”‚ โ€ข 3-5 epochs โ€ข Human eval โ€ข Failure cases โ€ข Monitor โ”‚ โ”‚ โ€ข FP16 โ€ข Per-lang โ€ข Bias audit โ€ข A/B test โ”‚ โ”‚ โ”‚ โ”‚ โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•— โ”‚ โ”‚ โ•‘ KEY DECISION POINT: Multilingual vs. Language-Specific? โ•‘ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ•‘ Multilingual (mBERT, MuRIL, XLM-R): โ•‘ โ”‚ โ”‚ โ•‘ โœ… One model for all languages โ•‘ โ”‚ โ”‚ โ•‘ โŒ Diluted per-language performance โ•‘ โ”‚ โ”‚ โ•‘ โ•‘ โ”‚ โ”‚ โ•‘ Language-specific (HindiBERT, IndicBERT): โ•‘ โ”‚ โ”‚ โ•‘ โœ… Better per-language performance โ•‘ โ”‚ โ”‚ โ•‘ โŒ Need separate model per language โ•‘ โ”‚ โ”‚ โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Model Selection Guide

Which Model Do You Need? โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ Classification? Generation? โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ โ–ผ โ–ผ English? Indian? Summarize? Translate? โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ–ผ โ–ผ โ–ผ BERT/ MuRIL/ mT5/ IndicTrans2/ RoBERTa IndicBERT IndicBART opus-mt โ”‚ โ”‚ โ”‚ โ”‚ F1โ‰ˆ0.92 F1โ‰ˆ0.76 ROUGEโ‰ˆ0.46 BLEUโ‰ˆ33.5 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Rule of thumb for Indian NLP: โ”‚ โ”‚ โ”‚ โ”‚ 1. Classification โ†’ MuRIL (best for Indic) โ”‚ โ”‚ 2. Generation โ†’ mT5 or IndicBART โ”‚ โ”‚ 3. Translation โ†’ IndicTrans2 (SOTA) โ”‚ โ”‚ 4. If code-mixed โ†’ MuRIL > mBERT > XLM-R โ”‚ โ”‚ 5. If resource-constrained โ†’ DistilmBERT โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Tokenization Comparison Across Scripts

Input: "เคฎเฅเคเฅ‡ เคฆเคฟเคฒเฅเคฒเฅ€ เคธเฅ‡ เคฎเฅเค‚เคฌเคˆ เค•เคพ ticket เคšเคพเคนเคฟเค" (I need a ticket from Delhi to Mumbai) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ English BERT Tokenizer (bert-base-uncased): โ”‚ โ”‚ [UNK] [UNK] [UNK] [UNK] [UNK] ticket [UNK] โ”‚ โ”‚ โ†’ 85% of tokens are [UNK]! Completely useless. โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ mBERT Tokenizer (bert-base-multilingual-cased): โ”‚ โ”‚ เคฎเฅเคเฅ‡ | เคฆเคฟเคฒเฅเคฒเฅ€ | เคธเฅ‡ | เคฎเฅเค‚เคฌเคˆ | เค•เคพ | ticket | เคšเคพเคนเคฟเค โ”‚ โ”‚ โ†’ Works! But limited Hindi vocab (1,700/120K tokens) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ MuRIL Tokenizer (google/muril-base-cased): โ”‚ โ”‚ เคฎเฅเคเฅ‡ | เคฆเคฟเคฒเฅเคฒเฅ€ | เคธเฅ‡ | เคฎเฅเค‚เคฌเคˆ | เค•เคพ | ticket | เคšเคพเคนเคฟเค โ”‚ โ”‚ โ†’ Same output but richer contextual embeddings for Hindi โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ IndicTrans SentencePiece: โ”‚ โ”‚ โ–เคฎเฅเคเฅ‡ | โ–เคฆเคฟเคฒเฅเคฒเฅ€ | โ–เคธเฅ‡ | โ–เคฎเฅเค‚เคฌเคˆ | โ–เค•เคพ | โ–ticket | โ–เคšเคพเคนเคฟเค โ”‚ โ”‚ โ†’ Clean subword segmentation with script-aware BPE โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Section 5

Common Misconceptions

โŒ MYTH: "BLEU score of 30 means the translation is 30% correct."

โœ… TRUTH: BLEU is NOT a percentage of correctness. A BLEU of 30 for Hindiโ†’English is actually quite good (approaching professional quality for some domains), while a BLEU of 30 for Frenchโ†’English might be mediocre. BLEU measures n-gram overlap with reference translations, not semantic accuracy.

๐Ÿ” WHY IT MATTERS: In interviews and papers, comparing BLEU across language pairs is meaningless. Always compare within the same language pair and test set.

โŒ MYTH: "Fine-tuning BERT means you only train the classification head."

โœ… TRUTH: Fine-tuning means updating ALL parameters โ€” both the pretrained Transformer layers AND the new classification head. If you freeze the Transformer and only train the head, that's called feature extraction (or linear probing), not fine-tuning. True fine-tuning typically gives 3-8% better F1 than feature extraction.

๐Ÿ” WHY IT MATTERS: GATE questions specifically test this distinction. In practice, with small datasets (<1K samples), feature extraction may actually be better to avoid overfitting.

โŒ MYTH: "Multilingual BERT (mBERT) was explicitly trained on parallel (aligned) multilingual data."

โœ… TRUTH: mBERT was trained on the concatenation of Wikipedia articles from 104 languages with NO parallel alignment and NO cross-lingual objectives. Yet it somehow learns cross-lingual representations! This is one of the biggest surprises of modern NLP โ€” it emerges from shared subword tokens and structural similarities.

๐Ÿ” WHY IT MATTERS: Understanding this explains why mBERT works on some language pairs (similar structures) better than others, and why models like XLM (with explicit translation objectives) can outperform it.

โŒ MYTH: "ROUGE and BLEU measure the same thing."

โœ… TRUTH: BLEU is precision-oriented (what fraction of the candidate's n-grams appear in the reference?). ROUGE is recall-oriented (what fraction of the reference's n-grams appear in the candidate?). BLEU penalizes long outputs; ROUGE penalizes short outputs. Use BLEU for translation, ROUGE for summarization.

๐Ÿ” WHY IT MATTERS: Using the wrong metric leads to wrong model selection. A model that copies the entire document would get ROUGE-1 โ‰ˆ 1.0 but BLEU โ‰ˆ 0.0.

โŒ MYTH: "Code-mixed (Hinglish) text is just Hindi with some English words. Tokenize each word by its language."

โœ… TRUTH: Code-mixing happens at every level โ€” word, morpheme, even character. "Main meeting mein gaya tha" has Hindi syntax wrapping English nouns. Some words are "language-ambiguous": "time" is used in both Hindi and English conversations. Language-tagging each token is itself a hard NLP problem (Language Identification) that must precede other tasks.

๐Ÿ” WHY IT MATTERS: Pipelines that assume clean language boundaries fail catastrophically on real Indian social media text.

Section 6

GATE / Exam Corner

Key Formulas Quick Reference

BLEU = BP ยท exp(ยผ ฮฃ log pn)   |   ROUGE-N = ฮฃ countmatch(n-gram) / ฮฃ count(n-gram)ref   |   F1 = 2PR/(P+R)

GATE Previous Year Questions (Adapted)

GATE CSE 2023 โ€” Adapted

In BERT, the [CLS] token representation is typically used for:

  1. Predicting the next word
  2. Sentence-level classification tasks
  3. Generating text autoregressively
  4. Computing positional encodings
Answer: B โ€” The [CLS] (classification) token's final hidden state is a fixed-size representation of the entire input, used as input to a classification head for tasks like sentiment analysis, fake news detection, etc. BERT is encoder-only and does not generate text (eliminating C). Positional encodings are added at input, not derived from [CLS] (eliminating D). BERT uses masked language modeling, not next-word prediction (eliminating A โ€” that's GPT).
RememberBERT Architecture
GATE DA 2024 โ€” Adapted

Which of the following correctly describes the Brevity Penalty in BLEU score?

  1. It penalizes translations that are too long
  2. It penalizes translations that are too short
  3. It penalizes translations with too many unique n-grams
  4. It is always equal to 1.0
Answer: B โ€” BP = exp(1 - r/c) when c < r (candidate shorter than reference), and BP = 1 when c โ‰ฅ r. Short translations achieve high precision trivially (e.g., translating a long sentence as just "The" gives perfect unigram precision for "The"). The brevity penalty counteracts this by penalizing candidates shorter than the reference.
UnderstandBLEU Score
UGC NET 2024 โ€” Adapted

In a Seq2Seq model with attention for Hindiโ†’English translation, the attention mechanism primarily helps with:

  1. Handling the different word order (SOV โ†’ SVO)
  2. Learning grammar rules explicitly
  3. Reducing the vocabulary size
  4. Eliminating the need for a decoder
Answer: A โ€” Hindi follows Subject-Object-Verb order while English follows Subject-Verb-Object. The attention mechanism allows the decoder to "look back" at any position in the source sequence, enabling it to reorder information. When translating "เคฐเคพเคฎ เคจเฅ‡ เคธเฅ‡เคฌ เค–เคพเคฏเคพ" (Ram apple ate) to "Ram ate an apple," attention lets the decoder attend to "เค–เคพเคฏเคพ" (ate, position 3) when generating "ate" (position 2 in English).
UnderstandAttention Mechanism

Prediction Table: Expected GATE Topics (2025-26)

TopicProbabilityQuestion TypeKey Concept
BLEU computation๐ŸŸข HighNumericalHand-calculate BLEU for given candidate/reference
BERT architecture๐ŸŸข HighMCQ[CLS] token, encoder-only, MLM objective
ROUGE vs BLEU๐ŸŸก MediumMCQPrecision vs recall orientation
Transformer attention๐ŸŸข HighMCQ + NATSelf vs cross attention, complexity O(nยฒ)
Seq2Seq + attention๐ŸŸก MediumMCQEncoder-decoder, alignment
Tokenization (BPE/WP)๐ŸŸก MediumMCQSubword segmentation methods
Section 7

Interview Prep

Conceptual Questions

Q1: Why does multilingual BERT (mBERT) work across languages despite never seeing parallel data?

Expected Answer (3-4 sentences):

mBERT's cross-lingual ability emerges from three factors: (1) Shared subword tokens โ€” words like "information" appear in many languages (English, French, etc.), creating anchor points; (2) Similar word order โ€” SVO languages share structural patterns that the model learns; (3) Parameter sharing โ€” all languages share the same Transformer weights, forcing the model to learn language-agnostic features. Research by Pires et al. (2019) showed this ability degrades for languages with different scripts (e.g., Hindi-Devanagari), which is why MuRIL (which includes transliterated text) outperforms mBERT on Indian languages.

Q2: You're building a Hinglish sentiment analyzer. Your F1 is 0.68. How would you improve it?

Expected Answer (structured, 5+ strategies):

Data-side: (1) Augment with spelling normalization (map "bohot"โ†’"bahut"); (2) Back-translation augmentation โ€” translate Hindi to English and back to generate paraphrases; (3) Add emoji as features (๐Ÿ‘Ž=negative, ๐Ÿ”ฅ=positive).

Model-side: (4) Switch from mBERT to MuRIL (trained on transliterated Indian text); (5) Add a Hinglish language model pre-training step on unlabeled tweets before fine-tuning.

Architecture-side: (6) Ensemble MuRIL + XLM-R + a CNN-based classifier; (7) Use focal loss instead of cross-entropy to handle class imbalance (neutral tweets often dominate).

Evaluation-side: (8) Stratify by language ratio (pure Hindi vs. mixed vs. pure English) to find where the model fails worst.

Coding Question

Q3: Implement BLEU score from scratch (no libraries). [Frequently asked at Google, Microsoft India]

The complete solution is in Project 5's bleu_from_scratch() function above. Key points interviewers look for:

  • Do you implement clipping? (min of candidate count vs. reference count)
  • Do you include the brevity penalty?
  • Do you use geometric mean of n-gram precisions (not arithmetic)?
  • Can you handle edge cases? (zero n-gram matches โ†’ avoid log(0))

Case Study Question

Q4: Design a multilingual content moderation system for ShareChat (Indian social media)

Context (India-specific):

ShareChat supports 15 Indian languages. Users post in code-mixed text, and hate speech often uses euphemisms, misspellings, and transliteration to evade filters. Design a system that detects hate speech, misinformation, and graphic violence descriptions across all supported languages.

Expected Design:

Layer 1 โ€” Language Detection: fastText-based language identifier to route text to the right pipeline.

Layer 2 โ€” Multilingual Classifier: MuRIL fine-tuned on labeled hate speech data from each language. Use a unified model (not 15 separate ones) to handle code-mixing.

Layer 3 โ€” Keyword Filter: Maintain a curated list of slurs, dog-whistles, and euphemisms in each language (updated weekly). This catches what the ML model misses.

Layer 4 โ€” Human Review: Route borderline cases (confidence 0.4-0.7) to human moderators who speak the relevant language.

Key challenge: New slang evolves faster than you can retrain. Solution: use retrieval-augmented classification โ€” embed new flagged terms and find nearest neighbors in the hate speech embedding space.

๐Ÿ‡ฎ๐Ÿ‡ณ Interview Focus (India)
  • Hinglish/code-mixing handling
  • Low-resource language techniques
  • WhatsApp-scale systems
  • MuRIL, IndicBERT, IndicTrans
  • GATE numerical: BLEU calculation
  • Companies: Flipkart, ShareChat, Google India, Microsoft India
๐Ÿ‡บ๐Ÿ‡ธ Interview Focus (US/Global)
  • BERT/GPT architecture details
  • Scaling laws and efficiency
  • Bias and fairness in NLP
  • Prompt engineering and few-shot
  • System design: search ranking, recommendation
  • Companies: Google, Meta, OpenAI, Amazon
Section 8

Hands-On Lab / Mini-Project

๐Ÿ”ฌ Mini-Project: End-to-End Multilingual News Classifier for Indian Languages

Objective

Build a news article topic classifier that works on Hindi, English, and Hinglish text. Classify articles into 5 categories: Sports, Politics, Technology, Entertainment, Business.

Requirements

ComponentSpecification
DatasetBBC Hindi + BBC English + scraped Hinglish news (total ~10K articles)
ModelMuRIL or mBERT, fine-tuned with HuggingFace Trainer
PreprocessingHandle Devanagari + Roman, normalize spelling, handle code-mixing
Training80/10/10 train/val/test split, stratified by language AND category
EvaluationPer-language F1, confusion matrix, error analysis on 50 misclassified samples
DeliverableJupyter notebook + 2-page report + working inference function

Rubric (100 points)

CriterionPointsExcellent (90-100%)Good (70-89%)Needs Work (<70%)
Data Preprocessing20Handles all three languages, normalizes spelling, removes noise intelligentlyBasic cleaning, handles two languagesMinimal cleaning, English-only
Model Training25MuRIL fine-tuned, proper hyperparameter search, FP16 trainingmBERT fine-tuned, reasonable hyperparamsNo fine-tuning, uses pre-trained directly
Evaluation20Per-language F1 breakdown, confusion matrix, statistical significanceOverall F1 and accuracy reportedOnly accuracy reported
Error Analysis2050+ samples analyzed, error categories identified, improvement strategies proposed20+ samples analyzedNo error analysis
Code Quality15Clean, commented, reproducible, modular functionsFunctional but messyHardcoded, not reproducible

Starter Code

Python
# Mini-Project Starter Code
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
from sklearn.model_selection import train_test_split
import pandas as pd

# TODO: Step 1 โ€” Load and preprocess your dataset
# df = pd.read_csv('multilingual_news.csv')
# df['text'] = df['text'].apply(preprocess_multilingual)

# TODO: Step 2 โ€” Choose model
MODEL_NAME = "google/muril-base-cased"  # Your choice!

# TODO: Step 3 โ€” Tokenize, create datasets

# TODO: Step 4 โ€” Train with Trainer API

# TODO: Step 5 โ€” Evaluate per-language

# TODO: Step 6 โ€” Error analysis on misclassified samples

# TODO: Step 7 โ€” Write report
Section 9

Exercises

Section A โ€” Conceptual Questions (5)

A1

Explain why tokenizing "bahut acchi movie thi" with English BERT's tokenizer produces mostly [UNK] tokens, while mBERT produces meaningful subwords. What structural difference in their vocabularies causes this?

Answer: English BERT's vocabulary contains only ~30,000 WordPiece tokens derived from English text. It has no Devanagari characters and no Romanized Hindi words. Every Hindi word maps to [UNK]. mBERT's vocabulary has ~120,000 tokens derived from 104 languages' Wikipedia, including ~1,700 Hindi/Devanagari tokens. However, romanized Hindi like "bahut" gets fragmented (ba + ##hu + ##t) because the mBERT vocabulary has limited Roman-Hindi subwords โ€” it was trained on Devanagari Wikipedia, not Roman transliterations.
A2

Compare extractive and abstractive summarization for Indian legal judgments. Which is more appropriate and why?

Answer: Extractive selects existing sentences verbatim โ€” safe but verbose, and legal sentences are often 100+ words long and full of references. Abstractive generates new sentences โ€” more concise but risks hallucinating incorrect legal citations (e.g., inventing "Section 420 IPC" when the case involves Section 302). For legal documents, a hybrid approach is best: extractive filtering to identify key passages, followed by abstractive summarization on the extracted text. This limits hallucination while producing readable summaries.
A3

Why is ROUGE more appropriate than BLEU for evaluating summarization systems?

Answer: ROUGE is recall-oriented: it measures what fraction of the reference summary's content is captured in the generated summary. For summarization, we care about coverage โ€” did the summary include all key points? BLEU is precision-oriented: it measures what fraction of the generated text matches the reference. A very short summary could have high BLEU (all its words match) but miss critical information. Additionally, ROUGE-L uses longest common subsequence, which is more forgiving of paraphrasing than BLEU's exact n-gram matching.
A4

What is "code-mixing" and why does it make NLP harder? Give 3 specific examples from Indian languages.

Answer: Code-mixing is the fluid interleaving of two or more languages within a single utterance. It makes NLP harder because: (1) Tokenizers designed for one language fragment the other language's words; (2) Sentiment/syntax models trained on monolingual data don't understand mixed-language patterns; (3) There's no standard orthography for transliterated text. Examples: (i) "Main kal meeting mein bahut nervous tha" (Hindi+English+Hindi); (ii) "Arre yaar, this movie was bakwas, complete waste" (Hindi interjections + English); (iii) "Can you book เคฎเฅเคเฅ‡ เคเค• ticket to Delhi?" (English+Hindi+English with Devanagari mid-sentence).
A5

Explain the difference between MuRIL, mBERT, and XLM-RoBERTa for Indian language tasks. When would you choose each?

Answer: mBERT: Trained on 104 languages' Wikipedia. Good general multilingual model, but Hindi gets only ~1.4% of the vocabulary. Best when you need one model for many languages including non-Indian. XLM-RoBERTa: Trained on 2.5TB CommonCrawl data from 100 languages. Stronger than mBERT due to more diverse training data and no NSP objective. Best for cross-lingual transfer from English. MuRIL: Trained specifically on 17 Indian languages + transliterated text. Best for Indian language tasks โ€” outperforms both by 2-5% F1 on Indian benchmarks. Choose MuRIL for India-focused tasks; XLM-R for multilingual tasks; mBERT only if resource-constrained.

Section B โ€” Mathematical Questions (8)

B1
Intermediate

Calculate the BLEU score for: Reference: "the cat is on the mat" | Candidate: "the the the the"

Answer: pโ‚: "the" appears 4 times in candidate, but max count in reference is 2. Clipped count = 2. Total candidate unigrams = 4. pโ‚ = 2/4 = 0.5. pโ‚‚: Bigram "the the" appears 3 times in candidate, 0 times in reference. pโ‚‚ = 0/3 = 0. Since pโ‚‚ = 0, log(pโ‚‚) = -โˆž, so BLEU = 0. (Once any pโ‚™ = 0, the entire BLEU becomes 0 due to the geometric mean.)
B2
Intermediate

Reference: "India is a beautiful country" (5 words). Candidate: "India is beautiful" (3 words). Calculate the brevity penalty.

Answer: c = 3 (candidate length), r = 5 (reference length). Since c < r: BP = exp(1 - r/c) = exp(1 - 5/3) = exp(-2/3) = exp(-0.667) โ‰ˆ 0.513. The translation is heavily penalized for being too short.
B3
Intermediate

Compute ROUGE-1, ROUGE-2, and ROUGE-L for: Reference: "the cat sat on the mat" | System summary: "the cat on the mat"

Answer: ROUGE-1 (unigram recall): Reference unigrams: {the:2, cat:1, sat:1, on:1, mat:1} = 6 tokens. System matches: {the:2, cat:1, on:1, mat:1} = 5 matches. ROUGE-1 = 5/6 โ‰ˆ 0.833. ROUGE-2 (bigram recall): Reference bigrams: {the cat, cat sat, sat on, on the, the mat} = 5. System bigrams: {the cat, cat on, on the, the mat} = 4. Matches: {the cat, on the, the mat} = 3. ROUGE-2 = 3/5 = 0.600. ROUGE-L (LCS): LCS of "the cat sat on the mat" and "the cat on the mat" = "the cat on the mat" (length 5). ROUGE-L recall = 5/6 โ‰ˆ 0.833.
B4
Advanced

In BERT's self-attention, if the input sequence has length n = 512 and hidden dimension d = 768 with h = 12 heads, what is (a) the dimension of each head's Q, K, V matrices, and (b) the total FLOPs for one self-attention layer?

Answer: (a) Each head dimension dโ‚– = d/h = 768/12 = 64. So Q, K, V for each head are n ร— 64 = 512 ร— 64. (b) Computing QKแต€: (512 ร— 64) ร— (64 ร— 512) = 512 ร— 512 ร— 64 โ‰ˆ 16.8M multiplications per head, ร— 12 heads = 201M. Computing attention ร— V: another 201M. Total โ‰ˆ 402M FLOPs for attention computation. Including the linear projections (W_Q, W_K, W_V, W_O): 4 ร— n ร— d ร— d = 4 ร— 512 ร— 768ยฒ โ‰ˆ 1.2B. Grand total โ‰ˆ 1.6B FLOPs per layer.
B5
Intermediate

A fake news classifier has: TP=180, FP=20, FN=30, TN=770. Compute precision, recall, F1, and accuracy. Is this model production-ready?

Answer: Precision = TP/(TP+FP) = 180/200 = 0.90. Recall = TP/(TP+FN) = 180/210 = 0.857. F1 = 2ร—0.9ร—0.857/(0.9+0.857) = 1.543/1.757 = 0.878. Accuracy = (TP+TN)/(Total) = 950/1000 = 0.95. Assessment: F1 of 0.878 is good but the recall of 0.857 means ~14% of fake news goes undetected. For a safety-critical application like fake news, we need higher recall (>0.95) even at the cost of precision. Use a lower classification threshold.
B6
Beginner

If mBERT has a vocabulary of 120,000 tokens across 104 languages, approximately how many tokens are allocated to Hindi? Why is this a problem?

Answer: If tokens were distributed equally: 120,000/104 โ‰ˆ 1,154 per language. In practice, Hindi gets ~1,700 tokens (slightly more due to Wikipedia size). This is a problem because Hindi has rich morphology โ€” "เค–เคพเคŠเคเค—เฅ€" is a single word requiring ~3-4 subword tokens. With only 1,700 tokens, many Hindi words get over-fragmented into 5+ subword pieces, leading to: (1) longer sequences (hitting the 512-token limit faster), (2) loss of semantic coherence (each subword loses meaning), (3) slower inference.
B7
Advanced

Derive the gradient of the cross-entropy loss with respect to the logits for a 3-class sentiment classifier. Show that โˆ‚L/โˆ‚zแตข = pแตข - yแตข where pแตข = softmax(zแตข).

Answer: Loss L = -ฮฃโฑผ yโฑผ log(pโฑผ) where pโฑผ = exp(zโฑผ)/ฮฃโ‚–exp(zโ‚–). For โˆ‚L/โˆ‚zแตข: โˆ‚L/โˆ‚zแตข = -ฮฃโฑผ yโฑผ ยท (1/pโฑผ) ยท โˆ‚pโฑผ/โˆ‚zแตข. We need โˆ‚pโฑผ/โˆ‚zแตข: If j=i: โˆ‚pแตข/โˆ‚zแตข = pแตข(1-pแตข). If jโ‰ i: โˆ‚pโฑผ/โˆ‚zแตข = -pโฑผpแตข. Substituting: โˆ‚L/โˆ‚zแตข = -yแตข(1-pแตข) + ฮฃโฑผโ‰ แตข yโฑผpแตข = -yแตข + yแตขpแตข + pแตขฮฃโฑผโ‰ แตขyโฑผ = -yแตข + pแตข(yแตข + ฮฃโฑผโ‰ แตขyโฑผ) = -yแตข + pแตขยท1 = pแตข - yแตข. โˆŽ This elegant result means the gradient is simply the difference between the predicted probability and the true label โ€” the foundation of backpropagation through softmax.
B8
Intermediate

A Hindiโ†’English NMT model produces translations with average length 15 words when references average 20 words. If the unigram precision pโ‚ = 0.85, pโ‚‚ = 0.55, pโ‚ƒ = 0.35, pโ‚„ = 0.20, what is the BLEU score?

Answer: BP = exp(1 - 20/15) = exp(1 - 1.333) = exp(-0.333) โ‰ˆ 0.717. log-precision average = (log 0.85 + log 0.55 + log 0.35 + log 0.20)/4 = (-0.163 + (-0.598) + (-1.050) + (-1.609))/4 = -3.420/4 = -0.855. BLEU = 0.717 ร— exp(-0.855) = 0.717 ร— 0.425 = 0.305 or 30.5. This is a reasonable BLEU score for Hindiโ†’English, with the brevity penalty reducing it from what would otherwise be ~42.5 without BP.

Section C โ€” Coding Questions (4)

C1
Intermediate

Write a function normalize_hinglish(text) that handles at least 10 common Hinglish spelling variants. For example: "bohot" โ†’ "bahut", "achha" โ†’ "acha", "kaise" / "kese" โ†’ "kaise", etc.

Hint: Use a dictionary mapping + regex for pattern-based normalization (repeated vowels like "hiii" โ†’ "hi", "pleaseee" โ†’ "please"). Also handle: nhi/nahi/nahin โ†’ nahi, hai/h/he โ†’ hai, mein/me/main โ†’ main.
C2
Advanced

Implement a function compute_rouge_l(reference, candidate) from scratch that computes ROUGE-L using the longest common subsequence algorithm. Do NOT use any NLP libraries.

Hint: Use dynamic programming to find LCS length. ROUGE-L recall = LCS_length / reference_length. ROUGE-L precision = LCS_length / candidate_length. F1-ROUGE-L = 2ยทPยทR/(P+R). The DP table is (m+1) ร— (n+1) where m = len(reference), n = len(candidate).
C3
Intermediate

Using HuggingFace, write a script that loads MuRIL, fine-tunes it on 100 labeled Hinglish sentiment samples, and evaluates on 20 test samples. Report per-class precision, recall, and F1.

Hint: Use AutoModelForSequenceClassification.from_pretrained("google/muril-base-cased", num_labels=3). Create a custom Dataset class. Use Trainer with TrainingArguments(num_train_epochs=3, learning_rate=2e-5). Use sklearn's classification_report for per-class metrics.
C4
Advanced

Build a simple language detection function that classifies a given text as "Hindi (Devanagari)", "Hindi (Roman/Hinglish)", or "English" using Unicode ranges. No ML โ€” just rule-based.

Hint: Devanagari Unicode range: U+0900โ€“U+097F. Count Devanagari characters vs. Latin characters. If >50% Devanagari โ†’ "Hindi (Devanagari)". If >50% Latin and contains common Hindi words (main, hai, kya, bahut, nahi) โ†’ "Hinglish". Otherwise โ†’ "English". Handle mixed scripts by checking ratios.

Section D โ€” Critical Thinking Questions (3)

D1
Advanced

A company deploys a fake news detection model for Indian WhatsApp. The model has 92% accuracy on the test set but users report it frequently flags legitimate political opinions as fake news. Analyze what might be going wrong and propose 3 concrete solutions.

Answer: (1) Training data bias: The model was likely trained on data where political content was disproportionately labeled as fake. Fix: Re-audit training labels, ensure balanced representation of legitimate political discourse. (2) Feature leakage: The model might be using political keywords as proxies for fakeness (e.g., any message mentioning a political party gets flagged). Fix: Use adversarial debiasing โ€” train a secondary classifier to NOT predict political party from hidden representations. (3) Threshold issue: 92% accuracy might mean the model is overconfident on borderline cases. Fix: Add a "uncertain" category for predictions with confidence 0.4-0.7, route to human review. Also consider: is the test set representative of real WhatsApp distribution?
D2
Advanced

Your Hindiโ†’English NMT model gets BLEU = 24 on news text but BLEU = 8 on poetry. Explain why, and propose how to improve poetry translation without hurting news performance.

Answer: News text is factual with standard vocabulary and grammar โ€” close to the training data. Poetry has metaphors ("เค†เคเค–เฅ‹เค‚ เคฎเฅ‡เค‚ เคคเฅ‚เคซเคผเคพเคจ" literally "storm in the eyes"), non-standard word order for meter, cultural references, and multiple valid translations (one metaphor can be translated 10 different ways). BLEU penalizes creative paraphrasing because it matches exact n-grams. Solutions: (1) Use multiple references (5+ translations) for BLEU โ€” a single reference unfairly penalizes valid alternatives. (2) Fine-tune a separate poetry model on a poetry parallel corpus. (3) Use semantic similarity metrics (BERTScore) alongside BLEU for evaluation. (4) Domain-adaptive pretraining: continue pretraining the LM on Hindi poetry before fine-tuning on translation pairs.
D3
Advanced

India's judiciary wants to deploy your legal summarization system. What ethical, legal, and technical concerns would you raise before deployment? How would you address each?

Answer: Ethical: (1) Hallucinated legal citations could lead to wrongful legal advice โ€” add a confidence score and disclaimer "This is an AI summary, not legal advice." (2) Bias: if the model is trained on English-language judgments, it may poorly summarize Hindi judgments, disadvantaging courts that operate in Hindi. Legal: (3) Under the IT Act, AI-generated legal summaries might create liability โ€” who is responsible if a wrong summary leads to a bad legal decision? (4) Data privacy: judgments may contain victim names, minor's details. Technical: (5) Long document handling: 50-page judgments exceed model limits โ€” need robust chunking. (6) Evaluation: ROUGE doesn't measure factual accuracy โ€” need human evaluation by lawyers. (7) Adversarial: a party could craft judgments designed to confuse the summarizer. Mitigation: deploy as an "assistive" tool, not a replacement. Always show the source text alongside the summary.

โ˜… Starred Research Questions (2)

โ˜… R1
Advanced

Read the MuRIL paper (Khanuja et al., 2021). How does MuRIL's pretraining differ from mBERT? Why does including transliterated text in pretraining improve performance on code-mixed tasks? Design an experiment to test whether transliteration augmentation helps IndicBERT as well.

โ˜… R2
Advanced

The IndicTrans2 paper (Gala et al., 2023) claims state-of-the-art Hindiโ†”English translation. Read the paper and answer: (a) How does their "script unification" approach work? (b) Why is SentencePiece with character coverage=0.9995 crucial for Indic scripts? (c) Design a study comparing IndicTrans2 vs. fine-tuned mBART on legal document translation (not general text).

Section 10

Connections

How This Chapter Connects to the Rest of the Book

โ† Builds On

Chapter 14 (LSTMs & GRUs): The Seq2Seq chatbot (P3) uses LSTM encoder-decoder architecture. Attention mechanism builds on the hidden state sequences from Chapter 14.

Chapter 15 (Transformers & Attention): Every project uses Transformers โ€” BERT for classification (P1, P4), T5/BART for generation (P2), and full encoder-decoder Transformers for translation (P5). Self-attention and cross-attention from Ch. 15 are directly applied here.

Chapter 17 (Transfer Learning): All 5 projects use transfer learning. Pre-trained models (MuRIL, mT5, opus-mt) are fine-tuned on domain-specific data. The concepts of feature extraction vs. fine-tuning from Ch. 17 are critical.

โ†’ Enables

Chapter 22 (Future & Ethics): The ethical concerns raised in P4 (fake news bias) and P2 (legal AI responsibility) connect directly to AI ethics discussions.

MLOps (Chapter 21): Deploying these models in production requires model serving, monitoring, and retraining pipelines covered in MLOps.

Capstone projects: Students can extend any of these 5 projects into a full capstone with real data.

๐Ÿ”ฌ Research Frontier

Large Language Models for Indian Languages: IndicLLM, Sarvam AI's models, and Krutrim are attempting to build GPT-scale models specifically for Indian languages. The fundamental challenges (tokenization, code-mixing, morphological richness) discussed in this chapter explain why simply scaling English LLMs doesn't solve Indian language AI.

๐Ÿญ Industry Implementation

Bhashini (Government of India): India's national platform for language translation and NLP services uses many of the same techniques covered here โ€” IndicTrans for translation, IndicBERT for understanding, and ASR models for speech. The AI4Bharat consortium (IIT Madras) has open-sourced many of these models.

Section 11

Chapter Summary

๐ŸŽฏ Key Takeaways

  1. Indian NLP is fundamentally harder than English NLP due to code-mixing, script diversity (13 scripts), morphological richness, spelling variation, and severe data scarcity. Pretending it's "just another language" leads to catastrophic failures.
  2. MuRIL > mBERT > English BERT for Indian language classification tasks. MuRIL's inclusion of transliterated text in pretraining gives it a 2-5% F1 advantage on code-mixed tasks.
  3. Legal summarization requires a hybrid approach: extractive filtering (to fit within token limits) followed by abstractive generation (for readable summaries). Watch for hallucinated legal citations.
  4. Task-oriented chatbots decompose into Intent โ†’ Slots โ†’ Response: Use a classifier for intent, token-level NER for slot filling, and either templates or Seq2Seq for response generation. Bilingual handling is the key Indian challenge.
  5. BLEU โ‰  quality percentage. BLEU measures n-gram overlap with clipping and brevity penalty. Never compare BLEU scores across language pairs. Use BLEU for translation, ROUGE for summarization, F1 for classification.
  6. Attention visualization reveals word alignment between source and target in NMT. Hindi (SOV) โ†’ English (SVO) word order changes are visible in cross-attention maps.
  7. Domain-specific preprocessing is critical: Legal text needs section normalization, social media needs emoji handling, WhatsApp messages need forward-indicator detection. There is no one-size-fits-all pipeline.
Key Equations:
BLEU = BP ยท exp(ยผ ฮฃ log pn)  |  ROUGE-N = ฮฃ match(n-gram) / ฮฃ count(n-gram)ref  |  F1 = 2PR/(P+R)
BP = min(1, e1-r/c)  |  Attention(Q,K,V) = softmax(QKT/โˆšdk)V

Key Intuition: Building NLP for a multilingual country like India isn't about translating English NLP โ€” it's about reimagining NLP from the ground up for a world where languages don't have neat boundaries.

Section 12

Further Reading

๐Ÿ‡ฎ๐Ÿ‡ณ Indian Resources

  • NPTEL โ€” NLP by Prof. Pushpak Bhattacharyya (IIT Bombay): The definitive Indian lecture series covering Hindi NLP, WordNet, and Indian language processing. nptel.ac.in
  • AI4Bharat (IIT Madras): Open-source models (IndicTrans2, IndicBERT, MuRIL), datasets, and benchmarks for 22 Indian languages. ai4bharat.iitm.ac.in
  • Bhashini Platform: Government of India's national language translation mission โ€” see the production deployment of the techniques from this chapter. bhashini.gov.in
  • GATE DA Syllabus โ€” NLP Section: Official syllabus covering text classification, sequence models, attention mechanisms, and evaluation metrics
  • ILDC Dataset (IIT Kharagpur): 35K Indian Supreme Court judgments with labels and summaries โ€” the benchmark for Indian legal NLP

๐ŸŒ Global Resources

  • "Attention Is All You Need" (Vaswani et al., 2017): The original Transformer paper. Every model in this chapter is built on this architecture.
  • "BERT: Pre-training of Deep Bidirectional Transformers" (Devlin et al., 2019): The foundation for Projects 1, 3, 4.
  • HuggingFace Course: Free interactive course covering all the APIs used in this chapter. huggingface.co/course
  • "MuRIL: Multilingual Representations for Indian Languages" (Khanuja et al., 2021): The paper behind the best-performing model for Indian NLP.
  • "IndicTrans2: Towards High-Quality and Accessible Machine Translation Models" (Gala et al., 2023): State-of-the-art Indian language translation.
  • Jay Alammar's "The Illustrated Transformer": The single best visual explanation of attention and Transformers. jalammar.github.io
  • 3Blue1Brown โ€” Attention in Transformers (YouTube): Visual, intuition-first explanation of attention mechanism
  • Distill.pub โ€” "Attention and Augmented Recurrent Neural Networks": Interactive article on attention mechanisms with beautiful visualizations

๐Ÿ“ Key Papers for Deep Dive

PaperYearRelevance
Vaswani et al. โ€” "Attention Is All You Need"2017Foundation for all Transformer-based projects
Devlin et al. โ€” "BERT"2019Pre-training + fine-tuning paradigm (P1, P3, P4)
Pires et al. โ€” "How Multilingual is mBERT?"2019Why mBERT transfers across languages
Khanuja et al. โ€” "MuRIL"2021Best model for Indian NLP tasks
Malik et al. โ€” "ILDC for CJPE"2021Indian legal document corpus (P2)
Gala et al. โ€” "IndicTrans2"2023SOTA Hindiโ†”English translation (P5)
Patwa et al. โ€” "SemEval-2020 Task 9"2020Hinglish sentiment benchmark (P1)
Papineni et al. โ€” "BLEU"2002The original BLEU metric paper

"IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages" (Gala et al., 2023)

This AI4Bharat paper introduced IndicTrans2, trained on the Bharat Parallel Corpus (230M sentence pairs across 22 languages). Key innovations: (1) Script unification โ€” converting all Indic scripts to a common representation before tokenization; (2) Language-tagged SentencePiece โ€” prefixing each sentence with a language tag; (3) Two-stage fine-tuning โ€” first on general data, then on high-quality human-translated data. It achieves BLEU scores 5-8 points above Google Translate for most Indian language pairs. The models are freely available on HuggingFace.