Build AI from Zero โข EduArtha
๐ง Build Your Own AI โ From Zero to ChatBot
A hands-on journey from counting patterns to fine-tuning your own chatbot. Build every piece yourself!
โฑ 5โ6 hours | 7 Chapters | 35+ Exercises | 25+ Code Files | Indian Context
๐ How to Run the Project
Read the lesson.md in each level first, then run the scripts in order
โ๏ธ Step 0 โ Install (one time)
cd build-ai-from-zero
pip install -r requirements.txt
๐ข Level 1 โ Prediction (Pure Python)
cd level_1_prediction
python step1_bigram.py # Count character patterns
python step2_generate.py # Generate text from patterns
๐ก Level 2 โ Neural Network (NumPy)
cd level_2_neural_network
python step1_neuron.py # Build a single neuron
python step2_network.py # Build a multi-layer network
python step3_train.py # Train with backprop from scratch
python step4_visualize.py # Plot loss curves
๐ Level 3 โ Transformer (PyTorch)
cd level_3_transformer
python step1_embedding.py # Token + positional embeddings
python step2_attention.py # Self-attention mechanism
python step3_transformer_block.py # Full transformer block
python step4_put_it_together.py # Complete Mini-Transformer
๐ด Level 4 โ Train Your Mini-GPT โญ
cd level_4_mini_gpt
python train.py # Train on Indian stories (~5-10 min)
python generate.py # Chat with YOUR model!
๐ฃ Level 5 โ Fine-Tune a Real Model
cd level_5_real_finetune
python prepare_data.py # Prepare education Q&A data
python finetune.py # Fine-tune DistilGPT-2 with LoRA
python chat.py # Chat with your fine-tuned bot!
๐ Tip: Each level has a lesson.md โ read it first to understand the concepts before running the code. Every script prints colorful output explaining what's happening! ๐จ
Foundations
Understanding AI and why it matters
What is Artificial Intelligence?
"The science of today is the technology of tomorrow." โ Edward Teller
Learning Objectives
- Define Artificial Intelligence in simple, intuitive terms
- Trace the key milestones in AI history โ from Turing to ChatGPT
- Distinguish between Narrow AI, General AI, and Super AI
- Explain, at a high level, how language models like ChatGPT work
- Understand the roadmap of what you'll build in this book
- Feel confident that you can learn AI โ no PhD required!
1.1 Welcome to the AI Revolution ๐ฎ๐ณ
Open your phone right now. Chances are, you've already used AI today โ maybe without even realising it.
- UPI & fraud detection: Every time you send money through PhonePe, Google Pay, or Paytm, AI is silently scanning the transaction in milliseconds โ checking if it looks suspicious, if your location makes sense, if the amount is unusual. Billions of UPI transactions happen every month in India, and AI keeps them safe.
- Aadhaar biometrics: India's Aadhaar system is one of the largest biometric databases in the world โ over 1.3 billion people. When you press your thumb at a ration shop or bank, AI-powered pattern recognition verifies your identity in seconds.
- Google Translate for Hindi: Try typing a sentence in Hindi and translating it to English. A few years ago, the translations were laughably bad. Today? They're remarkably good. That improvement is AI โ specifically, neural networks learning the patterns of language.
- IRCTC and recommendations: Ever noticed how shopping apps seem to know what you want? Or how YouTube suggests that next video you can't resist? That's AI predicting your behaviour based on patterns.
- Voice assistants: "OK Google, aaj mausam kaisa hai?" โ when you speak to your phone in Hindi, and it understands you, that's AI converting sound waves into text, understanding the meaning, and generating a response.
AI isn't some distant, futuristic technology. It's here, it's now, and it's deeply woven into India's digital fabric.
Important
You don't need to be a consumer of AI. This book will make you a creator of AI. By the end, you'll understand exactly how these systems work โ and you'll build one yourself.
1.2 What is Artificial Intelligence?
Let's start with the simplest possible definition:
Artificial Intelligence is the science of making computers do things that normally require human intelligence.
That's it. No jargon. No mystification.
But what does "human intelligence" mean? Think about what you do every day:
- You recognise faces (your friend in a crowd)
- You understand language ("Pass the chai" means something different from "Chai pass ho gaya")
- You predict outcomes (dark clouds = rain coming)
- You learn from experience (you touched a hot tawa once โ never again!)
- You make decisions (should I take the metro or an auto?)
Natural intelligence โ the kind you have โ was shaped by millions of years of evolution. Your brain has roughly 86 billion neurons connected in impossibly complex ways.
Artificial intelligence tries to replicate some of these abilities in a computer. Not by copying the brain exactly, but by using mathematics and data to achieve similar results.
Tip
The School Analogy ๐ซ
- Data = the textbooks and examples the student studies
- Algorithm = the method of studying (rote learning, understanding concepts, practice problems)
- Model = the knowledge the student builds in their brain
- Prediction = when the student answers a question they've never seen before
Think of AI like teaching a new student:
A well-trained student (model) who studied good textbooks (data) using effective methods (algorithms) can answer new questions (predictions) accurately. That's exactly how AI works!
The key difference? A human student might need 10 examples to understand a concept. A computer might need 10 million. But once it learns, it can apply that knowledge millions of times per second, without getting tired, without forgetting, without asking for chai breaks. โ
1.3 A Brief History of AI
AI didn't appear overnight. It's been a journey of over 70 years โ full of breakthroughs, disappointments (called "AI winters"), and spectacular comebacks.
Let's walk through each milestone:
๐๏ธ 1950 โ The Turing Test
Alan Turing, a British mathematician, asked a simple but profound question: "Can machines think?" He proposed a test: if a human talks to a machine and can't tell it's not human, the machine is "intelligent." This question launched the entire field of AI.
๐ง 1957 โ The Perceptron
Frank Rosenblatt built the first artificial neural network โ a simple device that could learn to recognise patterns. It was inspired by how biological neurons work. The media went wild: "A machine that thinks!" But the excitement faded when researchers discovered its limitations. (We'll build our own version in Chapter 3!)
โ๏ธ 1997 โ Deep Blue Beats Kasparov
IBM's chess computer defeated the world champion Garry Kasparov. This was huge โ chess was considered the ultimate test of intelligence. But Deep Blue worked by brute force (calculating millions of moves), not by "understanding" chess the way a human does.
๐ผ๏ธ 2012 โ AlexNet and the Deep Learning Revolution
A neural network called AlexNet crushed the competition in an image recognition contest, reducing error rates dramatically. The secret? Deep learning โ neural networks with many layers, trained on massive datasets using powerful GPUs. This was the moment everything changed.
๐ฎ 2017 โ The Transformer
A team at Google published a paper called "Attention Is All You Need." It introduced the Transformer architecture โ a new way for AI to process language. This single paper is the foundation of GPT, BERT, Gemini, and almost every modern language model. (We'll build a mini Transformer in this book!)
๐ฌ 2022 โ ChatGPT Changes Everything
OpenAI released ChatGPT, and within 5 days, it had 1 million users. For the first time, ordinary people โ teachers, students, shopkeepers, everyone โ could talk to AI and get useful, coherent responses. AI was no longer just for researchers.
๐ 2024 โ The AI Explosion
Google's Gemini, Anthropic's Claude, Meta's LLaMA, and dozens of open-source models made AI accessible to everyone. Indian startups began building AI for Indian languages. Students started learning to build their own models. That's exactly what you're about to do.
1.4 Types of AI
Not all AI is created equal. Scientists categorise AI into three types:
| Type | Description | Examples | Status |
|---|---|---|---|
| Narrow AI (ANI) | AI that does ONE thing well | Google Translate, Siri, chess engines, spam filters | โ Exists today |
| General AI (AGI) | AI that can do ANY intellectual task a human can | A machine that can cook, write poetry, do surgery, AND play cricket | โ Doesn't exist yet |
| Super AI (ASI) | AI that surpasses human intelligence in every way | Science fiction (for now) | โ Theoretical only |
Note
๐ค Think About It
ChatGPT can write essays, code, poetry, and answer questions. Does that make it "General AI"? Not quite! It can't drive a car, cook dinner, or even see the physical world. It's an incredibly capable Narrow AI โ extraordinary at language tasks, but only language tasks. We're still far from true AGI.
Where are we today? We live in the age of very powerful Narrow AI. These systems can beat humans at specific tasks (chess, Go, image recognition, language) but can't generalise across domains the way a 5-year-old child can.
1.5 How Do Language Models Work? (The 30,000 Feet View) โ๏ธ
Here's the single most important idea in this book. Ready?
Language models work by predicting the next word (or token).
That's it. That's the secret. Let me show you what I mean.
Complete this sentence:
"เคฎเฅเคฐเคพ เคจเคพเคฎ ___ เคนเฅ"
Your brain instantly filled in a name โ maybe your own name, maybe "Rahul" or "Priya." How did you do it? You've seen thousands of sentences with this pattern. Your brain predicts what comes next based on what it has seen before.
Now try this one in English:
"The capital of India is ___"
You thought "New Delhi" โ not because you reasoned through geography, but because you've seen this pattern so many times that the prediction is automatic.
ChatGPT works exactly the same way, just at an incredible scale:
- It was trained on billions of sentences from the internet
- It learned the patterns: what words typically follow what other words
- When you ask it a question, it predicts the most likely next word, then the next, then the nextโฆ
- It does this with such sophistication that the output looks like "thinking"
Tip
Key Insight ๐ก
- Your model looks at 1 character of context
- ChatGPT looks at 128,000 tokens of context
- Your model has a few hundred parameters
- GPT-4 has over 1 trillion parameters
The difference between your bigram model (Chapter 2) and ChatGPT isn't the core idea โ both predict what comes next. The difference is:
Same idea. Different scale. And you're about to understand both!
1.6 What You'll Build in This Book ๐๏ธ
This book takes you from zero knowledge to building your own chatbot, step by step. No magic, no black boxes โ you'll understand every line of code.
Here's the roadmap:
Each level builds on the previous one. By the end:
- Level 1: You'll have a model that generates text by counting character patterns
- Level 2: You'll build a neural network from scratch โ no TensorFlow, no PyTorch โ just NumPy and math
- Level 3: You'll understand how computers represent words as numbers (embeddings) and learn the "attention" mechanism
- Level 4: You'll build a mini Transformer โ the same architecture that powers ChatGPT and Gemini
- Level 5: You'll fine-tune a real model and build a working chatbot interface
Important
No shortcuts. We don't use high-level libraries until Level 5. You'll build everything from scratch so you truly understand it. This is like learning to drive with a manual car before switching to automatic โ harder at first, but you'll be a much better driver.
1.7 Prerequisites ๐
Here's what you need to start:
โ What You Need
- Basic Python knowledge โ variables, loops, functions, lists, dictionaries. If you can write a function that takes a list and returns the largest number, you're ready.
- A computer โ any laptop or desktop. The code in Levels 1-3 runs on even the most basic hardware.
- Curiosity โ the most important ingredient!
โ What You DON'T Need
- โ A PhD in mathematics
- โ A powerful GPU (until Level 4-5)
- โ Prior knowledge of machine learning
- โ Experience with TensorFlow or PyTorch
๐ Free Resources
- Google Colab (colab.research.google.com) โ free cloud computing with Python, NumPy, and even GPUs. You can run all the code in this book for free!
- Python (python.org) โ if you prefer running code locally
Tip
If you're new to Python, spend a weekend going through a basic tutorial. Focus on: variables, if-else statements, for loops, functions, lists, and dictionaries. That's all you need!
๐ญ 1.8 Discussion Questions ๐ญ
Take a moment to think about (or discuss with a friend):
AI in Indian classrooms: How could AI change the way students learn in government schools? Could an AI tutor help bridge the gap between urban and rural education?
Language diversity: India has 22 official languages and hundreds of dialects. Why is building AI for Indian languages harder than building it for English? What challenges exist?
Ethical concerns: If an AI system trained on internet data learns biases (gender, caste, region), who is responsible? The programmer? The company? The data?
Jobs and AI: Some people say "AI will take all our jobs." Others say "AI will create new jobs we can't imagine." What do you think will happen in India over the next 10 years?
Creativity and AI: If an AI writes a poem in Hindi, is it "creative"? Can a machine that predicts the next word ever truly be creative, or is it just very sophisticated pattern matching?
๐ Chapter Summary
- AI is the science of making computers do things that require human intelligence โ recognising images, understanding language, making decisions, and learning from experience.
- AI has a rich history spanning over 70 years, from Turing's question in 1950 to ChatGPT taking the world by storm in 2022.
- There are three types of AI: Narrow AI (what we have today), General AI (still theoretical), and Super AI (science fiction for now).
- Language models work by predicting the next word/token. This simple idea, scaled up with massive data and clever architectures, produces the remarkable behaviour we see in ChatGPT, Gemini, and other models.
- This book will take you from zero to building your own chatbot, through 5 levels of increasing complexity โ and you'll understand every step along the way.
- You don't need a PhD. Just Python basics and curiosity. Let's go! ๐
โญ๏ธ What's Next?
In Chapter 2: Prediction with Patterns, you'll write your very first AI model โ in pure Python, with zero libraries. You'll teach a computer to learn patterns from text and generate new text, character by character.
It starts with the simplest possible question: "Given this character, what character is most likely to come next?"
Simple question. Surprisingly powerful answer. Let's build it!
"A journey of a thousand miles begins with a single step." โ Lao Tzu
Your step starts in Chapter 2. Turn the page. ๐
Building Blocks
From counting patterns to learning neural networks
Prediction with Patterns โ Your First AI Model
"All models are wrong, but some are useful." โ George Box
Learning Objectives
- Explain what a bigram is and why it's useful for text prediction
- Build a bigram model from scratch in pure Python โ no libraries!
- Understand probability distributions and how they drive text generation
- Generate new text character by character using your model
- Recognise the limitations of bigram models and why more context matters
- Connect this simple model to how ChatGPT works at a fundamental level
2.1 The Simplest AI: Counting Patterns ๐ข
Before we write any code, let's do an experiment with your own brain.
Complete these sentences:
"เคฎเฅเคฐเคพ เคจเคพเคฎ ___ เคนเฅ"
"The capital of India is ___"
"Virat Kohli is a great ___"
How did you do that? You didn't "reason" through the answer โ your brain instantly predicted the most likely next word based on patterns you've seen thousands of times before. You've heard "เคฎเฅเคฐเคพ เคจเคพเคฎ" followed by a name so often that the prediction is automatic.
Now here's the profound insight: This is exactly how AI language models work.
They look at the text so far, and predict what comes next. The only difference is scale:
- You've read maybe a few thousand books in your life
- ChatGPT was trained on billions of pages of text
But the core idea? Prediction based on patterns. And we're about to build the simplest version of this idea.
Note
๐ค Think About It
When you text a friend on WhatsApp and your keyboard suggests the next word โ that's a prediction model! It learned from YOUR texting patterns. Our bigram model works on the same principle, just at the character level.
2.2 What is a Bigram?
A bigram is simply a pair of two consecutive characters (or words).
Let's look at the word "namaste":
n-a a-m m-a a-s s-t t-e
These are the bigrams: na, am, ma, as, st, te.
Now think about English. If I give you the letter 't', what letter do you think comes next?
thโ very common! ("the", "that", "this", "three")toโ common ("to", "top", "today")trโ fairly common ("tree", "train", "true")tzโ very rare in English!
You intuitively know that 'h' is much more likely to follow 't' than 'z' is. You know this because you've seen millions of English words. A bigram model learns the same thing โ by counting!
Tip
The Chai Stall Analogy โ
- 80 people ordered chai with samosa
- 15 people ordered chai with biscuit
- 5 people ordered chai with pakora
Imagine you sit at a chai stall and count how many people order what after chai:
Now if someone orders chai, you'd predict: "They'll probably want a samosa!" That's exactly what a bigram model does โ count what follows what, then predict.
2.3 Building Your First Model ๐๏ธ
Now let's look at actual code! Our first script, step1_bigram.py, takes a piece of text and counts every pair of consecutive characters.
The Training Text
The code uses a paragraph about Indian scientific achievement as its training data:
Python
SAMPLE_TEXT = """India has a rich history of scientific achievement that spans thousands of years.
Ancient Indian mathematicians gave the world the concept of zero, which transformed
mathematics forever. Aryabhata, born in 476 CE, calculated the value of pi to four
decimal places and proposed that the earth rotates on its axis. Charaka and Sushruta
were pioneers of medicine and surgery in ancient India. The decimal number system that
the whole world uses today originated in India. Ramanujan, one of the greatest
mathematical geniuses, made extraordinary contributions to number theory and infinite
series. Chandrasekhar won the Nobel Prize for his work on the structure and evolution
of stars. ISRO, the Indian Space Research Organisation, successfully launched missions
to the Moon and Mars, making India one of the few nations to achieve this feat. From
the ancient universities of Nalanda and Takshashila to modern research institutions,
India has always been a land of knowledge and discovery."""
The Core Function: Counting Bigrams
This is the heart of our "AI model" โ and it's just counting! The function reads through the text and tallies how many times each character follows each other character:
Python
def build_bigram_counts(text):
"""
Count how often each character follows each other character.
This is the CORE of our "AI model"!
Think of it like this:
- We read the text one character at a time
- For each pair of consecutive characters, we make a tally mark
- At the end, we know exactly which characters tend to follow which
"""
# A nested dictionary: counts[char_a][char_b] = number of times
# char_b appeared right after char_a
# Example: counts['t']['h'] โ 42 means 'h' followed 't' 42 times
counts = {}
# We look at pairs: text[i] and text[i+1]
# If text has 100 characters (indices 0-99), the last valid pair is (98, 99)
# So we go from 0 to 98, which is range(99) = range(len(text) - 1)
for i in range(len(text) - 1):
current_char = text[i] # The character we're looking at now
next_char = text[i + 1] # The character that comes right after it
# If we haven't seen current_char before, create an empty dict for it
if current_char not in counts:
counts[current_char] = {}
# If we haven't seen this specific pair before, start its count at 0
if next_char not in counts[current_char]:
counts[current_char][next_char] = 0
# Add one to the count! ๐
# This is literally the entire "learning" process of our AI model.
# That's it. Just counting.
counts[current_char][next_char] += 1
return counts
Let's walk through this with a tiny example. Suppose our text is "chai":
| Step | i | current_char | next_char | What happens |
|---|---|---|---|---|
| 1 | 0 | 'c' | 'h' | counts['c']['h'] = 1 |
| 2 | 1 | 'h' | 'a' | counts['h']['a'] = 1 |
| 3 | 2 | 'a' | 'i' | counts['a']['i'] = 1 |
After processing, our model "knows" that after 'c', 'h' appeared once; after 'h', 'a' appeared once, etc. With a longer text, these counts build up into meaningful patterns.
Important
Key Insight: The entire "learning" process of this AI model is on one line: counts[current_char][next_char] += 1. That's it! Just counting. Yet this simple idea โ learning patterns from data โ is the foundation of ALL language models, including ChatGPT.
Converting Counts to Probabilities
Raw counts aren't enough. We need probabilities โ "After seeing 't', what percentage of the time does 'h' come next?"
Python
def build_bigram_probabilities(counts):
"""
Convert raw counts into probabilities.
WHY probabilities instead of counts?
Because we need to know: "After seeing 't', what PERCENTAGE of the time
does 'h' come next?" โ not just "how many times."
Example:
If 't' is followed by 'h' 40 times, 'e' 10 times, and 'o' 5 times:
Total = 55
P('h' | 't') = 40/55 = 0.727 (72.7% of the time!)
P('e' | 't') = 10/55 = 0.182 (18.2%)
P('o' | 't') = 5/55 = 0.091 (9.1%)
"""
probabilities = {}
for char, next_chars in counts.items():
# Total count of all characters that followed this character
total = sum(next_chars.values())
probabilities[char] = {}
for next_char, count in next_chars.items():
# P(next | current) = count(current, next) / count(current)
probabilities[char][next_char] = count / total
return probabilities
2.4 Understanding Probability Distributions ๐ฒ
Let's pause and understand a crucial concept: probability distributions.
The Dice Analogy
When you roll a fair die, each face has an equal probability:
This is a uniform distribution โ all outcomes are equally likely.
The Loaded Dice
Now imagine a loaded die where 6 comes up more often:
This is a non-uniform distribution. The probabilities still sum to 1, but some outcomes are more likely than others.
Characters as Loaded Dice
Our bigram model creates a loaded die for each character. For the character 't':
The formula is beautifully simple:
Note
๐ค Think About It
Why do we need probabilities and not just pick the most common character every time? Because that would give us the SAME output every time! "th" โ "the" โ "the" โ forever repeating. Probabilities + randomness = variety. Just like how you don't say the exact same sentences every day, even though you use the same language.
2.5 Generating Text โจ
Now for the exciting part โ making the model generate new text! The code in step2_generate.py uses our probability table to create text character by character.
Weighted Random Choice
First, we need a function that picks a random character based on probabilities. This is like throwing a dart at a ruler where each character occupies space proportional to its probability:
Python
def weighted_random_choice(probability_dict):
"""
Choose a random character based on probability weights.
Imagine a ruler from 0 to 1:
|---'h'(0.45)---|--'e'(0.20)--|--' '(0.15)--|--'a'(0.10)--|-others-|
0 0.45 0.65 0.80 0.90 1.0
We throw a dart at a random point on this ruler.
Characters with bigger sections are more likely to be hit! ๐ฏ
"""
# random.random() gives us a random float between 0 and 1
# This is our "dart throw" on the probability ruler
r = random.random()
# Walk along the ruler, accumulating probabilities
cumulative = 0.0
for char, prob in probability_dict.items():
cumulative += prob
# If our random number falls within this character's section โ pick it!
if r <= cumulative:
return char
# Fallback for floating-point precision edge cases
return list(probability_dict.keys())[-1]
Example walkthrough: Suppose after 't', the probabilities are:
'h': 0.45 (occupies 0.00 to 0.45 on the ruler)'e': 0.20 (occupies 0.45 to 0.65)' ': 0.15 (occupies 0.65 to 0.80)'a': 0.10 (occupies 0.80 to 0.90)- others: 0.10 (occupies 0.90 to 1.00)
If random.random() returns 0.37, we hit the 'h' section. If it returns 0.72, we hit the ' ' (space) section. Characters with higher probabilities have bigger sections, so they get picked more often!
The Generation Function
Now we chain these choices together to build text character by character:
Python
def generate_text(probabilities, start_char, length):
"""
Generate text character by character using the bigram model.
Algorithm:
1. Start with a character
2. Look up: "what characters can follow this one, and with what probability?"
3. Randomly pick the next character (weighted by probability)
4. Use THAT character as the new current character
5. Repeat!
"""
# Start with our seed character
result = start_char
current_char = start_char
for _ in range(length - 1):
# Check if we have data for this character
if current_char not in probabilities:
# Pick a random known character to continue
current_char = random.choice(list(probabilities.keys()))
# Use our weighted random choice to pick the next character
next_char = weighted_random_choice(probabilities[current_char])
# Add it to our result
result += next_char
# The next character becomes the current character
# This is the "bigram" part โ we only look at the LAST character!
current_char = next_char
return result
Step-by-Step Generation Example
Let's trace through generating text starting with 't':
| Step | Current Char | Options (top 3) | Random Choice | Text So Far |
|---|---|---|---|---|
| 1 | 't' | 'h'=73%, 'e'=18%, 'o'=9% | 'h' | "th" |
| 2 | 'h' | 'e'=45%, 'a'=25%, 'i'=15% | 'e' | "the" |
| 3 | 'e' | ' '=40%, 'r'=15%, 'n'=12% | ' ' | "the " |
| 4 | ' ' | 'a'=12%, 't'=10%, 'o'=9% | 'i' | "the i" |
| 5 | 'i' | 'n'=35%, 's'=15%, 't'=12% | 'n' | "the in" |
Notice how the model captured real English patterns: "the" is a real word! But after a few characters, it starts generating gibberish because it can only see ONE character of context.
Tip
Key Insight ๐ก
This is the fundamental trade-off in language models: more context = better predictions. Our bigram model sees 1 character. A trigram sees 2. GPT-4 sees up to 128,000 tokens. That's why GPT-4 can write coherent essays and our bigram model cannot โ same idea, vastly different context windows.
๐ญ 2.6 Discussion: Why Does This Work?
Our bigram model generates text that has some English-like qualities:
- Common pairs like "th", "he", "in", "an" appear frequently
- Spaces appear at reasonable intervals (because they follow common letters)
- Some short words occasionally form by chance
But it also produces mostly gibberish. Why?
Because the model has zero memory beyond the last character. When it sees a space, it doesn't know if the previous word was "the" or "elephant" โ it treats them identically. It can't form words consistently, let alone sentences or paragraphs.
> [!IMPORTANT]
> The context window is everything. A bigram model has a context window of 1 character. Every improvement in language models from here to ChatGPT is essentially about making the context window larger and using it more effectively.
2.7 From Bigrams to N-grams
What if instead of looking at just the last character, we looked at the last TWO characters? That's a trigram model.
| Model | Context | Looks At | Example |
|---|---|---|---|
| Bigram | 1 char | 't' โ predict next | Knows 'h' often follows 't' |
| Trigram | 2 chars | 'th' โ predict next | Knows 'e' often follows 'th' |
| 4-gram | 3 chars | 'the' โ predict next | Knows ' ' often follows 'the' |
| 5-gram | 4 chars | 'the ' โ predict next | Knows common words starting after 'the ' |
More context = better predictions. But there's a problem: the number of possible combinations explodes. With 50 unique characters:
- Bigrams: 50 ร 50 = 2,500 possible pairs
- Trigrams: 50ยณ = 125,000 possible triples
- 5-grams: 50โต = 312,500,000 possible combos!
You'd need enormous amounts of text to see enough of each combination. This is called the curse of dimensionality, and it's one reason why simple counting doesn't scale โ and why we need neural networks (Chapter 3!).
Key Concepts Summary
| Concept | Definition |
|---|---|
| Bigram | A pair of two consecutive characters (or words). The building block of our first model. |
| Probability Distribution | A set of probabilities that tells us how likely each outcome is. All probabilities sum to 1. |
| Sampling | Randomly choosing an outcome based on a probability distribution. This is how we generate text. |
| Tokenization | Breaking text into units (characters, words, or subwords). Our model uses character-level tokenization. |
| Context Window | How much previous text the model can "see" when making a prediction. Bigram = 1 character. |
๐ 2.9 Exercises ๐
Trigram Model: Modify build_bigram_counts to look at the last TWO characters instead of one. How does the generated text quality change?
Train on Hindi: Find a Hindi poem (try a Kabir doha or a Premchand excerpt) and train the bigram model on it. What patterns does it learn? Does it generate recognisable Hindi syllables?
Visualise Frequencies: Use matplotlib to create a bar chart of the 20 most common bigrams in the training text. Which patterns dominate?
Compare Models: Train two separate bigram models โ one on the Indian science text, one on a cricket commentary. Compare the generated text. How are they different?
Tiny Data Experiment: What happens if you train the model on just 10 characters of text? What about 50? At what point does the model start generating somewhat reasonable output?
๐ญ 2.10 Discussion Questions ๐ญ
Why can't bigrams write essays? What fundamental capability is missing? (Hint: think about what "understanding" means.)
Counting vs learning: Our model doesn't "learn" anything โ it just counts. Is counting a form of learning? Where does counting end and real learning begin?
The vocabulary problem: Our model works on individual characters. What would change if we used whole words instead? What are the advantages and disadvantages?
Indian languages: Hindi uses Devanagari script, which has different character patterns than English. How would a bigram model trained on Hindi differ from one trained on English? What about Tamil or Bengali?
Scale and quality: If you had a bigram model trained on ALL the text on the internet, would it generate good text? Why or why not?
๐ Chapter Summary
- You learned that prediction based on patterns is the fundamental idea behind all language models โ from our simple bigram to ChatGPT.
- You built a bigram model that counts character pairs and converts them into probabilities. The entire "learning" is just counting:
counts[current][next] += 1. - You understood probability distributions โ how counts become probabilities, and how we sample from them to generate text.
- You generated new text character by character, watching the model pick each character based on what it saw before.
- You recognised the limitations: a context window of 1 character means the model can capture some local patterns (like "th") but can't form coherent words or sentences.
- You saw the path forward: more context = better predictions, but counting doesn't scale. We need something smarter.
โญ๏ธ What's Next?
Our bigram model has a fundamental problem: it can only count what it's explicitly seen. It can't generalise. It can't figure out that "th" and "sh" have something in common (both are followed by vowels).
What if, instead of counting, the computer could learn these patterns by itself? What if it could discover relationships we never told it about?
That's exactly what neural networks do. In Chapter 3: Neural Networks from Scratch, you'll build a single neuron, then connect many neurons into a network, and teach it to learn patterns through backpropagation โ the algorithm that powers all of modern AI.
Get ready โ things are about to get really interesting! ๐ง
"I hear and I forget. I see and I remember. I do and I understand." โ Confucius
You just DID it. You built an AI model. Now you understand. ๐
Complete Source Code - Chapter 2
Below are the complete, runnable source files for this chapter. Every line is included.
Complete Code: step1_bigram.py
Python
#!/usr/bin/env python3
"""
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ LEVEL 1 โ STEP 1: BUILDING A BIGRAM MODEL โ
โ โ
โ What we're doing: โ
โ 1. Take a piece of text โ
โ 2. Count which characters appear after which โ
โ 3. Build a probability table (our first "AI model"!) โ
โ โ
โ NO LIBRARIES NEEDED โ Pure Python only! โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
"""
# ============================================================================
# WHY: We don't use ANY external libraries in Level 1.
# This proves that AI concepts are simple enough to build from scratch.
# Python's built-in features are all we need!
# ============================================================================
import os
import sys
# ============================================================================
# ANSI COLOR CODES
# ============================================================================
# WHY: We use ANSI escape codes to make terminal output colorful and readable.
# These are special character sequences that terminals interpret as formatting.
# The format is: \033[CODEm where CODE controls the color/style.
# \033 is the "escape" character โ it tells the terminal "what follows is a command"
# ============================================================================
class Colors:
"""ANSI color codes for beautiful terminal output."""
# Text colors
RED = '\033[91m' # Bright red โ for errors or important warnings
GREEN = '\033[92m' # Bright green โ for success messages
YELLOW = '\033[93m' # Bright yellow โ for highlights and emphasis
BLUE = '\033[94m' # Bright blue โ for headers and section titles
MAGENTA = '\033[95m' # Bright magenta โ for special information
CYAN = '\033[96m' # Bright cyan โ for data and values
WHITE = '\033[97m' # Bright white โ for regular text
# Text styles
BOLD = '\033[1m' # Bold text โ makes text thicker/heavier
DIM = '\033[2m' # Dim text โ makes text lighter/faded
UNDERLINE = '\033[4m' # Underlined text
# Reset โ IMPORTANT: always reset after coloring, or the whole terminal stays colored!
RESET = '\033[0m'
# Background colors
BG_GREEN = '\033[42m' # Green background
BG_BLUE = '\033[44m' # Blue background
BG_MAGENTA = '\033[45m' # Magenta background
def print_header():
"""Print a beautiful header for this script."""
print(f"\n{Colors.BOLD}{Colors.CYAN}{'โ' * 64}{Colors.RESET}")
print(f"{Colors.BOLD}{Colors.YELLOW} ๐ง LEVEL 1 โ STEP 1: BUILDING A BIGRAM MODEL{Colors.RESET}")
print(f"{Colors.DIM}{Colors.WHITE} Your first step into the world of AI!{Colors.RESET}")
print(f"{Colors.BOLD}{Colors.CYAN}{'โ' * 64}{Colors.RESET}\n")
def print_section(emoji, title, description=""):
"""Print a formatted section header."""
print(f"\n{Colors.BOLD}{Colors.BLUE}{'โ' * 60}{Colors.RESET}")
print(f"{Colors.BOLD}{Colors.GREEN} {emoji} {title}{Colors.RESET}")
if description:
print(f"{Colors.DIM}{Colors.WHITE} {description}{Colors.RESET}")
print(f"{Colors.BOLD}{Colors.BLUE}{'โ' * 60}{Colors.RESET}\n")
def print_footer():
"""Print a beautiful footer for this script."""
print(f"\n{Colors.BOLD}{Colors.CYAN}{'โ' * 64}{Colors.RESET}")
print(f"{Colors.BOLD}{Colors.GREEN} โ
Step 1 Complete! Your bigram model is ready.{Colors.RESET}")
print(f"{Colors.DIM}{Colors.WHITE} Next: Run step2_generate.py to generate text!{Colors.RESET}")
print(f"{Colors.BOLD}{Colors.CYAN}{'โ' * 64}{Colors.RESET}\n")
# ============================================================================
# THE SAMPLE TEXT
# ============================================================================
# WHY: We need a piece of text to analyze. We embed it directly in the code
# so this script runs without needing any external files.
# We chose a paragraph about Indian science & history โ something meaningful
# and interesting to read while learning!
# ============================================================================
SAMPLE_TEXT = """India has a rich history of scientific achievement that spans thousands of years.
Ancient Indian mathematicians gave the world the concept of zero, which transformed
mathematics forever. Aryabhata, born in 476 CE, calculated the value of pi to four
decimal places and proposed that the earth rotates on its axis. Charaka and Sushruta
were pioneers of medicine and surgery in ancient India. The decimal number system that
the whole world uses today originated in India. Ramanujan, one of the greatest
mathematical geniuses, made extraordinary contributions to number theory and infinite
series. Chandrasekhar won the Nobel Prize for his work on the structure and evolution
of stars. ISRO, the Indian Space Research Organisation, successfully launched missions
to the Moon and Mars, making India one of the few nations to achieve this feat. From
the ancient universities of Nalanda and Takshashila to modern research institutions,
India has always been a land of knowledge and discovery."""
def build_bigram_counts(text):
"""
Count how often each character follows each other character.
This is the CORE of our "AI model"!
Think of it like this:
- We read the text one character at a time
- For each pair of consecutive characters, we make a tally mark
- At the end, we know exactly which characters tend to follow which
Args:
text (str): The input text to analyze
Returns:
dict: A nested dictionary where counts[char_a][char_b] = number of times
char_b appeared right after char_a
"""
# WHY a nested dictionary?
# We want to look up: "Given character A, how many times did character B follow?"
# A nested dict lets us do: counts['t']['h'] โ 42 (meaning 'h' followed 't' 42 times)
counts = {}
# WHY range(len(text) - 1)?
# Because we look at pairs: text[i] and text[i+1]
# If text has 100 characters (indices 0-99), the last valid pair is (98, 99)
# So we go from 0 to 98, which is range(99) = range(len(text) - 1)
for i in range(len(text) - 1):
current_char = text[i] # The character we're looking at now
next_char = text[i + 1] # The character that comes right after it
# If we haven't seen current_char before, create an empty dict for it
# WHY: We can't add to a dict that doesn't exist yet
if current_char not in counts:
counts[current_char] = {}
# If we haven't seen this specific pair before, start its count at 0
if next_char not in counts[current_char]:
counts[current_char][next_char] = 0
# Add one to the count! ๐
# This is literally the entire "learning" process of our AI model.
# That's it. Just counting.
counts[current_char][next_char] += 1
return counts
def build_bigram_probabilities(counts):
"""
Convert raw counts into probabilities.
WHY probabilities instead of counts?
Because we need to know: "After seeing 't', what PERCENTAGE of the time
does 'h' come next?" โ not just "how many times."
Example:
If 't' is followed by 'h' 40 times, 'e' 10 times, and 'o' 5 times:
Total = 55
P('h' | 't') = 40/55 = 0.727 (72.7% of the time!)
P('e' | 't') = 10/55 = 0.182 (18.2%)
P('o' | 't') = 5/55 = 0.091 (9.1%)
Args:
counts (dict): The bigram counts from build_bigram_counts()
Returns:
dict: A nested dictionary where probs[char_a][char_b] = probability
that char_b follows char_a
"""
probabilities = {}
for char, next_chars in counts.items():
# WHY sum all counts?
# To calculate probability, we need: (count of this pair) / (total of all pairs starting with this char)
total = sum(next_chars.values())
probabilities[char] = {}
for next_char, count in next_chars.items():
# This is Bayes' most basic form: P(next | current) = count(current, next) / count(current)
probabilities[char][next_char] = count / total
return probabilities
def display_text_info(text):
"""Display information about the sample text."""
# Count unique characters
unique_chars = sorted(set(text))
num_unique = len(unique_chars)
total_chars = len(text)
total_bigrams = total_chars - 1 # WHY -1? Because pairs overlap by one position
print(f" {Colors.CYAN}๐ Text length:{Colors.RESET} {Colors.BOLD}{total_chars}{Colors.RESET} characters")
print(f" {Colors.CYAN}๐ค Unique characters:{Colors.RESET} {Colors.BOLD}{num_unique}{Colors.RESET}")
print(f" {Colors.CYAN}๐ Total bigrams:{Colors.RESET} {Colors.BOLD}{total_bigrams}{Colors.RESET}")
print()
# Show the unique characters in a nice format
print(f" {Colors.YELLOW}Characters found:{Colors.RESET}")
# WHY do we display characters? So the student can see exactly what the model will learn from
line = " "
for ch in unique_chars:
# Replace invisible characters with readable names
if ch == ' ':
display = 'โฃ' # Space symbol
elif ch == '\n':
display = 'โต' # Newline symbol
elif ch == '\t':
display = 'โ' # Tab symbol
else:
display = ch
line += f" {Colors.GREEN}[{display}]{Colors.RESET}"
if len(line) > 200: # Prevent super-long lines in terminal
print(line)
line = " "
if line.strip():
print(line)
def display_top_bigrams(counts, top_n=25):
"""
Display the most frequent bigrams in a beautiful table.
WHY show this?
This helps students SEE the patterns that the model is learning.
When they see that 'th' appears 30 times, they'll understand why
the model generates 'th' so often!
"""
# Flatten the nested dict into a list of (char_a, char_b, count) tuples
# WHY flatten? So we can sort ALL bigrams together by count
all_bigrams = []
for char_a, next_chars in counts.items():
for char_b, count in next_chars.items():
all_bigrams.append((char_a, char_b, count))
# Sort by count, highest first
# WHY key=lambda x: x[2]? Because x[2] is the count (third element of tuple)
# WHY reverse=True? We want the MOST common bigrams first
all_bigrams.sort(key=lambda x: x[2], reverse=True)
# Print table header
print(f" {Colors.BOLD}{Colors.WHITE}{'Rank':<6} {'Bigram':<12} {'Visual':<16} {'Count':<8} {'Bar'}{Colors.RESET}")
print(f" {Colors.DIM}{'โ' * 56}{Colors.RESET}")
# WHY do we only show top_n? Because there could be hundreds of bigrams,
# and showing all of them would be overwhelming. The top ones tell the story.
max_count = all_bigrams[0][2] if all_bigrams else 1
for rank, (char_a, char_b, count) in enumerate(all_bigrams[:top_n], 1):
# Make invisible characters readable
display_a = 'โฃ' if char_a == ' ' else ('โต' if char_a == '\n' else char_a)
display_b = 'โฃ' if char_b == ' ' else ('โต' if char_b == '\n' else char_b)
# Create a visual bar โ length proportional to count
# WHY? Visual bars make it MUCH easier to compare frequencies at a glance
bar_length = int((count / max_count) * 20)
bar = 'โ' * bar_length
# Color the top 5 differently to highlight them
if rank <= 5:
color = Colors.YELLOW
elif rank <= 10:
color = Colors.CYAN
else:
color = Colors.WHITE
bigram_str = f"'{display_a}{display_b}'"
arrow_str = f"'{display_a}' โ '{display_b}'"
print(f" {color}{rank:<6} {bigram_str:<12} {arrow_str:<16} {count:<8} {Colors.GREEN}{bar}{Colors.RESET}")
print(f"\n {Colors.DIM}(Showing top {top_n} of {len(all_bigrams)} unique bigrams){Colors.RESET}")
def display_character_analysis(counts):
"""
For a few interesting characters, show what typically follows them.
WHY?
This helps students build intuition about language patterns.
They can see that after 'q', 'u' almost always follows โ just like in real English!
"""
# Pick some interesting characters to analyze
interesting = ['t', 'a', 'e', ' ', 'i', 's']
for char in interesting:
if char not in counts:
continue
next_chars = counts[char]
total = sum(next_chars.values())
# Sort followers by count
sorted_followers = sorted(next_chars.items(), key=lambda x: x[1], reverse=True)
display_char = 'โฃ (space)' if char == ' ' else f"'{char}'"
print(f" {Colors.BOLD}{Colors.YELLOW}After {display_char}:{Colors.RESET} ", end="")
# Show top 5 followers
parts = []
for next_char, count in sorted_followers[:5]:
pct = (count / total) * 100
display_next = 'โฃ' if next_char == ' ' else ('โต' if next_char == '\n' else next_char)
parts.append(f"{Colors.CYAN}'{display_next}'{Colors.RESET}={Colors.GREEN}{pct:.0f}%{Colors.RESET}")
print(" ".join(parts))
def display_probability_matrix(counts, top_chars=10):
"""
Display a small probability matrix โ a grid showing character relationships.
WHY a matrix?
This is how we VISUALIZE the model's "knowledge." Each cell shows how likely
character B is to follow character A. This is literally the model's brain!
"""
# Find the most common characters overall
# WHY? We can't show ALL characters (too many), so we pick the most frequent ones
char_frequency = {}
for char_a, next_chars in counts.items():
for char_b, count in next_chars.items():
char_frequency[char_a] = char_frequency.get(char_a, 0) + count
char_frequency[char_b] = char_frequency.get(char_b, 0) + count
# Get top characters by frequency
top = sorted(char_frequency.items(), key=lambda x: x[1], reverse=True)[:top_chars]
top_char_list = [ch for ch, _ in top]
# Print header row
header = f" {Colors.BOLD}{Colors.WHITE}{'':>6}"
for ch in top_char_list:
display = 'โฃ' if ch == ' ' else ('โต' if ch == '\n' else ch)
header += f" {Colors.CYAN}{display:>5}{Colors.RESET}"
print(header)
print(f" {'':>6}{Colors.DIM}{'โ' * (6 * len(top_char_list))}{Colors.RESET}")
# Print each row
for ch_a in top_char_list:
display_a = 'โฃ' if ch_a == ' ' else ('โต' if ch_a == '\n' else ch_a)
row = f" {Colors.CYAN}{display_a:>5}{Colors.RESET} โ"
for ch_b in top_char_list:
if ch_a in counts and ch_b in counts[ch_a]:
total = sum(counts[ch_a].values())
prob = counts[ch_a][ch_b] / total
# Color-code by probability
if prob > 0.3:
color = Colors.GREEN
elif prob > 0.1:
color = Colors.YELLOW
else:
color = Colors.DIM
row += f" {color}{prob:>4.0%}{Colors.RESET}"
else:
row += f" {Colors.DIM}{' ยท ':>5}{Colors.RESET}"
print(row)
print(f"\n {Colors.DIM}(Cells show: probability that column character follows row character){Colors.RESET}")
# ============================================================================
# MAIN EXECUTION
# ============================================================================
# WHY if __name__ == '__main__'?
# This is a Python convention that means: "Only run this code if this file
# is being executed directly, NOT if it's being imported by another file."
# This is important because step2_generate.py will import our functions!
# ============================================================================
if __name__ == '__main__':
# โโ Print the header โโ
print_header()
# โโ Step 1: Show the text we'll analyze โโ
print_section("๐", "Step 1: Reading the text...",
"This is the text our AI will learn from")
# Show a preview of the text (first 200 chars)
preview = SAMPLE_TEXT[:200].replace('\n', ' ')
print(f" {Colors.WHITE}\"{preview}...\"{Colors.RESET}")
print()
# Show text statistics
display_text_info(SAMPLE_TEXT)
# โโ Step 2: Count bigram frequencies โโ
print_section("๐", "Step 2: Counting bigram patterns...",
"For every character, we count what comes after it")
# WHY lowercase? To treat 'T' and 't' as the same character.
# Otherwise 'The' and 'the' would create different patterns, splitting our data.
text_lower = SAMPLE_TEXT.lower()
# This is where the magic happens! ๐ช
print(f" {Colors.YELLOW}โณ Counting patterns...{Colors.RESET}", end=" ", flush=True)
bigram_counts = build_bigram_counts(text_lower)
print(f"{Colors.GREEN}Done!{Colors.RESET}")
# How many unique patterns did we find?
total_unique = sum(len(v) for v in bigram_counts.values())
print(f" {Colors.CYAN}Found {Colors.BOLD}{total_unique}{Colors.RESET}{Colors.CYAN} unique bigram patterns!{Colors.RESET}")
# โโ Step 3: Display the results โโ
print_section("๐", "Step 3: Top Bigram Patterns",
"The most common character pairs in our text")
display_top_bigrams(bigram_counts, top_n=25)
# โโ Step 4: Character analysis โโ
print_section("๐ฌ", "Step 4: Character Analysis",
"What typically follows each character?")
display_character_analysis(bigram_counts)
# โโ Step 5: Probability matrix โโ
print_section("๐งฎ", "Step 5: The Probability Matrix (Our AI's Brain!)",
"Each cell = probability that column follows row")
display_probability_matrix(bigram_counts, top_chars=10)
# โโ Step 6: Build probabilities โโ
print_section("๐", "Step 6: Converting Counts to Probabilities",
"Probabilities let us make weighted random choices")
probs = build_bigram_probabilities(bigram_counts)
# Show a couple of examples
example_chars = ['t', 'a', 'i']
for ch in example_chars:
if ch in probs:
sorted_probs = sorted(probs[ch].items(), key=lambda x: x[1], reverse=True)
total = sum(p for _, p in sorted_probs)
print(f" {Colors.BOLD}After '{ch}':{Colors.RESET}")
for next_ch, prob in sorted_probs[:5]:
display_next = 'โฃ' if next_ch == ' ' else ('โต' if next_ch == '\n' else next_ch)
bar = 'โ' * int(prob * 30)
print(f" '{display_next}': {Colors.CYAN}{prob:.1%}{Colors.RESET} {Colors.GREEN}{bar}{Colors.RESET}")
remaining = len(sorted_probs) - 5
if remaining > 0:
print(f" {Colors.DIM}... and {remaining} more possible characters{Colors.RESET}")
print()
# โโ Summary โโ
print_section("๐", "What You Just Built!",
"Let's recap what happened")
print(f""" {Colors.WHITE}1. You took a piece of text ({len(SAMPLE_TEXT)} characters){Colors.RESET}
{Colors.WHITE}2. You counted every pair of consecutive characters{Colors.RESET}
{Colors.WHITE}3. You converted those counts into probabilities{Colors.RESET}
{Colors.WHITE}4. You now have a PROBABILITY TABLE โ this IS the model!{Colors.RESET}
{Colors.BOLD}{Colors.YELLOW}๐ง This probability table is your AI's "brain."{Colors.RESET}
{Colors.WHITE}It "knows" that after 'e', a space is most common.{Colors.RESET}
{Colors.WHITE}It "knows" that after 't', 'h' appears often.{Colors.RESET}
{Colors.WHITE}It "knows" the patterns of the English language!{Colors.RESET}
{Colors.MAGENTA}โก๏ธ Now run step2_generate.py to generate text using this model!{Colors.RESET}""")
# Print footer
print_footer()
Complete Code: step2_generate.py
Python
#!/usr/bin/env python3
"""
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ LEVEL 1 โ STEP 2: GENERATING TEXT WITH BIGRAMS โ
โ โ
โ What we're doing: โ
โ 1. Rebuild our bigram model (so this script runs alone) โ
โ 2. Use probability distributions to pick next characters โ
โ 3. Generate brand new text, character by character! โ
โ โ
โ NO LIBRARIES NEEDED โ Pure Python only! โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
"""
# ============================================================================
# WHY: We import functions from step1 AND also define what we need here.
# This way, the script works whether or not step1 is importable.
# We also rebuild the model from scratch if import fails.
# ============================================================================
import os
import sys
import random # WHY: We need random number generation to sample from probabilities
import time # WHY: We use time.sleep() to create a dramatic text generation effect
# ============================================================================
# ANSI COLOR CODES (same as step1 โ duplicated so this script is self-contained)
# ============================================================================
class Colors:
"""ANSI color codes for beautiful terminal output."""
RED = '\033[91m'
GREEN = '\033[92m'
YELLOW = '\033[93m'
BLUE = '\033[94m'
MAGENTA = '\033[95m'
CYAN = '\033[96m'
WHITE = '\033[97m'
BOLD = '\033[1m'
DIM = '\033[2m'
UNDERLINE = '\033[4m'
RESET = '\033[0m'
BG_GREEN = '\033[42m'
BG_BLUE = '\033[44m'
BG_YELLOW = '\033[43m'
def print_header():
"""Print a beautiful header for this script."""
print(f"\n{Colors.BOLD}{Colors.MAGENTA}{'โ' * 64}{Colors.RESET}")
print(f"{Colors.BOLD}{Colors.YELLOW} ๐ LEVEL 1 โ STEP 2: GENERATING TEXT WITH BIGRAMS{Colors.RESET}")
print(f"{Colors.DIM}{Colors.WHITE} Watch your AI create text character by character!{Colors.RESET}")
print(f"{Colors.BOLD}{Colors.MAGENTA}{'โ' * 64}{Colors.RESET}\n")
def print_section(emoji, title, description=""):
"""Print a formatted section header."""
print(f"\n{Colors.BOLD}{Colors.BLUE}{'โ' * 60}{Colors.RESET}")
print(f"{Colors.BOLD}{Colors.GREEN} {emoji} {title}{Colors.RESET}")
if description:
print(f"{Colors.DIM}{Colors.WHITE} {description}{Colors.RESET}")
print(f"{Colors.BOLD}{Colors.BLUE}{'โ' * 60}{Colors.RESET}\n")
def print_footer():
"""Print a beautiful footer for this script."""
print(f"\n{Colors.BOLD}{Colors.MAGENTA}{'โ' * 64}{Colors.RESET}")
print(f"{Colors.BOLD}{Colors.GREEN} ๐ Congratulations! You just built your first AI model!{Colors.RESET}")
print(f"{Colors.DIM}{Colors.WHITE} It's simple, but the core idea is the same as ChatGPT.{Colors.RESET}")
print(f"{Colors.DIM}{Colors.WHITE} Next: Level 2 โ Neural Networks!{Colors.RESET}")
print(f"{Colors.BOLD}{Colors.MAGENTA}{'โ' * 64}{Colors.RESET}\n")
# ============================================================================
# THE SAME SAMPLE TEXT FROM STEP 1
# ============================================================================
# WHY duplicate the text? So this script is 100% self-contained and runnable
# even if step1_bigram.py isn't in the same directory.
# ============================================================================
SAMPLE_TEXT = """India has a rich history of scientific achievement that spans thousands of years.
Ancient Indian mathematicians gave the world the concept of zero, which transformed
mathematics forever. Aryabhata, born in 476 CE, calculated the value of pi to four
decimal places and proposed that the earth rotates on its axis. Charaka and Sushruta
were pioneers of medicine and surgery in ancient India. The decimal number system that
the whole world uses today originated in India. Ramanujan, one of the greatest
mathematical geniuses, made extraordinary contributions to number theory and infinite
series. Chandrasekhar won the Nobel Prize for his work on the structure and evolution
of stars. ISRO, the Indian Space Research Organisation, successfully launched missions
to the Moon and Mars, making India one of the few nations to achieve this feat. From
the ancient universities of Nalanda and Takshashila to modern research institutions,
India has always been a land of knowledge and discovery."""
def build_bigram_model(text):
"""
Build a complete bigram model: counts + probabilities.
WHY rebuild instead of importing?
While we COULD import from step1, rebuilding here makes this script
completely self-contained. A student can run this file alone without
worrying about imports or file paths.
Args:
text (str): Input text to learn from
Returns:
tuple: (counts dict, probabilities dict)
"""
# Step A: Count bigrams (same logic as step1)
counts = {}
for i in range(len(text) - 1):
current = text[i]
next_ch = text[i + 1]
if current not in counts:
counts[current] = {}
if next_ch not in counts[current]:
counts[current][next_ch] = 0
counts[current][next_ch] += 1
# Step B: Convert to probabilities
probabilities = {}
for char, next_chars in counts.items():
total = sum(next_chars.values())
probabilities[char] = {}
for next_char, count in next_chars.items():
probabilities[char][next_char] = count / total
return counts, probabilities
def weighted_random_choice(probability_dict):
"""
Choose a random character based on probability weights.
WHY not just random.choice()?
random.choice() picks uniformly โ every option is equally likely.
But we want WEIGHTED randomness: 'h' after 't' should be picked more often
than 'z' after 't', because 'th' is way more common than 'tz'!
HOW IT WORKS:
Imagine a ruler from 0 to 1:
|---'h'(0.45)---|--'e'(0.20)--|--' '(0.15)--|--'a'(0.10)--|-others-|
0 0.45 0.65 0.80 0.90 1.0
We throw a dart at a random point on this ruler.
Characters with bigger sections are more likely to be hit! ๐ฏ
Args:
probability_dict (dict): {character: probability}
Returns:
str: The randomly chosen character
"""
# WHY random.random()? It gives us a random float between 0 and 1.
# This is our "dart throw" on the probability ruler.
r = random.random()
# Walk along the ruler, accumulating probabilities
cumulative = 0.0
for char, prob in probability_dict.items():
cumulative += prob
# If our random number falls within this character's section โ pick it!
if r <= cumulative:
return char
# WHY this fallback? Due to floating-point precision, cumulative might not
# reach exactly 1.0. If we somehow get past all entries, return the last one.
return list(probability_dict.keys())[-1]
def generate_text(probabilities, start_char, length):
"""
Generate text character by character using the bigram model.
This is the moment of truth! ๐ฌ
Algorithm:
1. Start with a character
2. Look up: "what characters can follow this one, and with what probability?"
3. Randomly pick the next character (weighted by probability)
4. Use THAT character as the new current character
5. Repeat!
Args:
probabilities (dict): Our bigram probability model
start_char (str): The first character to start with
length (int): How many characters to generate
Returns:
str: The generated text
"""
# Start with our seed character
result = start_char
current_char = start_char
for _ in range(length - 1):
# Check if we have data for this character
# WHY might we not? If a character only appears at the END of the text,
# we never saw what follows it, so it's not in our probability table
if current_char not in probabilities:
# Pick a random known character to continue
current_char = random.choice(list(probabilities.keys()))
# Use our weighted random choice to pick the next character
next_char = weighted_random_choice(probabilities[current_char])
# Add it to our result
result += next_char
# The next character becomes the current character for the next iteration
# This is the "bigram" part โ we only look at the LAST character!
current_char = next_char
return result
def generate_text_animated(probabilities, start_char, length, delay=0.03):
"""
Generate text with an animated display โ watch it appear character by character!
WHY animation?
It helps students FEEL how the model works: each character is chosen
one at a time, based only on the previous character. The slight delay
makes each decision visible and tangible.
Args:
probabilities (dict): Our bigram probability model
start_char (str): The first character to start with
length (int): How many characters to generate
delay (float): Seconds between characters (for dramatic effect!)
Returns:
str: The generated text
"""
result = start_char
current_char = start_char
# Print the first character
sys.stdout.write(f" {Colors.GREEN}")
sys.stdout.write(start_char)
sys.stdout.flush()
for i in range(length - 1):
if current_char not in probabilities:
current_char = random.choice(list(probabilities.keys()))
next_char = weighted_random_choice(probabilities[current_char])
result += next_char
# Print each character with a tiny delay for dramatic effect โจ
sys.stdout.write(next_char)
sys.stdout.flush()
time.sleep(delay)
current_char = next_char
sys.stdout.write(f"{Colors.RESET}\n")
return result
def show_generation_process(probabilities, start_char='t', steps=10):
"""
Show the step-by-step decision process of text generation.
WHY show this?
This is the most important educational part! Students can see EXACTLY
how the model "thinks" โ what options it considers and why it picks each one.
"""
current = start_char
generated = start_char
print(f" {Colors.BOLD}Starting character: '{start_char}'{Colors.RESET}")
print()
for step in range(steps):
if current not in probabilities:
current = random.choice(list(probabilities.keys()))
# Get the top options
options = sorted(probabilities[current].items(), key=lambda x: x[1], reverse=True)
# Show the decision process
display_current = 'โฃ' if current == ' ' else ('โต' if current == '\n' else current)
print(f" {Colors.YELLOW}Step {step + 1}:{Colors.RESET} Current char = '{Colors.CYAN}{display_current}{Colors.RESET}'")
# Show top 3 options
print(f" {Colors.DIM}Options:", end="")
for ch, p in options[:4]:
display = 'โฃ' if ch == ' ' else ('โต' if ch == '\n' else ch)
print(f" '{display}'={p:.0%}", end="")
if len(options) > 4:
print(f" ...+{len(options)-4} more", end="")
print(f"{Colors.RESET}")
# Make the choice
next_char = weighted_random_choice(probabilities[current])
display_next = 'โฃ' if next_char == ' ' else ('โต' if next_char == '\n' else next_char)
print(f" {Colors.GREEN}โ Chose: '{display_next}'{Colors.RESET}")
generated += next_char
current = next_char
# Show the text so far
print(f" {Colors.DIM}Text so far: \"{generated}\"{Colors.RESET}")
print()
return generated
# ============================================================================
# MAIN EXECUTION
# ============================================================================
if __name__ == '__main__':
# Set random seed for reproducibility (optional โ remove for truly random output)
# WHY a seed? So students get consistent results when first running the code.
# They can remove this line later to get different results each time!
# random.seed(42) # Uncomment this line for reproducible results
# โโ Print the header โโ
print_header()
# โโ Step 1: Rebuild the model โโ
print_section("๐ง", "Step 1: Rebuilding the bigram model...",
"Training our AI on the sample text")
text_lower = SAMPLE_TEXT.lower()
counts, probs = build_bigram_model(text_lower)
total_unique = sum(len(v) for v in counts.values())
print(f" {Colors.GREEN}โ{Colors.RESET} Model built! Learned {Colors.BOLD}{total_unique}{Colors.RESET} patterns")
print(f" {Colors.GREEN}โ{Colors.RESET} Vocabulary: {Colors.BOLD}{len(counts)}{Colors.RESET} unique characters")
# โโ Step 2: Show the generation process โโ
print_section("๐ฌ", "Step 2: The Generation Process (Step by Step)",
"Watch how the AI 'decides' each character")
print(f" {Colors.YELLOW}Let's see exactly how text generation works:{Colors.RESET}")
print(f" {Colors.DIM}The model looks at the current character, checks its{Colors.RESET}")
print(f" {Colors.DIM}probability table, and randomly picks the next one.{Colors.RESET}\n")
show_generation_process(probs, start_char='t', steps=10)
# โโ Step 3: Generate multiple samples โโ
print_section("๐", "Step 3: Generating Text Samples",
"Let's generate text of different lengths")
samples = [
('t', 30, "Short (30 chars, starting with 't')"),
('i', 60, "Medium (60 chars, starting with 'i')"),
('a', 100, "Longer (100 chars, starting with 'a')"),
('t', 200, "Full paragraph (200 chars, starting with 't')"),
]
for start, length, description in samples:
print(f" {Colors.BOLD}{Colors.YELLOW}๐ {description}:{Colors.RESET}")
generated = generate_text(probs, start, length)
# Clean up for display (replace newlines with spaces)
display_text = generated.replace('\n', ' ')
print(f" {Colors.CYAN}\"{display_text}\"{Colors.RESET}")
print()
# โโ Step 4: Animated generation โโ
print_section("๐ฌ", "Step 4: Live Text Generation (Animated!)",
"Watch text appear character by character...")
print(f" {Colors.YELLOW}Generating 150 characters starting with 'i'...{Colors.RESET}")
print(f" {Colors.DIM}(Each character is chosen one at a time based on the previous one){Colors.RESET}\n")
animated_text = generate_text_animated(probs, 'i', 150, delay=0.02)
print()
# โโ Step 5: Compare real vs generated โโ
print_section("โ๏ธ", "Step 5: Real Text vs Generated Text",
"Can you spot the difference?")
# Get a chunk of real text
real_chunk = text_lower[50:200].replace('\n', ' ')
generated_chunk = generate_text(probs, 't', 150).replace('\n', ' ')
print(f" {Colors.BOLD}{Colors.GREEN}๐ REAL TEXT:{Colors.RESET}")
print(f" {Colors.WHITE}\"{real_chunk}\"{Colors.RESET}")
print()
print(f" {Colors.BOLD}{Colors.MAGENTA}๐ค GENERATED TEXT:{Colors.RESET}")
print(f" {Colors.WHITE}\"{generated_chunk}\"{Colors.RESET}")
print()
print(f" {Colors.YELLOW}๐ Analysis:{Colors.RESET}")
print(f" {Colors.WHITE}โข The real text makes perfect sense โ coherent sentences{Colors.RESET}")
print(f" {Colors.WHITE}โข The generated text has SOME English patterns (common pairs like{Colors.RESET}")
print(f" {Colors.WHITE} 'th', 'he', 'in') but is mostly gibberish{Colors.RESET}")
print(f" {Colors.WHITE}โข WHY? Our model only looks at ONE previous character!{Colors.RESET}")
print(f" {Colors.WHITE} It has no idea about words, grammar, or meaning.{Colors.RESET}")
print()
# โโ Step 6: Multiple runs show randomness โโ
print_section("๐ฒ", "Step 6: Randomness in Action",
"Same starting character, different results each time!")
print(f" {Colors.YELLOW}Three different generations, all starting with 'th':{Colors.RESET}\n")
for run in range(1, 4):
gen = generate_text(probs, 't', 80).replace('\n', ' ')
print(f" {Colors.CYAN}Run {run}:{Colors.RESET} \"{gen}\"")
print()
print(f" {Colors.DIM}Each run is different because we use weighted RANDOM choices!{Colors.RESET}")
print(f" {Colors.DIM}The probabilities are the same, but the random dice rolls differ.{Colors.RESET}")
# โโ Key Insights โโ
print_section("๐ก", "Key Insights โ What Did We Learn?")
print(f""" {Colors.WHITE}1. {Colors.BOLD}A bigram model is just a LOOKUP TABLE{Colors.RESET}
{Colors.DIM}For each character, it stores what might come next{Colors.RESET}
{Colors.WHITE}2. {Colors.BOLD}Generation = repeated random sampling{Colors.RESET}
{Colors.DIM}Pick a char โ look up options โ randomly choose โ repeat{Colors.RESET}
{Colors.WHITE}3. {Colors.BOLD}Context window = 1 character{Colors.RESET}
{Colors.DIM}Our model only looks at the LAST character, which is why{Colors.RESET}
{Colors.DIM}it can't form real words or sentences{Colors.RESET}
{Colors.WHITE}4. {Colors.BOLD}More context = better predictions{Colors.RESET}
{Colors.DIM}GPT-4 looks at 128,000 tokens of context โ that's why it's so good!{Colors.RESET}
{Colors.WHITE}5. {Colors.BOLD}The CORE IDEA is the same across all language models:{Colors.RESET}
{Colors.YELLOW} โ Learn patterns from data{Colors.RESET}
{Colors.YELLOW} โ Use those patterns to predict what comes next{Colors.RESET}
{Colors.BOLD}{Colors.GREEN}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ You've gone from ZERO knowledge to building a model โ
โ that generates text! That's incredible! ๐ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ{Colors.RESET}
{Colors.MAGENTA}โญ๏ธ Ready for the next level? In Level 2, we'll build a{Colors.RESET}
{Colors.MAGENTA} NEURAL NETWORK that actually learns and improves!{Colors.RESET}""")
# โโ Print footer with celebration โโ
print_footer()
# Final celebration! ๐
print(f" {Colors.BOLD}{Colors.YELLOW}๐ CONGRATULATIONS! You just built your first AI model! ๐{Colors.RESET}")
print(f" {Colors.WHITE}You now understand:{Colors.RESET}")
print(f" {Colors.GREEN} โ What prediction means{Colors.RESET}")
print(f" {Colors.GREEN} โ How bigrams capture patterns{Colors.RESET}")
print(f" {Colors.GREEN} โ How counting = learning{Colors.RESET}")
print(f" {Colors.GREEN} โ How sampling = generation{Colors.RESET}")
print(f" {Colors.WHITE}These are the building blocks of EVERY language model.{Colors.RESET}\n")
Neural Networks from Scratch
"The brain is a world consisting of a number of unexplored continents and great stretches of unknown territory." โ Santiago Ramรณn y Cajal, Nobel Prize-winning neuroscientist
Learning Objectives
- Explain why counting patterns (bigrams) has fundamental limitations
- Describe how a biological neuron works and how we model it mathematically
- Build a single artificial neuron from scratch and train it to learn logic gates
- Understand activation functions (sigmoid, ReLU) and why non-linearity matters
- Build a multi-layer neural network with forward pass computation
- Explain backpropagation โ the algorithm that makes neural networks learn
- Train a character-level neural network to generate text
- Interpret training loss curves and understand what they tell us about learning
3.1 From Counting to Learning ๐
In Chapter 2, you built something remarkable โ a model that generates text by counting character patterns. But you also saw its limitations. The bigram model:
- Can only look at one character of context
- Has no ability to generalise โ if it hasn't seen a pattern, it can't predict it
- Treats every character as completely independent from every other (it doesn't know 'a' and 'e' are both vowels)
- Gets worse as we try to look at more context (the curse of dimensionality)
The bigram model does exactly what we tell it: count. But what if the computer could learn patterns by itself? What if, instead of us defining the rules, the machine could discover them?
That's exactly what neural networks do. And the beautiful thing is โ the core idea is surprisingly simple.
Important
The Fundamental Shift: A bigram model is programmed โ we tell it exactly what to count. A neural network is trained โ we show it data and it figures out the patterns on its own. This shift from programming to training is the most important idea in modern AI.
3.2 What is a Neuron? ๐ง
The Biological Inspiration
Your brain contains roughly 86 billion neurons. Each neuron:
- Receives signals from other neurons through dendrites
- Processes these signals in the cell body
- Sends output through the axon if the total signal is strong enough
- Connects to other neurons at synapses, with varying connection strengths
The key insight: some connections are stronger than others. When you learn something new, the connections between specific neurons get stronger or weaker. That's learning!
The Mathematical Model
We simplify this into a mathematical model:
inputs ร weights โ sum โ activation โ output
Let's use an analogy every Indian student will understand:
Tip
The Cricket Decision Analogy ๐
Imagine a student deciding whether to go play cricket. Three factors matter:
| Factor | Value | Weight (Importance) | |--------|-------|-------------------| | Is homework done? | 1 (yes) or 0 (no) | 0.8 (very important!) | | Is weather good? | 1 (yes) or 0 (no) | 0.3 (nice but not critical) | | Are friends going? | 1 (yes) or 0 (no) | 0.6 (good motivation) |
The student multiplies each factor by its importance, adds them up, and makes a decision:
Score = (homework ร 0.8) + (weather ร 0.3) + (friends ร 0.6) + bias
If the score is high โ "Let's go play!" If low โ "Better stay home."
That's exactly how an artificial neuron works!
3.3 Building a Single Neuron ๐ฌ
Let's look at the actual Neuron class from our code. This is a complete artificial neuron built from scratch with NumPy:
Initialization โ Starting with Random Guesses
Python
class Neuron:
"""
A single artificial neuron โ the fundamental building block of neural networks.
Mathematical formula:
output = sigmoid(w1*x1 + w2*x2 + ... + bias)
"""
def __init__(self, num_inputs, learning_rate=0.5):
"""
Initialize the neuron with random weights and zero bias.
"""
# Initialize weights randomly between -1 and 1
# Each weight represents how much the neuron "trusts" each input
self.weights = np.random.uniform(-1, 1, num_inputs)
# The bias is like the neuron's default tendency
# Positive bias = tends to fire even without input
# Negative bias = needs strong input to fire
self.bias = 0.0
# Controls how big each adjustment step is during learning
self.learning_rate = learning_rate
Why random weights? If all weights start at zero, the neuron has no starting "opinion" โ it can't learn effectively. Random weights give it a starting point to adjust from. Think of it as a student making initial guesses before learning the correct answers.
The Sigmoid Activation Function
The sigmoid function is the neuron's "decision maker." It squashes any number into the range (0, 1):
Python
def sigmoid(self, x):
"""
Sigmoid activation: ฯ(x) = 1 / (1 + e^(-x))
- Negative x โ output close to 0 ("no")
- Positive x โ output close to 1 ("yes")
- x = 0 โ output = 0.5 ("uncertain")
"""
x = np.clip(x, -500, 500) # Prevent overflow
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(self, sigmoid_output):
"""
Derivative of sigmoid: ฯ'(x) = ฯ(x) * (1 - ฯ(x))
The derivative tells us the SLOPE โ how sensitive the output
is to changes in input. Essential for backpropagation!
"""
return sigmoid_output * (1 - sigmoid_output)
Here's the sigmoid curve visualised:
1.0 โค โโโโโโโ
โ โโโ
โ โโ
0.5 โค โโ โ "uncertain"
โ โโ
โ โโโ
0.0 โค โโโโโโโ
โโโโฌโโโโฌโโโโฌโโโโฌโโโโฌโโโโฌโโ
-5 -3 -1 +1 +3 +5
Note
๐ค Think About It
Why do we need the sigmoid function? Why not just use the raw weighted sum? Because without it, a neuron would just be a linear function โ and stacking linear functions gives you... another linear function. Sigmoid introduces non-linearity, which allows neural networks to learn complex, curved patterns. Without activation functions, neural networks would be no more powerful than simple linear regression!
The Forward Pass โ Making a Prediction
Python
def forward(self, inputs, verbose=False):
"""
Forward pass: compute the neuron's output.
Steps:
1. Compute weighted sum: ฮฃ(wi * xi) + bias
2. Apply sigmoid activation
3. Return the output
"""
# Step 1: Weighted sum โ combines all inputs into a single number
# np.dot computes: w1*x1 + w2*x2 + ... + wn*xn
weighted_sum = np.dot(inputs, self.weights) + self.bias
# Step 2: Sigmoid squashes it to (0, 1)
output = self.sigmoid(weighted_sum)
return output, weighted_sum
The Training Step โ Learning from Mistakes
This is where the magic happens. The neuron learns by adjusting its weights after each mistake:
Python
def train_step(self, inputs, target):
"""
One step of training: forward โ compute error โ update weights.
Like a teacher correcting a student:
1. Student answers (forward pass)
2. Teacher checks (compute error)
3. Teacher gives feedback (compute gradient)
4. Student adjusts (update weights)
"""
# Step 1: Forward pass โ make a prediction
output, weighted_sum = self.forward(inputs)
# Step 2: Compute error โ how wrong are we?
error = target - output
# Step 3: Compute gradient โ which direction to adjust
sigmoid_deriv = self.sigmoid_derivative(output)
gradient = error * sigmoid_deriv
# Step 4: Update weights โ nudge them to reduce error
# weight_new = weight_old + learning_rate ร gradient ร input
self.weights += self.learning_rate * gradient * inputs
self.bias += self.learning_rate * gradient
return error
Demo: Teaching a Neuron Logic Gates
Let's see this in action! We can teach a single neuron to learn the AND gate โ a simple logic operation where the output is 1 only when BOTH inputs are 1:
| Input 1 | Input 2 | AND Output |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |
Python
# Create a neuron with 2 inputs
neuron = Neuron(num_inputs=2, learning_rate=0.5)
# AND gate training data
inputs_data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=float)
targets = np.array([0, 0, 0, 1], dtype=float)
# Train for 5000 epochs (showing the data 5000 times)
for epoch in range(5000):
for i in range(len(inputs_data)):
neuron.train_step(inputs_data[i], targets[i])
After 5000 epochs of training, the neuron learns to correctly implement the AND gate! The weights converge to values that make the neuron output โ0 for all inputs except [1, 1], where it outputs โ1.
Tip
Key Insight: A single neuron can learn any linearly separable problem. AND and OR are linearly separable. XOR is NOT โ that's why we need networks of neurons (coming up next!).
3.4 Activation Functions โ The Non-Linearity Secret ๐
Why Non-Linearity Matters
Without activation functions, a neural network is just a fancy way of doing linear algebra. No matter how many layers you stack, the result is always a linear function:
Multiple linear layers collapse into a single linear layer! That's useless โ we can't learn curves, boundaries, or complex patterns.
Activation functions like sigmoid break this linearity, allowing networks to learn any pattern.
Two Key Activation Functions
Sigmoid โ used in our code:
Sigmoid derivative (crucial for backpropagation):
ReLU (Rectified Linear Unit) โ used in modern deep learning:
Note
๐ค Think About It
Modern networks almost always use ReLU instead of sigmoid. Why? Sigmoid has a problem called vanishing gradients โ for very large or very small inputs, the derivative is nearly zero, so learning slows to a crawl. ReLU doesn't have this problem (its derivative is either 0 or 1). But for our educational examples, sigmoid works perfectly and is easier to understand!
3.5 Building a Neural Network ๐๏ธ
A single neuron can only learn simple patterns. To learn complex patterns, we connect many neurons in layers. Let's look at the NeuralNetwork class from step2_network.py:
Network Architecture
The Network Class
Python
class NeuralNetwork:
"""
A simple feedforward neural network with one hidden layer.
Architecture:
Input (n) โ Hidden (16, sigmoid) โ Output (m, softmax)
WHY non-linearity matters:
- Without activation functions, stacking layers is pointless
- Multiple linear layers collapse into a single linear layer
- Non-linearity allows the network to learn CURVES, not just lines
"""
def __init__(self, input_size, hidden_size=16, output_size=4):
"""
Initialize with Xavier-scaled random weights.
WHY Xavier initialization?
- Random weights too large โ outputs explode
- Random weights too small โ outputs vanish
- Xavier scales by 1/sqrt(n) to keep values reasonable
"""
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
# Weights: Input โ Hidden (every input connects to every hidden neuron)
scale_1 = np.sqrt(2.0 / (input_size + hidden_size))
self.weights_input_hidden = np.random.randn(input_size, hidden_size) * scale_1
self.biases_hidden = np.zeros(hidden_size)
# Weights: Hidden โ Output
scale_2 = np.sqrt(2.0 / (hidden_size + output_size))
self.weights_hidden_output = np.random.randn(hidden_size, output_size) * scale_2
self.biases_output = np.zeros(output_size)
The Forward Pass
Python
def forward(self, inputs, verbose=False):
"""
Forward pass: push input through all layers to get output.
Flow:
Input โ (weights ร input + bias) โ sigmoid โ Hidden
Hidden โ (weights ร hidden + bias) โ softmax โ Output
"""
# LAYER 1: Input โ Hidden
# Matrix multiplication computes ALL weighted sums at once!
hidden_raw = np.dot(inputs, self.weights_input_hidden) + self.biases_hidden
hidden_activated = self.sigmoid(hidden_raw) # Apply non-linearity
# LAYER 2: Hidden โ Output
output_raw = np.dot(hidden_activated, self.weights_hidden_output) + self.biases_output
output_probs = self.softmax(output_raw) # Convert to probabilities
return output_probs, hidden_raw, hidden_activated, output_raw
Softmax โ Converting Scores to Probabilities
Python
def softmax(self, x):
"""
Softmax: converts raw scores to probabilities that sum to 1.
softmax(xi) = e^(xi) / ฮฃ(e^(xj))
Example: Raw scores [2.0, 1.0, 0.5, 0.1]
โ Probabilities [0.45, 0.17, 0.10, 0.07] (sum = 1.0)
"""
x_shifted = x - np.max(x) # Prevent overflow (e^1000 = crash!)
exp_x = np.exp(x_shifted)
return exp_x / np.sum(exp_x)
Tip
The Layers Analogy ๐ซ
- Input layer = raw information (exam answers on paper)
- Hidden layer = teachers detecting patterns ("this student knows algebra but struggles with geometry")
- Output layer = final decision (grade: A, B, C, or D)
Think of the layers like a school processing system:
The first layer sees raw data. Middle layers find patterns. The output layer makes the final call.
3.6 The Magic of Backpropagation โจ
This is THE most important section in this chapter. Backpropagation is the algorithm that makes neural networks learn. Without it, neural networks would be useless.
The Big Picture
Tip
The Teacher Grading Papers Analogy ๐
Imagine a teacher (the loss function) grading a student's (the network's) exam:
1. Forward Pass: The student answers the questions 2. Compute Loss: The teacher marks the answers โ "You got 40% wrong" 3. Backward Pass (Chain Rule): The teacher traces back โ "You got question 5 wrong BECAUSE you don't understand fractions, which is BECAUSE you didn't learn multiplication tables" 4. Update Weights: The teacher tells the student โ "Practice multiplication tables more" (strengthen those connections)
Repeat this process thousands of times, and the student (network) masters the subject!
The Loss Function โ How Wrong Are We?
We use cross-entropy loss, which is perfect for classification tasks:
Where y_i is the true label and \hat{y}_i is the predicted probability.
Python
def compute_loss(self, predicted, target):
"""
Cross-entropy loss: heavily penalizes confident WRONG predictions.
- Predict right character with high probability โ low loss
- Predict wrong character โ high loss
"""
predicted_clipped = np.clip(predicted, 1e-15, 1.0) # Prevent log(0)
loss = -np.sum(target * np.log(predicted_clipped))
return loss
The Backward Pass โ Tracing the Error
This is the backward method from CharLevelNetwork in step3_train.py. Let's go through it line by line:
Python
def backward(self, target):
"""
Backward pass: compute gradients using the chain rule.
THIS IS BACKPROPAGATION โ the core of neural network learning!
For output layer:
โLoss/โW2 = a1แต ยท (predicted - target)
โLoss/โb2 = predicted - target
For hidden layer:
โLoss/โW1 = xแต ยท (ฮด_hidden)
โLoss/โb1 = ฮด_hidden
where ฮด_hidden = (predicted - target) ยท W2แต ร sigmoid'(a1)
"""
# OUTPUT LAYER GRADIENT
# The derivative of softmax + cross-entropy simplifies beautifully!
delta_output = self.a2 - target # Shape: (vocab_size,)
# Gradient for W2: how much each hiddenโoutput weight contributed
dW2 = np.outer(self.a1, delta_output) # Shape: (hidden, vocab)
db2 = delta_output
# HIDDEN LAYER GRADIENT
# Step 1: Propagate error back through W2
delta_hidden = np.dot(delta_output, self.W2.T) # Shape: (hidden,)
# Step 2: Multiply by sigmoid derivative (chain rule!)
delta_hidden *= self.sigmoid_derivative(self.a1)
# Gradient for W1
dW1 = np.outer(self.x, delta_hidden) # Shape: (vocab, hidden)
db1 = delta_hidden
# UPDATE WEIGHTS (Gradient Descent)
# We move OPPOSITE to the gradient (gradient points toward MORE error)
self.W2 -= self.learning_rate * dW2
self.b2 -= self.learning_rate * db2
self.W1 -= self.learning_rate * dW1
self.b1 -= self.learning_rate * db1
Let's break down the math:
Step 1: The output error is simply predicted minus target:
Step 2: The gradient for Wโ tells us how each weight contributed:
Step 3: We propagate the error backwards through the weights:
The \odot symbol means element-wise multiplication. We multiply by the sigmoid derivative because the sigmoid "squashed" values during the forward pass โ we need to account for that squashing.
Step 4: Update every weight by moving opposite to the gradient:
Where \eta is the learning rate.
Warning
Common Pitfall: Learning Rate
- Too high (e.g., 5.0): The network overshoots, bouncing wildly. Like a student who overcorrects every mistake.
- Too low (e.g., 0.001): The network learns agonizingly slowly. Like a student who barely adjusts after feedback.
- Just right (e.g., 0.5): Steady improvement. The sweet spot!
3.7 Training Loop Explained ๐
The training loop in step3_train.py ties everything together. Here's the core of the train_network function:
Python
for epoch in range(epochs): # epochs = 1500
total_loss = 0.0
# Shuffle training data each epoch
# WHY: Prevents learning the ORDER of examples instead of patterns
shuffle_idx = np.random.permutation(num_samples)
for i in shuffle_idx:
# 1. Forward pass: predict next character
predicted = net.forward(inputs[i])
# 2. Compute loss: how wrong is the prediction?
loss = net.compute_loss(predicted, targets[i])
total_loss += loss
# 3. Backward pass: compute gradients and update weights
net.backward(targets[i])
avg_loss = total_loss / num_samples
Key Concepts in the Training Loop:
Epoch: One complete pass through ALL training examples. If you have 300 training pairs and train for 1500 epochs, the network sees each example 1500 times!
Shuffling: We randomise the order each epoch so the network doesn't memorise the sequence. Just like how a good teacher mixes up practice problems.
One-Hot Encoding: Each character is represented as a binary vector. If our vocabulary is ['a', 'b', 'c', 'd'], then:
'a'=[1, 0, 0, 0]'b'=[0, 1, 0, 0]'c'=[0, 0, 1, 0]
The Loss Going Down: When the network starts, its weights are random, so the loss is high (around 3.0). As it trains, the loss steadily decreases โ the network is learning!
Epoch 0 โ Loss: 3.2814 โ Still learning...
Epoch 100 โ Loss: 2.7651 โ Still learning...
Epoch 500 โ Loss: 1.8203 โ Getting better...
Epoch 1000 โ Loss: 1.2145 โ Almost there!
Epoch 1499 โ Loss: 0.8932 โ ๐ Mastered!
3.8 Seeing the Learning ๐
The step4_visualize.py script creates visualisations of the training process. The most important is the loss curve:
Loss
3.5 โคโโ
โโโโ
โโโโโ
2.5 โคโโโโโ
โโโโโโโ
โโโโโโโโโ
1.5 โคโโโโโโโโโโ
โโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโ
0.5 โคโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโ
0 500 1000 1500
Training Epoch
What does this tell us?
- High loss at the start โ the network is making random guesses
- Rapidly decreasing loss โ the network is learning the easiest patterns first
- Slowly decreasing loss โ the network is fine-tuning, learning subtler patterns
- Flat loss at the end โ the network has learned as much as it can from this data
Before vs After Comparison
๐ด BEFORE training (random weights):
"xkq.pzmw bfvnj tglydc" โ complete gibberish!
๐ข AFTER training (1500 epochs):
"the network learns from data. the brain has" โ recognisable English!
The network went from outputting random characters to generating text that resembles the training data. It learned which characters follow which โ but using a neural network instead of simple counting!
Important
The Key Difference from Chapter 2: Our bigram model counted exact patterns: "after 't', 'h' appeared 42 times." The neural network learns distributed patterns: it represents characters as numbers in a hidden layer and discovers abstract relationships between them. This is why neural networks can generalise better.
๐ญ 3.9 Discussion: What Can Neural Networks Learn? ๐
### The Universal Approximation Theorem
Here's one of the most beautiful results in mathematics:
> With enough hidden neurons, a neural network with a single hidden layer can approximate ANY continuous function to arbitrary accuracy.
In plain language: if a pattern exists in the data, a neural network can learn it. Given enough neurons and enough data, there's essentially no limit to what patterns it can discover.
### Deep vs Shallow Networks
Our network has one hidden layer (a "shallow" network). Modern networks like GPT-4 have hundreds of layers (a "deep" network). Why does depth matter?
Think of it like this:
- Layer 1 might learn: "these character pairs are common"
- Layer 2 might learn: "these sequences of common pairs form syllables"
- Layer 3 might learn: "these syllables form words"
- Layer 4 might learn: "these words form phrases"
Each layer builds more abstract representations on top of the previous layer. Depth allows the network to learn hierarchical patterns โ from simple to complex.
> [!NOTE]
> ๐ค Think About It
>
> Is this how the human brain works too? In some ways, yes! The visual cortex processes information in layers: the first layer detects edges, the next detects shapes, then objects, then faces. Each layer builds on the one before it. The analogy isn't perfect, but the principle of hierarchical feature extraction is shared.
Key Concepts Summary
| Concept | Definition |
|---|---|
| Neuron | The fundamental unit of a neural network. Takes inputs, multiplies by weights, adds a bias, and applies an activation function. |
| Weight | A number that controls how much influence an input has on the neuron's output. Adjusted during training. |
| Bias | A number added to the weighted sum before activation. Shifts the activation threshold. |
| Activation Function | A non-linear function (like sigmoid or ReLU) that allows the network to learn complex patterns. |
| Forward Pass | The process of feeding input through the network to get a prediction. |
| Backpropagation | The algorithm that computes how much each weight contributed to the error, enabling learning. |
| Loss Function | Measures how wrong the network's predictions are. We try to minimise this. |
| Gradient | The direction and magnitude of the steepest increase in loss. We move opposite to it. |
| Learning Rate | Controls the step size of weight updates. Too high = unstable, too low = slow. |
| Epoch | One complete pass through all training data. |
| One-Hot Encoding | Representing categories (characters) as binary vectors. 'a' = [1, 0, 0, ...] |
๐ 3.11 Exercises ๐
Hidden neurons experiment: Change the number of hidden neurons in CharLevelNetwork from 64 to 16, then to 128. How does it affect:
Training speed?
Final loss value?
Quality of generated text?
Learning rate experiment: Try these learning rates: 0.01, 0.5, and 5.0. What happens with each?
0.01: Does it converge? How long does it take?
0.5: This is the default โ does it work well?
5.0: What goes wrong? (Hint: look at the loss curve)
Different training text: Replace SAMPLE_TEXT with a Hindi paragraph or a Bollywood dialogue. Run training and generate text. Does the network capture Hindi character patterns?
ReLU activation: Replace the sigmoid function with ReLU:
Python
def relu(self, x):
return np.maximum(0, x)
Does it train faster? Does the generated text quality change?
Overfitting experiment: Train for 50,000 epochs instead of 1,500. Does the loss keep going down? At some point, does the network start memorising the training text perfectly? (This is called overfitting โ the network learns the data by heart instead of learning general patterns.)
๐ญ 3.12 Discussion Questions ๐ญ
Brain vs network: Our neural network has ~5,000 parameters. The human brain has ~86 billion neurons with ~100 trillion connections. What can a brain do that our tiny network can't? What fundamental capabilities are we missing?
Can networks be creative? When our trained network generates text, is it being "creative"? It's producing sequences it has never seen before, but based entirely on patterns from training data. Is human creativity any different?
The role of data: Our network was trained on a few hundred characters. ChatGPT was trained on billions of pages. How does the quantity and quality of training data affect what a network can learn?
Ethical questions: If a neural network learns to write like a famous Indian poet by training on their work, who owns the generated text? The programmer? The network? The poet?
The future of education: Could neural networks one day replace teachers? What can a human teacher do that an AI tutor cannot (or should not)?
๐ Chapter Summary
- From counting to learning: Bigram models count explicitly. Neural networks learn patterns through training โ a fundamental paradigm shift.
- Neurons: The building block of neural networks. Each neuron computes a weighted sum of inputs, adds a bias, and applies an activation function.
- Activation functions: Sigmoid squashes values to (0, 1). ReLU keeps positive values and zeros out negatives. Both introduce the non-linearity that makes neural networks powerful.
- Neural networks: Multiple neurons connected in layers. Input layer โ hidden layer(s) โ output layer. Each layer transforms the data into more abstract representations.
- Backpropagation: THE algorithm of deep learning. It uses the chain rule to trace errors backward through the network, computing how much each weight contributed to the error, then adjusting weights to reduce it.
- Training: We showed the network thousands of examples, and it gradually reduced its loss from ~3.0 (random guessing) to ~0.9 (meaningful predictions).
- The result: A network that started generating gibberish and learned to produce recognisable English text โ all from scratch, with no frameworks!
โญ๏ธ What's Next?
Our neural network works, but it has a major limitation: it looks at only one character at a time (just like the bigram model!). It doesn't understand sequences. It can't grasp that "the" is a word, or that "machine learning" is a phrase.
In Chapter 4: Embeddings and Attention, you'll learn:
- How to represent characters and words as vectors (embeddings) in a continuous space
- The revolutionary attention mechanism โ "which parts of the input should I focus on?"
- How attention allows models to understand relationships between distant words
This is where we start building toward the Transformer โ the architecture behind ChatGPT and Gemini. The foundation you built in this chapter is exactly what you need. Let's go! ๐
"It always seems impossible until it's done." โ Nelson Mandela
You just built a neural network from scratch. Nothing is impossible now. ๐ช
Complete Source Code - Chapter 3
Below are the complete, runnable source files for this chapter. Every line is included.
Complete Code: step1_neuron.py
Python
"""
================================================================================
๐ง LEVEL 2 โ STEP 1: A SINGLE NEURON
================================================================================
Build a single neuron from scratch using only Python + NumPy.
We'll see how a neuron learns to be an AND gate and an OR gate!
KEY CONCEPTS:
- A neuron takes inputs, multiplies by weights, adds a bias, then activates
- Sigmoid squashes any number into range (0, 1)
- Learning = adjusting weights to reduce error
NO DEEP LEARNING FRAMEWORKS โ just NumPy and math!
================================================================================
"""
# ============================================================================
# IMPORTS
# ============================================================================
import numpy as np # NumPy gives us fast math operations on arrays
import os # For file path operations
import sys # For system-level operations
# ============================================================================
# ANSI COLOR CODES
# ============================================================================
# WHY: We use ANSI escape codes to make terminal output colorful and readable.
# These codes tell the terminal to change text color/style.
# Format: \033[<code>m where <code> is the color number.
class Colors:
"""ANSI color codes for beautiful terminal output."""
RESET = "\033[0m" # Reset to default color
BOLD = "\033[1m" # Bold text
DIM = "\033[2m" # Dimmed text
# Regular colors
RED = "\033[31m" # For errors or wrong outputs
GREEN = "\033[32m" # For correct outputs / success
YELLOW = "\033[33m" # For warnings / highlights
BLUE = "\033[34m" # For information
MAGENTA = "\033[35m" # For special highlights
CYAN = "\033[36m" # For data values
WHITE = "\033[37m" # For regular text
# Bright colors
BRIGHT_RED = "\033[91m"
BRIGHT_GREEN = "\033[92m"
BRIGHT_YELLOW = "\033[93m"
BRIGHT_BLUE = "\033[94m"
BRIGHT_MAGENTA = "\033[95m"
BRIGHT_CYAN = "\033[96m"
# ============================================================================
# HELPER FUNCTIONS
# ============================================================================
def print_header():
"""Print a beautiful header for this script."""
print(f"\n{Colors.BRIGHT_CYAN}{'='*70}")
print(f" ๐ง LEVEL 2 โ STEP 1: A SINGLE NEURON FROM SCRATCH")
print(f"{'='*70}{Colors.RESET}")
print(f"{Colors.DIM} Building the smallest unit of a neural network...{Colors.RESET}")
print(f"{Colors.DIM} Using only Python + NumPy. No frameworks!{Colors.RESET}\n")
def print_footer():
"""Print a beautiful footer for this script."""
print(f"\n{Colors.BRIGHT_CYAN}{'='*70}")
print(f" โ
STEP 1 COMPLETE! You now understand a single neuron!")
print(f" ๐ Next: step2_network.py โ Build a full network!")
print(f"{'='*70}{Colors.RESET}\n")
def print_section(title, emoji="๐"):
"""Print a section header."""
print(f"\n{Colors.BRIGHT_YELLOW}{'โ'*70}")
print(f" {emoji} {title}")
print(f"{'โ'*70}{Colors.RESET}\n")
# ============================================================================
# THE NEURON CLASS
# ============================================================================
class Neuron:
"""
A single artificial neuron โ the fundamental building block of neural networks.
Think of it like a student making a decision:
- It receives multiple INPUTS (pieces of information)
- Each input has a WEIGHT (how much the student trusts that info)
- It adds everything up (WEIGHTED SUM)
- It passes through an ACTIVATION function (the decision threshold)
- It produces an OUTPUT (the decision)
Mathematical formula:
output = sigmoid(w1*x1 + w2*x2 + ... + bias)
"""
def __init__(self, num_inputs, learning_rate=0.5):
"""
Initialize the neuron with random weights and zero bias.
WHY random weights?
- If all weights start at zero, the neuron has no starting "opinion"
- Random weights give it a random starting point to learn from
- Think of it as the student having some initial guesses
WHY learning_rate?
- Controls how big each adjustment step is
- Too high = overshoots (student changes mind too drastically)
- Too low = learns too slowly (student barely adjusts)
- 0.5 is a reasonable starting point for simple problems
"""
# Initialize weights randomly between -1 and 1
# WHY: Each weight represents how much the neuron "trusts" each input
# We use small random values so the neuron starts without strong opinions
self.weights = np.random.uniform(-1, 1, num_inputs)
# Initialize bias to zero
# WHY: The bias is like the neuron's default tendency
# A positive bias means the neuron tends to fire even without input
# A negative bias means the neuron needs strong input to fire
self.bias = 0.0
# Store learning rate for weight updates
# WHY: This controls the "step size" when adjusting weights
self.learning_rate = learning_rate
def sigmoid(self, x):
"""
The Sigmoid activation function: ฯ(x) = 1 / (1 + e^(-x))
WHY sigmoid?
- Squashes ANY number into the range (0, 1)
- This is perfect for "yes/no" decisions (probability)
- It's smooth and differentiable (we can calculate gradients for learning)
- Negative x โ output close to 0 ("no")
- Positive x โ output close to 1 ("yes")
- x = 0 โ output = 0.5 ("uncertain")
WHY clip x?
- Very large values of x can cause overflow in e^(-x)
- Clipping to [-500, 500] prevents numerical errors
"""
x = np.clip(x, -500, 500) # Prevent overflow in exponential
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(self, sigmoid_output):
"""
Derivative of sigmoid: ฯ'(x) = ฯ(x) * (1 - ฯ(x))
WHY do we need the derivative?
- The derivative tells us the SLOPE (rate of change)
- During learning, we need to know: "If I change the input slightly,
how much does the output change?"
- This is essential for backpropagation (learning from mistakes)
WHY use sigmoid_output directly?
- Beautiful math trick: sigmoid's derivative can be computed from
the sigmoid value itself, saving computation!
"""
return sigmoid_output * (1 - sigmoid_output)
def forward(self, inputs, verbose=False):
"""
Forward pass: compute the neuron's output for given inputs.
Steps:
1. Compute weighted sum: ฮฃ(wi * xi) + bias
2. Apply sigmoid activation
3. Return the output
This is like the student:
1. Gathering all information and weighing it
2. Making a decision based on the total
"""
# Step 1: Weighted sum
# WHY: Each input is multiplied by its weight, then all are summed
# This combines all inputs into a single number
# Think of it as: "How strong is the total evidence?"
weighted_sum = np.dot(inputs, self.weights) + self.bias
# Step 2: Activation (sigmoid)
# WHY: The raw sum could be any number (-inf to +inf)
# Sigmoid converts it to a probability between 0 and 1
output = self.sigmoid(weighted_sum)
# Verbose printing for educational purposes
if verbose:
print(f" {Colors.CYAN}Inputs: {inputs}{Colors.RESET}")
print(f" {Colors.MAGENTA}Weights: {np.round(self.weights, 4)}{Colors.RESET}")
print(f" {Colors.BLUE}Bias: {self.bias:.4f}{Colors.RESET}")
print(f" {Colors.YELLOW}Weighted Sum: {weighted_sum:.4f}{Colors.RESET}")
print(f" {Colors.BRIGHT_GREEN}Sigmoid Output: {output:.4f}{Colors.RESET}")
print()
return output, weighted_sum
def train_step(self, inputs, target):
"""
One step of training: forward โ compute error โ update weights.
This is the neuron LEARNING from one example.
Like a teacher correcting a student:
1. Student answers (forward pass)
2. Teacher checks (compute error)
3. Teacher gives feedback (compute gradient)
4. Student adjusts (update weights)
Parameters:
inputs: the input values (what the neuron sees)
target: the correct answer (what we WANT the neuron to output)
Returns:
error: how wrong the neuron was
"""
# Step 1: Forward pass โ make a prediction
output, weighted_sum = self.forward(inputs)
# Step 2: Compute error โ how wrong are we?
# WHY simple subtraction: For a single neuron, this works fine
# For networks, we'd use more sophisticated loss functions
error = target - output
# Step 3: Compute gradient
# WHY: The gradient tells us "which direction to adjust each weight"
# gradient = error ร sigmoid_derivative ร input
# - error: how wrong we are (magnitude and direction)
# - sigmoid_derivative: how sensitive the output is to changes
# - input: which inputs contributed to the error
sigmoid_deriv = self.sigmoid_derivative(output)
gradient = error * sigmoid_deriv
# Step 4: Update weights and bias
# WHY: We move each weight in the direction that reduces the error
# weight_new = weight_old + learning_rate ร gradient ร input
# The learning_rate controls how big each step is
self.weights += self.learning_rate * gradient * inputs
self.bias += self.learning_rate * gradient
return error
# ============================================================================
# DEMONSTRATION FUNCTIONS
# ============================================================================
def demo_single_neuron():
"""
Demonstrate how a single neuron processes inputs step by step.
"""
print_section("DEMO 1: How a Single Neuron Works", "๐ฌ")
print(f" {Colors.WHITE}A neuron is like a student deciding whether to go to cricket:{Colors.RESET}")
print(f" {Colors.DIM} Input 1: Is homework done? (1 = yes, 0 = no){Colors.RESET}")
print(f" {Colors.DIM} Input 2: Is weather good? (1 = yes, 0 = no){Colors.RESET}")
print(f" {Colors.DIM} Input 3: Are friends going? (1 = yes, 0 = no){Colors.RESET}\n")
# Create a neuron with 3 inputs
# WHY 3: We have 3 pieces of information to consider
neuron = Neuron(num_inputs=3)
# Set meaningful weights manually for demonstration
# WHY these values: They represent how much the student cares about each factor
neuron.weights = np.array([0.8, 0.3, 0.6]) # Homework most important!
neuron.bias = -0.5 # Slight tendency to NOT go (responsible student!)
print(f" {Colors.BRIGHT_MAGENTA}Neuron Configuration:{Colors.RESET}")
print(f" {Colors.MAGENTA} Weight for homework: 0.8 (most important!){Colors.RESET}")
print(f" {Colors.MAGENTA} Weight for weather: 0.3 (nice but not critical){Colors.RESET}")
print(f" {Colors.MAGENTA} Weight for friends: 0.6 (important motivation){Colors.RESET}")
print(f" {Colors.BLUE} Bias: -0.5 (tends to stay home){Colors.RESET}\n")
# Test different scenarios
scenarios = [
([1, 1, 1], "Homework โ, Weather โ, Friends โ"),
([1, 0, 1], "Homework โ, Weather โ, Friends โ"),
([0, 1, 1], "Homework โ, Weather โ, Friends โ"),
([0, 0, 0], "Homework โ, Weather โ, Friends โ"),
]
for inputs_list, description in scenarios:
inputs = np.array(inputs_list, dtype=float)
print(f" {Colors.BRIGHT_YELLOW}Scenario: {description}{Colors.RESET}")
output, _ = neuron.forward(inputs, verbose=True)
# Interpret the decision
if output > 0.5:
print(f" {Colors.BRIGHT_GREEN} โ Decision: GO to cricket! "
f"(confidence: {output:.1%}){Colors.RESET}\n")
else:
print(f" {Colors.BRIGHT_RED} โ Decision: STAY home. "
f"(confidence: {1-output:.1%}){Colors.RESET}\n")
def demo_and_gate():
"""
Test a neuron on the AND gate truth table.
AND gate: output is 1 ONLY when BOTH inputs are 1.
"""
print_section("DEMO 2: Neuron as AND Gate (Before Training)", "๐")
print(f" {Colors.WHITE}AND Gate Truth Table:{Colors.RESET}")
print(f" {Colors.DIM} 0 AND 0 = 0")
print(f" 0 AND 1 = 0")
print(f" 1 AND 0 = 0")
print(f" 1 AND 1 = 1{Colors.RESET}\n")
# Create neuron with random weights
np.random.seed(42) # WHY: Makes results reproducible for teaching
neuron = Neuron(num_inputs=2, learning_rate=0.5)
# AND gate data
# WHY one-hot-like: Simple binary inputs for logic gates
inputs_data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=float)
targets = np.array([0, 0, 0, 1], dtype=float)
print(f" {Colors.BRIGHT_MAGENTA}Initial Random Weights: "
f"{np.round(neuron.weights, 4)}{Colors.RESET}")
print(f" {Colors.BLUE}Initial Bias: {neuron.bias:.4f}{Colors.RESET}\n")
# Test BEFORE training
print(f" {Colors.BRIGHT_RED}Before Training (random guesses):{Colors.RESET}")
for i in range(len(inputs_data)):
output, _ = neuron.forward(inputs_data[i])
expected = targets[i]
correct = "โ" if round(output) == expected else "โ"
color = Colors.GREEN if correct == "โ" else Colors.RED
print(f" {Colors.CYAN}{inputs_data[i]}{Colors.RESET} โ "
f"{Colors.YELLOW}{output:.4f}{Colors.RESET} "
f"(expected: {expected:.0f}) {color}{correct}{Colors.RESET}")
return neuron, inputs_data, targets
def demo_or_gate():
"""
Test a neuron on the OR gate truth table.
OR gate: output is 1 when AT LEAST ONE input is 1.
"""
print_section("DEMO 3: Neuron as OR Gate (Before Training)", "๐")
print(f" {Colors.WHITE}OR Gate Truth Table:{Colors.RESET}")
print(f" {Colors.DIM} 0 OR 0 = 0")
print(f" 0 OR 1 = 1")
print(f" 1 OR 0 = 1")
print(f" 1 OR 1 = 1{Colors.RESET}\n")
np.random.seed(123)
neuron = Neuron(num_inputs=2, learning_rate=0.5)
inputs_data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=float)
targets = np.array([0, 1, 1, 1], dtype=float)
print(f" {Colors.BRIGHT_MAGENTA}Initial Random Weights: "
f"{np.round(neuron.weights, 4)}{Colors.RESET}")
print(f" {Colors.BLUE}Initial Bias: {neuron.bias:.4f}{Colors.RESET}\n")
print(f" {Colors.BRIGHT_RED}Before Training (random guesses):{Colors.RESET}")
for i in range(len(inputs_data)):
output, _ = neuron.forward(inputs_data[i])
expected = targets[i]
correct = "โ" if round(output) == expected else "โ"
color = Colors.GREEN if correct == "โ" else Colors.RED
print(f" {Colors.CYAN}{inputs_data[i]}{Colors.RESET} โ "
f"{Colors.YELLOW}{output:.4f}{Colors.RESET} "
f"(expected: {expected:.0f}) {color}{correct}{Colors.RESET}")
return neuron, inputs_data, targets
def train_neuron(neuron, inputs_data, targets, gate_name, epochs=5000):
"""
Train a neuron to learn a logic gate.
This is the LEARNING process:
- We show the neuron each example many times (epochs)
- Each time, it adjusts its weights to reduce the error
- Over time, it learns the correct behavior
Like a student practicing math problems:
- First attempts are mostly wrong
- With practice, accuracy improves
- Eventually, the student masters it!
"""
print_section(f"TRAINING: Neuron Learning {gate_name} Gate", "๐")
print(f" {Colors.WHITE}Training for {epochs} epochs...{Colors.RESET}")
print(f" {Colors.DIM}(Each epoch = showing ALL examples once){Colors.RESET}\n")
# Track errors for display
# WHY: We want to see the neuron improving over time
milestone_epochs = [0, 10, 50, 100, 500, 1000, 2000, epochs-1]
for epoch in range(epochs):
total_error = 0
# Train on each example
# WHY shuffle? In real training, shuffling prevents the network
# from learning the ORDER of examples instead of the patterns.
# For this simple demo, we keep it ordered for clarity.
for i in range(len(inputs_data)):
error = neuron.train_step(inputs_data[i], targets[i])
total_error += abs(error)
# Print progress at milestones
if epoch in milestone_epochs:
avg_error = total_error / len(inputs_data)
# Color based on error level
if avg_error < 0.05:
color = Colors.BRIGHT_GREEN
bar = "โ" * 20
status = "๐ Mastered!"
elif avg_error < 0.1:
color = Colors.GREEN
bar_len = int(20 * (1 - avg_error))
bar = "โ" * bar_len + "โ" * (20 - bar_len)
status = "Almost there!"
elif avg_error < 0.2:
color = Colors.YELLOW
bar_len = int(20 * (1 - avg_error))
bar = "โ" * bar_len + "โ" * (20 - bar_len)
status = "Getting better..."
else:
color = Colors.RED
bar_len = int(20 * (1 - min(avg_error, 1.0)))
bar = "โ" * bar_len + "โ" * (20 - bar_len)
status = "Still learning..."
print(f" {Colors.DIM}Epoch {epoch:>5}{Colors.RESET} โ "
f"{color}Error: {avg_error:.4f} โ [{bar}] โ {status}{Colors.RESET}")
# Show final results
print(f"\n {Colors.BRIGHT_GREEN}{'โ'*50}")
print(f" โ
Training Complete!{Colors.RESET}\n")
print(f" {Colors.BRIGHT_MAGENTA}Final Weights: "
f"{np.round(neuron.weights, 4)}{Colors.RESET}")
print(f" {Colors.BLUE}Final Bias: {neuron.bias:.4f}{Colors.RESET}\n")
print(f" {Colors.BRIGHT_GREEN}After Training:{Colors.RESET}")
all_correct = True
for i in range(len(inputs_data)):
output, _ = neuron.forward(inputs_data[i])
expected = targets[i]
correct = "โ" if round(output) == expected else "โ"
color = Colors.GREEN if correct == "โ" else Colors.RED
if correct == "โ":
all_correct = False
print(f" {Colors.CYAN}{inputs_data[i]}{Colors.RESET} โ "
f"{Colors.YELLOW}{output:.4f}{Colors.RESET} "
f"(expected: {expected:.0f}) โ rounded: {round(output)} "
f"{color}{correct}{Colors.RESET}")
if all_correct:
print(f"\n {Colors.BRIGHT_GREEN}๐ The neuron learned the {gate_name} gate "
f"PERFECTLY!{Colors.RESET}")
else:
print(f"\n {Colors.YELLOW}โ ๏ธ The neuron is still learning. "
f"Try more epochs!{Colors.RESET}")
def explain_learning():
"""
Print a visual explanation of how the neuron learns.
"""
print_section("HOW DOES THE NEURON LEARN?", "๐ก")
print(f""" {Colors.WHITE}The neuron learns through a simple 4-step process:{Colors.RESET}
{Colors.BRIGHT_CYAN}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ Step 1: FORWARD PASS โ
โ โโโโโโโโโโโโโโโโโ โ
โ Feed inputs through the neuron to get a prediction โ
โ output = sigmoid(weights ยท inputs + bias) โ
โ โ
โ Step 2: COMPUTE ERROR โ
โ โโโโโโโโโโโโโโโโ โ
โ error = expected_output - actual_output โ
โ "How wrong was the neuron?" โ
โ โ
โ Step 3: COMPUTE GRADIENT โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ gradient = error ร sigmoid_derivative(output) โ
โ "Which direction should we adjust?" โ
โ โ
โ Step 4: UPDATE WEIGHTS โ
โ โโโโโโโโโโโโโโโโโ โ
โ weight += learning_rate ร gradient ร input โ
โ "Nudge weights to reduce the error" โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ{Colors.RESET}
{Colors.BRIGHT_YELLOW}KEY INSIGHT:{Colors.RESET}
{Colors.YELLOW}This is like a student learning from a teacher:
1. Student answers a question (forward pass)
2. Teacher says "you're wrong by X" (error)
3. Teacher says "adjust your thinking THIS way" (gradient)
4. Student adjusts their understanding (weight update)
Repeat thousands of times โ student masters the subject!{Colors.RESET}
""")
# ============================================================================
# MAIN EXECUTION
# ============================================================================
if __name__ == "__main__":
"""
Main execution block โ runs when you execute this script directly.
WHY __name__ == '__main__'?
- This is a Python convention
- Code inside this block ONLY runs when you run this file directly
- If someone imports this file, this code won't execute
- This lets us use the Neuron class in other files without running demos
"""
# Print the beautiful header
print_header()
# Demo 1: Show how a single neuron works
demo_single_neuron()
# Demo 2: AND gate (before training)
and_neuron, and_inputs, and_targets = demo_and_gate()
# Demo 3: OR gate (before training)
or_neuron, or_inputs, or_targets = demo_or_gate()
# Explain how learning works
explain_learning()
# Train the AND gate neuron
train_neuron(and_neuron, and_inputs, and_targets, "AND", epochs=5000)
# Train a fresh OR gate neuron
np.random.seed(123)
or_neuron_fresh = Neuron(num_inputs=2, learning_rate=0.5)
train_neuron(or_neuron_fresh, or_inputs, or_targets, "OR", epochs=5000)
# Print the footer
print_footer()
Complete Code: step2_network.py
Python
"""
================================================================================
๐ง LEVEL 2 โ STEP 2: BUILDING A NEURAL NETWORK
================================================================================
Build a full neural network from scratch with:
- Input layer โ Hidden layer (16 neurons) โ Output layer
- Sigmoid activation for hidden layer
- Softmax activation for output layer
- Forward pass that shows data flowing through each layer
NO DEEP LEARNING FRAMEWORKS โ just NumPy and math!
================================================================================
"""
# ============================================================================
# IMPORTS
# ============================================================================
import numpy as np # NumPy for fast array operations
import os # For file path operations
# ============================================================================
# ANSI COLOR CODES
# ============================================================================
# WHY: Colors make terminal output easier to read and more engaging
# Each color code starts with \033[ (escape sequence) and ends with m
class Colors:
"""ANSI color codes for beautiful terminal output."""
RESET = "\033[0m"
BOLD = "\033[1m"
DIM = "\033[2m"
RED = "\033[31m"
GREEN = "\033[32m"
YELLOW = "\033[33m"
BLUE = "\033[34m"
MAGENTA = "\033[35m"
CYAN = "\033[36m"
WHITE = "\033[37m"
BRIGHT_RED = "\033[91m"
BRIGHT_GREEN = "\033[92m"
BRIGHT_YELLOW = "\033[93m"
BRIGHT_BLUE = "\033[94m"
BRIGHT_MAGENTA = "\033[95m"
BRIGHT_CYAN = "\033[96m"
# Background colors for extra flair
BG_BLUE = "\033[44m"
BG_GREEN = "\033[42m"
BG_YELLOW = "\033[43m"
# ============================================================================
# HELPER FUNCTIONS
# ============================================================================
def print_header():
"""Print a beautiful header for this script."""
print(f"\n{Colors.BRIGHT_MAGENTA}{'='*70}")
print(f" ๐ง LEVEL 2 โ STEP 2: BUILDING A NEURAL NETWORK")
print(f"{'='*70}{Colors.RESET}")
print(f"{Colors.DIM} A multi-layer network with forward pass visualization!{Colors.RESET}")
print(f"{Colors.DIM} Input โ Hidden (16 neurons, sigmoid) โ Output (softmax){Colors.RESET}\n")
def print_footer():
"""Print a beautiful footer for this script."""
print(f"\n{Colors.BRIGHT_MAGENTA}{'='*70}")
print(f" โ
STEP 2 COMPLETE! You built a full neural network!")
print(f" ๐ Next: step3_train.py โ Train it to generate text!")
print(f"{'='*70}{Colors.RESET}\n")
def print_section(title, emoji="๐"):
"""Print a section header with color."""
print(f"\n{Colors.BRIGHT_YELLOW}{'โ'*70}")
print(f" {emoji} {title}")
print(f"{'โ'*70}{Colors.RESET}\n")
# ============================================================================
# NEURAL NETWORK CLASS
# ============================================================================
class NeuralNetwork:
"""
A simple feedforward neural network with one hidden layer.
Architecture:
Input (n features) โ Hidden (16 neurons, sigmoid) โ Output (m classes, softmax)
WHY this architecture?
- One hidden layer is enough to learn many patterns (Universal Approximation Theorem)
- 16 hidden neurons is enough for simple tasks but shows the concept clearly
- Sigmoid in hidden layer: squashes values to (0, 1), introduces non-linearity
- Softmax in output layer: converts raw scores into probabilities that sum to 1
WHY non-linearity matters:
- Without activation functions, stacking layers would be pointless
- Multiple linear layers collapse into a single linear layer
- Non-linearity (sigmoid) allows the network to learn CURVES, not just lines
"""
def __init__(self, input_size, hidden_size=16, output_size=4):
"""
Initialize the network with random weights.
Parameters:
input_size: Number of input features (e.g., 4 for 4 inputs)
hidden_size: Number of neurons in hidden layer (default: 16)
output_size: Number of output classes (default: 4)
WHY Xavier initialization?
- Random weights that are too large โ outputs explode to infinity
- Random weights that are too small โ outputs shrink to zero
- Xavier initialization scales weights by 1/sqrt(n) to keep values reasonable
- Named after Xavier Glorot who proposed this in 2010
"""
# Store architecture info for display
self.input_size = input_size
self.hidden_size = hidden_size
self.output_size = output_size
# โโ Weights connecting Input โ Hidden โโ
# Shape: (input_size, hidden_size)
# WHY this shape: Each input connects to EVERY hidden neuron
# So we need input_size ร hidden_size connections
# Xavier initialization: scale by sqrt(2 / (fan_in + fan_out))
scale_1 = np.sqrt(2.0 / (input_size + hidden_size))
self.weights_input_hidden = np.random.randn(input_size, hidden_size) * scale_1
# โโ Biases for Hidden Layer โโ
# Shape: (hidden_size,) โ one bias per hidden neuron
# WHY zeros: Biases start at zero; they'll be learned during training
self.biases_hidden = np.zeros(hidden_size)
# โโ Weights connecting Hidden โ Output โโ
# Shape: (hidden_size, output_size)
# WHY: Each hidden neuron connects to EVERY output neuron
scale_2 = np.sqrt(2.0 / (hidden_size + output_size))
self.weights_hidden_output = np.random.randn(hidden_size, output_size) * scale_2
# โโ Biases for Output Layer โโ
# Shape: (output_size,) โ one bias per output neuron
self.biases_output = np.zeros(output_size)
def sigmoid(self, x):
"""
Sigmoid activation: ฯ(x) = 1 / (1 + e^(-x))
WHY sigmoid for hidden layer?
- Squashes values into (0, 1) range
- Smooth and differentiable (needed for backpropagation)
- Easy to understand conceptually: "how active is this neuron?"
- A neuron with output close to 1 is "strongly activated"
- A neuron with output close to 0 is "barely activated"
"""
x = np.clip(x, -500, 500) # Prevent overflow
return 1 / (1 + np.exp(-x))
def softmax(self, x):
"""
Softmax activation: converts raw scores to probabilities.
Formula: softmax(xi) = e^(xi) / ฮฃ(e^(xj))
WHY softmax for output layer?
- We want PROBABILITIES (they must sum to 1.0)
- If we're predicting "which class?", we want: P(class1) + P(class2) + ... = 1
- Softmax naturally does this!
WHY subtract max(x)?
- Numerical stability! e^(large number) = infinity = crash
- Subtracting max makes the largest value 0, preventing overflow
- Math still works: softmax(x) = softmax(x - max(x)) [can be proven]
"""
# Subtract max for numerical stability (prevents e^1000 = infinity)
x_shifted = x - np.max(x)
exp_x = np.exp(x_shifted)
return exp_x / np.sum(exp_x)
def forward(self, inputs, verbose=False):
"""
Forward pass: push input through all layers to get output.
Flow:
Input โ (weights ร input + bias) โ sigmoid โ Hidden
Hidden โ (weights ร hidden + bias) โ softmax โ Output
This is like a relay race:
- Input layer PASSES information to hidden layer
- Hidden layer PROCESSES it (sigmoid squashes it)
- Hidden layer PASSES processed info to output layer
- Output layer CONVERTS it to probabilities (softmax)
Returns:
output_probs: probability distribution over output classes
hidden_raw: raw weighted sums before sigmoid (for visualization)
hidden_activated: hidden values after sigmoid (for visualization)
output_raw: raw weighted sums before softmax (for visualization)
"""
# โโ LAYER 1: Input โ Hidden โโ
# Matrix multiplication: each hidden neuron computes its weighted sum
# WHY np.dot: This efficiently computes all weighted sums at once
# Instead of looping over each neuron, matrix math does it in one shot!
hidden_raw = np.dot(inputs, self.weights_input_hidden) + self.biases_hidden
# Shape: (hidden_size,) โ one value per hidden neuron
# Apply sigmoid activation to hidden layer
# WHY: Without this, the network is just a linear function
# Sigmoid introduces the non-linearity that makes neural networks powerful
hidden_activated = self.sigmoid(hidden_raw)
# โโ LAYER 2: Hidden โ Output โโ
# Same process: matrix multiply hidden activations by output weights
output_raw = np.dot(hidden_activated, self.weights_hidden_output) + self.biases_output
# Apply softmax to get probabilities
# WHY softmax here: We want the output to be a probability distribution
# Each output value represents: "how likely is this class?"
output_probs = self.softmax(output_raw)
# Verbose output for educational purposes
if verbose:
self._print_forward_pass(inputs, hidden_raw, hidden_activated,
output_raw, output_probs)
return output_probs, hidden_raw, hidden_activated, output_raw
def _print_forward_pass(self, inputs, hidden_raw, hidden_activated,
output_raw, output_probs):
"""
Print the complete forward pass with beautiful formatting.
Shows exactly what happens at each stage of the computation.
"""
print(f" {Colors.BRIGHT_CYAN}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ")
print(f" โ FORWARD PASS VISUALIZATION โ")
print(f" โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ{Colors.RESET}\n")
# โโ Input Layer โโ
print(f" {Colors.BRIGHT_GREEN}โธ INPUT LAYER{Colors.RESET} "
f"{Colors.DIM}({self.input_size} values){Colors.RESET}")
print(f" {Colors.CYAN}", end="")
for i, val in enumerate(inputs):
print(f"x{i}={val:.2f} ", end="")
print(f"{Colors.RESET}\n")
print(f" {Colors.DIM} โ Matrix multiply by weights "
f"({self.input_size}ร{self.hidden_size}){Colors.RESET}")
print(f" {Colors.DIM} โ Add biases ({self.hidden_size}){Colors.RESET}")
print(f" {Colors.DIM} โผ{Colors.RESET}\n")
# โโ Hidden Layer (raw) โโ
print(f" {Colors.BRIGHT_YELLOW}โธ HIDDEN LAYER โ Raw Weighted Sums{Colors.RESET} "
f"{Colors.DIM}(before activation){Colors.RESET}")
self._print_neuron_values(hidden_raw, "h", Colors.YELLOW)
print(f"\n {Colors.DIM} โ Apply sigmoid: ฯ(x) = 1/(1+e^(-x)){Colors.RESET}")
print(f" {Colors.DIM} โ Squash each value to (0, 1){Colors.RESET}")
print(f" {Colors.DIM} โผ{Colors.RESET}\n")
# โโ Hidden Layer (activated) โโ
print(f" {Colors.BRIGHT_GREEN}โธ HIDDEN LAYER โ After Sigmoid{Colors.RESET} "
f"{Colors.DIM}(activated values){Colors.RESET}")
self._print_neuron_values(hidden_activated, "a", Colors.GREEN, show_bar=True)
print(f"\n {Colors.DIM} โ Matrix multiply by weights "
f"({self.hidden_size}ร{self.output_size}){Colors.RESET}")
print(f" {Colors.DIM} โ Add biases ({self.output_size}){Colors.RESET}")
print(f" {Colors.DIM} โผ{Colors.RESET}\n")
# โโ Output Layer (raw) โโ
print(f" {Colors.BRIGHT_MAGENTA}โธ OUTPUT LAYER โ Raw Scores{Colors.RESET} "
f"{Colors.DIM}(before softmax){Colors.RESET}")
self._print_neuron_values(output_raw, "o", Colors.MAGENTA)
print(f"\n {Colors.DIM} โ Apply softmax: convert to probabilities{Colors.RESET}")
print(f" {Colors.DIM} โ All values sum to 1.0{Colors.RESET}")
print(f" {Colors.DIM} โผ{Colors.RESET}\n")
# โโ Output Layer (probabilities) โโ
print(f" {Colors.BRIGHT_CYAN}โธ OUTPUT LAYER โ Probabilities{Colors.RESET} "
f"{Colors.DIM}(after softmax){Colors.RESET}")
self._print_probability_bars(output_probs)
# Sum check
print(f"\n {Colors.DIM}Sum of probabilities: {Colors.BRIGHT_GREEN}"
f"{np.sum(output_probs):.6f}{Colors.RESET} "
f"{Colors.DIM}(should be 1.000000) โ{Colors.RESET}")
# Prediction
predicted_class = np.argmax(output_probs)
print(f"\n {Colors.BRIGHT_GREEN} ๐ฏ Predicted Class: {predicted_class} "
f"(probability: {output_probs[predicted_class]:.4f}){Colors.RESET}")
def _print_neuron_values(self, values, prefix, color, show_bar=False):
"""Print neuron values in a formatted grid."""
# Print 4 values per row for readability
for i in range(0, len(values), 4):
row_vals = values[i:i+4]
row_str = " "
for j, val in enumerate(row_vals):
idx = i + j
if show_bar:
# Show a mini bar chart for activated values (0-1 range)
bar_len = int(val * 10)
bar = "โ" * bar_len + "โ" * (10 - bar_len)
row_str += f"{color}{prefix}{idx:>2}={val:>6.3f} [{bar}] {Colors.RESET}"
else:
row_str += f"{color}{prefix}{idx:>2}={val:>8.4f} {Colors.RESET}"
print(row_str)
def _print_probability_bars(self, probs):
"""Print probability values as horizontal bar charts."""
max_idx = np.argmax(probs)
for i, prob in enumerate(probs):
bar_len = int(prob * 40) # Scale to 40 characters wide
bar = "โ" * bar_len + "โ" * (40 - bar_len)
# Highlight the highest probability
if i == max_idx:
print(f" {Colors.BRIGHT_GREEN}Class {i}: {prob:.4f} [{bar}] โ WINNER{Colors.RESET}")
else:
print(f" {Colors.CYAN}Class {i}: {prob:.4f} [{bar}]{Colors.RESET}")
# ============================================================================
# VISUALIZATION FUNCTIONS
# ============================================================================
def print_network_architecture(net):
"""
Print a visual ASCII art representation of the network architecture.
WHY visualize?
- Seeing the structure helps understand what's happening
- You can see how many connections there are
- It makes the concept tangible
"""
print_section("NETWORK ARCHITECTURE", "๐๏ธ")
# Count total parameters
total_params = (net.input_size * net.hidden_size + # InputโHidden weights
net.hidden_size + # Hidden biases
net.hidden_size * net.output_size + # HiddenโOutput weights
net.output_size) # Output biases
total_weights = (net.input_size * net.hidden_size +
net.hidden_size * net.output_size)
total_biases = net.hidden_size + net.output_size
print(f" {Colors.BRIGHT_CYAN}Network Configuration:{Colors.RESET}")
print(f" {Colors.CYAN} โข Input size: {net.input_size} neurons{Colors.RESET}")
print(f" {Colors.CYAN} โข Hidden size: {net.hidden_size} neurons (sigmoid){Colors.RESET}")
print(f" {Colors.CYAN} โข Output size: {net.output_size} neurons (softmax){Colors.RESET}")
print(f" {Colors.YELLOW} โข Total weights: {total_weights}{Colors.RESET}")
print(f" {Colors.YELLOW} โข Total biases: {total_biases}{Colors.RESET}")
print(f" {Colors.BRIGHT_GREEN} โข Total params: {total_params}{Colors.RESET}\n")
# ASCII art network diagram
print(f" {Colors.BRIGHT_MAGENTA}Visual Architecture:{Colors.RESET}\n")
# Determine how many neurons to show (max display for readability)
in_show = min(net.input_size, 5)
hid_show = min(net.hidden_size, 6)
out_show = min(net.output_size, 5)
# Build the visual layer by layer
in_labels = [f"x{i}" for i in range(in_show)]
if net.input_size > in_show:
in_labels.append("...")
in_labels.append(f"x{net.input_size-1}")
hid_labels = [f"h{i}" for i in range(hid_show)]
if net.hidden_size > hid_show:
hid_labels.append("...")
hid_labels.append(f"h{net.hidden_size-1}")
out_labels = [f"y{i}" for i in range(out_show)]
if net.output_size > out_show:
out_labels.append("...")
out_labels.append(f"y{net.output_size-1}")
# Calculate layout
max_rows = max(len(in_labels), len(hid_labels), len(out_labels))
# Pad lists to same length
def pad_list(lst, target_len):
while len(lst) < target_len:
lst.append("")
return lst
in_labels = pad_list(in_labels, max_rows)
hid_labels = pad_list(hid_labels, max_rows)
out_labels = pad_list(out_labels, max_rows)
# Print header
print(f" {Colors.BRIGHT_GREEN} INPUT {Colors.BRIGHT_YELLOW} HIDDEN {Colors.BRIGHT_CYAN} OUTPUT{Colors.RESET}")
print(f" {Colors.BRIGHT_GREEN} LAYER {Colors.BRIGHT_YELLOW} LAYER {Colors.BRIGHT_CYAN} LAYER{Colors.RESET}")
print(f" {Colors.BRIGHT_GREEN} ({net.input_size}) "
f"{Colors.BRIGHT_YELLOW} ({net.hidden_size}, sigmoid) "
f"{Colors.BRIGHT_CYAN} ({net.output_size}, softmax){Colors.RESET}")
print()
for i in range(max_rows):
in_val = in_labels[i]
hid_val = hid_labels[i]
out_val = out_labels[i]
# Input neuron
if in_val and in_val != "...":
in_part = f"{Colors.BRIGHT_GREEN} ( {in_val:>3} ){Colors.RESET}"
elif in_val == "...":
in_part = f"{Colors.DIM} ... {Colors.RESET}"
else:
in_part = " "
# Connection lines
if in_val and in_val != "..." and hid_val and hid_val != "...":
conn1 = f"{Colors.DIM}โโโโโโบ{Colors.RESET}"
elif in_val and in_val != "...":
conn1 = f"{Colors.DIM}โโโ {Colors.RESET}"
elif hid_val and hid_val != "...":
conn1 = f"{Colors.DIM} โโโโบ{Colors.RESET}"
else:
conn1 = " "
# Hidden neuron
if hid_val and hid_val != "...":
hid_part = f"{Colors.BRIGHT_YELLOW}( {hid_val:>3} ){Colors.RESET}"
elif hid_val == "...":
hid_part = f"{Colors.DIM} ... {Colors.RESET}"
else:
hid_part = " "
# Connection lines 2
if hid_val and hid_val != "..." and out_val and out_val != "...":
conn2 = f"{Colors.DIM}โโโโโโบ{Colors.RESET}"
elif hid_val and hid_val != "...":
conn2 = f"{Colors.DIM}โโโ {Colors.RESET}"
elif out_val and out_val != "...":
conn2 = f"{Colors.DIM} โโโโบ{Colors.RESET}"
else:
conn2 = " "
# Output neuron
if out_val and out_val != "...":
out_part = f"{Colors.BRIGHT_CYAN}( {out_val:>3} ){Colors.RESET}"
elif out_val == "...":
out_part = f"{Colors.DIM} ... {Colors.RESET}"
else:
out_part = " "
print(f" {in_part}{conn1}{hid_part}{conn2}{out_part}")
print(f"\n {Colors.DIM} Note: In reality, EVERY input neuron connects to EVERY hidden neuron,")
print(f" and EVERY hidden neuron connects to EVERY output neuron.{Colors.RESET}")
print(f" {Colors.DIM} That's {net.input_size}ร{net.hidden_size} + {net.hidden_size}ร{net.output_size} = {total_weights} connection weights!{Colors.RESET}")
def print_weight_statistics(net):
"""
Print statistics about the network's weights.
WHY: Understanding weight distributions helps diagnose network health.
- If weights are too large: outputs explode (gradient explosion)
- If weights are too small: outputs vanish (vanishing gradients)
- Well-initialized weights should have mean โ 0 and small std
"""
print_section("WEIGHT STATISTICS", "๐")
layers = [
("Input โ Hidden", net.weights_input_hidden),
("Hidden Biases", net.biases_hidden),
("Hidden โ Output", net.weights_hidden_output),
("Output Biases", net.biases_output),
]
print(f" {Colors.WHITE}{'Layer':<20} {'Shape':<15} {'Mean':>8} {'Std':>8} "
f"{'Min':>8} {'Max':>8}{Colors.RESET}")
print(f" {Colors.DIM}{'โ'*20} {'โ'*15} {'โ'*8} {'โ'*8} {'โ'*8} {'โ'*8}{Colors.RESET}")
for name, weights in layers:
mean = np.mean(weights)
std = np.std(weights)
wmin = np.min(weights)
wmax = np.max(weights)
shape = str(weights.shape)
# Color code based on health
if abs(mean) < 0.1 and std < 1.0:
color = Colors.GREEN # Healthy
elif abs(mean) < 0.5:
color = Colors.YELLOW # Okay
else:
color = Colors.RED # Concerning
print(f" {color}{name:<20} {shape:<15} {mean:>8.4f} {std:>8.4f} "
f"{wmin:>8.4f} {wmax:>8.4f}{Colors.RESET}")
print(f"\n {Colors.BRIGHT_GREEN}โ All weights look healthy! "
f"(Xavier initialization working well){Colors.RESET}")
def demo_forward_pass(net):
"""
Run a sample forward pass and display everything.
"""
print_section("FORWARD PASS DEMO", "๐")
# Create a sample input
# WHY this input: We use values between 0 and 1 to simulate typical features
sample_input = np.random.rand(net.input_size)
print(f" {Colors.WHITE}Feeding a sample input through the network...{Colors.RESET}")
print(f" {Colors.DIM}(Using random input values between 0 and 1){Colors.RESET}\n")
# Run forward pass with verbose output
output_probs, hidden_raw, hidden_activated, output_raw = net.forward(
sample_input, verbose=True
)
def demo_multiple_inputs(net):
"""
Show how the network responds to different inputs.
"""
print_section("MULTIPLE INPUTS โ SAME NETWORK", "๐")
print(f" {Colors.WHITE}Let's see how the same network responds to different inputs:{Colors.RESET}\n")
# Generate several different inputs
test_inputs = [
("All zeros", np.zeros(net.input_size)),
("All ones", np.ones(net.input_size)),
("Random #1", np.random.rand(net.input_size)),
("Random #2", np.random.rand(net.input_size)),
("Alternating", np.array([1 if i % 2 == 0 else 0 for i in range(net.input_size)], dtype=float)),
]
print(f" {Colors.WHITE}{'Input Type':<15} ", end="")
for i in range(net.output_size):
print(f"{'Class '+str(i):>10} ", end="")
print(f"{'Prediction':>12}{Colors.RESET}")
print(f" {Colors.DIM}{'โ'*15} ", end="")
for i in range(net.output_size):
print(f"{'โ'*10} ", end="")
print(f"{'โ'*12}{Colors.RESET}")
for name, inp in test_inputs:
output_probs, _, _, _ = net.forward(inp)
predicted = np.argmax(output_probs)
print(f" {Colors.CYAN}{name:<15} ", end="")
for prob in output_probs:
# Color intensity based on probability
if prob > 0.5:
color = Colors.BRIGHT_GREEN
elif prob > 0.25:
color = Colors.YELLOW
else:
color = Colors.DIM
print(f"{color}{prob:>10.4f} ", end="")
print(f"{Colors.BRIGHT_GREEN}โ Class {predicted}{Colors.RESET}")
print(f"\n {Colors.BRIGHT_YELLOW}๐ก Notice: Without training, the network gives "
f"random-looking predictions!{Colors.RESET}")
print(f" {Colors.YELLOW} This is because the weights are random. "
f"Training will fix this!{Colors.RESET}")
def explain_concepts():
"""
Print educational explanation of key concepts.
"""
print_section("KEY CONCEPTS EXPLAINED", "๐ก")
print(f""" {Colors.BRIGHT_CYAN}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ WHAT JUST HAPPENED? โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ 1. INPUT LAYER receives raw data โ
โ โ Just passes values through, no processing โ
โ โ
โ 2. HIDDEN LAYER does the heavy lifting โ
โ โ Multiplies inputs by weights (matrix multiplication)โ
โ โ Adds biases (shifts the activation) โ
โ โ Applies sigmoid (introduces non-linearity) โ
โ โ Each neuron detects a different PATTERN โ
โ โ
โ 3. OUTPUT LAYER makes the final decision โ
โ โ Combines hidden neurons' opinions โ
โ โ Applies softmax (converts to probabilities) โ
โ โ Highest probability = the prediction โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ ๐ KEY INSIGHT: โ
โ Right now the network is UNTRAINED โ its weights are โ
โ random, so predictions are random too! โ
โ In step3_train.py, we'll train it to learn patterns. โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ{Colors.RESET}
""")
# Activation function comparison
print(f" {Colors.BRIGHT_YELLOW}Activation Functions Used:{Colors.RESET}\n")
print(f" {Colors.GREEN} SIGMOID (Hidden Layer):{Colors.RESET}")
print(f" {Colors.DIM} ฯ(x) = 1 / (1 + e^(-x)){Colors.RESET}")
print(f" {Colors.DIM} Output range: (0, 1){Colors.RESET}")
print(f" {Colors.DIM} Used for: internal feature detection{Colors.RESET}\n")
# ASCII sigmoid curve
print(f" {Colors.GREEN}1.0 โค โโโโโโโ{Colors.RESET}")
print(f" {Colors.GREEN} โ โโโ{Colors.RESET}")
print(f" {Colors.GREEN} โ โโ{Colors.RESET}")
print(f" {Colors.GREEN}0.5 โค โโ{Colors.RESET}")
print(f" {Colors.GREEN} โ โโ{Colors.RESET}")
print(f" {Colors.GREEN} โ โโโ{Colors.RESET}")
print(f" {Colors.GREEN}0.0 โค โโโโโโโ{Colors.RESET}")
print(f" {Colors.DIM} โโโโฌโโโโฌโโโโฌโโโโฌโโโโฌโโโโฌโโ{Colors.RESET}")
print(f" {Colors.DIM} -5 -3 -1 +1 +3 +5{Colors.RESET}\n")
print(f" {Colors.MAGENTA} SOFTMAX (Output Layer):{Colors.RESET}")
print(f" {Colors.DIM} softmax(xi) = e^xi / ฮฃ(e^xj){Colors.RESET}")
print(f" {Colors.DIM} Output range: (0, 1) per class, sum = 1.0{Colors.RESET}")
print(f" {Colors.DIM} Used for: final classification / probability distribution{Colors.RESET}\n")
print(f" {Colors.MAGENTA} Raw scores: [2.0, 1.0, 0.5, 0.1]{Colors.RESET}")
print(f" {Colors.DIM} โ softmax{Colors.RESET}")
print(f" {Colors.MAGENTA} Probabilities: [0.45, 0.17, 0.10, 0.07] "
f"{Colors.DIM}(sum โ 1.0){Colors.RESET}\n")
# ============================================================================
# MAIN EXECUTION
# ============================================================================
if __name__ == "__main__":
"""
Main execution block.
WHY: This pattern ensures the demo code only runs when this file
is executed directly, not when it's imported by another file.
"""
# Set random seed for reproducibility
# WHY: Same seed = same random numbers = same output every time
# This is important for teaching โ students get the same results
np.random.seed(42)
# Print header
print_header()
# Create the neural network
# WHY these sizes:
# - 8 inputs: a reasonable small feature vector
# - 16 hidden: enough neurons to learn patterns, few enough to display
# - 4 outputs: simulates a 4-class classification problem
input_size = 8
hidden_size = 16
output_size = 4
print(f" {Colors.WHITE}Creating a neural network:{Colors.RESET}")
print(f" {Colors.CYAN} Input: {input_size} neurons (raw features){Colors.RESET}")
print(f" {Colors.CYAN} Hidden: {hidden_size} neurons (sigmoid activation){Colors.RESET}")
print(f" {Colors.CYAN} Output: {output_size} neurons (softmax โ probabilities){Colors.RESET}\n")
net = NeuralNetwork(input_size, hidden_size, output_size)
# Show network architecture
print_network_architecture(net)
# Show weight statistics
print_weight_statistics(net)
# Run a forward pass demo
demo_forward_pass(net)
# Show multiple inputs
demo_multiple_inputs(net)
# Explain concepts
explain_concepts()
# Print footer
print_footer()
Complete Code: step3_train.py
Python
"""
================================================================================
๐ง LEVEL 2 โ STEP 3: TRAINING A NEURAL NETWORK
================================================================================
Train a character-level neural network to predict the next character!
This script:
1. Takes a sample text and creates training data (character pairs)
2. One-hot encodes characters (binary representation)
3. Builds a neural network with backpropagation FROM SCRATCH
4. Trains the network to predict: given a character, what comes next?
5. Generates text using the trained network
NO DEEP LEARNING FRAMEWORKS โ everything from scratch with NumPy!
Training completes in under 60 seconds.
================================================================================
"""
# ============================================================================
# IMPORTS
# ============================================================================
import numpy as np # For fast math operations on arrays
import os # For file path operations (saving training history)
import json # For saving/loading training history
import time # For measuring training duration
import sys # For system operations
# ============================================================================
# ANSI COLOR CODES
# ============================================================================
class Colors:
"""ANSI color codes for beautiful terminal output."""
RESET = "\033[0m"
BOLD = "\033[1m"
DIM = "\033[2m"
RED = "\033[31m"
GREEN = "\033[32m"
YELLOW = "\033[33m"
BLUE = "\033[34m"
MAGENTA = "\033[35m"
CYAN = "\033[36m"
WHITE = "\033[37m"
BRIGHT_RED = "\033[91m"
BRIGHT_GREEN = "\033[92m"
BRIGHT_YELLOW = "\033[93m"
BRIGHT_BLUE = "\033[94m"
BRIGHT_MAGENTA = "\033[95m"
BRIGHT_CYAN = "\033[96m"
# ============================================================================
# HELPER FUNCTIONS
# ============================================================================
def print_header():
"""Print a beautiful header for this script."""
print(f"\n{Colors.BRIGHT_BLUE}{'='*70}")
print(f" ๐ง LEVEL 2 โ STEP 3: TRAINING A NEURAL NETWORK")
print(f"{'='*70}{Colors.RESET}")
print(f"{Colors.DIM} Character-level text generation with backpropagation!{Colors.RESET}")
print(f"{Colors.DIM} Everything from scratch โ no frameworks!{Colors.RESET}\n")
def print_footer():
"""Print a beautiful footer."""
print(f"\n{Colors.BRIGHT_BLUE}{'='*70}")
print(f" โ
STEP 3 COMPLETE! The network learned to generate text!")
print(f" ๐ Next: step4_visualize.py โ Visualize the training!")
print(f"{'='*70}{Colors.RESET}\n")
def print_section(title, emoji="๐"):
"""Print a section header."""
print(f"\n{Colors.BRIGHT_YELLOW}{'โ'*70}")
print(f" {emoji} {title}")
print(f"{'โ'*70}{Colors.RESET}\n")
# ============================================================================
# TRAINING DATA PREPARATION
# ============================================================================
# WHY this text: A short, meaningful paragraph with enough variety of characters
# to learn patterns. We keep it short so training finishes quickly.
# The text has repeated patterns (common English letter sequences) that the
# network can learn to reproduce.
SAMPLE_TEXT = (
"the quick brown fox jumps over the lazy dog. "
"a neural network learns from data. "
"the brain has billions of neurons. "
"each neuron connects to many others. "
"learning happens when connections change. "
"the network adjusts weights to reduce error. "
"practice makes perfect in learning. "
"data is the fuel for machine learning. "
)
def prepare_data(text):
"""
Prepare character-level training data from text.
Steps:
1. Find all unique characters in the text (vocabulary)
2. Create mappings: character โ index and index โ character
3. Create training pairs: (current_char, next_char)
4. One-hot encode everything
WHY character-level?
- Simplest form of text generation
- No need for tokenizers or word dictionaries
- Shows the core concept: predict what comes next
WHY one-hot encoding?
- Neural networks work with NUMBERS, not characters
- One-hot = a vector of 0s with a single 1 at the character's index
- Example: if vocab = ['a', 'b', 'c'], then 'b' = [0, 1, 0]
- This treats each character as equally different from every other
"""
# Step 1: Get unique characters and sort them
# WHY sort: Ensures consistent ordering across runs
chars = sorted(list(set(text)))
vocab_size = len(chars)
# Step 2: Create mappings
# WHY two mappings: We need to go both ways
# char_to_idx: 'a' โ 0, 'b' โ 1, etc. (for encoding input)
# idx_to_char: 0 โ 'a', 1 โ 'b', etc. (for decoding output)
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
# Step 3: Create training pairs
# WHY pairs: We train the network to predict: "given THIS character,
# what character comes NEXT?"
# Example: "hello" โ [('h','e'), ('e','l'), ('l','l'), ('l','o')]
input_indices = []
target_indices = []
for i in range(len(text) - 1):
input_indices.append(char_to_idx[text[i]])
target_indices.append(char_to_idx[text[i + 1]])
# Step 4: One-hot encode
# WHY one-hot: Each character becomes a binary vector
# This lets the network treat each character as a separate "category"
num_samples = len(input_indices)
inputs_onehot = np.zeros((num_samples, vocab_size))
targets_onehot = np.zeros((num_samples, vocab_size))
for i in range(num_samples):
inputs_onehot[i, input_indices[i]] = 1.0
targets_onehot[i, target_indices[i]] = 1.0
return (inputs_onehot, targets_onehot, input_indices, target_indices,
chars, char_to_idx, idx_to_char, vocab_size)
# ============================================================================
# NEURAL NETWORK CLASS (WITH BACKPROPAGATION)
# ============================================================================
class CharLevelNetwork:
"""
A neural network that learns to predict the next character.
Architecture:
Input (vocab_size) โ Hidden (64 neurons, sigmoid) โ Output (vocab_size, softmax)
This class implements:
- Forward pass (making predictions)
- Backward pass (learning from mistakes) โ BACKPROPAGATION!
- Weight updates (gradient descent)
- Text generation (using learned patterns)
WHY 64 hidden neurons?
- Enough to learn character-level patterns in our small text
- Not so many that training is slow
- A good balance for educational purposes
"""
def __init__(self, vocab_size, hidden_size=64, learning_rate=0.5):
"""
Initialize the network with random weights.
WHY Xavier initialization?
- Prevents gradients from exploding or vanishing
- Keeps values in a reasonable range during forward pass
"""
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.learning_rate = learning_rate
# โโ Weights: Input โ Hidden โโ
# Shape: (vocab_size, hidden_size)
# Each input character connects to every hidden neuron
scale_1 = np.sqrt(2.0 / (vocab_size + hidden_size))
self.W1 = np.random.randn(vocab_size, hidden_size) * scale_1
self.b1 = np.zeros(hidden_size)
# โโ Weights: Hidden โ Output โโ
# Shape: (hidden_size, vocab_size)
# Each hidden neuron connects to every output character
scale_2 = np.sqrt(2.0 / (hidden_size + vocab_size))
self.W2 = np.random.randn(hidden_size, vocab_size) * scale_2
self.b2 = np.zeros(vocab_size)
def sigmoid(self, x):
"""
Sigmoid activation: ฯ(x) = 1 / (1 + e^(-x))
Squashes values to (0, 1) range.
"""
x = np.clip(x, -500, 500)
return 1.0 / (1.0 + np.exp(-x))
def sigmoid_derivative(self, sigmoid_output):
"""
Derivative of sigmoid: ฯ'(x) = ฯ(x) ร (1 - ฯ(x))
WHY we need this:
- Backpropagation requires the derivative of each function
- The derivative tells us "how sensitive is the output to input changes"
- This is used in the chain rule to compute gradients
"""
return sigmoid_output * (1.0 - sigmoid_output)
def softmax(self, x):
"""
Softmax: converts raw scores to probabilities that sum to 1.
"""
x_shifted = x - np.max(x) # Numerical stability
exp_x = np.exp(x_shifted)
return exp_x / np.sum(exp_x)
def forward(self, x):
"""
Forward pass: Input โ Hidden โ Output.
Saves intermediate values for backpropagation.
WHY save intermediate values?
- During backward pass, we need to know what happened at each step
- The chain rule requires values from the forward pass
- Think of it as keeping your rough work so the teacher can check it
"""
# โโ Layer 1: Input โ Hidden โโ
# z1 = x ยท W1 + b1 (weighted sum)
self.x = x # Save input for backward pass
self.z1 = np.dot(x, self.W1) + self.b1 # Raw hidden values
self.a1 = self.sigmoid(self.z1) # Activated hidden values
# โโ Layer 2: Hidden โ Output โโ
# z2 = a1 ยท W2 + b2 (weighted sum)
self.z2 = np.dot(self.a1, self.W2) + self.b2 # Raw output values
self.a2 = self.softmax(self.z2) # Output probabilities
return self.a2 # Return predicted probabilities
def compute_loss(self, predicted, target):
"""
Cross-entropy loss: measures how wrong our prediction is.
Formula: Loss = -ฮฃ target_i ร log(predicted_i)
WHY cross-entropy?
- Perfect for classification (predicting categories)
- Heavily penalizes confident WRONG predictions
- If we predict the right character with high probability โ low loss
- If we predict the wrong character โ high loss
WHY clip predicted values?
- log(0) = -infinity โ crash!
- Clipping to [1e-15, 1] prevents this numerical issue
"""
# Clip to prevent log(0) which is undefined
predicted_clipped = np.clip(predicted, 1e-15, 1.0)
# Cross-entropy loss
loss = -np.sum(target * np.log(predicted_clipped))
return loss
def backward(self, target):
"""
Backward pass: compute gradients using the chain rule.
THIS IS BACKPROPAGATION โ the core of neural network learning!
The chain rule in action:
For output layer (Layer 2):
โLoss/โW2 = a1แต ยท (predicted - target)
โLoss/โb2 = predicted - target
For hidden layer (Layer 1):
โLoss/โW1 = xแต ยท (ฮด_hidden)
โLoss/โb1 = ฮด_hidden
where ฮด_hidden = (predicted - target) ยท W2แต ร sigmoid'(a1)
WHY this math?
- We want to know: "How much did each weight contribute to the error?"
- The chain rule lets us trace the error backwards through the network
- Each weight gets a gradient: "move this direction to reduce error"
Analogy:
- A teacher marking an exam traces back through each step
- "You got the final answer wrong BECAUSE you made an error in step 3"
- The gradient tells each weight: "you were responsible for THIS much error"
"""
# โโ Output Layer Gradient โโ
# The derivative of softmax + cross-entropy simplifies beautifully!
# ฮด_output = predicted - target
# WHY so simple: This is one of the beautiful mathematical properties
# of combining softmax with cross-entropy loss
delta_output = self.a2 - target # Shape: (vocab_size,)
# Gradient for W2: how much each hiddenโoutput weight contributed to error
# WHY outer product: We need gradient for EVERY weight in the matrix
# self.a1.reshape(-1, 1) ร delta_output.reshape(1, -1) gives us the matrix
dW2 = np.outer(self.a1, delta_output) # Shape: (hidden_size, vocab_size)
db2 = delta_output # Shape: (vocab_size,)
# โโ Hidden Layer Gradient โโ
# Step 1: Propagate error back through W2
# WHY dot with W2.T: We're tracing the error back through the connections
# Each hidden neuron receives error proportional to its connection weight
delta_hidden = np.dot(delta_output, self.W2.T) # Shape: (hidden_size,)
# Step 2: Multiply by sigmoid derivative (chain rule!)
# WHY: The sigmoid "squashed" the values during forward pass
# We need to account for this squashing when computing gradients
delta_hidden *= self.sigmoid_derivative(self.a1)
# Gradient for W1
dW1 = np.outer(self.x, delta_hidden) # Shape: (vocab_size, hidden_size)
db1 = delta_hidden # Shape: (hidden_size,)
# โโ Update Weights (Gradient Descent) โโ
# WHY subtract: We move OPPOSITE to the gradient direction
# Gradient points toward INCREASING loss
# We want to DECREASE loss, so we go in the opposite direction
# learning_rate controls how big each step is
self.W2 -= self.learning_rate * dW2
self.b2 -= self.learning_rate * db2
self.W1 -= self.learning_rate * dW1
self.b1 -= self.learning_rate * db1
def generate(self, start_char_idx, length, idx_to_char, temperature=1.0):
"""
Generate text character by character.
Process:
1. Start with a character
2. Feed it through the network โ get probabilities for next character
3. Sample from those probabilities โ get next character
4. Feed THAT character back in โ repeat
WHY temperature?
- Controls how "creative" vs "predictable" the output is
- temperature = 1.0: normal sampling
- temperature < 1.0: more conservative (picks most likely characters)
- temperature > 1.0: more random/creative
- Think of it as: low temperature = a cautious student,
high temperature = a creative student
"""
generated = []
current_idx = start_char_idx
for _ in range(length):
# Create one-hot input for current character
x = np.zeros(self.vocab_size)
x[current_idx] = 1.0
# Forward pass
probs = self.forward(x)
# Apply temperature scaling
# WHY: Adjusts the "sharpness" of the probability distribution
if temperature != 1.0:
log_probs = np.log(np.clip(probs, 1e-15, 1.0)) / temperature
log_probs -= np.max(log_probs)
probs = np.exp(log_probs)
probs = probs / np.sum(probs)
# Sample from probability distribution
# WHY sample instead of argmax: Adds variety to generated text
# If we always pick the most likely character, output is repetitive
current_idx = np.random.choice(len(probs), p=probs)
generated.append(idx_to_char[current_idx])
return ''.join(generated)
# ============================================================================
# TRAINING FUNCTION
# ============================================================================
def train_network(text, epochs=1500, print_every=100, sample_every=500):
"""
Train the character-level network on the given text.
Parameters:
text: The training text
epochs: Number of complete passes through the data
print_every: Print loss every N epochs
sample_every: Generate sample text every N epochs
Returns:
net: The trained network
history: Training history (losses, samples, etc.)
"""
print_section("PREPARING TRAINING DATA", "๐ฆ")
# Prepare data
(inputs, targets, input_indices, target_indices,
chars, char_to_idx, idx_to_char, vocab_size) = prepare_data(text)
num_samples = len(input_indices)
print(f" {Colors.WHITE}Training Text:{Colors.RESET}")
# Print text with word wrapping
for i in range(0, len(text), 60):
print(f" {Colors.CYAN} \"{text[i:i+60]}\"{Colors.RESET}")
print(f"\n {Colors.BRIGHT_GREEN}Data Statistics:{Colors.RESET}")
print(f" {Colors.GREEN} โข Text length: {len(text)} characters{Colors.RESET}")
print(f" {Colors.GREEN} โข Vocabulary size: {vocab_size} unique characters{Colors.RESET}")
print(f" {Colors.GREEN} โข Training pairs: {num_samples}{Colors.RESET}")
print(f" {Colors.GREEN} โข Characters: {repr(''.join(chars))}{Colors.RESET}")
print(f"\n {Colors.BRIGHT_YELLOW}One-Hot Encoding Example:{Colors.RESET}")
example_char = 'a'
if example_char in char_to_idx:
idx = char_to_idx[example_char]
onehot = np.zeros(vocab_size)
onehot[idx] = 1.0
# Show just first 15 values to keep it readable
display_len = min(15, vocab_size)
print(f" {Colors.YELLOW} '{example_char}' โ index {idx} โ "
f"[{', '.join(f'{int(v)}' for v in onehot[:display_len])}{'...' if vocab_size > display_len else ''}]{Colors.RESET}")
print(f" {Colors.DIM} (A vector of {vocab_size} numbers โ all zeros except "
f"position {idx} which is 1){Colors.RESET}")
# โโ Create Network โโ
print_section("CREATING NETWORK", "๐๏ธ")
# WHY learning_rate=0.5: A reasonable starting value for this problem
# Too high โ training is unstable (overshoots)
# Too low โ training is too slow (doesn't converge in time)
net = CharLevelNetwork(vocab_size=vocab_size, hidden_size=64, learning_rate=0.5)
print(f" {Colors.BRIGHT_CYAN}Network Architecture:{Colors.RESET}")
print(f" {Colors.CYAN} Input: {vocab_size} neurons (one per character){Colors.RESET}")
print(f" {Colors.CYAN} Hidden: 64 neurons (sigmoid activation){Colors.RESET}")
print(f" {Colors.CYAN} Output: {vocab_size} neurons (softmax โ probabilities){Colors.RESET}")
total_params = vocab_size * 64 + 64 + 64 * vocab_size + vocab_size
print(f" {Colors.YELLOW} Total parameters: {total_params}{Colors.RESET}")
# โโ Generate text BEFORE training โโ
print_section("BEFORE TRAINING โ Random Output", "๐ฒ")
start_idx = char_to_idx.get('t', 0) # Start with 't'
before_text = net.generate(start_idx, 100, idx_to_char, temperature=0.8)
print(f" {Colors.RED}Generated text (untrained network):{Colors.RESET}")
print(f" {Colors.BRIGHT_RED} \"{before_text}\"{Colors.RESET}")
print(f"\n {Colors.DIM} ^ This is gibberish because the weights are random!{Colors.RESET}")
# โโ Training Loop โโ
print_section("TRAINING", "๐")
print(f" {Colors.WHITE}Training for {epochs} epochs...{Colors.RESET}")
print(f" {Colors.DIM} (Each epoch = one pass through ALL training pairs){Colors.RESET}\n")
# Track training history
history = {
"losses": [],
"steps": [],
"samples": [],
"before_training": before_text,
"after_training": "" # Will be filled after training
}
start_time = time.time()
# Header for training progress table
print(f" {Colors.WHITE}{'Epoch':>7} โ {'Loss':>10} โ {'Progress Bar':^25} โ {'Time':>6}{Colors.RESET}")
print(f" {Colors.DIM}{'โ'*7}โโผโ{'โ'*10}โโผโ{'โ'*25}โโผโ{'โ'*6}{Colors.RESET}")
for epoch in range(epochs):
total_loss = 0.0
# Shuffle training data each epoch
# WHY shuffle: Prevents the network from learning the ORDER of examples
# instead of the actual patterns. Randomizing improves generalization.
shuffle_idx = np.random.permutation(num_samples)
# โโ Mini training loop โโ
# Process each training pair
for i in shuffle_idx:
# Forward pass: predict next character
predicted = net.forward(inputs[i])
# Compute loss: how wrong is the prediction?
loss = net.compute_loss(predicted, targets[i])
total_loss += loss
# Backward pass: compute gradients and update weights
net.backward(targets[i])
# Average loss for this epoch
avg_loss = total_loss / num_samples
# Record history
if epoch % print_every == 0 or epoch == epochs - 1:
elapsed = time.time() - start_time
history["losses"].append(float(avg_loss))
history["steps"].append(epoch)
# Create progress bar
progress = epoch / epochs
bar_len = 20
filled = int(bar_len * progress)
bar = "โ" * filled + "โ" * (bar_len - filled)
# Color based on loss level
if avg_loss < 1.0:
loss_color = Colors.BRIGHT_GREEN
elif avg_loss < 2.0:
loss_color = Colors.GREEN
elif avg_loss < 3.0:
loss_color = Colors.YELLOW
else:
loss_color = Colors.RED
print(f" {Colors.CYAN}{epoch:>7}{Colors.RESET} โ "
f"{loss_color}{avg_loss:>10.4f}{Colors.RESET} โ "
f"{Colors.BLUE}[{bar}]{Colors.RESET} {progress:>4.0%} โ "
f"{Colors.DIM}{elapsed:>5.1f}s{Colors.RESET}")
# Generate sample text periodically
if epoch % sample_every == 0 and epoch > 0:
sample_text = net.generate(start_idx, 60, idx_to_char, temperature=0.8)
history["samples"].append({"step": epoch, "text": sample_text})
print(f" {Colors.BRIGHT_MAGENTA} โณ Sample: \"{sample_text[:50]}...\"{Colors.RESET}")
elapsed_total = time.time() - start_time
# โโ Generate text AFTER training โโ
print_section("AFTER TRAINING โ Learned Output", "โจ")
after_text = net.generate(start_idx, 150, idx_to_char, temperature=0.8)
history["after_training"] = after_text
print(f" {Colors.BRIGHT_GREEN}Generated text (trained network):{Colors.RESET}")
for i in range(0, len(after_text), 60):
print(f" {Colors.GREEN} \"{after_text[i:i+60]}\"{Colors.RESET}")
# โโ Comparison โโ
print_section("COMPARISON: Before vs After", "๐")
print(f" {Colors.BRIGHT_RED}BEFORE (random weights โ gibberish):{Colors.RESET}")
print(f" {Colors.RED} \"{before_text[:80]}\"{Colors.RESET}\n")
print(f" {Colors.BRIGHT_GREEN}AFTER ({epochs} epochs of training):{Colors.RESET}")
print(f" {Colors.GREEN} \"{after_text[:80]}\"{Colors.RESET}\n")
print(f" {Colors.BRIGHT_YELLOW}๐ Training Statistics:{Colors.RESET}")
print(f" {Colors.YELLOW} โข Total time: {elapsed_total:.1f} seconds{Colors.RESET}")
print(f" {Colors.YELLOW} โข Final loss: {history['losses'][-1]:.4f}{Colors.RESET}")
print(f" {Colors.YELLOW} โข Starting loss: {history['losses'][0]:.4f}{Colors.RESET}")
improvement = ((history['losses'][0] - history['losses'][-1]) /
history['losses'][0] * 100)
print(f" {Colors.YELLOW} โข Loss reduction: {improvement:.1f}%{Colors.RESET}")
# โโ Save Training History โโ
# WHY save to JSON: So step4_visualize.py can load it and create plots
# JSON is human-readable and easy to parse
script_dir = os.path.dirname(os.path.abspath(__file__))
history_path = os.path.join(script_dir, "training_history.json")
with open(history_path, 'w') as f:
json.dump(history, f, indent=2)
print(f"\n {Colors.BRIGHT_CYAN}๐พ Training history saved to:{Colors.RESET}")
print(f" {Colors.CYAN} {history_path}{Colors.RESET}")
return net, history
# ============================================================================
# MAIN EXECUTION
# ============================================================================
if __name__ == "__main__":
"""
Main execution: train the character-level network and generate text.
WHY __name__ == '__main__':
- Only runs when this file is executed directly
- Allows importing the CharLevelNetwork class without running training
"""
# Print header
print_header()
# Set random seed for reproducibility
# WHY: Same seed = same results = students can verify their output matches
np.random.seed(42)
# Train the network!
# epochs=1500 keeps training under 60 seconds on most machines
net, history = train_network(
text=SAMPLE_TEXT,
epochs=1500,
print_every=100,
sample_every=500
)
# Generate a few more samples to show variety
print_section("BONUS: Multiple Generated Samples", "๐ฒ")
# Rebuild data to get char mappings
chars = sorted(list(set(SAMPLE_TEXT)))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
# Generate with different starting characters
start_chars = ['t', 'a', 'n', 'l', 'd']
for i, start_char in enumerate(start_chars):
if start_char in char_to_idx:
start_idx = char_to_idx[start_char]
generated = net.generate(start_idx, 80, idx_to_char, temperature=0.7)
print(f" {Colors.CYAN}Starting with '{start_char}':{Colors.RESET}")
print(f" {Colors.GREEN} \"{generated}\"{Colors.RESET}\n")
# Print footer
print_footer()
Complete Code: step4_visualize.py
Python
"""
================================================================================
๐ง LEVEL 2 โ STEP 4: VISUALIZING TRAINING
================================================================================
Visualize the training results from step3_train.py!
This script:
1. Loads training_history.json (saved by step3_train.py)
2. Plots the training loss curve with matplotlib
3. Shows sample generated text at different training stages
4. Saves the plot as training_results.png
5. Also prints an ASCII loss curve for terminals without display
6. Compares Before vs After training quality
Requires: matplotlib, numpy, json
Run step3_train.py FIRST to generate the training history!
================================================================================
"""
# ============================================================================
# IMPORTS
# ============================================================================
import numpy as np # For numerical operations
import os # For file path operations
import json # For loading training history
import sys # For system operations
# We import matplotlib in a try/except block because some systems
# may not have a display (e.g., remote servers).
# WHY Agg backend: It renders to files without needing a display
try:
import matplotlib
matplotlib.use('Agg') # Use non-interactive backend (saves to file)
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
HAS_MATPLOTLIB = True
except ImportError:
HAS_MATPLOTLIB = False
print("โ ๏ธ matplotlib not installed. Only ASCII visualization available.")
# ============================================================================
# ANSI COLOR CODES
# ============================================================================
class Colors:
"""ANSI color codes for beautiful terminal output."""
RESET = "\033[0m"
BOLD = "\033[1m"
DIM = "\033[2m"
RED = "\033[31m"
GREEN = "\033[32m"
YELLOW = "\033[33m"
BLUE = "\033[34m"
MAGENTA = "\033[35m"
CYAN = "\033[36m"
WHITE = "\033[37m"
BRIGHT_RED = "\033[91m"
BRIGHT_GREEN = "\033[92m"
BRIGHT_YELLOW = "\033[93m"
BRIGHT_BLUE = "\033[94m"
BRIGHT_MAGENTA = "\033[95m"
BRIGHT_CYAN = "\033[96m"
# ============================================================================
# HELPER FUNCTIONS
# ============================================================================
def print_header():
"""Print a beautiful header for this script."""
print(f"\n{Colors.BRIGHT_GREEN}{'='*70}")
print(f" ๐ LEVEL 2 โ STEP 4: VISUALIZING TRAINING RESULTS")
print(f"{'='*70}{Colors.RESET}")
print(f"{Colors.DIM} Plotting loss curves and comparing output quality!{Colors.RESET}\n")
def print_footer():
"""Print a beautiful footer."""
print(f"\n{Colors.BRIGHT_GREEN}{'='*70}")
print(f" โ
STEP 4 COMPLETE! Training visualization done!")
print(f" ๐ Level 2 is complete โ you built a neural network from scratch!")
print(f"{'='*70}{Colors.RESET}\n")
def print_section(title, emoji="๐"):
"""Print a section header."""
print(f"\n{Colors.BRIGHT_YELLOW}{'โ'*70}")
print(f" {emoji} {title}")
print(f"{'โ'*70}{Colors.RESET}\n")
# ============================================================================
# LOAD TRAINING HISTORY
# ============================================================================
def load_training_history():
"""
Load the training history saved by step3_train.py.
WHY JSON?
- Human-readable format (you can open it in any text editor)
- Easy to parse in any programming language
- Standard format for data exchange
Returns:
dict with keys: losses, steps, samples, before_training, after_training
"""
# Get the directory where THIS script is located
# WHY: We want to find training_history.json in the SAME directory
# This works regardless of where the script is run FROM
script_dir = os.path.dirname(os.path.abspath(__file__))
history_path = os.path.join(script_dir, "training_history.json")
if not os.path.exists(history_path):
print(f" {Colors.BRIGHT_RED}โ Error: training_history.json not found!{Colors.RESET}")
print(f" {Colors.RED} Please run step3_train.py first.{Colors.RESET}")
print(f" {Colors.DIM} Expected path: {history_path}{Colors.RESET}")
sys.exit(1)
with open(history_path, 'r') as f:
history = json.load(f)
print(f" {Colors.BRIGHT_GREEN}โ Loaded training history from:{Colors.RESET}")
print(f" {Colors.CYAN} {history_path}{Colors.RESET}\n")
print(f" {Colors.WHITE}History contents:{Colors.RESET}")
print(f" {Colors.CYAN} โข {len(history['losses'])} loss data points{Colors.RESET}")
print(f" {Colors.CYAN} โข {len(history['steps'])} step markers{Colors.RESET}")
print(f" {Colors.CYAN} โข {len(history['samples'])} sample generations{Colors.RESET}")
return history
# ============================================================================
# MATPLOTLIB VISUALIZATION
# ============================================================================
def create_matplotlib_plot(history):
"""
Create a beautiful matplotlib plot of the training results.
Plot contains:
1. Top panel: Training loss curve (the main visualization)
2. Bottom panel: Sample generated text at different stages
WHY matplotlib?
- Industry standard for scientific plotting in Python
- Produces publication-quality figures
- Highly customizable
"""
if not HAS_MATPLOTLIB:
print(f" {Colors.YELLOW}โ ๏ธ matplotlib not available. "
f"Skipping graphical plot.{Colors.RESET}")
return
print_section("MATPLOTLIB VISUALIZATION", "๐")
# โโ Set up the figure โโ
# WHY GridSpec: Gives us precise control over subplot layout
# figsize=(12, 8): 12 inches wide, 8 inches tall
fig = plt.figure(figsize=(12, 8))
# Use a dark background for modern look
# WHY dark: Looks professional and is easier on the eyes
fig.patch.set_facecolor('#1a1a2e')
# Create grid: top panel (loss curve) takes 60%, bottom (text samples) takes 40%
gs = GridSpec(2, 1, height_ratios=[3, 2], hspace=0.35)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# TOP PANEL: Training Loss Curve
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
ax1 = fig.add_subplot(gs[0])
ax1.set_facecolor('#16213e')
steps = history['steps']
losses = history['losses']
# Plot the loss curve with a gradient-like effect
# WHY multiple visual elements: Makes the plot more informative and beautiful
# Fill area under curve (semi-transparent)
# WHY fill: Gives a sense of the "volume" of loss reduction
ax1.fill_between(steps, losses, alpha=0.3, color='#e94560')
# Main line
ax1.plot(steps, losses, color='#e94560', linewidth=2.5, label='Training Loss',
marker='o', markersize=4, markerfacecolor='white', markeredgecolor='#e94560')
# Add annotation for first and last loss
# WHY annotations: Help the viewer immediately understand the improvement
ax1.annotate(f'Start: {losses[0]:.2f}',
xy=(steps[0], losses[0]),
xytext=(steps[0] + (steps[-1]-steps[0])*0.1, losses[0]*0.95),
fontsize=10, color='#ff6b6b',
arrowprops=dict(arrowstyle='->', color='#ff6b6b', lw=1.5),
fontweight='bold')
ax1.annotate(f'End: {losses[-1]:.2f}',
xy=(steps[-1], losses[-1]),
xytext=(steps[-1] * 0.75, losses[-1] + (losses[0]-losses[-1])*0.15),
fontsize=10, color='#51cf66',
arrowprops=dict(arrowstyle='->', color='#51cf66', lw=1.5),
fontweight='bold')
# Mark sample generation points with vertical lines
for sample in history.get('samples', []):
step = sample['step']
if step in steps:
idx = steps.index(step)
loss_at_step = losses[idx]
else:
# Interpolate
loss_at_step = None
ax1.axvline(x=step, color='#ffd93d', linestyle='--', alpha=0.4, linewidth=1)
# Styling
ax1.set_title('๐ง Neural Network Training โ Loss Over Time',
fontsize=16, color='white', fontweight='bold', pad=15)
ax1.set_xlabel('Training Epoch', fontsize=12, color='#a0a0a0')
ax1.set_ylabel('Cross-Entropy Loss', fontsize=12, color='#a0a0a0')
ax1.tick_params(colors='#a0a0a0')
ax1.grid(True, alpha=0.15, color='white')
ax1.legend(fontsize=11, loc='upper right', facecolor='#16213e',
edgecolor='#444', labelcolor='white')
# Set spine colors
for spine in ax1.spines.values():
spine.set_color('#444')
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# BOTTOM PANEL: Sample Generated Text
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
ax2 = fig.add_subplot(gs[1])
ax2.set_facecolor('#16213e')
ax2.set_xlim(0, 1)
ax2.set_ylim(0, 1)
ax2.axis('off')
ax2.set_title('๐ Generated Text at Different Training Stages',
fontsize=14, color='white', fontweight='bold', pad=10)
# Add Before Training text
y_pos = 0.9
ax2.text(0.02, y_pos, '๐ด Before Training:',
fontsize=11, color='#ff6b6b', fontweight='bold',
transform=ax2.transAxes, fontfamily='monospace')
before_text = history.get('before_training', 'N/A')[:70]
ax2.text(0.02, y_pos - 0.1, f'"{before_text}"',
fontsize=9, color='#ff9999',
transform=ax2.transAxes, fontfamily='monospace',
style='italic')
# Add sample texts during training
y_pos -= 0.25
samples = history.get('samples', [])
for i, sample in enumerate(samples[:3]): # Show max 3 samples
step = sample['step']
text = sample['text'][:60]
color_val = 0.4 + (i + 1) * 0.2 # Gradually greener
text_color = (1.0 - color_val, 0.5 + color_val * 0.5, color_val * 0.5)
ax2.text(0.02, y_pos, f'๐ก Epoch {step}:',
fontsize=10, color='#ffd93d', fontweight='bold',
transform=ax2.transAxes, fontfamily='monospace')
ax2.text(0.02, y_pos - 0.08, f'"{text}"',
fontsize=9, color=text_color,
transform=ax2.transAxes, fontfamily='monospace',
style='italic')
y_pos -= 0.2
# Add After Training text
after_text = history.get('after_training', 'N/A')[:70]
ax2.text(0.02, y_pos, '๐ข After Training:',
fontsize=11, color='#51cf66', fontweight='bold',
transform=ax2.transAxes, fontfamily='monospace')
ax2.text(0.02, y_pos - 0.1, f'"{after_text}"',
fontsize=9, color='#8ce99a',
transform=ax2.transAxes, fontfamily='monospace',
style='italic')
# โโ Save the plot โโ
# WHY save to same directory: Keeps all Level 2 files together
script_dir = os.path.dirname(os.path.abspath(__file__))
plot_path = os.path.join(script_dir, "training_results.png")
# WHY dpi=150: Good balance between file size and quality
# WHY bbox_inches='tight': Removes excess whitespace
plt.savefig(plot_path, dpi=150, bbox_inches='tight',
facecolor=fig.get_facecolor(), edgecolor='none')
plt.close()
print(f" {Colors.BRIGHT_GREEN}โ Plot saved to:{Colors.RESET}")
print(f" {Colors.CYAN} {plot_path}{Colors.RESET}")
return plot_path
# ============================================================================
# ASCII LOSS CURVE
# ============================================================================
def print_ascii_loss_curve(history):
"""
Print an ASCII art loss curve for terminals without graphical display.
WHY ASCII plot?
- Works in ANY terminal (no GUI needed)
- Great for remote servers / SSH sessions
- Shows the same information as the matplotlib plot
- Fun and educational!
The plot uses Unicode block characters to draw bars.
"""
print_section("ASCII LOSS CURVE", "๐")
losses = history['losses']
steps = history['steps']
if not losses:
print(f" {Colors.RED}No loss data available!{Colors.RESET}")
return
# โโ Calculate plot dimensions โโ
max_loss = max(losses)
min_loss = min(losses)
plot_height = 15 # Number of rows in the plot
plot_width = min(50, len(losses)) # Number of columns
# Resample losses if we have more data points than columns
# WHY resample: We might have 100+ data points but only 50 columns
if len(losses) > plot_width:
indices = np.linspace(0, len(losses) - 1, plot_width, dtype=int)
sampled_losses = [losses[i] for i in indices]
sampled_steps = [steps[i] for i in indices]
else:
sampled_losses = losses
sampled_steps = steps
# โโ Draw the plot โโ
print(f" {Colors.BRIGHT_CYAN} Training Loss Over Time{Colors.RESET}")
print(f" {Colors.DIM} (Each column = one recorded epoch){Colors.RESET}\n")
# Y-axis labels and grid
for row in range(plot_height, -1, -1):
# Calculate the loss value for this row
if max_loss == min_loss:
loss_at_row = max_loss
else:
loss_at_row = min_loss + (max_loss - min_loss) * (row / plot_height)
# Y-axis label (show every 3rd row)
if row % 3 == 0 or row == plot_height:
y_label = f"{loss_at_row:>6.2f}"
else:
y_label = " "
# Draw the row
row_str = f" {Colors.DIM}{y_label} โค{Colors.RESET}"
for col in range(len(sampled_losses)):
loss_val = sampled_losses[col]
# Normalize loss to plot height
if max_loss == min_loss:
bar_height = plot_height // 2
else:
bar_height = int((loss_val - min_loss) / (max_loss - min_loss) * plot_height)
# Draw bar character based on whether this row is filled
if bar_height >= row:
# Color gradient from red (high loss) to green (low loss)
if col < len(sampled_losses) * 0.3:
color = Colors.BRIGHT_RED
elif col < len(sampled_losses) * 0.6:
color = Colors.BRIGHT_YELLOW
else:
color = Colors.BRIGHT_GREEN
row_str += f"{color}โ{Colors.RESET}"
else:
row_str += " "
print(row_str)
# X-axis
print(f" {Colors.DIM} โ{'โ' * len(sampled_losses)}{Colors.RESET}")
# X-axis labels (first, middle, last)
if sampled_steps:
first = sampled_steps[0]
last = sampled_steps[-1]
mid = sampled_steps[len(sampled_steps) // 2]
label_line = f" {first:<{len(sampled_losses)//2}}"
label_line += f"{mid}"
remaining = len(sampled_losses) - len(label_line) + 8
if remaining > 0:
label_line += " " * remaining + f"{last}"
print(f" {Colors.DIM}{label_line}{Colors.RESET}")
print(f" {Colors.DIM} {'Training Epoch':^{len(sampled_losses)}}{Colors.RESET}")
# โโ Print statistics โโ
print(f"\n {Colors.BRIGHT_YELLOW}Statistics:{Colors.RESET}")
print(f" {Colors.YELLOW} โข Starting loss: {losses[0]:.4f}{Colors.RESET}")
print(f" {Colors.YELLOW} โข Final loss: {losses[-1]:.4f}{Colors.RESET}")
print(f" {Colors.YELLOW} โข Best loss: {min(losses):.4f} "
f"(at epoch {steps[losses.index(min(losses))]}){Colors.RESET}")
improvement = ((losses[0] - losses[-1]) / losses[0] * 100)
print(f" {Colors.BRIGHT_GREEN} โข Improvement: {improvement:.1f}% reduction{Colors.RESET}")
# ============================================================================
# BEFORE vs AFTER COMPARISON
# ============================================================================
def print_comparison(history):
"""
Print a detailed comparison of before vs after training.
WHY this comparison?
- This is the most dramatic demonstration of learning
- Students can SEE that the network improved
- It connects the abstract loss curve to concrete output quality
"""
print_section("BEFORE vs AFTER COMPARISON", "๐")
before = history.get('before_training', 'N/A')
after = history.get('after_training', 'N/A')
# โโ Before Training Box โโ
print(f" {Colors.BRIGHT_RED}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ{Colors.RESET}")
print(f" {Colors.BRIGHT_RED}โ ๐ด BEFORE TRAINING (Random Weights) โ{Colors.RESET}")
print(f" {Colors.BRIGHT_RED}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค{Colors.RESET}")
# Word-wrap the before text
for i in range(0, min(len(before), 120), 56):
line = before[i:i+56]
padding = 56 - len(line)
print(f" {Colors.RED}โ \"{line}\"{' ' * padding}โ{Colors.RESET}")
print(f" {Colors.BRIGHT_RED}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค{Colors.RESET}")
print(f" {Colors.RED}โ Quality: Random gibberish โ no patterns learned โ{Colors.RESET}")
print(f" {Colors.RED}โ Loss: ~{history['losses'][0]:.2f} (high = very wrong)"
f"{'':20}โ{Colors.RESET}")
print(f" {Colors.BRIGHT_RED}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ{Colors.RESET}")
print()
# โโ After Training Box โโ
print(f" {Colors.BRIGHT_GREEN}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ{Colors.RESET}")
print(f" {Colors.BRIGHT_GREEN}โ ๐ข AFTER TRAINING (Learned Weights) โ{Colors.RESET}")
print(f" {Colors.BRIGHT_GREEN}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค{Colors.RESET}")
for i in range(0, min(len(after), 120), 56):
line = after[i:i+56]
padding = 56 - len(line)
print(f" {Colors.GREEN}โ \"{line}\"{' ' * padding}โ{Colors.RESET}")
print(f" {Colors.BRIGHT_GREEN}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค{Colors.RESET}")
print(f" {Colors.GREEN}โ Quality: Recognizable English words and patterns! โ{Colors.RESET}")
print(f" {Colors.GREEN}โ Loss: ~{history['losses'][-1]:.2f} (low = getting it right!)"
f"{'':17}โ{Colors.RESET}")
print(f" {Colors.BRIGHT_GREEN}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ{Colors.RESET}")
print(f"\n {Colors.BRIGHT_YELLOW}๐ก Key Insight:{Colors.RESET}")
print(f" {Colors.YELLOW} The network learned to associate characters with what typically{Colors.RESET}")
print(f" {Colors.YELLOW} follows them in English text. It's not perfect (it's a tiny network{Colors.RESET}")
print(f" {Colors.YELLOW} with a tiny dataset), but it shows the PRINCIPLE of how language{Colors.RESET}")
print(f" {Colors.YELLOW} models work: predict the next token based on patterns in data!{Colors.RESET}")
# โโ Training Progress Through Samples โโ
samples = history.get('samples', [])
if samples:
print(f"\n {Colors.BRIGHT_MAGENTA}Training Progress (Generated Text at Each Stage):{Colors.RESET}\n")
for i, sample in enumerate(samples):
step = sample['step']
text = sample['text'][:60]
# Progress bar
if history['steps']:
max_step = history['steps'][-1]
progress = step / max_step if max_step > 0 else 0
else:
progress = 0
bar_len = 15
filled = int(bar_len * progress)
bar = "โ" * filled + "โ" * (bar_len - filled)
# Color transitions from red to green
if progress < 0.33:
color = Colors.RED
elif progress < 0.66:
color = Colors.YELLOW
else:
color = Colors.GREEN
print(f" {Colors.DIM} Epoch {step:>5}{Colors.RESET} "
f"{Colors.BLUE}[{bar}]{Colors.RESET} "
f"{color}\"{text}\"{Colors.RESET}")
print(f"\n {Colors.DIM} Notice how the text gradually improves from random to "
f"recognizable!{Colors.RESET}")
# ============================================================================
# LEARNING SUMMARY
# ============================================================================
def print_learning_summary():
"""
Print a summary of what was learned in this level.
"""
print_section("๐ LEVEL 2 COMPLETE โ WHAT YOU LEARNED", "๐")
print(f""" {Colors.BRIGHT_CYAN}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๐ CONGRATULATIONS! ๐ โ
โ You built a neural network from scratch! โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Step 1: Single Neuron โ
โ โ Weighted sum + sigmoid activation โ
โ โ Learning AND/OR gates โ
โ โ
โ Step 2: Neural Network โ
โ โ Multiple layers with matrix multiplication โ
โ โ Forward pass visualization โ
โ โ Sigmoid + Softmax activations โ
โ โ
โ Step 3: Training โ
โ โ Character-level data preparation โ
โ โ One-hot encoding โ
โ โ Backpropagation FROM SCRATCH โ
โ โ Cross-entropy loss โ
โ โ Text generation! โ
โ โ
โ Step 4: Visualization โ
โ โ Loss curve plotting โ
โ โ Before vs After comparison โ
โ โ Training progress analysis โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ ๐ฎ Next Level: โ
โ Level 3 will introduce more advanced concepts! โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ{Colors.RESET}
""")
# ============================================================================
# MAIN EXECUTION
# ============================================================================
if __name__ == "__main__":
"""
Main execution block.
This script REQUIRES step3_train.py to have been run first,
since it loads the training_history.json file that step3 creates.
"""
# Print header
print_header()
# Load training history
print_section("LOADING TRAINING HISTORY", "๐")
history = load_training_history()
# Create matplotlib plot (saved as image)
plot_path = create_matplotlib_plot(history)
# Print ASCII loss curve (works everywhere)
print_ascii_loss_curve(history)
# Print before vs after comparison
print_comparison(history)
# Print learning summary
print_learning_summary()
# Print footer
print_footer()
The Transformer Revolution
The architecture that changed everything
Transformers and the Magic of Attention
Learning Objectives
- Explain what the Transformer architecture is and why it revolutionised AI
- Convert text into numerical representations using embeddings
- Understand why positional encoding is needed and how sinusoidal waves solve it
- Implement self-attention from scratch and explain every term in the attention formula
- Describe how multi-head attention lets a model see text from multiple perspectives
- Build a complete Transformer block with residual connections, layer norm, and feed-forward networks
- Assemble a Mini-Transformer โ a working language model in under 200 lines of PyTorch
4.1 The Breakthrough That Changed Everything
In June 2017, a team of eight researchers at Google published a paper with a deceptively simple title: "Attention Is All You Need." That paper didn't just introduce a new model โ it rewrote the rules of artificial intelligence.
Before the Transformer, the dominant architectures for language tasks were Recurrent Neural Networks (RNNs) and their more sophisticated cousins, LSTMs (Long Short-Term Memory networks). These models processed text one word at a time, like a student reading a textbook left-to-right, never skipping ahead, never glancing back without effort. They worked, but they were painfully slow to train โ because every word had to wait for the previous one to be processed โ and they struggled to remember things said far earlier in a paragraph.
The Transformer threw away the conveyor belt. Instead of processing words sequentially, it processes all words at once, in parallel, and uses a mechanism called attention to figure out which words are relevant to each other. Imagine an entire classroom of students working on a problem simultaneously, each student free to glance at any other student's notes. That is the Transformer.
The impact was staggering. Within two years, Transformer-based models โ BERT, GPT-2, T5 โ were shattering records on virtually every natural language processing benchmark. Today, every large language model you've heard of โ GPT-4, Claude, Gemini, LLaMA โ is built on this architecture. When you chat with ChatGPT or use Google Translate, there is a Transformer under the hood.
Important
The Transformer is not just one breakthrough โ it is the foundation of modern AI. Understanding it deeply is the single most important step in your AI journey.
Let's build one from scratch.
4.2 From Words to Numbers: Embeddings
Here is a fundamental truth: computers don't understand words. They understand numbers. So the very first step in any language model is to convert text into numbers.
The Naive Approach: One-Hot Encoding
The simplest idea is one-hot encoding. If your vocabulary has, say, 26 characters, represent each character as a vector of length 26 with a single 1 and the rest 0s. So a = [1, 0, 0, ..., 0], b = [0, 1, 0, ..., 0], and so on.
This works technically, but it has two fatal flaws:
- The vectors are enormous. If your vocabulary has 50,000 words, each vector has 50,000 dimensions. That is extremely wasteful.
- Every word is equally different from every other word. The distance between "king" and "queen" is the same as the distance between "king" and "mango." The representation carries zero information about meaning.
Dense Embeddings: Rich Descriptions
A far better idea is to represent each word (or character) as a short, dense vector โ say, 64 or 128 numbers โ where similar words end up with similar vectors.
Think of it like this:
A student's roll number tells you nothing about them. Roll number 42 could be anyone. But a description โ [tall, curious, loves science, good at cricket, from Lucknow] โ tells you a lot. You can immediately see that this student is more similar to another science-loving cricketer than to a quiet artist.
An embedding is that description. It is a learned vector of numbers that captures the meaning and relationships of a token.
Building a Vocabulary
Let's start at the very beginning: mapping characters to numbers. This is the simplest form of tokenization (real models like GPT use subword tokenization, but character-level is easiest to learn with).
Python
def build_vocabulary(text):
"""
Build a character-level vocabulary from text.
Returns:
char_to_idx: Dictionary mapping character โ number
idx_to_char: Dictionary mapping number โ character
vocab_size: Total number of unique characters
"""
# Get all unique characters and sort them
chars = sorted(list(set(text)))
# Create the two-way mapping
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
return char_to_idx, idx_to_char, len(chars)
def encode(text, char_to_idx):
"""Convert a string into a list of numbers using our vocabulary."""
return [char_to_idx[ch] for ch in text]
def decode(indices, idx_to_char):
"""Convert a list of numbers back into a string."""
return ''.join([idx_to_char[i] for i in indices])
Feed in the sentence "The sun rises in the east and sets in the west. India is a beautiful country." and you get a mapping like ' 'โ0, '.'โ1, 'I'โ2, 'T'โ3, 'a'โ4, 'b'โ5, .... Each unique character gets an index.
The Token Embedding Layer
Now we need to turn each index into a vector. In PyTorch, nn.Embedding does exactly this โ it is essentially a lookup table. You give it an index, and it returns the corresponding row from a learnable matrix:
Python
import torch
import torch.nn as nn
class TokenEmbedding(nn.Module):
"""
Converts token indices into dense vectors.
Args:
vocab_size: Number of unique tokens
embed_dim: Size of each embedding vector
"""
def __init__(self, vocab_size, embed_dim):
super().__init__()
# nn.Embedding is like a lookup table:
# It stores a matrix of shape (vocab_size ร embed_dim)
# When you give it index 3, it returns row 3 of the matrix
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.embed_dim = embed_dim
def forward(self, x):
# x shape: (batch_size, sequence_length) โ indices
# output shape: (batch_size, sequence_length, embed_dim) โ vectors
return self.embedding(x)
If our embedding dimension is 8 (tiny, for illustration โ real models use 768 or more), then each character becomes a vector of 8 numbers. The character 'T' might become [+0.312, -0.821, +0.047, ...]. These numbers are random at first and are learned during training โ the model discovers what representation works best.
Tip
Think of nn.Embedding as a dictionary where the keys are token indices and the values are vectors. But unlike a regular dictionary, these vectors are trainable parameters โ they get updated every time the model learns from data.
4.3 Positional Encoding: Telling the Model About Order
Here is a problem you might not have noticed: the Transformer processes all tokens at the same time. There is no concept of "first" or "second" or "last." But order matters โ a lot.
Consider these two sentences:
- "เคเฅเคคเฅเคคเคพ เคเคฆเคฎเฅ เคเฅ เคเคพเคเคคเคพ เคนเฅ" (Dog bites man)
- "เคเคฆเคฎเฅ เคเฅเคคเฅเคคเฅ เคเฅ เคเคพเคเคคเคพ เคนเฅ" (Man bites dog)
Same words, completely different meanings. If the model can't tell which word came first, it cannot distinguish between these two sentences.
The solution from the original paper is elegant: add a unique positional signal to each token's embedding. The Transformer uses sinusoidal (wave-based) encoding โ sine and cosine functions at different frequencies.
Why waves? Think of it like tuning into a radio station. Each station (position) has a unique combination of frequencies. Even though two stations might share one frequency, the full combination is always unique. Similarly, each position gets a unique fingerprint of sine and cosine values.
The Formulas
Here, pos is the position in the sequence and i is the dimension index. Even dimensions get \sin, odd dimensions get \cos.
Python
import math
class PositionalEncoding(nn.Module):
"""
Adds positional information to embeddings using sinusoidal patterns.
"""
def __init__(self, embed_dim, max_seq_len=512):
super().__init__()
# Create a matrix to store all positional encodings
pe = torch.zeros(max_seq_len, embed_dim)
# Position indices: [0, 1, 2, ..., max_seq_len-1]
position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
# Division term: creates different frequencies for each dimension
div_term = torch.exp(
torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim)
)
# Apply sin to even indices (0, 2, 4, ...)
pe[:, 0::2] = torch.sin(position * div_term)
# Apply cos to odd indices (1, 3, 5, ...)
pe[:, 1::2] = torch.cos(position * div_term)
# Add batch dimension: (1, max_seq_len, embed_dim)
pe = pe.unsqueeze(0)
# Register as buffer (saved with model but not trained)
self.register_buffer('pe', pe)
def forward(self, x):
"""
Args:
x: Tensor of shape (batch_size, seq_len, embed_dim)
Returns:
x + positional encoding
"""
seq_len = x.size(1)
# Add positional encoding to the input
return x + self.pe[:, :seq_len, :]
Notice that the positional encoding is added to the embedding, not concatenated. After this step, the same character at different positions in a sentence will have different vector representations. The model now knows what a token is (from the embedding) and where it is (from the positional encoding).
Note
The positional encoding is registered as a buffer, not a parameter. This means it is saved with the model but is not updated during training โ it is a fixed mathematical pattern. Some modern models (including our final Mini-Transformer) use learned positional embeddings instead, which are trained just like token embeddings.
4.4 The Heart of the Transformer: Self-Attention
This is it. If you understand this one section deeply, you understand the engine that powers all of modern AI.
The Classroom Analogy
Imagine a classroom with 30 students. The teacher asks a question: "What is the role of the monsoon in Indian agriculture?"
Now, every student in the class could potentially answer. But some students are more relevant than others. The student who studied geography has the most useful answer. The student who studied mathematics? Probably less relevant โ but maybe she can add something about rainfall statistics. The student who plays cricket all day? Perhaps not useful at all for this question.
Self-attention is exactly this. Each word in a sentence "asks a question" and then looks at every other word to decide: "How relevant are you to me?" The word then collects a weighted combination of information from all other words, paying more attention to the relevant ones.
Query, Key, and Value: Three Ways to Look at Every Token
Every token in the sequence gets transformed into three vectors: a Query (Q), a Key (K), and a Value (V). Let's understand these through three analogies:
Analogy 1 โ The Library:
- Query = the search term you type into the library catalog ("monsoon agriculture India")
- Key = the title/tags of each book on the shelf ("Indian Climate Patterns," "History of Cricket," "Monsoon and Farming")
- Value = the actual content of each book
You match your query against all keys to find relevant books, then read the content (values) of those books.
Analogy 2 โ The Classroom:
- Query = the question being asked ("What causes monsoons?")
- Key = each student's expertise tag ("geography expert," "maths nerd," "cricket captain")
- Value = the actual answer each student would give
You compare your question against everyone's expertise, then listen most carefully to the most relevant students.
Analogy 3 โ Google Search:
- Query = what you type in the search bar
- Key = the title/description of each web page
- Value = the actual content of each web page
Google ranks pages by matching your query to their keys, then shows you the values of the best matches.
The Implementation
Python
import torch.nn.functional as F
class SelfAttention(nn.Module):
"""
Single-head self-attention mechanism.
This is the core building block of the Transformer.
"""
def __init__(self, embed_dim):
super().__init__()
self.embed_dim = embed_dim
# Three weight matrices โ these are LEARNED during training!
# W_q: transforms input into "what am I looking for?"
# W_k: transforms input into "what do I contain?"
# W_v: transforms input into "what information do I give?"
self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)
# Scaling factor
self.scale = math.sqrt(embed_dim)
def forward(self, x, mask=None):
"""
Args:
x: Input tensor of shape (batch, seq_len, embed_dim)
mask: Optional causal mask
Returns:
output: Attention output, same shape as input
attention_weights: The attention matrix
"""
batch_size, seq_len, _ = x.shape
# STEP 1: Create Q, K, V
Q = self.W_q(x) # (batch, seq_len, embed_dim)
K = self.W_k(x) # (batch, seq_len, embed_dim)
V = self.W_v(x) # (batch, seq_len, embed_dim)
# STEP 2: Compute attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) # (batch, seq_len, seq_len)
# STEP 3: Scale
scores = scores / self.scale
# STEP 4: Apply causal mask (optional)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
# STEP 5: Softmax โ probabilities
attention_weights = F.softmax(scores, dim=-1)
# STEP 6: Weighted sum of values
output = torch.matmul(attention_weights, V)
return output, attention_weights
Step-by-Step Numerical Walkthrough
Let's trace through the computation with a tiny example. Suppose we have 3 tokens and an embedding dimension of 4.
Input matrix X (3 tokens ร 4 dimensions):
Step 1: Multiply by weight matrices W_Q, W_K, W_V to get Q, K, V. (In practice these are learned; let's say the multiplication yields Q, K, V each of shape 3ร4.)
Step 2: Compute raw scores: \text{scores} = Q \cdot K^T. This is a 3ร3 matrix โ each entry (i, j) tells us how much token i is interested in token j.
Step 3: Scale by \frac{1}{\sqrt{d_k}} = \frac{1}{\sqrt{4}} = \frac{1}{2}:
Step 4: Apply softmax to each row (so each row sums to 1):
Step 5: Multiply weights by V. Each token's output is a weighted combination of all value vectors.
The result: token 0 pays most attention to itself (0.50), token 1 pays most attention to itself (0.43) but also notices token 2 (0.30), and so on. The model has learned to focus on what matters.
The Attention Formula
This single equation is the beating heart of every modern language model. Let's understand every part:
| Component | What it does |
|---|---|
Q K^T | Dot product between queries and keys โ raw similarity scores |
\sqrt{d_k} | Scaling factor to keep gradients healthy |
\text{softmax} | Converts raw scores into probabilities (0 to 1, summing to 1) |
\cdot V | Weighted sum of values using those probabilities |
Why Scale by \sqrt{d_k}?
This is a subtle but critical detail. When you compute the dot product of two random vectors of dimension d_k, the result has a variance of approximately d_k. If d_k = 64, the dot products can easily be in the range of ยฑ50 or more.
What happens when you feed very large numbers into softmax? The output becomes extremely "peaked" โ one element gets a probability near 1.0, and everything else gets near 0.0. This is essentially a hard argmax, and the gradients become vanishingly small. The model stops learning.
By dividing by \sqrt{d_k}, we bring the variance back to approximately 1, keeping softmax in a range where gradients flow well.
Warning
Without the \sqrt{d_k} scaling, training becomes unstable โ the model either learns nothing or converges to poor solutions. This seemingly small detail makes a huge difference in practice.
Causal Masking: Why GPT Can't Look at the Future
When you're generating text one token at a time ("The capital of India is ___"), the model must predict the next word using only the words that came before it. It cannot peek at the answer.
This is enforced with a causal mask โ a lower-triangular matrix of 1s and 0s:
Python
def create_causal_mask(seq_len):
"""
Create a lower-triangular mask for autoregressive models.
[[1, 0, 0, 0], โ token 0 can only see token 0
[1, 1, 0, 0], โ token 1 can see tokens 0, 1
[1, 1, 1, 0], โ token 2 can see tokens 0, 1, 2
[1, 1, 1, 1]] โ token 3 can see all tokens
"""
mask = torch.tril(torch.ones(seq_len, seq_len))
return mask.unsqueeze(0) # Add batch dimension
Where the mask is 0, we set the attention score to -\infty. After softmax, e^{-\infty} = 0 โ those positions get zero attention weight. The model is effectively blind to future tokens.
Note
BERT-style models (used for understanding, not generation) do not use causal masking โ they attend in both directions. GPT-style models (used for generation) always use it. This is the fundamental difference between "encoder" and "decoder" Transformers.
4.5 Multi-Head Attention: Multiple Perspectives
Single-head attention is powerful, but it has a limitation: one attention pattern per token. In reality, a word can relate to other words in many different ways simultaneously.
Consider the sentence: "The student from Delhi who loves physics scored the highest marks."
The word "scored" needs to attend to:
- "student" โ to know who scored
- "highest" โ to know how much was scored
- "marks" โ to know what was scored
One attention head would struggle to capture all three relationships at once. The solution? Use multiple heads, each learning a different type of relationship.
Think of it like reading a poem. One head reads for meaning, another for rhyme scheme, a third for emotional tone, and a fourth for grammatical structure. Each perspective is partial, but together they form a rich understanding.
Mechanically, multi-head attention splits the embedding dimension among heads. If embed_dim = 128 and num_heads = 4, each head works with 32 dimensions. After each head computes attention independently, the results are concatenated and projected back:
Python
class MultiHeadAttention(nn.Module):
"""
Multi-Head Self-Attention.
Splits the input into multiple "heads", runs attention on each,
then combines the results.
"""
def __init__(self, embed_dim, num_heads):
super().__init__()
assert embed_dim % num_heads == 0, \
f"embed_dim ({embed_dim}) must be divisible by num_heads ({num_heads})"
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
# One big linear layer for Q, K, V (more efficient than 3 separate ones)
self.W_qkv = nn.Linear(embed_dim, 3 * embed_dim, bias=False)
# Output projection: combines all heads back together
self.W_out = nn.Linear(embed_dim, embed_dim, bias=False)
self.scale = math.sqrt(self.head_dim)
def forward(self, x, mask=None):
batch_size, seq_len, _ = x.shape
# Step 1: Compute Q, K, V all at once
qkv = self.W_qkv(x) # (batch, seq_len, 3 * embed_dim)
# Step 2: Split into Q, K, V and reshape for multi-head
qkv = qkv.reshape(batch_size, seq_len, 3, self.num_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4) # (3, batch, heads, seq_len, head_dim)
Q, K, V = qkv[0], qkv[1], qkv[2]
# Step 3: Compute attention scores for ALL heads at once
scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
# Step 4: Apply causal mask
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
# Step 5: Softmax
attention_weights = F.softmax(scores, dim=-1)
# Step 6: Weighted sum of values
output = torch.matmul(attention_weights, V)
# Step 7: Combine heads back together
output = output.transpose(1, 2).reshape(batch_size, seq_len, self.embed_dim)
# Step 8: Final projection
output = self.W_out(output)
return output
Tip
Notice the efficiency trick: instead of three separate linear layers for Q, K, and V, we use one large linear layer (W_qkv) that produces all three at once. This is mathematically identical but runs faster on GPUs because it is a single matrix multiplication.
4.6 The Transformer Block
Attention tells the model what to look at. But it also needs to process that information, stabilise its numbers, and preserve what it already knew. That is the job of the full Transformer block, which wraps attention with three additional components.
Residual Connections: "You Still Have Your Own Notes"
Imagine a student attends a lecture. Even if the lecture was confusing and the student didn't grasp much, she still has her own notes from before the lecture. A residual connection does the same thing โ it adds the original input back to the output:
This means the model never loses the original information. In the worst case (the attention layer learns nothing useful), the input passes through unchanged. In practice, residual connections make deep networks much easier to train.
Layer Normalization: "Grading on a Curve"
After many matrix multiplications, the numbers in our vectors can drift โ some becoming very large, others very small. Layer normalization is like grading on a curve: it rescales each vector to have mean 0 and standard deviation 1.
Without normalization, training deep Transformers becomes unstable. The numbers "explode" or "vanish," and the model fails to learn.
Feed-Forward Network: "Processing What You Gathered"
After attention has gathered relevant information from across the sequence, the FFN processes that information. It is a simple two-layer network applied independently to each token:
The hidden layer is typically 4ร wider than the embedding dimension. This "expand then shrink" pattern gives the model a larger computational space to work in โ like having scratch paper to work out a problem โ before compressing the result back down.
Python
class FeedForward(nn.Module):
"""
Position-wise Feed-Forward Network.
Each position (token) is processed INDEPENDENTLY through the same network.
"""
def __init__(self, embed_dim, ff_dim=None):
super().__init__()
ff_dim = ff_dim or 4 * embed_dim
self.net = nn.Sequential(
nn.Linear(embed_dim, ff_dim), # Expand
nn.ReLU(), # Non-linearity
nn.Linear(ff_dim, embed_dim), # Shrink back
)
def forward(self, x):
return self.net(x)
The Complete Transformer Block
Now let's put attention, FFN, residual connections, and layer norm together:
Python
class TransformerBlock(nn.Module):
"""
A single Transformer block.
This is the fundamental repeating unit in models like GPT.
"""
def __init__(self, embed_dim, num_heads, ff_dim=None, dropout=0.1):
super().__init__()
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.attention = MultiHeadAttention(embed_dim, num_heads)
self.ffn = FeedForward(embed_dim, ff_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Sub-layer 1: Attention + Residual
normed = self.norm1(x)
attended = self.attention(normed, mask=mask)
attended = self.dropout(attended)
x = x + attended # โ Residual connection
# Sub-layer 2: FFN + Residual
normed = self.norm2(x)
fed_forward = self.ffn(normed)
fed_forward = self.dropout(fed_forward)
x = x + fed_forward # โ Residual connection
return x
The beauty of this design: input and output have the same shape. This means you can stack as many blocks as you want. The output of Block 1 feeds directly into Block 2, Block 2 into Block 3, and so on. Deeper stacks learn more complex patterns:
- Block 1: Basic patterns (which characters tend to appear together)
- Blocks 2โ3: Higher-level patterns (word structure, common phrases)
- Blocks 4+: Complex patterns (meaning, grammar, context, reasoning)
4.7 Putting It All Together: The Mini-Transformer
Now let's assemble every piece into a complete language model. This is a miniature version of GPT โ same architecture, just smaller:
Python
class MiniTransformer(nn.Module):
"""
A complete mini-Transformer language model.
Takes character indices as input, predicts the next character.
"""
def __init__(self, vocab_size, embed_dim=64, num_heads=4,
num_blocks=4, max_seq_len=256, dropout=0.1):
super().__init__()
self.vocab_size = vocab_size
self.embed_dim = embed_dim
self.max_seq_len = max_seq_len
# Token embedding: character index โ vector
self.token_embedding = nn.Embedding(vocab_size, embed_dim)
# Positional embedding: position โ vector (learned, not sinusoidal)
self.position_embedding = nn.Embedding(max_seq_len, embed_dim)
# Dropout after embeddings
self.dropout = nn.Dropout(dropout)
# Stack of Transformer blocks
self.blocks = nn.ModuleList([
TransformerBlock(embed_dim, num_heads, dropout=dropout)
for _ in range(num_blocks)
])
# Final layer normalization
self.final_norm = nn.LayerNorm(embed_dim)
# Output projection: vector โ vocabulary scores (logits)
self.output_head = nn.Linear(embed_dim, vocab_size, bias=False)
# Weight tying: share weights between input embedding and output head
self.output_head.weight = self.token_embedding.weight
def forward(self, idx, targets=None):
batch_size, seq_len = idx.shape
# Step 1: Token embeddings
tok_emb = self.token_embedding(idx)
# Step 2: Positional embeddings
positions = torch.arange(seq_len, device=idx.device)
pos_emb = self.position_embedding(positions)
# Step 3: Combine and apply dropout
x = self.dropout(tok_emb + pos_emb)
# Step 4: Causal mask
mask = torch.tril(torch.ones(seq_len, seq_len, device=idx.device))
mask = mask.unsqueeze(0)
# Step 5: Pass through transformer blocks
for block in self.blocks:
x = block(x, mask=mask)
# Step 6: Final normalization
x = self.final_norm(x)
# Step 7: Project to vocabulary size
logits = self.output_head(x)
# Compute loss if targets are provided
loss = None
if targets is not None:
loss = F.cross_entropy(
logits.view(-1, self.vocab_size),
targets.view(-1)
)
return logits, loss
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
"""Generate text autoregressively."""
for _ in range(max_new_tokens):
idx_cond = idx[:, -self.max_seq_len:]
logits, _ = self(idx_cond)
logits = logits[:, -1, :] / temperature
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float('-inf')
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
idx = torch.cat([idx, next_token], dim=1)
return idx
Architecture Diagram
Input: "The cat sat on the mat"
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ Token Embedding โ char index โ vector (64d)
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ
+ โโโโโ Position Embedding (learned, 64d)
โ
โผ
Dropout
โ
โโโโโโโโโโโโโดโโโโโโโโโโโโ
โ โ
โ โโโโโโโโโโโโโโโโโ โ
โ โ LayerNorm โ โ
โ โ Multi-Head โ โ โโ Transformer
โ โ Attention โ โ Block 1
โ โ + Residual โ โ
โ โโโโโโโโโโโโโโโโโค โ
โ โ LayerNorm โ โ
โ โ FFN โ โ
โ โ + Residual โ โ
โ โโโโโโโโโโโโโโโโโ โ
โ ... โ ร4 blocks
โ โโโโโโโโโโโโโโโโโ โ
โ โ Block 4 โ โ
โ โโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ Final LayerNorm โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโ
โ Output Head โ vector โ vocab scores
โ (weight-tied) โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โ
โผ
Logits (65 scores per position)
โ softmax โ probabilities โ sample โ next token
Parameter Count Breakdown
With vocab_size=65, embed_dim=64, num_heads=4, num_blocks=4, max_seq_len=128:
| Component | Parameters |
|---|---|
| Token Embedding (65 ร 64) | 4,160 |
| Position Embedding (128 ร 64) | 8,192 |
| Transformer Blocks (ร4) | ~66,000 |
| Final LayerNorm | 128 |
| Output Head | (shared with token embedding) |
| Total | ~78,000 |
For comparison:
- Our Mini-Transformer: ~78,000 parameters
- GPT-2 (small): 124,000,000 parameters
- GPT-3: 175,000,000,000 parameters
- GPT-4 (estimated): ~1,700,000,000,000 parameters
Same architecture. Different scale.
๐ญ 4.8 Discussion: Why Transformers Beat Everything
Before 2017, RNNs and LSTMs ruled natural language processing. Why did Transformers replace them so completely?
### 1. Parallelisation
RNNs process tokens one at a time: word 5 must wait for word 4 to finish, which must wait for word 3, and so on. This is inherently sequential โ you cannot speed it up by throwing more GPUs at it.
Transformers process all tokens simultaneously. The attention computation is a matrix multiplication โ the exact kind of operation that GPUs are designed to do blazingly fast. Training a Transformer on 8 GPUs is nearly 8ร faster. Training an RNN on 8 GPUs is barely faster than on 1.
### 2. Long-Range Dependencies
In an RNN, information from word 1 has to pass through every subsequent word to reach word 100. By that point, the signal has degraded โ the vanishing gradient problem. LSTMs improved this but didn't eliminate it.
In a Transformer, word 100 can attend directly to word 1 in a single step. The attention mechanism doesn't care about distance. A word on page 1 of a document can directly influence a word on page 50.
### 3. Scaling Laws
Perhaps the most important advantage: Transformers scale predictably. Research has shown that as you increase model size, data size, and compute, Transformer performance improves in a smooth, predictable curve. This led to the modern paradigm: make the model bigger, give it more data, and it gets better. This insight fuelled the race from GPT-2 (1.5B parameters) to GPT-4 (estimated 1.7T parameters).
> [!IMPORTANT]
> The Transformer's key advantage is not intelligence โ it is scalability. The same architecture works for a 78K-parameter toy model and a trillion-parameter frontier model. This universality is unprecedented in AI history.
Key Concepts Summary
| Concept | What It Does | Analogy |
|---|---|---|
| Embedding | Converts tokens into dense vectors | Roll number โ student description |
| Positional Encoding | Adds position information to embeddings | Radio frequencies giving each station a unique ID |
| Self-Attention | Lets each token attend to all other tokens | Students in a class looking at each other's notes |
| Query, Key, Value | Three projections of each token | Search term, book title, book content |
Scaling (\sqrt{d_k}) | Prevents softmax from saturating | Keeping exam scores in a reasonable range |
| Causal Mask | Prevents seeing future tokens | No peeking at the answer key |
| Multi-Head Attention | Multiple attention patterns in parallel | Reading a poem for meaning, rhyme, and emotion simultaneously |
| FFN | Processes gathered information | Doing homework after collecting notes from classmates |
| Residual Connection | Preserves original input | Keeping your own notes even after a confusing lecture |
| Layer Norm | Stabilises numbers | Grading on a curve |
| Transformer Block | Attention + FFN + residual + norm | One complete round of classroom discussion and homework |
๐ 4.10 Exercises
Exercise 1: Embedding Exploration
Run step1_embedding.py and observe the embedding vectors for the characters 'a' and 'b'. Are they similar? Why or why not? (Hint: the model is untrained.) Modify embed_dim from 8 to 32 and observe how the vectors change.
Exercise 2: Attention Matrix Interpretation
Run step2_attention.py and study the attention weight matrix printed for "The cat". Which character pays the most attention to which other character? Now change the sentence to "aaaaaa" (all same characters). What do you expect the attention matrix to look like, and why?
Exercise 3: Masking Experiment
In the SelfAttention class, remove the causal mask (set mask=None always). Run the model and compare the attention weights with and without the mask. Write a paragraph explaining why GPT-style models need the mask but BERT-style models don't.
Exercise 4: Multi-Head Intuition
In step3_transformer_block.py, change num_heads from 4 to 1 (keeping embed_dim the same). How does the parameter count change? Why might using more heads be better even though the total parameters stay the same?
Exercise 5: Scale the Model
Modify step4_put_it_together.py to create a larger model with embed_dim=128, num_heads=8, and num_blocks=6. Calculate the expected parameter count by hand, then verify it against the code's output. How does it compare to GPT-2?
๐ญ 4.11 Discussion Questions
The Attention Bottleneck: Self-attention computes a score between every pair of tokens, making its computational cost O(n^2) where n is the sequence length. If you double the sequence length, the cost quadruples. Why is this a problem for processing very long documents (like an entire novel)? What approaches might help?
Learned vs. Fixed Positional Encoding: Our final Mini-Transformer uses learned positional embeddings, while the original 2017 paper used sinusoidal (fixed) encodings. What are the trade-offs? Can a model with learned embeddings generalise to sequences longer than it was trained on?
Weight Tying: In our MiniTransformer, the token embedding matrix and the output head share the same weights (self.output_head.weight = self.token_embedding.weight). Why does this make intuitive sense? (Hint: think about what both layers represent.)
Why ReLU in the FFN? The feed-forward network uses ReLU (Rectified Linear Unit) as its non-linearity. What would happen if we removed the non-linearity entirely? Would the FFN still be useful? (Hint: think about what two consecutive linear layers reduce to.)
The Scaling Revolution: The same Transformer architecture powers models from 78K parameters (our toy model) to 1.7 trillion parameters (GPT-4). What does this tell us about the relationship between architecture and scale in modern AI? Is architecture or data more important?
Tip
What's Next? In Chapter 5, you will take this Mini-Transformer and train it on real text. You'll watch it go from producing random garbage to generating coherent English โ character by character. The architecture is ready. Now it's time to teach it to think.
Complete Source Code - Chapter 4
Below are the complete, runnable source files for this chapter. Every line is included.
Complete Code: step1_embedding.py
Python
"""
๐ Level 3, Step 1: Embeddings โ Turning Words into Numbers
=============================================================
Before a Transformer can process text, it needs to convert characters (or words)
into NUMBERS. This is called "embedding".
But there's a catch โ the model also needs to know the ORDER of the characters.
"cat" and "tac" have the same characters but different meanings!
That's why we add "positional encoding" โ a special pattern that tells the model
WHERE each character is in the sequence.
This script shows you:
1. How to build a vocabulary (character โ number)
2. How token embedding works (number โ vector)
3. How positional encoding works (adding position information)
"""
import torch
import torch.nn as nn
import math
# ============================================================================
# ๐จ ANSI Colors for beautiful terminal output
# ============================================================================
class Colors:
HEADER = '\033[95m'
BLUE = '\033[94m'
CYAN = '\033[96m'
GREEN = '\033[92m'
YELLOW = '\033[93m'
RED = '\033[91m'
BOLD = '\033[1m'
DIM = '\033[2m'
RESET = '\033[0m'
def print_header(text):
print(f"\n{Colors.BOLD}{Colors.HEADER}{'='*60}")
print(f" {text}")
print(f"{'='*60}{Colors.RESET}\n")
def print_step(num, text):
print(f"{Colors.BOLD}{Colors.CYAN}๐ Step {num}: {text}{Colors.RESET}")
def print_info(text):
print(f" {Colors.DIM}{text}{Colors.RESET}")
def print_success(text):
print(f" {Colors.GREEN}โ {text}{Colors.RESET}")
# ============================================================================
# ๐ STEP 1: Build a Vocabulary
# ============================================================================
# A vocabulary maps each unique character to a number (index).
# For example: 'a' โ 0, 'b' โ 1, 'c' โ 2, ...
# This is the simplest form of "tokenization".
# Real models like GPT use "subword" tokenization (BPE), which breaks words
# into pieces like "play" + "ing". But character-level is easier to understand!
# ============================================================================
def build_vocabulary(text):
"""
Build a character-level vocabulary from text.
Returns:
char_to_idx: Dictionary mapping character โ number
idx_to_char: Dictionary mapping number โ character
vocab_size: Total number of unique characters
"""
# Get all unique characters and sort them
# sorted() ensures the mapping is consistent every time
chars = sorted(list(set(text)))
# Create the two-way mapping
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
return char_to_idx, idx_to_char, len(chars)
def encode(text, char_to_idx):
"""Convert a string into a list of numbers using our vocabulary."""
return [char_to_idx[ch] for ch in text]
def decode(indices, idx_to_char):
"""Convert a list of numbers back into a string."""
return ''.join([idx_to_char[i] for i in indices])
# ============================================================================
# ๐ STEP 2: Token Embedding
# ============================================================================
# An embedding turns each character INDEX into a VECTOR (list of numbers).
#
# Why? Because a single number (like 5) doesn't carry much meaning.
# But a vector (like [0.2, -0.5, 0.8, 0.1]) can represent complex relationships:
# - Similar characters will have similar vectors
# - The model LEARNS these vectors during training!
#
# Think of it like this:
# Index 5 โ just a label, like a student's roll number
# Vector [0.2, -0.5, 0.8] โ the student's actual abilities/personality
# ============================================================================
class TokenEmbedding(nn.Module):
"""
Converts token indices into dense vectors.
Args:
vocab_size: Number of unique tokens
embed_dim: Size of each embedding vector
"""
def __init__(self, vocab_size, embed_dim):
super().__init__()
# nn.Embedding is like a lookup table:
# It stores a matrix of shape (vocab_size ร embed_dim)
# When you give it index 3, it returns row 3 of the matrix
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.embed_dim = embed_dim
def forward(self, x):
# x shape: (batch_size, sequence_length) โ indices
# output shape: (batch_size, sequence_length, embed_dim) โ vectors
return self.embedding(x)
# ============================================================================
# ๐ STEP 3: Positional Encoding
# ============================================================================
# The Transformer processes ALL tokens at once (not one-by-one like RNNs).
# This means it has NO idea about word order!
# "The cat sat on the mat" and "mat the on sat cat the" look the same to it.
#
# Positional encoding ADDS a unique pattern to each position:
# Position 0: add pattern [sin(0), cos(0), sin(0), cos(0), ...]
# Position 1: add pattern [sin(1), cos(1), sin(0.1), cos(0.1), ...]
# Position 2: add pattern [sin(2), cos(2), sin(0.2), cos(0.2), ...]
#
# The patterns use sin/cos waves at different frequencies so each position
# gets a UNIQUE fingerprint. The model can then learn to use this info!
# ============================================================================
class PositionalEncoding(nn.Module):
"""
Adds positional information to embeddings using sinusoidal patterns.
The famous formula from "Attention Is All You Need":
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
"""
def __init__(self, embed_dim, max_seq_len=512):
super().__init__()
# Create a matrix to store all positional encodings
pe = torch.zeros(max_seq_len, embed_dim)
# Position indices: [0, 1, 2, ..., max_seq_len-1]
position = torch.arange(0, max_seq_len, dtype=torch.float).unsqueeze(1)
# Division term: creates different frequencies for each dimension
# Even dimensions use sin, odd dimensions use cos
div_term = torch.exp(
torch.arange(0, embed_dim, 2).float() * (-math.log(10000.0) / embed_dim)
)
# Apply sin to even indices (0, 2, 4, ...)
pe[:, 0::2] = torch.sin(position * div_term)
# Apply cos to odd indices (1, 3, 5, ...)
pe[:, 1::2] = torch.cos(position * div_term)
# Add batch dimension: (1, max_seq_len, embed_dim)
pe = pe.unsqueeze(0)
# Register as buffer (saved with model but not trained)
self.register_buffer('pe', pe)
def forward(self, x):
"""
Args:
x: Tensor of shape (batch_size, seq_len, embed_dim)
Returns:
x + positional encoding
"""
seq_len = x.size(1)
# Add positional encoding to the input
# The position info is ADDED to the embedding, not concatenated
return x + self.pe[:, :seq_len, :]
# ============================================================================
# ๐ MAIN: Let's see it all in action!
# ============================================================================
if __name__ == '__main__':
print_header("๐ Level 3, Step 1: Embeddings & Positional Encoding")
# --- Step 1: Build Vocabulary ---
print_step(1, "Building the Vocabulary")
sample_text = "The sun rises in the east and sets in the west. India is a beautiful country."
print_info(f'Sample text: "{sample_text}"')
char_to_idx, idx_to_char, vocab_size = build_vocabulary(sample_text)
print(f"\n {Colors.YELLOW}Vocabulary ({vocab_size} unique characters):{Colors.RESET}")
for ch, idx in sorted(char_to_idx.items(), key=lambda x: x[1]):
display_ch = repr(ch) if ch == ' ' else f"'{ch}'"
print(f" {display_ch:6s} โ {idx}")
print_success(f"Built vocabulary with {vocab_size} characters")
# --- Step 2: Encode Text ---
print_step(2, "Encoding Text into Numbers")
test_text = "The sun"
encoded = encode(test_text, char_to_idx)
print(f"\n Text: \"{test_text}\"")
print(f" Encoded: {encoded}")
print(f" Decoded: \"{decode(encoded, idx_to_char)}\"")
print_success("Text successfully converted to numbers!")
# --- Step 3: Token Embedding ---
print_step(3, "Converting Numbers to Vectors (Token Embedding)")
embed_dim = 8 # Small for visualization (real models use 768+)
token_emb = TokenEmbedding(vocab_size, embed_dim)
# Convert our encoded text to a tensor
input_tensor = torch.tensor([encoded]) # Shape: (1, 7) โ batch=1, seq=7
print(f"\n Input shape: {list(input_tensor.shape)} (batch_size=1, seq_len={len(encoded)})")
# Get embeddings
embedded = token_emb(input_tensor)
print(f" Output shape: {list(embedded.shape)} (batch_size=1, seq_len={len(encoded)}, embed_dim={embed_dim})")
print(f"\n {Colors.YELLOW}Embedding vectors for each character:{Colors.RESET}")
for i, ch in enumerate(test_text):
vec = embedded[0, i].detach().numpy()
vec_str = ', '.join([f'{v:+.3f}' for v in vec])
display_ch = 'SPC' if ch == ' ' else ch
print(f" '{display_ch}' โ [{vec_str}]")
print_success("Each character is now a rich vector of numbers!")
# --- Step 4: Positional Encoding ---
print_step(4, "Adding Positional Information")
pos_enc = PositionalEncoding(embed_dim, max_seq_len=100)
# Show the positional encoding patterns
print(f"\n {Colors.YELLOW}Positional encoding patterns:{Colors.RESET}")
for pos in range(min(5, len(test_text))):
pe_vals = pos_enc.pe[0, pos].numpy()
pe_str = ', '.join([f'{v:+.3f}' for v in pe_vals])
print(f" Position {pos}: [{pe_str}]")
# Apply positional encoding
embedded_with_pos = pos_enc(embedded)
print(f"\n {Colors.YELLOW}Before vs After positional encoding:{Colors.RESET}")
for i, ch in enumerate(test_text[:4]):
before = embedded[0, i].detach().numpy()
after = embedded_with_pos[0, i].detach().numpy()
display_ch = 'SPC' if ch == ' ' else ch
before_str = ', '.join([f'{v:+.3f}' for v in before[:4]])
after_str = ', '.join([f'{v:+.3f}' for v in after[:4]])
print(f" '{display_ch}' before: [{before_str}, ...]")
print(f" '{display_ch}' after: [{after_str}, ...]")
print()
print_success("Position information added! Same character at different positions now has different vectors.")
# --- Summary ---
print_header("๐ Summary")
print(f""" The embedding pipeline:
Text: "The sun"
โ
โผ
{Colors.CYAN}Tokenize{Colors.RESET}: Convert characters to indices
โ 'T'โ{char_to_idx.get('T', '?')}, 'h'โ{char_to_idx.get('h', '?')}, 'e'โ{char_to_idx.get('e', '?')}, ...
โ
โผ
{Colors.CYAN}Embed{Colors.RESET}: Look up vector for each index
โ Index {char_to_idx.get('T', '?')} โ [{', '.join([f'{v:.2f}' for v in embedded[0, 0].detach().numpy()[:3]])}, ...]
โ
โผ
{Colors.CYAN}Add Position{Colors.RESET}: Add sinusoidal position pattern
โ Vector + Position Pattern = Final Embedding
โ
โผ
Ready for Attention! โ Go to step2_attention.py
""")
print(f" {Colors.BOLD}{Colors.GREEN}โ
Step 1 Complete! Next: python step2_attention.py{Colors.RESET}\n")
Complete Code: step2_attention.py
Python
"""
๐ Level 3, Step 2: Self-Attention โ The Core of Transformers
===============================================================
This is THE most important mechanism in modern AI.
Self-attention allows each token in a sequence to "look at" every other token
and decide how much to pay attention to it.
We'll build it from scratch:
1. Create Query (Q), Key (K), Value (V) matrices
2. Compute attention scores
3. Apply causal mask (no peeking at the future!)
4. Softmax to get probabilities
5. Weighted sum of values
By the end, you'll understand the formula:
Attention(Q, K, V) = softmax(QยทK^T / โd_k) ยท V
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
# ============================================================================
# ๐จ Colors
# ============================================================================
class Colors:
HEADER = '\033[95m'
BLUE = '\033[94m'
CYAN = '\033[96m'
GREEN = '\033[92m'
YELLOW = '\033[93m'
RED = '\033[91m'
BOLD = '\033[1m'
DIM = '\033[2m'
RESET = '\033[0m'
def print_header(text):
print(f"\n{Colors.BOLD}{Colors.HEADER}{'='*60}")
print(f" {text}")
print(f"{'='*60}{Colors.RESET}\n")
def print_step(num, text):
print(f"{Colors.BOLD}{Colors.CYAN}๐ Step {num}: {text}{Colors.RESET}")
def print_info(text):
print(f" {Colors.DIM}{text}{Colors.RESET}")
def print_success(text):
print(f" {Colors.GREEN}โ {text}{Colors.RESET}")
def print_matrix(name, matrix, row_labels=None, col_labels=None):
"""Pretty-print a 2D matrix with labels."""
print(f"\n {Colors.YELLOW}{name}:{Colors.RESET}")
rows, cols = matrix.shape
# Column headers
if col_labels:
header = " " + " ".join([f"{l:>7s}" for l in col_labels])
print(f" {header}")
print(f" {'โ' * len(header)}")
for i in range(rows):
label = f" {row_labels[i]:>6s} โ " if row_labels else f" Row {i}: "
vals = " ".join([f"{matrix[i,j]:>7.3f}" for j in range(cols)])
print(f"{label}{vals}")
# ============================================================================
# ๐ง SELF-ATTENTION FROM SCRATCH
# ============================================================================
class SelfAttention(nn.Module):
"""
Single-head self-attention mechanism.
This is the core building block of the Transformer.
How it works:
1. Take input X (sequence of vectors)
2. Create three versions: Q (Query), K (Key), V (Value)
3. Compute attention = softmax(QยทK^T / โd) ยท V
4. Return attention output
"""
def __init__(self, embed_dim):
super().__init__()
self.embed_dim = embed_dim
# Three weight matrices โ these are LEARNED during training!
# W_q: transforms input into "what am I looking for?"
# W_k: transforms input into "what do I contain?"
# W_v: transforms input into "what information do I give?"
self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)
self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)
# Scaling factor to prevent dot products from getting too large
self.scale = math.sqrt(embed_dim)
def forward(self, x, mask=None, verbose=False):
"""
Args:
x: Input tensor of shape (batch, seq_len, embed_dim)
mask: Optional causal mask
verbose: If True, print intermediate values
Returns:
output: Attention output, same shape as input
attention_weights: The attention matrix (for visualization)
"""
batch_size, seq_len, _ = x.shape
# ===== STEP 1: Create Q, K, V =====
# Each is a different "view" of the same input
Q = self.W_q(x) # (batch, seq_len, embed_dim)
K = self.W_k(x) # (batch, seq_len, embed_dim)
V = self.W_v(x) # (batch, seq_len, embed_dim)
if verbose:
print_step("A", "Computed Q (Query), K (Key), V (Value)")
print_info(f"Q shape: {list(Q.shape)}")
print_info(f"K shape: {list(K.shape)}")
print_info(f"V shape: {list(V.shape)}")
# ===== STEP 2: Compute Attention Scores =====
# Score = Q ยท K^T (dot product between queries and keys)
# High score = this query is very interested in this key
scores = torch.matmul(Q, K.transpose(-2, -1)) # (batch, seq_len, seq_len)
if verbose:
print_step("B", "Computed raw attention scores (Q ยท K^T)")
print_info(f"Scores shape: {list(scores.shape)} โ each token has a score for every other token")
# ===== STEP 3: Scale =====
# Divide by โd_k to prevent scores from getting too large
# Large scores โ softmax becomes too "peaked" (one token gets all attention)
# Scaled scores โ softer distribution โ better learning
scores = scores / self.scale
if verbose:
print_step("C", f"Scaled scores by 1/โ{self.embed_dim} = 1/{self.scale:.2f}")
# ===== STEP 4: Apply Causal Mask (Optional) =====
# In GPT-style models, each token can only attend to tokens BEFORE it
# We set future positions to -infinity so softmax turns them to 0
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
if verbose:
print_step("D", "Applied causal mask (future tokens set to -โ)")
print_info("This prevents the model from 'cheating' by looking ahead!")
# ===== STEP 5: Softmax =====
# Convert scores to probabilities (0 to 1, summing to 1)
attention_weights = F.softmax(scores, dim=-1)
if verbose:
print_step("E", "Applied softmax โ attention weights (probabilities)")
print_info("Each row sums to 1.0 โ it's a probability distribution!")
# ===== STEP 6: Weighted Sum of Values =====
# Multiply attention weights by V to get the final output
# Each token's output is a weighted combination of ALL values
output = torch.matmul(attention_weights, V) # (batch, seq_len, embed_dim)
if verbose:
print_step("F", "Computed output = attention_weights ร V")
print_info(f"Output shape: {list(output.shape)} โ same as input!")
print_success("Each token now contains information from tokens it attended to!")
return output, attention_weights
# ============================================================================
# ๐ญ Create Causal Mask
# ============================================================================
def create_causal_mask(seq_len):
"""
Create a lower-triangular mask for autoregressive (GPT-style) models.
The mask looks like:
[[1, 0, 0, 0], โ token 0 can only see token 0
[1, 1, 0, 0], โ token 1 can see tokens 0, 1
[1, 1, 1, 0], โ token 2 can see tokens 0, 1, 2
[1, 1, 1, 1]] โ token 3 can see all tokens
"""
mask = torch.tril(torch.ones(seq_len, seq_len))
return mask.unsqueeze(0) # Add batch dimension
# ============================================================================
# ๐ MAIN: Interactive Demo
# ============================================================================
if __name__ == '__main__':
print_header("๐ Level 3, Step 2: Self-Attention from Scratch")
# --- Setup ---
torch.manual_seed(42) # For reproducible results
# Our example sentence (character-level)
sentence = "The cat"
tokens = list(sentence)
seq_len = len(tokens)
embed_dim = 8 # Small for visualization
print(f" {Colors.BOLD}Example sentence: \"{sentence}\"{Colors.RESET}")
print(f" Tokens: {tokens}")
print(f" Sequence length: {seq_len}")
print(f" Embedding dimension: {embed_dim}")
# Create random embeddings (in real model, these come from Step 1)
x = torch.randn(1, seq_len, embed_dim)
# --- Build Attention ---
print_header("๐ง Building Self-Attention")
attention = SelfAttention(embed_dim)
# --- Without Mask (bidirectional) ---
print_header("๐ Attention WITHOUT Causal Mask (Bidirectional)")
print_info("Every token can see every other token")
output_bi, weights_bi = attention(x, mask=None, verbose=True)
# Show attention matrix
print_matrix(
"Attention Weights (who pays attention to whom?)",
weights_bi[0].detach(),
row_labels=tokens,
col_labels=tokens
)
# Visual attention grid
print(f"\n {Colors.YELLOW}Visual Attention Grid:{Colors.RESET}")
print(f" (โ = high attention, โ = low attention)\n")
header = " " + " ".join([f"{t:>3s}" for t in tokens])
print(f" {header}")
for i, token in enumerate(tokens):
row = f" {token:>6s} โ "
for j in range(seq_len):
w = weights_bi[0, i, j].item()
if w > 0.3:
row += f" {Colors.GREEN}โโ{Colors.RESET} "
elif w > 0.15:
row += f" {Colors.YELLOW}โโ{Colors.RESET} "
else:
row += f" {Colors.DIM}โโ{Colors.RESET} "
print(row)
# --- With Causal Mask ---
print_header("๐ญ Attention WITH Causal Mask (Autoregressive / GPT-style)")
print_info("Each token can only see itself and tokens BEFORE it")
causal_mask = create_causal_mask(seq_len)
print(f"\n {Colors.YELLOW}Causal Mask:{Colors.RESET}")
for i, token in enumerate(tokens):
row = f" {token:>6s} โ "
for j in range(seq_len):
if causal_mask[0, i, j] == 1:
row += f" {Colors.GREEN}โ{Colors.RESET} "
else:
row += f" {Colors.RED}โ{Colors.RESET} "
print(row)
output_causal, weights_causal = attention(x, mask=causal_mask, verbose=True)
print_matrix(
"Causal Attention Weights",
weights_causal[0].detach(),
row_labels=tokens,
col_labels=tokens
)
# Visual grid for causal attention
print(f"\n {Colors.YELLOW}Visual Causal Attention Grid:{Colors.RESET}")
print(f" (โ = high attention, โ = low, โ = masked)\n")
header = " " + " ".join([f"{t:>3s}" for t in tokens])
print(f" {header}")
for i, token in enumerate(tokens):
row = f" {token:>6s} โ "
for j in range(seq_len):
if causal_mask[0, i, j] == 0:
row += f" {Colors.RED}โโ{Colors.RESET} "
else:
w = weights_causal[0, i, j].item()
if w > 0.3:
row += f" {Colors.GREEN}โโ{Colors.RESET} "
elif w > 0.15:
row += f" {Colors.YELLOW}โโ{Colors.RESET} "
else:
row += f" {Colors.DIM}โโ{Colors.RESET} "
print(row)
# --- Summary ---
print_header("๐ Summary")
print(f""" Self-Attention in 6 steps:
1. {Colors.CYAN}Create Q, K, V{Colors.RESET} from input using learned weight matrices
2. {Colors.CYAN}Score{Colors.RESET} = Q ยท K^T (how much does each token care about others?)
3. {Colors.CYAN}Scale{Colors.RESET} by 1/โd_k (keep numbers reasonable)
4. {Colors.CYAN}Mask{Colors.RESET} future tokens (for GPT-style models)
5. {Colors.CYAN}Softmax{Colors.RESET} to get probabilities
6. {Colors.CYAN}Output{Colors.RESET} = attention_weights ร V (weighted combination)
{Colors.BOLD}The Formula:{Colors.RESET}
Attention(Q, K, V) = softmax(QยทK^T / โd_k) ยท V
{Colors.BOLD}{Colors.GREEN}โ
Step 2 Complete! Next: python step3_transformer_block.py{Colors.RESET}
""")
Complete Code: step3_transformer_block.py
Python
"""
๐ Level 3, Step 3: The Transformer Block
============================================
A Transformer Block combines several components into one powerful unit:
1. Multi-Head Self-Attention โ look at the sequence from multiple perspectives
2. Feed-Forward Network โ process the information
3. Layer Normalization โ keep numbers stable
4. Residual Connections โ preserve original information
This is the building block that gets stacked to make GPT, Claude, etc.
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
# ============================================================================
# ๐จ Colors
# ============================================================================
class Colors:
HEADER = '\033[95m'
BLUE = '\033[94m'
CYAN = '\033[96m'
GREEN = '\033[92m'
YELLOW = '\033[93m'
RED = '\033[91m'
BOLD = '\033[1m'
DIM = '\033[2m'
RESET = '\033[0m'
def print_header(text):
print(f"\n{Colors.BOLD}{Colors.HEADER}{'='*60}")
print(f" {text}")
print(f"{'='*60}{Colors.RESET}\n")
def print_step(num, text):
print(f"{Colors.BOLD}{Colors.CYAN}๐ Step {num}: {text}{Colors.RESET}")
def print_info(text):
print(f" {Colors.DIM}{text}{Colors.RESET}")
def print_success(text):
print(f" {Colors.GREEN}โ {text}{Colors.RESET}")
def count_parameters(module, name=""):
"""Count and print the number of parameters in a module."""
total = sum(p.numel() for p in module.parameters())
trainable = sum(p.numel() for p in module.parameters() if p.requires_grad)
if name:
print(f" {name:30s}: {trainable:>8,} parameters")
return trainable
# ============================================================================
# ๐ MULTI-HEAD ATTENTION
# ============================================================================
# Instead of one attention head, we use MULTIPLE heads.
# Each head learns to focus on different types of relationships:
# Head 1 might focus on: "what word comes before me?"
# Head 2 might focus on: "what is the subject of this sentence?"
# Head 3 might focus on: "is there a negation word nearby?"
#
# We split the embedding dimension among heads:
# embed_dim=128, num_heads=4 โ each head works with 32 dimensions
# ============================================================================
class MultiHeadAttention(nn.Module):
"""
Multi-Head Self-Attention.
Splits the input into multiple "heads", runs attention on each,
then combines the results.
"""
def __init__(self, embed_dim, num_heads):
super().__init__()
assert embed_dim % num_heads == 0, \
f"embed_dim ({embed_dim}) must be divisible by num_heads ({num_heads})"
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads # Dimension per head
# One big linear layer for Q, K, V (more efficient than 3 separate ones)
self.W_qkv = nn.Linear(embed_dim, 3 * embed_dim, bias=False)
# Output projection: combines all heads back together
self.W_out = nn.Linear(embed_dim, embed_dim, bias=False)
self.scale = math.sqrt(self.head_dim)
def forward(self, x, mask=None):
"""
Args:
x: (batch, seq_len, embed_dim)
mask: Optional causal mask
Returns:
output: (batch, seq_len, embed_dim)
"""
batch_size, seq_len, _ = x.shape
# Step 1: Compute Q, K, V all at once
qkv = self.W_qkv(x) # (batch, seq_len, 3 * embed_dim)
# Step 2: Split into Q, K, V
qkv = qkv.reshape(batch_size, seq_len, 3, self.num_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4) # (3, batch, heads, seq_len, head_dim)
Q, K, V = qkv[0], qkv[1], qkv[2]
# Step 3: Compute attention scores for ALL heads at once
scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
# scores shape: (batch, heads, seq_len, seq_len)
# Step 4: Apply causal mask
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
# Step 5: Softmax
attention_weights = F.softmax(scores, dim=-1)
# Step 6: Weighted sum of values
output = torch.matmul(attention_weights, V)
# output shape: (batch, heads, seq_len, head_dim)
# Step 7: Combine heads back together
output = output.transpose(1, 2) # (batch, seq_len, heads, head_dim)
output = output.reshape(batch_size, seq_len, self.embed_dim)
# Step 8: Final projection
output = self.W_out(output)
return output
# ============================================================================
# ๐ง FEED-FORWARD NETWORK
# ============================================================================
# After attention figures out RELATIONSHIPS between tokens,
# the FFN PROCESSES that information.
#
# It's a simple 2-layer network:
# Input โ Expand (4x bigger) โ ReLU โ Shrink (back to original) โ Output
#
# The "expand then shrink" pattern gives the model a larger space to
# compute in, then compresses the result back down.
# ============================================================================
class FeedForward(nn.Module):
"""
Position-wise Feed-Forward Network.
Each position (token) is processed INDEPENDENTLY through the same network.
It's like giving each student the same worksheet to fill out.
"""
def __init__(self, embed_dim, ff_dim=None):
super().__init__()
# Default: expand to 4x the embedding dimension
if ff_dim is None:
ff_dim = 4 * embed_dim
self.net = nn.Sequential(
nn.Linear(embed_dim, ff_dim), # Expand
nn.ReLU(), # Non-linearity (the "thinking" part)
nn.Linear(ff_dim, embed_dim), # Shrink back
)
def forward(self, x):
return self.net(x)
# ============================================================================
# ๐งฑ THE COMPLETE TRANSFORMER BLOCK
# ============================================================================
# This combines everything:
#
# Input
# โ
# โโโโโโโโโโโโโโโโโโโโโโ (Residual)
# โผ โ
# LayerNorm โ
# โผ โ
# Multi-Head Attention โ
# โผ โ
# ADD โโโโโโโโโโโโโโโโโโโโ
# โ
# โโโโโโโโโโโโโโโโโโโโโโ (Residual)
# โผ โ
# LayerNorm โ
# โผ โ
# Feed-Forward โ
# โผ โ
# ADD โโโโโโโโโโโโโโโโโโโโ
# โ
# Output
# ============================================================================
class TransformerBlock(nn.Module):
"""
A single Transformer block.
This is the fundamental repeating unit in models like GPT.
Stack many of these together to get a full Transformer model.
"""
def __init__(self, embed_dim, num_heads, ff_dim=None, dropout=0.1):
super().__init__()
# Layer Normalization: keeps values in a reasonable range
# Think of it as "grading on a curve" โ normalizes each student's scores
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
# Multi-Head Self-Attention
self.attention = MultiHeadAttention(embed_dim, num_heads)
# Feed-Forward Network
self.ffn = FeedForward(embed_dim, ff_dim)
# Dropout: randomly "turns off" some neurons during training
# Prevents the model from memorizing (overfitting) the training data
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
"""
Args:
x: (batch, seq_len, embed_dim)
mask: Optional causal mask
Returns:
output: (batch, seq_len, embed_dim) โ same shape as input!
"""
# === Sub-layer 1: Attention with Residual Connection ===
# 1. Normalize
normed = self.norm1(x)
# 2. Apply attention
attended = self.attention(normed, mask=mask)
# 3. Dropout (only during training)
attended = self.dropout(attended)
# 4. Residual connection: ADD original input back
# This ensures information isn't lost through the attention layer
x = x + attended
# === Sub-layer 2: FFN with Residual Connection ===
# 1. Normalize
normed = self.norm2(x)
# 2. Apply feed-forward
fed_forward = self.ffn(normed)
# 3. Dropout
fed_forward = self.dropout(fed_forward)
# 4. Residual connection
x = x + fed_forward
return x
# ============================================================================
# ๐ MAIN: See it in action
# ============================================================================
if __name__ == '__main__':
print_header("๐ Level 3, Step 3: The Transformer Block")
torch.manual_seed(42)
# Configuration
embed_dim = 32 # Embedding dimension
num_heads = 4 # Number of attention heads
seq_len = 8 # Sequence length
batch_size = 1
print(f" {Colors.BOLD}Configuration:{Colors.RESET}")
print(f" Embedding dimension: {embed_dim}")
print(f" Number of heads: {num_heads}")
print(f" Head dimension: {embed_dim // num_heads}")
print(f" Sequence length: {seq_len}")
print(f" Feed-forward dim: {4 * embed_dim}")
# --- Build Components ---
print_header("๐ง Building Components")
print_step(1, "Multi-Head Attention")
mha = MultiHeadAttention(embed_dim, num_heads)
count_parameters(mha, "Multi-Head Attention")
print_info(f" โ {num_heads} heads, each with dim={embed_dim // num_heads}")
print()
print_step(2, "Feed-Forward Network")
ffn = FeedForward(embed_dim)
count_parameters(ffn, "Feed-Forward Network")
print_info(f" โ Expand: {embed_dim} โ {4*embed_dim} โ {embed_dim}")
print()
print_step(3, "Layer Normalization")
ln = nn.LayerNorm(embed_dim)
count_parameters(ln, "Layer Norm (ร2)")
print_info(" โ Normalizes values to mean=0, std=1")
# --- Build Full Transformer Block ---
print_header("๐งฑ Complete Transformer Block")
block = TransformerBlock(embed_dim, num_heads)
total_params = count_parameters(block, "Total Transformer Block")
print(f"\n {Colors.YELLOW}Architecture:{Colors.RESET}")
print(f"""
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ INPUT ({embed_dim}d) โ
โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโ (residual)
โผ โ
โโโโโโโโโโโโโโโ โ
โ LayerNorm โ โ
โโโโโโโโฌโโโโโโโ โ
โผ โ
โโโโโโโโโโโโโโโ โ
โ Multi-Head โ โ
โ Attention โ โ
โ ({num_heads} heads) โ โ
โโโโโโโโฌโโโโโโโ โ
โผ โ
ADD โโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโ (residual)
โผ โ
โโโโโโโโโโโโโโโ โ
โ LayerNorm โ โ
โโโโโโโโฌโโโโโโโ โ
โผ โ
โโโโโโโโโโโโโโโ โ
โ Feed-Forwardโ โ
โ {embed_dim}โ{4*embed_dim}โ{embed_dim} โ โ
โโโโโโโโฌโโโโโโโ โ
โผ โ
ADD โโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโ
โ OUTPUT ({embed_dim}d) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
""")
# --- Forward Pass ---
print_header("๐ Running a Forward Pass")
# Create causal mask
mask = torch.tril(torch.ones(seq_len, seq_len)).unsqueeze(0)
# Random input (simulating embedded tokens)
x = torch.randn(batch_size, seq_len, embed_dim)
print_step(1, "Input")
print_info(f"Shape: {list(x.shape)} (batch={batch_size}, seq={seq_len}, dim={embed_dim})")
print_info(f"First token vector (first 8 dims): [{', '.join(f'{v:.3f}' for v in x[0,0,:8].tolist())}]")
print()
print_step(2, "Processing through Transformer Block...")
output = block(x, mask=mask)
print()
print_step(3, "Output")
print_info(f"Shape: {list(output.shape)} โ Same as input! โ")
print_info(f"First token vector (first 8 dims): [{', '.join(f'{v:.3f}' for v in output[0,0,:8].tolist())}]")
print(f"\n {Colors.YELLOW}Notice:{Colors.RESET}")
print(f" โ Input shape = {list(x.shape)}")
print(f" โ Output shape = {list(output.shape)}")
print(f" โ {Colors.GREEN}Shapes are identical!{Colors.RESET} This means we can STACK blocks.")
print(f" The output of Block 1 becomes the input to Block 2!")
# --- Stacking Demo ---
print_header("๐ Stacking Multiple Blocks")
num_blocks = 4
blocks = nn.ModuleList([
TransformerBlock(embed_dim, num_heads) for _ in range(num_blocks)
])
# Pass through all blocks
current = x
for i, b in enumerate(blocks):
current = b(current, mask=mask)
print(f" Block {i+1}: {list(current.shape)} โ")
stack_params = sum(count_parameters(b) for b in blocks)
print(f"\n Total parameters in {num_blocks}-block stack: {Colors.BOLD}{stack_params:,}{Colors.RESET}")
# --- Summary ---
print_header("๐ Summary")
print(f""" A Transformer Block contains:
1. {Colors.CYAN}Multi-Head Attention{Colors.RESET} โ learns relationships between tokens
2. {Colors.CYAN}Feed-Forward Network{Colors.RESET} โ processes the information
3. {Colors.CYAN}Layer Normalization{Colors.RESET} โ keeps numbers stable
4. {Colors.CYAN}Residual Connections{Colors.RESET} โ preserves original information
Key insight: {Colors.BOLD}Input and output shapes are the same!{Colors.RESET}
This means we can stack as many blocks as we want.
More blocks = deeper understanding:
Block 1: Basic patterns (which characters go together)
Block 2-3: Higher-level patterns (word structure)
Block 4+: Complex patterns (meaning, context)
{Colors.BOLD}{Colors.GREEN}โ
Step 3 Complete! Next: python step4_put_it_together.py{Colors.RESET}
""")
Complete Code: step4_put_it_together.py
Python
"""
๐ Level 3, Step 4: Putting It All Together โ A Complete Mini-Transformer
==========================================================================
Now we assemble all the pieces from Steps 1-3 into a COMPLETE model:
Text โ Tokenize โ Embed โ Position โ [Transformer Blocks] โ Output Logits
This is essentially a tiny version of GPT!
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
# ============================================================================
# ๐จ Colors
# ============================================================================
class Colors:
HEADER = '\033[95m'
BLUE = '\033[94m'
CYAN = '\033[96m'
GREEN = '\033[92m'
YELLOW = '\033[93m'
RED = '\033[91m'
BOLD = '\033[1m'
DIM = '\033[2m'
RESET = '\033[0m'
def print_header(text):
print(f"\n{Colors.BOLD}{Colors.HEADER}{'='*60}")
print(f" {text}")
print(f"{'='*60}{Colors.RESET}\n")
def print_step(num, text):
print(f"{Colors.BOLD}{Colors.CYAN}๐ Step {num}: {text}{Colors.RESET}")
def print_info(text):
print(f" {Colors.DIM}{text}{Colors.RESET}")
def print_success(text):
print(f" {Colors.GREEN}โ {text}{Colors.RESET}")
# ============================================================================
# ๐ Multi-Head Attention (from Step 2 & 3)
# ============================================================================
class MultiHeadAttention(nn.Module):
def __init__(self, embed_dim, num_heads):
super().__init__()
self.embed_dim = embed_dim
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
self.W_qkv = nn.Linear(embed_dim, 3 * embed_dim, bias=False)
self.W_out = nn.Linear(embed_dim, embed_dim, bias=False)
self.scale = math.sqrt(self.head_dim)
def forward(self, x, mask=None):
B, T, C = x.shape
qkv = self.W_qkv(x).reshape(B, T, 3, self.num_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4)
Q, K, V = qkv[0], qkv[1], qkv[2]
scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1)
out = torch.matmul(weights, V)
out = out.transpose(1, 2).reshape(B, T, C)
return self.W_out(out)
# ============================================================================
# ๐ง Feed-Forward Network (from Step 3)
# ============================================================================
class FeedForward(nn.Module):
def __init__(self, embed_dim, ff_dim=None):
super().__init__()
ff_dim = ff_dim or 4 * embed_dim
self.net = nn.Sequential(
nn.Linear(embed_dim, ff_dim),
nn.ReLU(),
nn.Linear(ff_dim, embed_dim),
)
def forward(self, x):
return self.net(x)
# ============================================================================
# ๐งฑ Transformer Block (from Step 3)
# ============================================================================
class TransformerBlock(nn.Module):
def __init__(self, embed_dim, num_heads, dropout=0.1):
super().__init__()
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.attention = MultiHeadAttention(embed_dim, num_heads)
self.ffn = FeedForward(embed_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
x = x + self.dropout(self.attention(self.norm1(x), mask))
x = x + self.dropout(self.ffn(self.norm2(x)))
return x
# ============================================================================
# ๐๏ธ THE COMPLETE MINI-TRANSFORMER MODEL
# ============================================================================
# This is it! The full model that can:
# 1. Take in a sequence of character indices
# 2. Process them through embeddings + transformer blocks
# 3. Output probabilities for the NEXT character
#
# Architecture:
# Input indices โ Token Embedding โ + Positional Embedding
# โ TransformerBlock ร N
# โ LayerNorm โ Linear โ Logits
# ============================================================================
class MiniTransformer(nn.Module):
"""
A complete mini-Transformer language model.
This is a simplified version of GPT:
- Takes character indices as input
- Predicts the next character
- Can generate text autoregressively
"""
def __init__(self, vocab_size, embed_dim=64, num_heads=4,
num_blocks=4, max_seq_len=256, dropout=0.1):
super().__init__()
self.vocab_size = vocab_size
self.embed_dim = embed_dim
self.max_seq_len = max_seq_len
# Token embedding: character index โ vector
self.token_embedding = nn.Embedding(vocab_size, embed_dim)
# Positional embedding: position โ vector
# (Using learned positional embeddings instead of sinusoidal โ simpler!)
self.position_embedding = nn.Embedding(max_seq_len, embed_dim)
# Dropout after embeddings
self.dropout = nn.Dropout(dropout)
# Stack of Transformer blocks โ this is the "brain" of the model
self.blocks = nn.ModuleList([
TransformerBlock(embed_dim, num_heads, dropout)
for _ in range(num_blocks)
])
# Final layer normalization
self.final_norm = nn.LayerNorm(embed_dim)
# Output projection: vector โ vocabulary scores (logits)
# This tells us: for each position, how likely is each character?
self.output_head = nn.Linear(embed_dim, vocab_size, bias=False)
# Weight tying: share weights between input embedding and output head
# This is a common trick that improves performance
self.output_head.weight = self.token_embedding.weight
def forward(self, idx, targets=None):
"""
Args:
idx: Input token indices, shape (batch, seq_len)
targets: Optional target indices for computing loss
Returns:
logits: Prediction scores, shape (batch, seq_len, vocab_size)
loss: Cross-entropy loss (if targets provided)
"""
batch_size, seq_len = idx.shape
assert seq_len <= self.max_seq_len, \
f"Sequence length {seq_len} exceeds max {self.max_seq_len}"
# Step 1: Get token embeddings
tok_emb = self.token_embedding(idx) # (batch, seq_len, embed_dim)
# Step 2: Get positional embeddings
positions = torch.arange(seq_len, device=idx.device)
pos_emb = self.position_embedding(positions) # (seq_len, embed_dim)
# Step 3: Combine token + position embeddings
x = self.dropout(tok_emb + pos_emb)
# Step 4: Create causal mask
mask = torch.tril(torch.ones(seq_len, seq_len, device=idx.device))
mask = mask.unsqueeze(0) # (1, seq_len, seq_len)
# Step 5: Pass through transformer blocks
for block in self.blocks:
x = block(x, mask=mask)
# Step 6: Final normalization
x = self.final_norm(x)
# Step 7: Project to vocabulary size
logits = self.output_head(x) # (batch, seq_len, vocab_size)
# Compute loss if targets are provided
loss = None
if targets is not None:
# Reshape for cross-entropy: (batch*seq_len, vocab_size) and (batch*seq_len,)
loss = F.cross_entropy(
logits.view(-1, self.vocab_size),
targets.view(-1)
)
return logits, loss
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
"""
Generate text autoregressively.
Args:
idx: Starting token indices, shape (1, seq_len)
max_new_tokens: How many tokens to generate
temperature: Controls randomness (lower = more deterministic)
top_k: Only consider top-k most likely tokens
"""
for _ in range(max_new_tokens):
# Crop to max sequence length
idx_cond = idx[:, -self.max_seq_len:]
# Get predictions
logits, _ = self(idx_cond)
# Take only the last position's predictions
logits = logits[:, -1, :] / temperature
# Optional: top-k filtering
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float('-inf')
# Convert to probabilities
probs = F.softmax(logits, dim=-1)
# Sample from the distribution
next_token = torch.multinomial(probs, num_samples=1)
# Append to sequence
idx = torch.cat([idx, next_token], dim=1)
return idx
# ============================================================================
# ๐ MAIN
# ============================================================================
if __name__ == '__main__':
print_header("๐ Level 3, Step 4: Complete Mini-Transformer")
torch.manual_seed(42)
# --- Configuration ---
vocab_size = 65 # ~65 printable ASCII characters
embed_dim = 64 # Embedding dimension
num_heads = 4 # Attention heads
num_blocks = 4 # Transformer blocks (layers)
max_seq_len = 128 # Maximum sequence length
print(f" {Colors.BOLD}Model Configuration:{Colors.RESET}")
print(f" Vocabulary size: {vocab_size} characters")
print(f" Embedding dimension: {embed_dim}")
print(f" Attention heads: {num_heads}")
print(f" Transformer blocks: {num_blocks}")
print(f" Max sequence length: {max_seq_len}")
# --- Build Model ---
print_header("๐๏ธ Building the Model")
model = MiniTransformer(
vocab_size=vocab_size,
embed_dim=embed_dim,
num_heads=num_heads,
num_blocks=num_blocks,
max_seq_len=max_seq_len
)
# Count parameters by component
print(f" {Colors.YELLOW}Parameter count by component:{Colors.RESET}")
tok_params = sum(p.numel() for p in model.token_embedding.parameters())
pos_params = sum(p.numel() for p in model.position_embedding.parameters())
block_params = sum(p.numel() for p in model.blocks.parameters())
norm_params = sum(p.numel() for p in model.final_norm.parameters())
print(f" {'Token Embedding':30s}: {tok_params:>8,}")
print(f" {'Position Embedding':30s}: {pos_params:>8,}")
print(f" {'Transformer Blocks (ร'+str(num_blocks)+')':30s}: {block_params:>8,}")
print(f" {'Final LayerNorm':30s}: {norm_params:>8,}")
print(f" {'Output Head (tied)':30s}: {'(shared)'}")
total_params = sum(p.numel() for p in model.parameters())
print(f"\n {Colors.BOLD}{'TOTAL':30s}: {total_params:>8,} parameters{Colors.RESET}")
# Compare with real models
print(f"\n {Colors.YELLOW}For comparison:{Colors.RESET}")
print(f" Your Mini-Transformer: {total_params:>12,} parameters")
print(f" GPT-2 (small): 124,000,000 parameters")
print(f" GPT-3: 175,000,000,000 parameters")
print(f" GPT-4 (estimated): 1,700,000,000,000 parameters")
# --- Print Architecture ---
print_header("๐ Model Architecture")
print(model)
# --- Forward Pass ---
print_header("๐ Forward Pass Demo")
# Create dummy input (batch of 2, seq length 10)
batch_size = 2
seq_len = 10
dummy_input = torch.randint(0, vocab_size, (batch_size, seq_len))
dummy_targets = torch.randint(0, vocab_size, (batch_size, seq_len))
print_step(1, f"Input shape: {list(dummy_input.shape)}")
print_info(f"(batch_size={batch_size}, seq_len={seq_len})")
print_info(f"Sample input: {dummy_input[0].tolist()}")
# Forward pass
logits, loss = model(dummy_input, targets=dummy_targets)
print()
print_step(2, f"Output logits shape: {list(logits.shape)}")
print_info(f"(batch_size={batch_size}, seq_len={seq_len}, vocab_size={vocab_size})")
print_info("Each position outputs a score for every possible next character!")
print()
print_step(3, f"Loss: {loss.item():.4f}")
print_info(f"Expected random loss: -ln(1/{vocab_size}) = {-math.log(1/vocab_size):.4f}")
print_info("(Our untrained model is close to random โ that's expected!)")
# --- Probability Distribution ---
print_header("๐ Output Probability Distribution")
# Show probabilities for the last position
probs = F.softmax(logits[0, -1, :], dim=-1)
top_probs, top_indices = torch.topk(probs, 10)
print(f" Top 10 predicted next characters (for position {seq_len}):")
print(f" {'Character':>10s} {'Probability':>12s} {'Bar'}")
print(f" {'โ'*10} {'โ'*12} {'โ'*30}")
for prob, idx in zip(top_probs, top_indices):
char = chr(idx.item() + 32) if 32 <= idx.item() + 32 <= 126 else '?'
bar_len = int(prob.item() * 200)
bar = 'โ' * bar_len
print(f" {repr(char):>10s} {prob.item():>11.4f}% {Colors.GREEN}{bar}{Colors.RESET}")
print_info("(Probabilities are roughly equal โ model is untrained)")
# --- Generation Demo ---
print_header("โจ Text Generation (Untrained)")
# Generate from a simple start
start = torch.zeros((1, 1), dtype=torch.long) # Start with token 0
generated = model.generate(start, max_new_tokens=50, temperature=1.0)
# Convert to "characters" (just ASCII mapping for demo)
gen_chars = ''.join([chr(min(t.item() + 32, 126)) for t in generated[0]])
print(f" Generated text (random, untrained):")
print(f" {Colors.DIM}\"{gen_chars}\"{Colors.RESET}")
print()
print_info("This is garbage because the model hasn't been trained yet!")
print_info("In Level 4, you'll train this model and watch it learn to write! ๐")
# --- Summary ---
print_header("๐ Summary โ What You've Built!")
print(f""" You now have a complete {Colors.BOLD}Mini-Transformer{Colors.RESET} with:
โ
Token Embedding โ turns characters into vectors
โ
Position Embedding โ adds position information
โ
{num_blocks} Transformer Blocks โ the "brain" (attention + FFN)
โ
Output Head โ predicts the next character
โ
Generate method โ creates new text!
Total: {Colors.BOLD}{total_params:,} parameters{Colors.RESET}
This is the SAME architecture as GPT, just much smaller.
The only difference? GPT has more blocks, bigger embeddings,
and was trained on MUCH more data.
{Colors.BOLD}๐ฏ You understand how AI language models work from the ground up!{Colors.RESET}
{Colors.BOLD}{Colors.GREEN}โ
Level 3 Complete! Next: Level 4 โ Train your own Mini-GPT!{Colors.RESET}
{Colors.DIM}Run: python ../level_4_mini_gpt/train.py{Colors.RESET}
""")
Creating Your Own AI
Building and training your own language models
Building Your Own GPT
Learning Objectives
- Explain what GPT actually does (spoiler: it predicts the next token)
- Understand the full training loop โ from raw text to a learning model
- Read and modify a complete GPT model architecture in PyTorch
- Train your own Mini-GPT on a text dataset
- Generate text from your trained model using different sampling strategies
- Explain temperature, top-k, and how they shape the model's "creativity"
- Critically discuss what a language model learns โ and what it does not
5.1 GPT: Just a Transformer with a Job
Let's clear up something that confuses a lot of people. GPT โ Generative Pre-trained Transformer โ is not magic. It's not a thinking machine. It's not even, fundamentally, a new invention. GPT is simply the Transformer architecture we built in Chapter 4โฆ but given a very specific job:
Predict the next token.
That's it. That is the entire idea behind every GPT model, from your tiny Mini-GPT to OpenAI's GPT-4 with its hundreds of billions of parameters.
Think of it like this. Suppose you're reading a Hindi sentence:
"เคเค เคฎเฅเคธเคฎ เคฌเคนเฅเคค ___"
What comes next? Your brain immediately suggests candidates: เค เคเฅเคเคพ, เคเคฐเฅเคฎ, เค เคเคกเคพ, เคเคฐเคพเคฌ. You're doing next-word prediction! GPT does the same thing โ but with mathematics.
Given a sequence of tokens [t_1, t_2, \ldots, t_{n}], GPT learns the conditional probability:
It doesn't predict just one word. It produces a probability distribution over the entire vocabulary. For every possible next token, it says: "This is how likely I think this token comes next."
Note
GPT is an autoregressive model. It generates text one token at a time, feeding each generated token back as input to predict the next one. It's like a cricket commentator โ each sentence builds on what was said before.
What Makes GPT Different from BERT?
If you've heard of BERT, here's the key difference. BERT is bidirectional โ it looks at context from both the left and the right. GPT is unidirectional โ it can only look at what came before. This is enforced by the causal mask we'll see in the code.
Why the restriction? Because GPT's job is generation. When you're writing the next word, you can't peek at words that haven't been written yet. The causal mask ensures the model plays fair โ it only uses past context to predict the future.
5.2 The Training Process
Training a GPT model involves four steps that repeat thousands of times. Let's walk through each one carefully.
Step 1: Data Preparation
Our model works at the character level โ each character is a token. This is simpler than word-level or subword tokenization (like BPE used in production GPTs), but the principles are identical.
We take a text file โ say, a collection of short stories โ and do the following:
- Build a vocabulary: Find all unique characters in the text
- Create mappings:
char_to_idx(character โ number) andidx_to_char(number โ character) - Encode the text: Convert the entire text into a sequence of integers
- Create training pairs: For every sequence of characters, the target is the same sequence shifted by one position
For example, if our text is "namaste":
| Input | n | a | m | a | s | t |
|---|---|---|---|---|---|---|
| Target | a | m | a | s | t | e |
Every character learns to predict the character that follows it.
Step 2: Forward Pass
A batch of input sequences goes through the model:
- Token Embedding: Each character index becomes a 128-dimensional vector
- Position Embedding: Position information is added (so the model knows word order)
- Transformer Blocks: The combined embeddings pass through 4 transformer blocks, each with multi-head attention and a feed-forward network
- Output Projection: The final layer produces logits โ raw scores for every character in the vocabulary
The output shape is (batch_size, sequence_length, vocab_size). For each position, we get a score for every possible next character.
Step 3: Cross-Entropy Loss
We need to measure how wrong the model is. For this, we use cross-entropy loss:
In plain language: for each position, we look at the probability the model assigned to the correct next character. If the model was confident and correct, the loss is low. If it was confident and wrong, the loss is high.
Tip
At the start of training, the model assigns roughly equal probability to all characters. With a vocabulary of 65 characters, the initial loss should be around -\log(1/65) \approx 4.17. If you see this value at step 0, your model is initialized correctly!
Step 4: Backward Pass
This is where the learning happens. PyTorch computes the gradient of the loss with respect to every parameter in the model using backpropagation. Then the optimizer (AdamW) updates each parameter in the direction that reduces the loss:
where \eta is the learning rate. We also apply gradient clipping โ if the gradients become too large (which can happen with transformer models), we scale them down. This prevents "exploding gradients" from destabilising training.
This four-step cycle โ forward pass, compute loss, backward pass, update weights โ repeats 3,000 times. Each repetition is one training step.
5.3 The Model Architecture
Now let's look at the actual code. Our Mini-GPT lives in model.py and consists of four classes stacked together. Think of it as building a temple โ you lay the foundation first, then add pillars, then the dome.
The Configuration: GPTConfig
Every model begins with its hyperparameters:
Python
from dataclasses import dataclass
@dataclass
class GPTConfig:
"""All hyperparameters in one place โ easy to experiment!"""
# Model architecture
vocab_size: int = 65 # Will be set based on training data
embed_dim: int = 128 # Size of token vectors
num_heads: int = 4 # Number of attention heads
num_blocks: int = 4 # Number of transformer blocks
max_seq_len: int = 256 # Maximum context length
dropout: float = 0.1 # Dropout rate for regularization
# Training
batch_size: int = 32 # Samples per training step
learning_rate: float = 3e-4 # How fast the model learns
max_steps: int = 3000 # Total training steps
eval_interval: int = 100 # Evaluate every N steps
eval_steps: int = 20 # Steps per evaluation
sample_interval: int = 500 # Generate sample every N steps
Using a dataclass keeps everything tidy. Want to experiment with 8 attention heads? Just change num_heads = 8. Want a deeper model? Change num_blocks = 6. This is how real ML research works โ you tweak hyperparameters and observe what happens.
Important
The embed_dim must be divisible by num_heads. Each attention head works on a slice of the embedding: head_dim = embed_dim // num_heads. With 128 dimensions and 4 heads, each head operates on 32 dimensions.
Multi-Head Self-Attention
This is the heart of the Transformer. We covered the theory in Chapter 4 โ now see it in code:
Python
class MultiHeadAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.num_heads = config.num_heads
self.head_dim = config.embed_dim // config.num_heads
self.embed_dim = config.embed_dim
# Single linear layer projects to Q, K, V simultaneously
self.W_qkv = nn.Linear(config.embed_dim, 3 * config.embed_dim, bias=False)
self.W_out = nn.Linear(config.embed_dim, config.embed_dim, bias=False)
self.attn_dropout = nn.Dropout(config.dropout)
self.resid_dropout = nn.Dropout(config.dropout)
self.scale = math.sqrt(self.head_dim)
def forward(self, x, mask=None):
B, T, C = x.shape # Batch, Time (sequence length), Channels (embed_dim)
# Project to Q, K, V in one shot, then split
qkv = self.W_qkv(x).reshape(B, T, 3, self.num_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4) # (3, B, heads, T, head_dim)
Q, K, V = qkv[0], qkv[1], qkv[2]
# Scaled dot-product attention
scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1)
weights = self.attn_dropout(weights)
# Weighted sum of values
out = torch.matmul(weights, V)
out = out.transpose(1, 2).reshape(B, T, C) # Recombine heads
return self.resid_dropout(self.W_out(out))
Let's trace through this carefully:
W_qkv: Instead of three separate linear layers for Q, K, and V, we use one big layer that produces all three at once. This is a common efficiency trick โ one matrix multiplication instead of three.
- Reshape and Permute: We split the output into
num_headsseparate attention heads. Each head gets its own slice of the embedding to work with independently.
- Scaled Dot-Product Attention:
\text{scores} = \frac{QK^T}{\sqrt{d_k}}. The scaling by\sqrt{d_k}prevents the dot products from becoming too large, which would push softmax into regions with tiny gradients.
- Causal Mask: The
maskargument is a lower-triangular matrix. Positionican only attend to positions\leq i. This is what makes GPT autoregressive.
- Output Projection: After attention, all heads are concatenated and projected back to the embedding dimension through
W_out.
Feed-Forward Network
After attention, each position's representation is processed independently:
Python
class FeedForward(nn.Module):
def __init__(self, config):
super().__init__()
self.net = nn.Sequential(
nn.Linear(config.embed_dim, 4 * config.embed_dim),
nn.GELU(),
nn.Linear(4 * config.embed_dim, config.embed_dim),
nn.Dropout(config.dropout),
)
def forward(self, x):
return self.net(x)
The feed-forward network expands the dimension by 4ร, applies a non-linearity (GELU), then projects it back down. Think of this as the "thinking" step โ attention gathers information from other positions, and the FFN processes that gathered information.
Note
Why GELU instead of ReLU? GELU (Gaussian Error Linear Unit) is smoother than ReLU โ it doesn't have a hard cutoff at zero. Most modern transformer models (GPT-2, BERT, etc.) use GELU because it tends to train better.
Transformer Block
A transformer block combines attention and feed-forward with residual connections and layer normalization:
Python
class TransformerBlock(nn.Module):
def __init__(self, config):
super().__init__()
self.norm1 = nn.LayerNorm(config.embed_dim)
self.norm2 = nn.LayerNorm(config.embed_dim)
self.attention = MultiHeadAttention(config)
self.ffn = FeedForward(config)
def forward(self, x, mask=None):
x = x + self.attention(self.norm1(x), mask) # Residual + Attention
x = x + self.ffn(self.norm2(x)) # Residual + FFN
return x
Notice the Pre-Norm design: we apply LayerNorm before attention and FFN, not after. This was found to train more stably than the original "Post-Norm" design from the 2017 Transformer paper. The residual connection (x + ...) ensures that gradients can flow directly through the network without degradation โ like adding a shortcut in a highway.
The Complete MiniGPT Model
Now we assemble everything:
Python
class MiniGPT(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
# Token and position embeddings
self.token_emb = nn.Embedding(config.vocab_size, config.embed_dim)
self.pos_emb = nn.Embedding(config.max_seq_len, config.embed_dim)
self.dropout = nn.Dropout(config.dropout)
# Stack of transformer blocks
self.blocks = nn.ModuleList([
TransformerBlock(config) for _ in range(config.num_blocks)
])
# Final layer norm and output projection
self.final_norm = nn.LayerNorm(config.embed_dim)
self.output_head = nn.Linear(config.embed_dim, config.vocab_size, bias=False)
# Weight tying โ reuse token embedding weights for output
self.output_head.weight = self.token_emb.weight
# Initialize weights
self.apply(self._init_weights)
def _init_weights(self, module):
"""Initialize weights for better training."""
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, idx, targets=None):
B, T = idx.shape
# Embeddings
tok = self.token_emb(idx) # (B, T, embed_dim)
pos = self.pos_emb(torch.arange(T, device=idx.device)) # (T, embed_dim)
x = self.dropout(tok + pos)
# Causal mask
mask = torch.tril(torch.ones(T, T, device=idx.device)).unsqueeze(0)
# Pass through transformer blocks
for block in self.blocks:
x = block(x, mask)
# Project to vocabulary
x = self.final_norm(x)
logits = self.output_head(x) # (B, T, vocab_size)
# Compute loss if targets provided
loss = None
if targets is not None:
loss = F.cross_entropy(
logits.view(-1, self.config.vocab_size),
targets.view(-1)
)
return logits, loss
Let's highlight three important design choices:
- Weight Tying:
self.output_head.weight = self.token_emb.weightmakes the input embedding and output projection share the same weight matrix. The intuition: the embedding maps characters into the vector space, and the output head maps vectors back to characters. These should be inverse operations, so sharing weights makes sense โ and it reduces the parameter count significantly.
- Causal Mask:
torch.tril(torch.ones(T, T))creates a lower-triangular matrix of ones. When applied in attention, positionican only attend to positions0, 1, \ldots, i. This ensures the model can't "cheat" by looking at future tokens.
- Weight Initialization: Weights are drawn from
\mathcal{N}(0, 0.02). This specific standard deviation was found to work well in the original GPT paper. Too large and training is unstable; too small and the model learns too slowly.
Tip
Our Mini-GPT has approximately 1.5 million parameters. For perspective, GPT-2 Small has 124 million, and GPT-3 has 175 billion. Despite being tiny, our model can still learn interesting patterns from text!
5.4 Training Your Mini-GPT
Let's look at the training script (train.py). We'll break it into logical stages.
Stage 1: Data Loading and Tokenization
Python
class CharDataset:
"""Character-level dataset for language modeling."""
def __init__(self, text, config):
self.config = config
# Build vocabulary from the text
chars = sorted(list(set(text)))
self.char_to_idx = {ch: i for i, ch in enumerate(chars)}
self.idx_to_char = {i: ch for i, ch in enumerate(chars)}
self.vocab_size = len(chars)
# Encode entire text
self.data = torch.tensor(
[self.char_to_idx[ch] for ch in text], dtype=torch.long
)
# Train/validation split (90/10)
n = int(0.9 * len(self.data))
self.train_data = self.data[:n]
self.val_data = self.data[n:]
def encode(self, text):
return [self.char_to_idx.get(ch, 0) for ch in text]
def decode(self, indices):
return ''.join([self.idx_to_char.get(i, '?') for i in indices])
def get_batch(self, split='train'):
"""Get a random batch of training data."""
data = self.train_data if split == 'train' else self.val_data
seq_len = self.config.max_seq_len
batch_size = self.config.batch_size
# Random starting positions
ix = torch.randint(len(data) - seq_len - 1, (batch_size,))
# Input and target sequences (target is shifted by 1)
x = torch.stack([data[i:i+seq_len] for i in ix])
y = torch.stack([data[i+1:i+seq_len+1] for i in ix])
return x, y
The get_batch method is where the training data comes from. Each call:
- Picks
batch_sizerandom starting positions in the text - Extracts sequences of length
max_seq_lenstarting from each position - Creates target sequences that are shifted by one character
This means every batch is different โ the model never sees the same batch twice, which is a form of data augmentation built right into the sampling process.
Stage 2: The Evaluation Function
Python
@torch.no_grad()
def estimate_loss(model, dataset, config):
"""Estimate average loss on train and validation sets."""
model.eval()
losses = {}
for split in ['train', 'val']:
total_loss = 0.0
for _ in range(config.eval_steps):
x, y = dataset.get_batch(split)
_, loss = model(x, targets=y)
total_loss += loss.item()
losses[split] = total_loss / config.eval_steps
model.train()
return losses
We evaluate on both training and validation data. If the training loss keeps going down but the validation loss starts going up, that's overfitting โ the model is memorising the training data instead of learning general patterns. Think of a student who memorises answers without understanding concepts โ they do great on practice papers but fail on unseen questions.
Warning
The @torch.no_grad() decorator is critical during evaluation. Without it, PyTorch would compute and store gradients for every evaluation step, wasting memory and slowing things down. Always use torch.no_grad() (or model.eval()) when you're not training.
Stage 3: The Training Loop
Here's the core of train.py โ the actual training loop:
Python
# Setup
config = GPTConfig()
dataset = CharDataset(text, config)
config.vocab_size = dataset.vocab_size
model = MiniGPT(config)
optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate)
model.train()
best_val_loss = float('inf')
for step in range(config.max_steps):
# Get a random batch
x, y = dataset.get_batch('train')
# Forward pass โ compute predictions and loss
logits, loss = model(x, targets=y)
# Backward pass โ compute gradients
optimizer.zero_grad()
loss.backward()
# Gradient clipping โ prevent exploding gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# Update weights
optimizer.step()
# Evaluate periodically
if (step + 1) % config.eval_interval == 0:
losses = estimate_loss(model, dataset, config)
# Save best model
if losses['val'] < best_val_loss:
best_val_loss = losses['val']
torch.save({
'model_state_dict': model.state_dict(),
'config': config,
'char_to_idx': dataset.char_to_idx,
'idx_to_char': dataset.idx_to_char,
'vocab_size': dataset.vocab_size,
'step': step + 1,
'val_loss': best_val_loss,
}, 'mini_gpt_model.pt')
# Generate sample text periodically
if (step + 1) % config.sample_interval == 0:
model.eval()
start_tokens = torch.zeros((1, 1), dtype=torch.long)
generated = model.generate(start_tokens, max_new_tokens=150, temperature=0.8)
gen_text = dataset.decode(generated[0].tolist())
model.train()
Let's unpack the key choices:
- AdamW Optimizer: Adam with weight decay. It's the go-to optimizer for transformers โ it adapts the learning rate for each parameter individually, which works much better than plain SGD for these architectures.
- Gradient Clipping (
clip_grad_norm_with max norm 1.0): Transformers can occasionally produce very large gradients. Clipping prevents these from causing catastrophic weight updates. - Model Checkpointing: We save the model whenever the validation loss improves. This means even if training gets worse later (overfitting), we keep the best version. It's like taking a photo of the scoreboard when your team is winning โ just in case!
5.5 Watching It Learn
One of the most magical moments in AI is watching your model go from producing complete garbage to generating coherent text. Here's what you'll see at different stages:
Step 0 (Before Training)
"xK&mQ!zP;yWjR#3nL@fT$8vUoC*1bHi^9dAe"
The model knows nothing. It assigns equal probability to every character, so the output is pure random noise โ like a monkey typing on a keyboard.
Step 500 (Early Training)
"the the the and the was a the of the"
The model has learned the most basic pattern: common words exist. It produces recognisable English words, but just repeats them with no structure. It's like a toddler who knows a few words but can't form sentences.
Step 1500 (Mid Training)
"The king was a great and the people of the village were happy."
Now we see grammar emerging! The model has learned that sentences start with capital letters, contain subjects and verbs, and end with periods. The sentences make superficial sense, even if the overall narrative is disjointed.
Step 3000 (Final)
"The old woman lived in a small village near the river. She would
walk every morning to collect water and bring it back to her home."
At this stage, the model produces text that reads like coherent prose. It maintains a topic across multiple sentences, uses proper punctuation, and even shows a sense of narrative flow.
Note
The quality of generated text depends heavily on your training data. If you train on stories, the model writes stories. If you train on code, it writes code. If you train on Bollywood song lyrics, it will write lyrics! The model mirrors whatever patterns exist in its training data.
5.6 Chatting with Your Model
Once training is complete, you can have an interactive conversation with your model using generate.py:
Loading the Trained Model
Python
def load_model():
"""Load the trained model from checkpoint."""
model_path = os.path.join(
os.path.dirname(os.path.abspath(__file__)), 'mini_gpt_model.pt'
)
checkpoint = torch.load(model_path, map_location='cpu', weights_only=False)
config = checkpoint['config']
char_to_idx = checkpoint['char_to_idx']
idx_to_char = checkpoint['idx_to_char']
model = MiniGPT(config)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
return model, char_to_idx, idx_to_char
The checkpoint file contains everything needed to reconstruct the model: the architecture configuration, the trained weights, and the character vocabulary mappings. This is why we saved all of these during training โ without the char_to_idx mapping, we wouldn't know which number corresponds to which character.
Generating Text
Python
def generate_text(model, prompt, char_to_idx, idx_to_char,
max_tokens=200, temperature=0.8, top_k=20):
"""Generate text from a prompt."""
# Encode prompt characters to indices
encoded = [char_to_idx.get(ch, 0) for ch in prompt]
input_ids = torch.tensor([encoded], dtype=torch.long)
# Generate autoregressively
with torch.no_grad():
output = model.generate(
input_ids, max_new_tokens=max_tokens,
temperature=temperature, top_k=top_k
)
# Decode indices back to characters
generated = ''.join([idx_to_char.get(i, '?') for i in output[0].tolist()])
return generated
And here's the autoregressive generation loop inside the model:
Python
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
self.eval()
for _ in range(max_new_tokens):
# Crop to max context length
idx_crop = idx[:, -self.config.max_seq_len:]
# Forward pass โ get logits for all positions
logits, _ = self(idx_crop)
logits = logits[:, -1, :] / temperature # Only the last position matters
# Top-k filtering
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float('-inf')
# Sample from the distribution
probs = F.softmax(logits, dim=-1)
next_tok = torch.multinomial(probs, num_samples=1)
idx = torch.cat([idx, next_tok], dim=1)
return idx
Notice the generation loop: at each step, we feed the entire sequence so far into the model, but we only care about the logits at the last position (logits[:, -1, :]). That last position's output contains the model's prediction for what comes next. We sample from this distribution, append the sampled token, and repeat.
Tip
The idx_crop line is important. Our model has a maximum context length of 256 characters. If the generated sequence grows beyond this, we crop it to the last 256 characters. This means the model "forgets" the very beginning of long sequences โ a fundamental limitation of fixed-context-length models.
The Interactive Loop
The interactive chat allows you to type prompts and adjust settings on the fly:
Python
# Settings
temperature = 0.8
top_k = 20
max_tokens = 200
while True:
prompt = input("You > ")
if prompt.strip().lower() == 'quit':
break
# Handle setting commands
if prompt.startswith('temp '):
temperature = float(prompt.split()[1])
continue
if prompt.startswith('topk '):
top_k = int(prompt.split()[1])
continue
# Generate and display
generated = generate_text(
model, prompt, char_to_idx, idx_to_char,
max_tokens=max_tokens, temperature=temperature, top_k=top_k
)
print(generated)
Type any text and the model continues it. Type temp 0.3 for more conservative output, or temp 1.5 for wilder creativity. But what do these settings actually mean? Let's find out.
5.7 Temperature, Top-k, and Sampling Strategies
When your model produces logits for the next character, how do we choose which character to actually use? This is where sampling strategies come in, and they have a dramatic effect on the output.
Temperature
Temperature controls the "sharpness" of the probability distribution. The logits are divided by the temperature before applying softmax:
where z_i are the raw logits and \tau is the temperature.
Consider a model predicting the next character after "Ind". Suppose the raw logits produce these probabilities:
| Character | \tau = 0.3 (Low) | \tau = 0.8 (Default) | \tau = 1.0 (Normal) | \tau = 1.5 (High) |
|---|---|---|---|---|
| i | 0.85 | 0.45 | 0.35 | 0.24 |
| u | 0.10 | 0.20 | 0.20 | 0.18 |
| e | 0.04 | 0.15 | 0.18 | 0.17 |
| o | 0.01 | 0.10 | 0.12 | 0.14 |
| a | ~0.00 | 0.05 | 0.08 | 0.12 |
| others | ~0.00 | 0.05 | 0.07 | 0.15 |
At low temperature (\tau = 0.3), the model is very confident โ it almost always picks "i" (making "Indi" โ "India"). The output is predictable, repetitive, but safe.
At high temperature (\tau = 1.5), probabilities are spread out. The model might pick "u" (making "Indu" โ "Industry") or even "e" (making "Inde" โ "Indeed"). The output is more creative but also more prone to nonsense.
Important
Temperature = 0 is a special case called greedy decoding โ always pick the most likely token. This produces the most predictable output but often leads to repetitive, boring text ("the the the theโฆ"). In practice, a temperature between 0.7 and 0.9 usually works best.
Top-k Sampling
Top-k restricts the model to only consider the k most likely characters, setting all other probabilities to zero. This prevents the model from ever choosing very unlikely characters (which tend to be nonsensical).
Continuing our example with "Ind" and top-k = 3:
| Character | Original Probability | After Top-3 Filtering | After Renormalisation |
|---|---|---|---|
| i | 0.35 | 0.35 | 0.48 |
| u | 0.20 | 0.20 | 0.27 |
| e | 0.18 | 0.18 | 0.25 |
| o | 0.12 | ~~0.00~~ | 0.00 |
| a | 0.08 | ~~0.00~~ | 0.00 |
| others | 0.07 | ~~0.00~~ | 0.00 |
Only "i", "u", and "e" survive. The probabilities are renormalised to sum to 1, and we sample from this filtered distribution.
Combining Temperature and Top-k
In practice, we use both together. Temperature controls the shape of the distribution, and top-k provides a safety net against low-probability nonsense. Our default settings (temperature=0.8, top_k=20) give a good balance between creativity and coherence.
A practical guide:
| Use Case | Temperature | Top-k | Result |
|---|---|---|---|
| Factual / predictable | 0.3 | 5 | Very conservative output |
| Story continuation | 0.7โ0.8 | 20 | Balanced and coherent |
| Creative brainstorming | 1.0 | 40 | Diverse and surprising |
| Experimental / chaotic | 1.5 | None | Wild, often nonsensical |
Tip
When experimenting, change one parameter at a time. Set temp 0.3 and generate. Then set temp 1.5 and generate the same prompt. Compare the results. This builds intuition much faster than reading about it!
๐ญ 5.8 Discussion: What Did Your Model Actually Learn?
After training, your Mini-GPT can produce text that looks surprisingly coherent. But let's be honest with ourselves: what has it actually learned?
### What It HAS Learned
Character Frequencies: The model knows that 'e' is the most common letter in English, that spaces appear between words, and that 'q' is almost always followed by 'u'.
Word Structure: It has internalised the spelling of common words โ "the", "and", "was", "village", "morning". It rarely produces non-words after sufficient training.
Grammar Patterns: Subject-verb-object ordering, article-noun pairs ("the king", "a village"), and verb tenses are all captured. It doesn't know grammar rules โ it has learned statistical patterns that happen to align with grammar.
Punctuation and Formatting: Sentences start with capitals and end with periods. Dialogue uses quotation marks. Paragraphs have line breaks.
Thematic Coherence: Within a short span, the model can maintain a topic. If it starts writing about a king, the next few sentences will likely continue about the king.
### What It Has NOT Learned
True Understanding: The model doesn't know what a "king" is. It doesn't know that kings rule kingdoms, wear crowns, or exist in the physical world. It only knows that the character sequence "king" tends to appear near sequences like "queen", "throne", "kingdom".
Logic and Reasoning: Ask your model to solve "2 + 3" and it might output "5" โ not because it understands arithmetic, but because it has seen "2 + 3 = 5" in text. Ask "2847 + 9283" and it will likely fail.
Factual Knowledge: Your model might write "Delhi is the capital of India" โ but only if something similar appeared in the training data. It doesn't know facts; it reproduces patterns.
Long-Range Coherence: Our model's context is 256 characters โ roughly 40-50 words. It cannot maintain a plot across paragraphs or remember a character introduced 1,000 tokens ago. Larger models with longer contexts do better, but even they struggle with book-length coherence.
Think of it like a very talented mimic. A mimic can perfectly reproduce the accent, rhythm, and vocabulary of a native Hindi speaker without understanding a single word of Hindi. Your GPT is doing the same thing with text โ reproducing the form of language without grasping its meaning.
> [!NOTE]
> This is one of the deepest debates in AI today. Some researchers argue that sufficiently large language models do develop a form of understanding. Others insist they remain "stochastic parrots" โ impressive pattern matchers, nothing more. Where you stand on this question will shape how you think about AI's future.
Key Concepts Summary
| Concept | Definition |
|---|---|
| GPT | A Transformer model trained to predict the next token in a sequence |
| Autoregressive | Generating one token at a time, feeding each output back as input |
| Causal Mask | Lower-triangular matrix that prevents attending to future positions |
| Cross-Entropy Loss | Measures how well the predicted probability matches the true next token |
| Weight Tying | Sharing weights between the input embedding and output projection |
| Pre-Norm | Applying LayerNorm before (not after) attention and FFN sublayers |
| AdamW | Adam optimizer with decoupled weight decay โ standard for transformers |
| Gradient Clipping | Capping gradient magnitudes to prevent training instability |
| Temperature | Controls the sharpness of the sampling distribution (\tau < 1 = conservative, \tau > 1 = creative) |
| Top-k Sampling | Restricting sampling to the k most probable tokens |
| Overfitting | When train loss decreases but validation loss increases โ model is memorising, not learning |
| Checkpoint | A saved snapshot of the model's weights and configuration |
๐ 5.10 Exercises
Exercise 1: Trace the Dimensions (Pen and Paper)
Take an input batch of shape (batch_size=2, seq_len=8) with vocab_size=65 and embed_dim=128. Trace the shape of the tensor through every layer of the model: token embedding โ position embedding โ transformer block โ final norm โ output logits. Write down the shape at each stage.
Exercise 2: Experiment with Hyperparameters
Modify GPTConfig and retrain the model. Try each of these independently and record the best validation loss:
embed_dim = 64 (smaller model)
embed_dim = 256 (larger model)
num_blocks = 2 (shallower)
num_blocks = 8 (deeper)
learning_rate = 1e-3 (faster learning)
learning_rate = 1e-4 (slower learning)
Which change helps the most? Which hurts? Why do you think that is?
Exercise 3: Temperature Explorer
Write a script that generates text from the same prompt ("Once upon a time") at temperatures 0.1, 0.5, 0.8, 1.0, 1.5, and 2.0. Print all outputs side by side. At what temperature does the output become unreadable?
Exercise 4: Train on Your Own Data
Find a text file of your choice โ it could be a collection of Panchatantra stories in English, Bollywood movie dialogues, or even your own writing. Train the model on it. How does the domain of training data affect what the model generates?
Exercise 5: Implement Top-p (Nucleus) Sampling
Top-k has a limitation: sometimes the top-5 tokens capture 99% of the probability, and sometimes they capture only 40%. Top-p sampling (also called nucleus sampling) is an alternative: instead of keeping the top-k tokens, keep the smallest set of tokens whose cumulative probability exceeds p (e.g., p = 0.9).
Implement top-p sampling in the generate method. Compare its output with top-k. Which do you prefer?
Exercise 6: The Perplexity Metric
Perplexity is a common metric for language models, defined as:
$\text{PPL} = e^{\mathcal{L}}$
where \mathcal{L} is the cross-entropy loss. A perplexity of 10 means the model is, on average, as confused as if it were choosing between 10 equally likely options. Write a function that computes your model's perplexity on the validation set. What value do you get? How does it change with different hyperparameters?
Exercise 7: Attention Visualisation
Modify the MultiHeadAttention class to return the attention weights along with the output. Write a script that:
Feeds a short sentence into the model
Extracts attention weights from each head and each layer
Plots a heatmap showing which characters attend to which other characters
What patterns do you observe? Do different heads learn different patterns?
Important
What's Next? You've now built a complete language model โ from architecture to training to generation. But our model is tiny and trains on a small dataset. In the next chapter, we'll explore how to scale up: larger models, better data, and the techniques that make billion-parameter models possible. We'll also discuss the ethical implications of large language models โ a topic that every AI practitioner in India and globally must grapple with.
"The measure of intelligence is the ability to change." โ Albert Einstein
Your Mini-GPT changes its weights 3,000 times during training. Whether that constitutes intelligence is a question we'll keep exploring.
Complete Source Code - Chapter 5
Below are the complete, runnable source files for this chapter. Every line is included.
Complete Code: model.py
Python
"""
๐ด Level 4: Mini-GPT Model Definition
========================================
A complete, self-contained GPT model for character-level language modeling.
This file defines the model architecture and can be imported by train.py and generate.py.
Architecture:
- Character-level tokenization
- Learned positional embeddings
- 4 Transformer blocks with 4 attention heads
- 128-dimensional embeddings
- ~1.5M parameters
"""
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from dataclasses import dataclass
# ============================================================================
# โ๏ธ CONFIGURATION
# ============================================================================
@dataclass
class GPTConfig:
"""All hyperparameters in one place โ easy to experiment!"""
# Model architecture
vocab_size: int = 65 # Will be set based on training data
embed_dim: int = 128 # Size of token vectors
num_heads: int = 4 # Number of attention heads
num_blocks: int = 4 # Number of transformer blocks
max_seq_len: int = 256 # Maximum context length
dropout: float = 0.1 # Dropout rate for regularization
# Training
batch_size: int = 32 # Samples per training step
learning_rate: float = 3e-4 # How fast the model learns
max_steps: int = 3000 # Total training steps
eval_interval: int = 100 # Evaluate every N steps
eval_steps: int = 20 # Steps per evaluation
sample_interval: int = 500 # Generate sample every N steps
def __str__(self):
lines = [f" {k:20s}: {v}" for k, v in self.__dict__.items()]
return "\n".join(lines)
# ============================================================================
# ๐ Multi-Head Self-Attention
# ============================================================================
class MultiHeadAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.num_heads = config.num_heads
self.head_dim = config.embed_dim // config.num_heads
self.embed_dim = config.embed_dim
self.W_qkv = nn.Linear(config.embed_dim, 3 * config.embed_dim, bias=False)
self.W_out = nn.Linear(config.embed_dim, config.embed_dim, bias=False)
self.attn_dropout = nn.Dropout(config.dropout)
self.resid_dropout = nn.Dropout(config.dropout)
self.scale = math.sqrt(self.head_dim)
def forward(self, x, mask=None):
B, T, C = x.shape
qkv = self.W_qkv(x).reshape(B, T, 3, self.num_heads, self.head_dim)
qkv = qkv.permute(2, 0, 3, 1, 4)
Q, K, V = qkv[0], qkv[1], qkv[2]
scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1)
weights = self.attn_dropout(weights)
out = torch.matmul(weights, V)
out = out.transpose(1, 2).reshape(B, T, C)
return self.resid_dropout(self.W_out(out))
# ============================================================================
# ๐ง Feed-Forward Network
# ============================================================================
class FeedForward(nn.Module):
def __init__(self, config):
super().__init__()
self.net = nn.Sequential(
nn.Linear(config.embed_dim, 4 * config.embed_dim),
nn.GELU(), # GELU is smoother than ReLU โ used in modern models
nn.Linear(4 * config.embed_dim, config.embed_dim),
nn.Dropout(config.dropout),
)
def forward(self, x):
return self.net(x)
# ============================================================================
# ๐งฑ Transformer Block
# ============================================================================
class TransformerBlock(nn.Module):
def __init__(self, config):
super().__init__()
self.norm1 = nn.LayerNorm(config.embed_dim)
self.norm2 = nn.LayerNorm(config.embed_dim)
self.attention = MultiHeadAttention(config)
self.ffn = FeedForward(config)
def forward(self, x, mask=None):
x = x + self.attention(self.norm1(x), mask)
x = x + self.ffn(self.norm2(x))
return x
# ============================================================================
# ๐๏ธ MINI-GPT MODEL
# ============================================================================
class MiniGPT(nn.Module):
"""
A complete GPT-style language model for character-level text generation.
This model:
- Takes a sequence of character indices
- Processes them through embedding + transformer blocks
- Predicts the probability of the next character
- Can generate new text autoregressively
"""
def __init__(self, config):
super().__init__()
self.config = config
# Token and position embeddings
self.token_emb = nn.Embedding(config.vocab_size, config.embed_dim)
self.pos_emb = nn.Embedding(config.max_seq_len, config.embed_dim)
self.dropout = nn.Dropout(config.dropout)
# Stack of transformer blocks
self.blocks = nn.ModuleList([
TransformerBlock(config) for _ in range(config.num_blocks)
])
# Final layer norm and output projection
self.final_norm = nn.LayerNorm(config.embed_dim)
self.output_head = nn.Linear(config.embed_dim, config.vocab_size, bias=False)
# Weight tying (improves performance)
self.output_head.weight = self.token_emb.weight
# Initialize weights
self.apply(self._init_weights)
# Print model summary
n_params = sum(p.numel() for p in self.parameters())
self._param_count = n_params
def _init_weights(self, module):
"""Initialize weights for better training."""
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, idx, targets=None):
"""
Forward pass.
Args:
idx: (batch, seq_len) โ input token indices
targets: (batch, seq_len) โ target token indices (optional)
Returns:
logits: (batch, seq_len, vocab_size)
loss: scalar (if targets provided)
"""
B, T = idx.shape
# Embeddings
tok = self.token_emb(idx)
pos = self.pos_emb(torch.arange(T, device=idx.device))
x = self.dropout(tok + pos)
# Causal mask
mask = torch.tril(torch.ones(T, T, device=idx.device)).unsqueeze(0)
# Transformer blocks
for block in self.blocks:
x = block(x, mask)
# Output
x = self.final_norm(x)
logits = self.output_head(x)
# Loss
loss = None
if targets is not None:
loss = F.cross_entropy(logits.view(-1, self.config.vocab_size), targets.view(-1))
return logits, loss
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
"""
Generate new tokens autoregressively.
Args:
idx: (1, seq_len) โ starting tokens
max_new_tokens: how many tokens to generate
temperature: creativity control (0.1=safe, 1.0=normal, 1.5=creative)
top_k: only consider top-k most likely tokens
Returns:
idx: (1, seq_len + max_new_tokens)
"""
self.eval()
for _ in range(max_new_tokens):
# Crop to max context
idx_crop = idx[:, -self.config.max_seq_len:]
# Forward pass
logits, _ = self(idx_crop)
logits = logits[:, -1, :] / temperature
# Top-k filtering
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float('-inf')
# Sample
probs = F.softmax(logits, dim=-1)
next_tok = torch.multinomial(probs, num_samples=1)
idx = torch.cat([idx, next_tok], dim=1)
return idx
def count_parameters(self):
"""Return total number of parameters."""
return self._param_count
if __name__ == '__main__':
# Quick test
print("\n\033[1m\033[95m" + "="*50)
print(" ๐ด Mini-GPT Model Test")
print("="*50 + "\033[0m\n")
config = GPTConfig(vocab_size=65)
model = MiniGPT(config)
print(f" \033[1mConfiguration:\033[0m")
print(config)
n_params = model.count_parameters()
print(f"\n \033[1m\033[93mTotal parameters: {n_params:,}\033[0m")
print(f" That's {n_params/1e6:.2f}M parameters โ tiny compared to GPT-2 (124M)!\n")
# Test forward pass
x = torch.randint(0, 65, (2, 32))
logits, loss = model(x, targets=x)
print(f" Forward pass test:")
print(f" Input: {list(x.shape)}")
print(f" Output: {list(logits.shape)}")
print(f" Loss: {loss.item():.4f}")
# Test generation
start = torch.zeros((1, 1), dtype=torch.long)
gen = model.generate(start, max_new_tokens=20)
print(f" Generated tokens: {gen[0].tolist()}")
print(f"\n \033[1m\033[92mโ
Model works! Ready for training.\033[0m\n")
Complete Code: train.py
Python
"""
๐ด Level 4: Train Your Mini-GPT!
====================================
This script trains your Mini-GPT model on the stories dataset.
Watch it go from random gibberish to coherent text!
Usage:
python train.py
The training takes ~5-10 minutes on CPU.
"""
import os
import sys
import time
import math
import torch
# Add parent directory to path so we can import model
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from model import MiniGPT, GPTConfig
# ============================================================================
# ๐จ Colors
# ============================================================================
class Colors:
HEADER = '\033[95m'
BLUE = '\033[94m'
CYAN = '\033[96m'
GREEN = '\033[92m'
YELLOW = '\033[93m'
RED = '\033[91m'
BOLD = '\033[1m'
DIM = '\033[2m'
RESET = '\033[0m'
def print_header(text):
print(f"\n{Colors.BOLD}{Colors.HEADER}{'='*60}")
print(f" {text}")
print(f"{'='*60}{Colors.RESET}\n")
def print_step(num, text):
print(f"{Colors.BOLD}{Colors.CYAN}๐ Step {num}: {text}{Colors.RESET}")
def print_info(text):
print(f" {Colors.DIM}{text}{Colors.RESET}")
def print_success(text):
print(f" {Colors.GREEN}โ {text}{Colors.RESET}")
# ============================================================================
# ๐ฆ Data Loading & Tokenization
# ============================================================================
class CharDataset:
"""Character-level dataset for language modeling."""
def __init__(self, text, config):
self.config = config
# Build vocabulary from the text
chars = sorted(list(set(text)))
self.char_to_idx = {ch: i for i, ch in enumerate(chars)}
self.idx_to_char = {i: ch for i, ch in enumerate(chars)}
self.vocab_size = len(chars)
# Encode entire text
self.data = torch.tensor([self.char_to_idx[ch] for ch in text], dtype=torch.long)
# Train/validation split (90/10)
n = int(0.9 * len(self.data))
self.train_data = self.data[:n]
self.val_data = self.data[n:]
def encode(self, text):
return [self.char_to_idx.get(ch, 0) for ch in text]
def decode(self, indices):
return ''.join([self.idx_to_char.get(i, '?') for i in indices])
def get_batch(self, split='train'):
"""Get a random batch of training data."""
data = self.train_data if split == 'train' else self.val_data
seq_len = self.config.max_seq_len
batch_size = self.config.batch_size
# Random starting positions
ix = torch.randint(len(data) - seq_len - 1, (batch_size,))
# Input and target sequences
x = torch.stack([data[i:i+seq_len] for i in ix])
y = torch.stack([data[i+1:i+seq_len+1] for i in ix])
return x, y
# ============================================================================
# ๐ Evaluation
# ============================================================================
@torch.no_grad()
def estimate_loss(model, dataset, config):
"""Estimate average loss on train and validation sets."""
model.eval()
losses = {}
for split in ['train', 'val']:
total_loss = 0.0
for _ in range(config.eval_steps):
x, y = dataset.get_batch(split)
_, loss = model(x, targets=y)
total_loss += loss.item()
losses[split] = total_loss / config.eval_steps
model.train()
return losses
# ============================================================================
# ๐ฏ Progress Bar
# ============================================================================
def progress_bar(current, total, width=40, loss=None, extra=""):
"""Simple progress bar without tqdm dependency."""
filled = int(width * current / total)
bar = 'โ' * filled + 'โ' * (width - filled)
percent = 100 * current / total
loss_str = f" loss={loss:.4f}" if loss else ""
print(f"\r [{bar}] {percent:5.1f}% ({current}/{total}){loss_str} {extra}", end='', flush=True)
# ============================================================================
# ๐ MAIN TRAINING LOOP
# ============================================================================
if __name__ == '__main__':
print_header("๐ด Level 4: Training Your Mini-GPT!")
# --- Step 1: Load Data ---
print_step(1, "Loading training data")
data_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'data', 'stories.txt')
if not os.path.exists(data_path):
print(f" {Colors.RED}Error: {data_path} not found!{Colors.RESET}")
print(f" Make sure stories.txt is in the data/ folder.")
sys.exit(1)
with open(data_path, 'r', encoding='utf-8') as f:
text = f.read()
print_info(f"Loaded {len(text):,} characters")
print_info(f"First 100 chars: \"{text[:100]}...\"")
# --- Step 2: Create Dataset ---
print_step(2, "Creating character-level dataset")
config = GPTConfig()
dataset = CharDataset(text, config)
# Update config with actual vocab size
config.vocab_size = dataset.vocab_size
print_info(f"Vocabulary size: {dataset.vocab_size} unique characters")
print_info(f"Training data: {len(dataset.train_data):,} characters")
print_info(f"Validation data: {len(dataset.val_data):,} characters")
print(f"\n {Colors.YELLOW}Character vocabulary:{Colors.RESET}")
chars_display = ''.join([dataset.idx_to_char[i] for i in range(min(dataset.vocab_size, 50))])
print(f" {repr(chars_display)}")
# --- Step 3: Create Model ---
print_step(3, "Building Mini-GPT model")
model = MiniGPT(config)
n_params = model.count_parameters()
print_info(f"Model parameters: {n_params:,} ({n_params/1e6:.2f}M)")
print(f"\n {Colors.YELLOW}Configuration:{Colors.RESET}")
print(config)
# --- Step 4: Setup Optimizer ---
print_step(4, "Setting up optimizer")
optimizer = torch.optim.AdamW(model.parameters(), lr=config.learning_rate)
print_info(f"Optimizer: AdamW, lr={config.learning_rate}")
# --- Step 5: TRAINING! ---
print_header("๐๏ธ Training Loop Starting!")
print(f" Training for {config.max_steps} steps...")
print(f" Evaluating every {config.eval_interval} steps")
print(f" Generating sample every {config.sample_interval} steps\n")
# Generate BEFORE training to show how bad it is
print(f" {Colors.YELLOW}๐ Sample BEFORE training:{Colors.RESET}")
start_tokens = torch.zeros((1, 1), dtype=torch.long)
generated = model.generate(start_tokens, max_new_tokens=100, temperature=1.0)
gen_text = dataset.decode(generated[0].tolist())
print(f" {Colors.DIM}\"{gen_text[:100]}\"{Colors.RESET}")
print(f" {Colors.RED}^ Complete garbage! The model knows nothing yet.{Colors.RESET}\n")
model.train()
start_time = time.time()
best_val_loss = float('inf')
train_losses = []
for step in range(config.max_steps):
# Get batch
x, y = dataset.get_batch('train')
# Forward pass
logits, loss = model(x, targets=y)
# Backward pass
optimizer.zero_grad()
loss.backward()
# Gradient clipping (prevents exploding gradients)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# Update weights
optimizer.step()
train_losses.append(loss.item())
# Progress bar
progress_bar(step + 1, config.max_steps, loss=loss.item())
# Evaluate periodically
if (step + 1) % config.eval_interval == 0:
losses = estimate_loss(model, dataset, config)
elapsed = time.time() - start_time
print() # New line after progress bar
print(f"\n {Colors.BOLD}Step {step+1}/{config.max_steps}{Colors.RESET}")
print(f" Train loss: {Colors.CYAN}{losses['train']:.4f}{Colors.RESET}")
print(f" Val loss: {Colors.CYAN}{losses['val']:.4f}{Colors.RESET}")
print(f" Time: {elapsed:.1f}s")
# Save best model
if losses['val'] < best_val_loss:
best_val_loss = losses['val']
save_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'mini_gpt_model.pt')
torch.save({
'model_state_dict': model.state_dict(),
'config': config,
'char_to_idx': dataset.char_to_idx,
'idx_to_char': dataset.idx_to_char,
'vocab_size': dataset.vocab_size,
'step': step + 1,
'val_loss': best_val_loss,
}, save_path)
print(f" {Colors.GREEN}๐พ Best model saved! (val_loss={best_val_loss:.4f}){Colors.RESET}")
print()
# Generate sample periodically
if (step + 1) % config.sample_interval == 0:
model.eval()
start_tokens = torch.zeros((1, 1), dtype=torch.long)
generated = model.generate(start_tokens, max_new_tokens=150, temperature=0.8)
gen_text = dataset.decode(generated[0].tolist())
model.train()
print(f" {Colors.YELLOW}๐ Sample at step {step+1}:{Colors.RESET}")
print(f" {Colors.GREEN}\"{gen_text[:150]}\"{Colors.RESET}\n")
# --- Training Complete ---
total_time = time.time() - start_time
print_header("๐ Training Complete!")
print(f" Total training time: {Colors.BOLD}{total_time:.1f} seconds{Colors.RESET}")
print(f" ({total_time/60:.1f} minutes)")
print(f" Best validation loss: {Colors.BOLD}{best_val_loss:.4f}{Colors.RESET}")
# Final generation
print(f"\n {Colors.YELLOW}๐ Final generated text:{Colors.RESET}")
model.eval()
prompts = ["The ", "A ", "Once "]
for prompt in prompts:
encoded = dataset.encode(prompt)
start_tokens = torch.tensor([encoded], dtype=torch.long)
generated = model.generate(start_tokens, max_new_tokens=200, temperature=0.7, top_k=20)
gen_text = dataset.decode(generated[0].tolist())
print(f"\n {Colors.CYAN}Prompt: \"{prompt}\"{Colors.RESET}")
print(f" {Colors.GREEN}{gen_text[:200]}{Colors.RESET}")
print(f"\n\n {Colors.BOLD}{Colors.GREEN}โ
Training complete!{Colors.RESET}")
print(f" {Colors.DIM}Model saved to: mini_gpt_model.pt{Colors.RESET}")
print(f" {Colors.DIM}Run 'python generate.py' to chat with your model!{Colors.RESET}\n")
Complete Code: generate.py
Python
"""
๐ด Level 4: Chat with Your Mini-GPT!
========================================
Interactive text generation with your trained model.
Usage:
python generate.py
Commands:
Type any text โ model generates continuation
temp 0.5 โ change temperature (creativity level)
topk 20 โ change top-k sampling
quit โ exit
"""
import os
import sys
import torch
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from model import MiniGPT, GPTConfig
# ============================================================================
# ๐จ Colors
# ============================================================================
class Colors:
HEADER = '\033[95m'
BLUE = '\033[94m'
CYAN = '\033[96m'
GREEN = '\033[92m'
YELLOW = '\033[93m'
RED = '\033[91m'
BOLD = '\033[1m'
DIM = '\033[2m'
RESET = '\033[0m'
def load_model():
"""Load the trained model from checkpoint."""
model_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'mini_gpt_model.pt')
if not os.path.exists(model_path):
print(f"\n {Colors.RED}โ Model not found!{Colors.RESET}")
print(f" The file '{model_path}' does not exist.")
print(f"\n You need to train the model first:")
print(f" {Colors.CYAN}python train.py{Colors.RESET}")
print(f"\n Training takes about 5-10 minutes on CPU.")
return None, None, None
print(f" {Colors.DIM}Loading model from {model_path}...{Colors.RESET}")
checkpoint = torch.load(model_path, map_location='cpu', weights_only=False)
config = checkpoint['config']
char_to_idx = checkpoint['char_to_idx']
idx_to_char = checkpoint['idx_to_char']
model = MiniGPT(config)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
print(f" {Colors.GREEN}โ Model loaded! ({model.count_parameters():,} parameters){Colors.RESET}")
print(f" {Colors.DIM}Trained to step {checkpoint.get('step', '?')}, val_loss={checkpoint.get('val_loss', '?'):.4f}{Colors.RESET}")
return model, char_to_idx, idx_to_char
def generate_text(model, prompt, char_to_idx, idx_to_char,
max_tokens=200, temperature=0.8, top_k=20):
"""Generate text from a prompt."""
# Encode prompt
encoded = [char_to_idx.get(ch, 0) for ch in prompt]
input_ids = torch.tensor([encoded], dtype=torch.long)
# Generate
with torch.no_grad():
output = model.generate(input_ids, max_new_tokens=max_tokens,
temperature=temperature, top_k=top_k)
# Decode
generated = ''.join([idx_to_char.get(i, '?') for i in output[0].tolist()])
return generated
if __name__ == '__main__':
# Banner
print(f"""
{Colors.BOLD}{Colors.HEADER}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ ๐ด Mini-GPT Interactive Text Generator โ
โ โ
โ Your very own language model! โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ{Colors.RESET}
""")
# Load model
model, char_to_idx, idx_to_char = load_model()
if model is None:
sys.exit(1)
# Settings
temperature = 0.8
top_k = 20
max_tokens = 200
print(f"""
{Colors.YELLOW}Settings:{Colors.RESET}
Temperature: {temperature} (creativity level)
Top-k: {top_k} (diversity control)
Max tokens: {max_tokens}
{Colors.YELLOW}Commands:{Colors.RESET}
Type any text โ model generates continuation
{Colors.CYAN}temp 0.5{Colors.RESET} โ change temperature
{Colors.CYAN}topk 10{Colors.RESET} โ change top-k
{Colors.CYAN}tokens 300{Colors.RESET} โ change max tokens
{Colors.CYAN}quit{Colors.RESET} โ exit
""")
# Interactive loop
while True:
try:
prompt = input(f" {Colors.BOLD}{Colors.CYAN}You > {Colors.RESET}")
except (EOFError, KeyboardInterrupt):
print(f"\n\n {Colors.DIM}Goodbye! ๐{Colors.RESET}\n")
break
if not prompt.strip():
continue
# Handle commands
if prompt.strip().lower() == 'quit':
print(f"\n {Colors.DIM}Goodbye! ๐{Colors.RESET}\n")
break
if prompt.strip().lower().startswith('temp '):
try:
temperature = float(prompt.strip().split()[1])
print(f" {Colors.GREEN}โ Temperature set to {temperature}{Colors.RESET}\n")
except (ValueError, IndexError):
print(f" {Colors.RED}Usage: temp 0.5{Colors.RESET}\n")
continue
if prompt.strip().lower().startswith('topk '):
try:
top_k = int(prompt.strip().split()[1])
print(f" {Colors.GREEN}โ Top-k set to {top_k}{Colors.RESET}\n")
except (ValueError, IndexError):
print(f" {Colors.RED}Usage: topk 20{Colors.RESET}\n")
continue
if prompt.strip().lower().startswith('tokens '):
try:
max_tokens = int(prompt.strip().split()[1])
print(f" {Colors.GREEN}โ Max tokens set to {max_tokens}{Colors.RESET}\n")
except (ValueError, IndexError):
print(f" {Colors.RED}Usage: tokens 300{Colors.RESET}\n")
continue
# Generate text
try:
generated = generate_text(
model, prompt, char_to_idx, idx_to_char,
max_tokens=max_tokens, temperature=temperature, top_k=top_k
)
# Display: prompt in cyan, generated part in green
prompt_len = len(prompt)
print(f"\n {Colors.CYAN}{generated[:prompt_len]}{Colors.GREEN}{generated[prompt_len:]}{Colors.RESET}")
print(f"\n {Colors.DIM}[temp={temperature}, top_k={top_k}, tokens={len(generated)-prompt_len}]{Colors.RESET}\n")
except Exception as e:
print(f" {Colors.RED}Error: {e}{Colors.RESET}\n")
The Training Data: stories.txt
Text
Once upon a time, in a small village near the river, there lived a wise old farmer. He worked hard every day in his fields. The farmer grew rice, wheat, and vegetables. He shared his food with everyone in the village. People loved him because he was kind and generous.
The sun rises in the east and sets in the west. Every morning, the birds sing beautiful songs. The flowers open their petals to welcome the sunlight. The trees provide shade and fresh air. Nature is beautiful and full of wonders.
A clever fox lived in a forest near a village. One hot summer day, the fox was very thirsty. He searched for water everywhere but could not find any. Then he saw a pot with some water at the bottom. The fox put small stones into the pot one by one. Slowly the water came up to the top. The fox drank the water happily. This story teaches us that intelligence solves problems.
The river Ganga flows from the Himalayas to the Bay of Bengal. It is one of the longest rivers in India. Many cities and towns are built along its banks. People use the river water for drinking and farming. The Ganga is very important for the people of India.
A kind king ruled a beautiful kingdom. His people were happy and peaceful. The king built schools for children and hospitals for the sick. He made sure everyone had food to eat and a place to live. The kingdom prospered under his wise rule.
The moon shines brightly in the night sky. Stars twinkle like tiny diamonds above us. The sky changes color from blue to orange during sunset. Clouds float gently across the sky like cotton balls. Looking at the sky fills our hearts with wonder.
A small boy named Arjun loved to read books. He would sit under the banyan tree and read for hours. His favorite books were about science and adventure. One day he read about the solar system and the planets. He dreamed of becoming a scientist when he grew up.
Water is essential for all living things. Plants need water to grow and make food. Animals drink water to stay alive and healthy. The water cycle keeps water moving around the earth. Rain fills the rivers and lakes with fresh water.
A poor woodcutter lived at the edge of a forest. Every day he would cut wood and sell it in the market. One day his axe fell into the river. He sat by the river and cried because he was very poor. A kind spirit appeared and asked him what happened. The spirit dove into the water and brought up a golden axe. The woodcutter said that was not his axe. The spirit brought up a silver axe. Again the woodcutter said it was not his. Finally the spirit brought up his old iron axe. The woodcutter was happy and said yes that is mine. The spirit was pleased with his honesty and gave him all three axes.
The earth goes around the sun in one year. The moon goes around the earth in about one month. The earth spins on its axis once every day. This spinning gives us day and night. When our part of the earth faces the sun it is daytime. When it faces away from the sun it is nighttime.
A beautiful peacock lived in a garden near the palace. It had colorful feathers of blue and green. When it danced in the rain everyone would stop and watch. The peacock was proud of its beautiful feathers. It spread its tail like a magnificent fan.
Trees are very important for our planet. They give us oxygen to breathe and clean the air. Trees provide fruits and nuts for us to eat. Birds build their nests in the branches of trees. We should plant more trees and take care of them.
A young girl named Priya wanted to learn music. She practiced singing every day after school. Her teacher said she had a beautiful voice. Priya worked very hard and never missed a practice session. After many months she sang in a concert and everyone clapped.
The heart pumps blood through our body. Blood carries oxygen and food to every part of the body. The brain controls all our movements and thoughts. Our bones give shape to our body and protect our organs. The human body is an amazing machine.
An old tortoise and a young rabbit decided to have a race. The rabbit ran very fast and went far ahead. He thought he had plenty of time so he took a nap. The tortoise kept walking slowly but steadily. When the rabbit woke up the tortoise had already crossed the finish line. Slow and steady wins the race.
India has many beautiful festivals throughout the year. Diwali is the festival of lights celebrated with joy and happiness. Holi is the festival of colors where people play with colored powder. Eid brings people together for prayers and feasts. Christmas is celebrated with decorations and gifts.
A magnet has two poles called north and south. Like poles repel each other and unlike poles attract. Magnets can attract things made of iron and steel. The earth itself is like a giant magnet. A compass needle points north because of the earth magnetic field.
There was a merchant who traveled from town to town selling goods. He carried silk cloths and precious spices on his camel. One day he got lost in the desert during a sandstorm. He prayed for help and soon the storm passed away. He followed the stars in the night sky and found his way home.
Light travels in straight lines very fast. When light passes through a prism it splits into seven colors. These colors are violet indigo blue green yellow orange and red. We can see a rainbow after rain because water drops act like tiny prisms. Light is a form of energy that helps us see the world.
A mother bird built a nest in a tall tree. She laid three small eggs in the nest. She sat on the eggs to keep them warm for many days. Soon the eggs cracked and three baby birds came out. The mother bird brought food for her babies every day until they learned to fly.
Plants make their own food through photosynthesis. They use sunlight water and carbon dioxide for this process. The green color in leaves comes from a substance called chlorophyll. Chlorophyll captures sunlight to make food for the plant. Plants give out oxygen during photosynthesis which we breathe.
A brave soldier named Ravi protected his village from danger. He stood guard at the border day and night without complaint. The villagers respected him and treated him like a hero. Ravi taught the young boys how to be brave and strong. He said courage means doing the right thing even when you are afraid.
The seasons change throughout the year in India. Summer is hot and dry with temperatures rising very high. The monsoon brings heavy rains and cools the land. Winter is cold and pleasant in most parts of the country. Spring brings new flowers and green leaves on the trees.
A fisherman went to the sea every morning in his small boat. He would throw his net into the water and wait patiently. Sometimes he caught many fish and sometimes very few. One day he caught a beautiful golden fish. The golden fish spoke and asked to be set free. The kind fisherman released it back into the sea.
Electricity flows through wires like water flows through pipes. We use electricity to power lights fans and computers. A battery stores electrical energy for later use. Switches control the flow of electricity in a circuit. We should use electricity wisely and not waste it.
Two friends were walking through a forest one day. Suddenly they saw a large bear coming toward them. One friend quickly climbed a tree to save himself. The other friend lay down on the ground and pretended to be dead. The bear came close and smelled him then walked away. When the bear left the friend in the tree came down. He asked what the bear whispered in his ear. The friend on the ground said the bear told me not to trust a friend who runs away in danger.
Mountains are the tallest landforms on the earth. The Himalayas are the highest mountains in the world. Mount Everest is the tallest peak standing at eight thousand meters. Many rivers begin from the glaciers in the mountains. Mountains affect the weather and rainfall in nearby areas.
A little ant worked hard all summer long. It collected food and stored it carefully in its home. A grasshopper spent the whole summer singing and dancing. When winter came the ant had plenty of food to eat. The grasshopper had nothing and was cold and hungry. The ant shared some food with the grasshopper and said it is wise to prepare for the future.
Sound is a form of energy that travels in waves. We hear sounds when these waves reach our ears. Sound travels faster through water than through air. It travels fastest through solid objects like metal. Very loud sounds can damage our hearing so we should protect our ears.
A teacher loved her students very much. She came to school early every day to prepare her lessons. She explained difficult topics in simple and easy ways. Her students always performed well in their examinations. She believed that every child can learn if given the right guidance.
Fine-Tuning โ From Generic AI to Your Personal ChatBot
Learning Objectives
- Explain why fine-tuning is more practical than training from scratch โ and how it saves crores of rupees.
- Distinguish between pre-training and fine-tuning with clear mental models.
- Describe LoRA (Low-Rank Adaptation), including the math behind it and why it's revolutionary.
- Trace the full RLHF pipeline used to build ChatGPT, Claude, and Gemini.
- Prepare a custom Q&A dataset for fine-tuning using Python.
- Fine-tune DistilGPT-2 with LoRA on your own data โ on a regular laptop.
- Chat with your fine-tuned model and compare it against the base model.
- Critically evaluate the ethical dimensions of fine-tuning, especially in the Indian context.
6.1 Standing on the Shoulders of Giants
Imagine you want to start a chai stall. Would you plant tea bushes, wait three years for them to grow, build a factory to process leaves, and then start making chai? Of course not! You buy ready-made tea powder from Tata or Brooke Bond and focus on what makes your chai special โ the perfect ratio of adrak, elaichi, and sugar.
Fine-tuning works exactly the same way.
Training a large language model from scratch โ what we call pre-training โ is staggeringly expensive. GPT-3 cost an estimated 4.6 million (roughly โน38 crore) in compute alone. GPT-4's training cost is rumoured to exceed 100 million (โน830+ crore). These models are trained on trillions of tokens of internet text for weeks on clusters of thousands of GPUs.
Now compare that to fine-tuning. When you fine-tune, you take a pre-trained model that already understands language โ grammar, facts, reasoning patterns โ and you teach it your specific task. A fine-tuning run on a small model like DistilGPT-2 can cost as little as 0 (free, on your laptop's CPU) to 10 (a few hours on a cloud GPU). Even fine-tuning a 7-billion-parameter model on an A100 GPU costs under $50.
| Approach | Cost | Time | Data Needed | Hardware |
|---|---|---|---|---|
| Pre-training from scratch | โน38โ830 crore | Weeksโmonths | Trillions of tokens | Thousands of GPUs |
| Fine-tuning (full) | โน400โโน40,000 | Hoursโdays | Thousands of examples | 1โ8 GPUs |
| Fine-tuning (LoRA) | โน0โโน4,000 | Minutesโhours | Hundreds of examples | 1 GPU or CPU |
Important
Fine-tuning is why AI is democratized today. You don't need the budget of Google or OpenAI. A student at IIT Bombay or a teacher in Jaipur can build a specialized AI chatbot with a laptop and a few hundred well-crafted examples.
The giants โ Google, Meta, OpenAI, Mistral โ have done the expensive work of pre-training. We stand on their shoulders and specialize their models for our needs.
6.2 Pre-training vs Fine-tuning
Here's the analogy that makes this click:
Pre-training is like going to school for 12 years. From Class 1 to Class 12, you learn Hindi, English, Maths, Science, Social Studies, Art โ everything. By the time you finish school, you're a well-rounded person who knows a little about a lot.
Fine-tuning is like taking admission in a B.Sc. or B.Tech programme. You pick one subject โ say, Computer Science โ and spend 3โ4 years going deep into it. You don't forget what you learned in school; you build on it.
Now consider what happens at each stage for a language model:
Pre-training
During pre-training, the model reads massive amounts of text from the internet โ Wikipedia articles, books, news, code, forums โ and learns to predict the next word. Through billions of these next-word predictions, it picks up:
- Grammar and syntax โ how sentences are structured
- Facts and knowledge โ who was the first Prime Minister of India, what is photosynthesis
- Reasoning patterns โ if X then Y, cause and effect
- Multiple languages โ Hindi, English, Tamil, and hundreds more
The pre-trained model is a generalist. Ask it anything and it will produce grammatically correct, somewhat relevant text. But it won't follow instructions well, it won't stay on topic, and it won't have the personality or expertise you want.
Fine-tuning
During fine-tuning, you take this generalist model and train it further on a small, curated dataset specific to your task. For our chatbot, we'll use education Q&A pairs โ questions about science, maths, Indian history, and study tips, paired with clear, helpful answers.
After fine-tuning, the model:
- Follows the Q&A format โ it knows when to stop answering and doesn't ramble
- Stays on topic โ it gives education-relevant responses
- Matches the tone of your training data โ helpful, clear, encouraging
Tip
Think of it this way: pre-training gives the model its IQ (general intelligence). Fine-tuning gives it its specialization and personality (like a teacher who explains things simply, or a doctor who speaks with compassion).
6.3 What is LoRA?
Here's the problem with fine-tuning: even though the dataset is small, you still need to update the model's weights. A model like LLaMA-2 7B has 7 billion parameters. Storing a full copy of those 7 billion updated parameters requires ~28 GB of memory (in FP32). Training them requires even more. For most of us, that's impossible.
LoRA โ Low-Rank Adaptation โ solves this brilliantly.
The Sticky Note Analogy
Imagine you have a massive NCERT textbook โ say, the Class 12 Physics textbook. It has 500 pages of printed text that you cannot change (those are the frozen pre-trained weights). But you can add small sticky notes (Post-it notes) on specific pages with your own handwritten additions, corrections, or summaries.
That's LoRA. Instead of rewriting the entire textbook (full fine-tuning), you add tiny, targeted modifications. The original textbook stays intact. Your additions are small โ maybe 50 sticky notes total โ but they dramatically change how you use the book.
The Math Behind LoRA
Let's go deeper. In a Transformer, the key computation happens in weight matrices โ large matrices like W that transform input vectors into output vectors:
where W is, say, a 768 \times 768 matrix (589,824 parameters) in DistilGPT-2.
In standard fine-tuning, you'd update every single one of those 589,824 values. LoRA instead says: "The change \Delta W that fine-tuning makes is probably low-rank." That is, the update matrix doesn't need all 589,824 degrees of freedom โ it can be approximated by a much smaller matrix.
LoRA decomposes \Delta W into two small matrices:
where:
Ais a matrix of shaper \times 768(low-rank "down-projection")Bis a matrix of shape768 \times r(low-rank "up-projection")ris the rank โ a tiny number like 4, 8, or 16
So instead of storing 768 \times 768 = 589,824 values for \Delta W, you store:
With r = 8:
That's a 98% reduction in trainable parameters! The forward pass becomes:
Understanding Rank and Alpha
Rank (r): This controls the "capacity" of your adaptation. Think of it as how many sticky notes you're allowed to add. A rank of 4 means very focused changes; a rank of 64 gives more expressive power but requires more memory. For most tasks, r = 8 or r = 16 works beautifully.
Alpha (\alpha): This is a scaling factor that controls how much the LoRA adaptation influences the output. The effective scaling is \frac{\alpha}{r}. A common practice is to set \alpha = 2 \times r (e.g., r = 8, \alpha = 16), so the scaling factor is 2.
Note
The beauty of LoRA is that at inference time, you can merge \Delta W into W to get W' = W + \frac{\alpha}{r} B A, so there's zero additional latency. The sticky notes get permanently written into the textbook, and the book is the same size as before.
Why LoRA is Revolutionary
| Feature | Full Fine-tuning | LoRA |
|---|---|---|
| Parameters updated | All (100%) | 0.1โ2% |
| Memory required | Very high | Very low |
| Training speed | Slow | Fast |
| Storage per task | Full model copy | Tiny adapter (~MB) |
| Switch between tasks | Load entire model | Swap adapter file |
This last point is especially powerful. Imagine you fine-tune the same base model for three different subjects โ Physics, History, and Mathematics. With LoRA, you store one base model and three tiny adapter files. Swapping subjects is as fast as loading a small file.
6.4 The RLHF Pipeline โ How ChatGPT, Claude, and Gemini Are Trained
You might wonder: if fine-tuning is so simple, why did it take until 2022 for ChatGPT to feel truly magical? The answer lies in a multi-stage pipeline that goes far beyond basic fine-tuning.
Here's the full journey:
Stage 1: Pre-training
The model (GPT-4, Gemini, Claude, LLaMA) is trained on trillions of tokens of internet text. This produces a base model โ a powerful text predictor that can complete any sentence but doesn't follow instructions or behave helpfully.
Stage 2: Supervised Fine-Tuning (SFT)
Human annotators write thousands of high-quality (instruction, response) pairs. The model is fine-tuned on these examples to learn the format of being helpful. After SFT, the model can follow instructions, but its responses vary in quality.
Stage 3: RLHF (Reinforcement Learning from Human Feedback)
This is the magic sauce. Here's how it works:
- Generate: The SFT model generates multiple responses to the same prompt.
- Rank: Human raters rank these responses from best to worst. ("Response A is more helpful and accurate than Response B.")
- Train a Reward Model: A separate neural network learns to predict which responses humans will prefer. It assigns a reward score to any (prompt, response) pair.
- Optimize with RL: Using Proximal Policy Optimization (PPO), the language model is trained to generate responses that maximize the reward model's score โ while staying close to the SFT model (to prevent it from "hacking" the reward).
This is analogous to a teacher training a student:
- SFT = giving the student model answers to copy
- RLHF = having the student write their own answers, then giving grades and feedback
Stage 4: Constitutional AI and DPO
Anthropic's Constitutional AI (used in Claude) adds another layer: instead of relying solely on human raters, the model critiques its own responses against a set of principles (a "constitution") and revises them. This is like a student doing self-correction based on a rubric.
DPO (Direct Preference Optimization) is a newer, simpler alternative to RLHF. Instead of training a separate reward model and using PPO, DPO directly optimizes the language model using preference pairs. It's mathematically equivalent to RLHF in many cases but much simpler to implement.
Pre-training โ SFT โ RLHF/DPO โ Deployed Model
(Trillions of (Thousands of (Thousands of (ChatGPT,
tokens, months) examples, hours) comparisons, days) Claude, Gemini)
Tip
What we're doing in this chapter โ fine-tuning DistilGPT-2 on Q&A pairs โ is equivalent to Stage 2 (SFT). In a production system, you'd add RLHF or DPO on top. But even SFT alone produces a dramatic improvement!
6.5 Preparing Your Data
Before fine-tuning, we need to transform raw question-answer pairs into tokenized, model-ready training data. Our data lives in a JSONL file (education_qa.jsonl) where each line is a JSON object:
JSON
{"instruction": "What is photosynthesis?", "response": "Photosynthesis is the process by which green plants convert sunlight into food..."}
{"instruction": "Who wrote the Indian national anthem?", "response": "The Indian national anthem 'Jana Gana Mana' was written by Rabindranath Tagore..."}
Our data preparation pipeline has five clear steps. Let's walk through each one.
Step 1: Load Raw Data from JSONL
Python
import os
import json
import random
from pathlib import Path
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
DATA_DIR = os.path.join(SCRIPT_DIR, 'data')
PROCESSED_DIR = os.path.join(DATA_DIR, 'processed')
DATASET_FILE = os.path.join(DATA_DIR, 'education_qa.jsonl')
def load_jsonl(filepath):
"""Load data from a JSONL file."""
if not os.path.exists(filepath):
print(f"Error: Dataset file not found: {filepath}")
return None
data = []
with open(filepath, 'r', encoding='utf-8') as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
entry = json.loads(line)
data.append(entry)
except json.JSONDecodeError as e:
print(f" Skipping malformed line {line_num}: {e}")
print(f"Loaded {len(data)} examples")
return data
We use JSONL (JSON Lines) instead of a single JSON file because JSONL is streaming-friendly โ you can process one line at a time without loading the entire file into memory. This matters when your dataset grows to millions of examples.
Step 2: Format into Instruction-Response Pairs
Python
def format_examples(data):
"""Format each example into instruction-response format."""
formatted = []
for entry in data:
instruction = entry.get('instruction', entry.get('question', ''))
response = entry.get('response', entry.get('answer', entry.get('output', '')))
if not instruction or not response:
continue
text = f"### Question:\n{instruction}\n\n### Answer:\n{response}"
formatted.append(text)
print(f"Formatted {len(formatted)} examples")
return formatted
Notice the format: ### Question:\n...\n\n### Answer:\n.... This is a prompt template โ a consistent structure that teaches the model to recognize where questions end and answers begin. The ### markers act as clear delimiters.
Note
The .get() calls with fallbacks ('question', 'answer', 'output') make the function robust to different dataset formats. Whether your data uses instruction/response or question/answer keys, it works.
Step 3: Tokenize
Python
def tokenize_texts(texts):
"""Tokenize all formatted texts using DistilGPT-2 tokenizer."""
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
# GPT-2 doesn't have a pad token โ use eos_token instead
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenized_data = []
for text in texts:
encoded = tokenizer(
text,
padding='max_length',
truncation=True,
max_length=256,
return_tensors=None
)
tokenized_data.append({
'input_ids': encoded['input_ids'],
'attention_mask': encoded['attention_mask'],
'text': text
})
print(f"Tokenized {len(tokenized_data)} examples (max_length=256)")
return tokenized_data, tokenizer
Key choices here:
max_length=256: We cap each example at 256 tokens. This is enough for most education Q&A pairs while keeping memory usage manageable. Longer texts get truncated; shorter texts get padded.padding='max_length': All sequences are padded to exactly 256 tokens so they can be batched together.attention_mask: A binary mask that tells the model which tokens are real (1) and which are padding (0). The model ignores padding tokens.
Step 4: Split into Train and Validation Sets
Python
def split_data(tokenized_data, train_ratio=0.9, seed=42):
"""Split data into train and validation sets."""
random.seed(seed)
indices = list(range(len(tokenized_data)))
random.shuffle(indices)
split_idx = int(len(indices) * train_ratio)
train_indices = indices[:split_idx]
val_indices = indices[split_idx:]
train_data = [tokenized_data[i] for i in train_indices]
val_data = [tokenized_data[i] for i in val_indices]
print(f"Train: {len(train_data)} examples")
print(f"Val: {len(val_data)} examples")
return train_data, val_data
We use a 90/10 split โ 90% for training, 10% for validation. The validation set is crucial: it tells us whether the model is actually learning or just memorizing the training data (overfitting). Setting seed=42 ensures reproducibility โ you get the same split every time you run the script.
Step 5: Save Processed Data
Python
def save_processed_data(train_data, val_data):
"""Save processed data to JSON files."""
os.makedirs(PROCESSED_DIR, exist_ok=True)
train_path = os.path.join(PROCESSED_DIR, 'train.json')
val_path = os.path.join(PROCESSED_DIR, 'val.json')
with open(train_path, 'w', encoding='utf-8') as f:
json.dump(train_data, f, indent=2)
with open(val_path, 'w', encoding='utf-8') as f:
json.dump(val_data, f, indent=2)
print(f"Saved train data: {train_path}")
print(f"Saved val data: {val_path}")
Tip
Run this script with: python prepare_data.py. It will create a data/processed/ directory with train.json and val.json, ready for the fine-tuning step.
6.6 Fine-Tuning DistilGPT-2 with LoRA
Now the exciting part โ we train the model. This script loads DistilGPT-2, wraps it with LoRA adapters, and trains it on our education data using Hugging Face's Trainer API.
Step 1: Load the Base Model
Python
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import Dataset
MODEL_NAME = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
# GPT-2 doesn't have a pad token by default
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id
AutoModelForCausalLM loads the model in causal language modelling mode โ the model predicts the next token given all previous tokens. This is the standard mode for GPT-style text generation.
DistilGPT-2 has about 82 million parameters, 6 Transformer layers, 12 attention heads, and an embedding dimension of 768. It's a distilled (compressed) version of GPT-2 โ smaller and faster, perfect for learning.
Step 2: Apply LoRA
Python
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # LoRA rank
lora_alpha=16, # Scaling factor
lora_dropout=0.05, # Small dropout for regularization
target_modules=["c_attn"], # Target attention layers in GPT-2
bias="none",
)
model = get_peft_model(model, lora_config)
# Check: how many parameters are trainable?
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
all_params = sum(p.numel() for p in model.parameters())
trainable_pct = 100 * trainable_params / all_params
print(f"Trainable parameters: {trainable_params:,} ({trainable_pct:.2f}% of {all_params:,})")
Let's unpack each parameter:
task_type=TaskType.CAUSAL_LMโ We're doing causal (autoregressive) language modelling, not classification or sequence-to-sequence.r=8โ Rank 8. OurAandBmatrices are8 \times 768and768 \times 8respectively. This gives us enough capacity to adapt the model while keeping the adapter tiny.lora_alpha=16โ Scaling factor. The LoRA update is scaled by\frac{\alpha}{r} = \frac{16}{8} = 2.target_modules=["c_attn"]โ We only add LoRA to the attention projection layers in GPT-2 (calledc_attn). This is where the query, key, and value projections live โ the most impactful place to adapt.lora_dropout=0.05โ A tiny 5% dropout on the LoRA layers to prevent overfitting.bias="none"โ We don't train bias terms, keeping the adapter even smaller.
After applying LoRA, you'll see that only about 0.29% of the model's parameters are trainable. The rest are frozen. That's the power of LoRA.
Step 3: Load and Prepare the Dataset
Python
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
DATASET_FILE = os.path.join(SCRIPT_DIR, "data", "education_qa.jsonl")
# Load from JSONL and format
raw_data = load_dataset_from_jsonl(DATASET_FILE)
# Create a Hugging Face Dataset
dataset = Dataset.from_list(raw_data)
# Tokenize
def tokenize_function(examples):
tokenized = tokenizer(
examples["text"],
truncation=True,
padding="max_length",
max_length=256,
)
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenized
tokenized_dataset = dataset.map(
tokenize_function,
batched=True,
remove_columns=["text"],
desc="Tokenizing",
)
# 90/10 train/validation split
split = tokenized_dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split["train"]
val_dataset = split["test"]
Important
Notice tokenized["labels"] = tokenized["input_ids"].copy(). In causal language modelling, the labels are the same as the inputs, shifted by one position. The model learns to predict each token given all preceding tokens. The Trainer handles the shifting internally.
Step 4: Configure and Launch Training
Python
OUTPUT_DIR = os.path.join(SCRIPT_DIR, "output", "edu-chatbot-lora")
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
learning_rate=2e-4,
warmup_steps=50,
weight_decay=0.01,
logging_steps=10,
eval_strategy="epoch",
save_strategy="epoch",
save_total_limit=2,
fp16=torch.cuda.is_available(),
report_to="none",
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False, # Causal LM, not masked LM
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
data_collator=data_collator,
)
# Train!
train_result = trainer.train()
Key training decisions explained:
- 3 epochs โ The model sees each training example 3 times. More epochs risk overfitting on a small dataset.
- Batch size 4 โ Process 4 examples at once. Small enough for CPU/low-memory GPU, large enough for stable gradients.
- Learning rate 2e-4 โ Higher than typical pre-training LRs (1e-5 to 5e-5) because we're only training LoRA parameters, which need larger updates.
- Warmup 50 steps โ The learning rate starts at 0 and linearly increases to 2e-4 over 50 steps. This prevents early instability.
fp16=torch.cuda.is_available()โ Uses half-precision (16-bit floats) on GPU for 2x speedup. Falls back to FP32 on CPU.load_best_model_at_end=Trueโ After training, automatically loads the checkpoint with the lowest validation loss. This prevents using an overfit model.
Step 5: Save the LoRA Adapter
Python
# Save LoRA adapter and tokenizer
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
Note
The saved adapter is tiny โ typically just a few hundred kilobytes. The base DistilGPT-2 model is ~350 MB. Your LoRA adapter adds less than 1 MB on top. This is like saving just the sticky notes, not the entire textbook.
6.7 Chatting with Your Fine-Tuned Model
Now let's build an interactive chat interface to talk to our creation!
Loading the Fine-Tuned Model
Python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
MODEL_DIR = os.path.join(SCRIPT_DIR, "output", "edu-chatbot-lora")
BASE_MODEL = "distilgpt2"
def load_model(model_path, base_model_name):
"""Load the fine-tuned model (base + LoRA adapter)."""
tokenizer = AutoTokenizer.from_pretrained(model_path)
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Apply the LoRA adapter on top of the base model
model = PeftModel.from_pretrained(base_model, model_path)
model.eval()
return model, tokenizer, base_model
Notice the two-step loading process:
- Load the base model (
distilgpt2) โ the original, unmodified weights. - Apply the LoRA adapter on top using
PeftModel.from_pretrained().
We also keep a reference to the base_model so we can compare outputs later.
Generating Responses
Python
def generate_response(model, tokenizer, prompt, temperature=0.7,
max_new_tokens=200, top_k=50, top_p=0.9):
"""Generate a response from the model."""
# Format using the same template as training
formatted_prompt = f"### Question:\n{prompt}\n\n### Answer:\n"
inputs = tokenizer(formatted_prompt, return_tensors="pt",
truncation=True, max_length=256)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
with torch.no_grad():
outputs = model.generate(
input_ids,
attention_mask=attention_mask,
max_new_tokens=max_new_tokens,
temperature=max(temperature, 0.01),
top_k=top_k,
top_p=top_p,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
)
# Decode only the newly generated tokens
generated_ids = outputs[0][input_ids.shape[1]:]
response = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
# Stop at prompt markers to prevent the model from generating new Q&A pairs
for stop_marker in ["### Question:", "### Answer:", "\n\n\n"]:
if stop_marker in response:
response = response[:response.index(stop_marker)].strip()
return response
Let's understand the generation parameters:
temperature=0.7โ Controls randomness. Lower values (0.1) make the model deterministic and focused. Higher values (1.5) make it creative but potentially incoherent.top_k=50โ At each step, only consider the top 50 most likely next tokens.top_p=0.9โ Nucleus sampling: consider the smallest set of tokens whose cumulative probability exceeds 90%.repetition_penalty=1.2โ Penalize tokens that have already appeared, reducing repetitive output.no_repeat_ngram_size=3โ Never repeat the same 3-word sequence. Prevents loops like "the the the" or "is very very very important."
Warning
The formatted_prompt must use the exact same template as the training data (### Question:\n...\n\n### Answer:\n). If you change this format, the model won't recognize the pattern and will produce poor output. Consistency between training and inference is critical.
The Comparison Function
Python
def compare_models(base_model, finetuned_model, tokenizer, prompt, temperature=0.7):
"""Compare responses from base model vs fine-tuned model."""
base_response = generate_response(base_model, tokenizer, prompt, temperature)
ft_response = generate_response(finetuned_model, tokenizer, prompt, temperature)
print(f"Base Model (no fine-tuning):\n{base_response}\n")
print(f"Fine-Tuned Model:\n{ft_response}\n")
This function sends the same prompt to both the base model and the fine-tuned model, displaying their responses side by side.
6.8 Comparing Base vs Fine-Tuned
Here's what you'll typically see when you run the comparison:
Prompt: "What is photosynthesis?"
| Base DistilGPT-2 | Fine-Tuned DistilGPT-2 | |
|---|---|---|
| Response | "The main purpose of this post is to explain how a new generation..." (random, off-topic text) | "Photosynthesis is the process by which green plants use sunlight, water, and carbon dioxide to make their own food (glucose) and release oxygen..." |
| Quality | โ Incoherent, random | โ Clear, educational |
| Format | No structure | Follows Q&A format |
| Relevance | 0% relevant | Highly relevant |
Prompt: "Give me tips to prepare for board exams."
| Base DistilGPT-2 | Fine-Tuned DistilGPT-2 | |
|---|---|---|
| Response | "I'm not sure what you're talking about but the first thing I'd say is..." | "Here are some effective tips: 1) Make a study timetable and stick to it. 2) Focus on NCERT textbooks first. 3) Practice previous years' question papers..." |
Tip
Try the compare command in chat.py to see this in action with your own questions. It's the most convincing demonstration of why fine-tuning works!
The base model is like a Class 12 topper who answers every question with random Wikipedia trivia. The fine-tuned model is like a dedicated tuition teacher who understands exactly what you asked and gives a clear, structured answer.
๐ญ 6.9 Discussion: Ethics of Fine-Tuning
Fine-tuning is powerful โ and with power comes responsibility. Let's discuss the ethical dimensions, especially as they apply to India.
### Bias Amplification
Every dataset carries the biases of its creators. If your education Q&A data is written from a particular perspective โ say, a North Indian, upper-caste, English-medium viewpoint โ the fine-tuned model will reflect those biases. It might:
- Give examples only from CBSE, ignoring state board syllabi
- Use English explanations that aren't accessible to Hindi-medium or regional-medium students
- Present history from a single perspective, ignoring diverse regional narratives
Mitigation: Actively include diverse examples. Have reviewers from different states, languages, and backgrounds evaluate your training data.
### Misinformation
A model fine-tuned on incorrect or outdated information will confidently produce wrong answers. In education, this is particularly dangerous โ imagine a student trusting an AI that says "India became independent in 1948" or gives wrong formulas for Physics.
Mitigation: Rigorously fact-check your training data. Include citations where possible. Add disclaimers that the AI can make mistakes.
### Language and Accessibility
India has 22 officially recognized languages and hundreds more spoken across the country. An AI trained only on English education data excludes the vast majority of Indian students. A student in Madurai studying in Tamil medium, or a student in Assam studying in Assamese medium, deserves the same quality of AI assistance.
Mitigation: Build multilingual datasets. Fine-tune models that support Hindi, Tamil, Telugu, Bengali, Marathi, and other Indian languages. Organizations like AI4Bharat are doing pioneering work in this space.
### The Digital Divide
Fine-tuning requires computing resources, technical knowledge, and data โ all of which are unevenly distributed. There's a risk that AI-powered education tools benefit urban, English-speaking, well-connected students while leaving rural India behind.
Mitigation: Design tools that work offline, on low-end devices. Partner with government schools and NGOs. Make your models and data open-source so others can build on them.
### Privacy
Education datasets might contain student questions that reveal personal information โ learning difficulties, family situations, or mental health struggles. Fine-tuning on such data without consent is a serious privacy violation.
Mitigation: Anonymize all data. Get informed consent. Follow India's Digital Personal Data Protection Act (DPDPA) guidelines.
> [!CAUTION]
> Never deploy a fine-tuned education AI without human oversight. AI should assist teachers, not replace them. A wrong answer from a textbook can be corrected in the next edition; a wrong answer from an AI can be given to thousands of students simultaneously before anyone notices.
Key Concepts Summary
| Concept | Definition |
|---|---|
| Pre-training | Training a model from scratch on massive text data to learn general language understanding. Extremely expensive. |
| Fine-tuning | Adapting a pre-trained model to a specific task using a small, curated dataset. Cheap and fast. |
| LoRA (Low-Rank Adaptation) | A parameter-efficient fine-tuning method that decomposes weight updates into two small matrices (B \times A), reducing trainable parameters by 98%+. |
Rank (r) | The inner dimension of LoRA matrices. Controls adaptation capacity. Typical values: 4, 8, 16. |
Alpha (\alpha) | LoRA scaling factor. The update is scaled by \frac{\alpha}{r}. Common choice: \alpha = 2r. |
| SFT (Supervised Fine-Tuning) | Fine-tuning on human-written (instruction, response) pairs. |
| RLHF | Reinforcement Learning from Human Feedback. Uses human preference rankings to train a reward model, then optimizes the LM with PPO. |
| DPO | Direct Preference Optimization. A simpler alternative to RLHF that optimizes directly from preference data without a separate reward model. |
| Prompt Template | A consistent format (e.g., ### Question:\n...\n\n### Answer:\n) used during both training and inference. |
| Temperature | A generation parameter controlling randomness. Low = focused; high = creative. |
| Attention Mask | A binary mask indicating real tokens (1) vs. padding (0). |
| Data Collator | A utility that dynamically batches and pads tokenized examples for training. |
๐ 6.11 Exercises
Exercise 1: Experiment with LoRA Hyperparameters ๐ฌ
Modify finetune.py to try different LoRA configurations:
Change the rank from 8 to 4 and to 16. How does the validation loss change? Does higher rank always mean better performance?
Change lora_alpha to 8 (same as rank) and to 32 (4ร rank). How does this affect training stability?
Add "c_proj" to target_modules alongside "c_attn". Does targeting more layers improve results?
Exercise 2: Build a Hindi Q&A Dataset ๐ฎ๐ณ
Create a JSONL file with at least 50 question-answer pairs in Hindi (or your regional language). Topics can include: Indian history, geography, civics, or any school subject. Run the full pipeline (prepare_data.py โ finetune.py โ chat.py) on your dataset and evaluate the results. Does the model respond in Hindi?
Exercise 3: Temperature Exploration ๐ก๏ธ
Using chat.py, ask the same question ("Explain Newton's third law") at five different temperatures: 0.1, 0.3, 0.7, 1.0, and 1.5. Copy the responses and analyze:
At what temperature does the response become incoherent?
Which temperature produces the most "textbook-like" answer?
Which temperature produces the most "creative" answer?
Exercise 4: Measure Overfitting ๐
Modify finetune.py to train for 10 epochs instead of 3. Plot the training loss and validation loss for each epoch. At what epoch does the validation loss start increasing while training loss keeps decreasing? This is the overfitting point. What strategies could you use to prevent it?
Exercise 5: Compare Prompt Templates ๐
The current prompt template uses ### Question: and ### Answer:. Try these alternatives and compare output quality:
Q: ... A: ...
Student: ... Teacher: ...
<question> ... </question> <answer> ... </answer>
Remember: you must use the same template in both finetune.py and chat.py.
Exercise 6: Ethical Audit ๐
Take your fine-tuned model and ask it 10 questions about Indian history from different regional perspectives (e.g., the independence movement from a South Indian perspective, tribal history, Northeast Indian history). Document:
Which questions does the model answer well?
Where does it show bias or gaps?
How would you improve the training data to address these gaps?
Exercise 7: Adapter Arithmetic โ
You created one LoRA adapter for education. Now create a second adapter for a different domain (e.g., cooking recipes, cricket commentary, or Bollywood trivia). Can you load them separately on the same base model? How quickly can you switch between "teacher mode" and "cricket commentator mode"?
In the next chapter, we'll bring everything together โ building a complete, deployable chatbot with a web interface that your friends, students, and colleagues can actually use. The journey from theory to product begins!
Complete Source Code - Chapter 6
Below are the complete, runnable source files for this chapter. Every line is included.
Complete Code: prepare_data.py
Python
#!/usr/bin/env python3
"""Level 5: Prepare Education Q&A Dataset for Fine-Tuning
This script loads raw education Q&A data from JSONL format,
formats it into instruction-response pairs, tokenizes using
the DistilGPT-2 tokenizer, and splits into train/validation sets.
"""
import os
import json
import random
from pathlib import Path
# ANSI Colors
GREEN = '\033[92m'
CYAN = '\033[96m'
YELLOW = '\033[93m'
MAGENTA = '\033[95m'
BOLD = '\033[1m'
RESET = '\033[0m'
BLUE = '\033[94m'
RED = '\033[91m'
DIM = '\033[2m'
# Get script directory
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
DATA_DIR = os.path.join(SCRIPT_DIR, 'data')
PROCESSED_DIR = os.path.join(DATA_DIR, 'processed')
DATASET_FILE = os.path.join(DATA_DIR, 'education_qa.jsonl')
def print_banner():
"""Print the data preparation banner."""
banner = f"""
{CYAN}{BOLD}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ ๐ฆ Level 5: Data Preparation Pipeline โ
โ โ
โ Transforming raw Q&A data into tokenized training data โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ{RESET}
"""
print(banner)
def load_jsonl(filepath):
"""Load data from a JSONL file."""
print(f"{BLUE}{BOLD}[Step 1/5]{RESET} Loading dataset from {DIM}{filepath}{RESET}")
if not os.path.exists(filepath):
print(f"{RED}{BOLD}โ Error:{RESET} Dataset file not found: {filepath}")
print(f"{YELLOW} โ Make sure 'education_qa.jsonl' exists in the data/ directory{RESET}")
return None
data = []
with open(filepath, 'r', encoding='utf-8') as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
try:
entry = json.loads(line)
data.append(entry)
except json.JSONDecodeError as e:
print(f"{YELLOW} โ Skipping malformed line {line_num}: {e}{RESET}")
print(f"{GREEN} โ Loaded {BOLD}{len(data)}{RESET}{GREEN} examples{RESET}")
return data
def format_examples(data):
"""Format each example into instruction-response format."""
print(f"\n{BLUE}{BOLD}[Step 2/5]{RESET} Formatting examples into Q&A pairs")
formatted = []
for entry in data:
instruction = entry.get('instruction', entry.get('question', ''))
response = entry.get('response', entry.get('answer', entry.get('output', '')))
if not instruction or not response:
continue
text = f"### Question:\n{instruction}\n\n### Answer:\n{response}"
formatted.append(text)
print(f"{GREEN} โ Formatted {BOLD}{len(formatted)}{RESET}{GREEN} examples{RESET}")
return formatted
def tokenize_texts(texts):
"""Tokenize all formatted texts using DistilGPT-2 tokenizer."""
print(f"\n{BLUE}{BOLD}[Step 3/5]{RESET} Loading tokenizer and tokenizing texts")
try:
from transformers import AutoTokenizer
except ImportError:
print(f"{RED}{BOLD}โ Error:{RESET} 'transformers' library not installed.")
print(f"{YELLOW} โ Run: pip install transformers{RESET}")
return None, None
tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
# Set pad token to eos token (GPT-2 doesn't have a pad token by default)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
print(f"{DIM} Using tokenizer: distilgpt2 (vocab size: {tokenizer.vocab_size}){RESET}")
tokenized_data = []
for text in texts:
encoded = tokenizer(
text,
padding='max_length',
truncation=True,
max_length=256,
return_tensors=None
)
tokenized_data.append({
'input_ids': encoded['input_ids'],
'attention_mask': encoded['attention_mask'],
'text': text
})
print(f"{GREEN} โ Tokenized {BOLD}{len(tokenized_data)}{RESET}{GREEN} examples (max_length=256){RESET}")
return tokenized_data, tokenizer
def split_data(tokenized_data, train_ratio=0.9, seed=42):
"""Split data into train and validation sets."""
print(f"\n{BLUE}{BOLD}[Step 4/5]{RESET} Splitting data (90/10 train/val, seed={seed})")
random.seed(seed)
indices = list(range(len(tokenized_data)))
random.shuffle(indices)
split_idx = int(len(indices) * train_ratio)
train_indices = indices[:split_idx]
val_indices = indices[split_idx:]
train_data = [tokenized_data[i] for i in train_indices]
val_data = [tokenized_data[i] for i in val_indices]
print(f"{GREEN} โ Train: {BOLD}{len(train_data)}{RESET}{GREEN} examples{RESET}")
print(f"{GREEN} โ Val: {BOLD}{len(val_data)}{RESET}{GREEN} examples{RESET}")
return train_data, val_data
def print_statistics(tokenized_data, train_data, val_data, tokenizer):
"""Print colorful dataset statistics."""
print(f"\n{MAGENTA}{BOLD}{'โ' * 50}")
print(f" ๐ Dataset Statistics")
print(f"{'โ' * 50}{RESET}\n")
# Calculate token lengths (non-padding tokens)
token_lengths = []
for entry in tokenized_data:
non_pad = sum(entry['attention_mask'])
token_lengths.append(non_pad)
avg_len = sum(token_lengths) / len(token_lengths) if token_lengths else 0
max_len = max(token_lengths) if token_lengths else 0
min_len = min(token_lengths) if token_lengths else 0
stats = [
("Total examples", f"{len(tokenized_data)}", CYAN),
("Train split", f"{len(train_data)}", GREEN),
("Validation split", f"{len(val_data)}", GREEN),
("Avg token length", f"{avg_len:.1f}", YELLOW),
("Max token length", f"{max_len}", YELLOW),
("Min token length", f"{min_len}", YELLOW),
("Vocabulary size", f"{tokenizer.vocab_size:,}", MAGENTA),
]
for label, value, color in stats:
print(f" {color}{BOLD}{'โข':>3} {label:<22}{RESET} {color}{value}{RESET}")
print(f"\n{MAGENTA}{BOLD}{'โ' * 50}{RESET}")
def save_processed_data(train_data, val_data):
"""Save processed data to JSON files."""
print(f"\n{BLUE}{BOLD}[Step 5/5]{RESET} Saving processed data")
os.makedirs(PROCESSED_DIR, exist_ok=True)
train_path = os.path.join(PROCESSED_DIR, 'train.json')
val_path = os.path.join(PROCESSED_DIR, 'val.json')
with open(train_path, 'w', encoding='utf-8') as f:
json.dump(train_data, f, indent=2)
with open(val_path, 'w', encoding='utf-8') as f:
json.dump(val_data, f, indent=2)
# Calculate file sizes
train_size = os.path.getsize(train_path) / (1024 * 1024)
val_size = os.path.getsize(val_path) / (1024 * 1024)
print(f"{GREEN} โ Saved train data: {DIM}{train_path}{RESET} ({train_size:.2f} MB)")
print(f"{GREEN} โ Saved val data: {DIM}{val_path}{RESET} ({val_size:.2f} MB)")
def main():
"""Main data preparation pipeline."""
print_banner()
# Step 1: Load raw data
data = load_jsonl(DATASET_FILE)
if data is None:
return
if len(data) == 0:
print(f"{RED}{BOLD}โ Error:{RESET} No valid examples found in the dataset.")
return
# Step 2: Format examples
formatted_texts = format_examples(data)
if not formatted_texts:
print(f"{RED}{BOLD}โ Error:{RESET} No examples could be formatted.")
return
# Step 3: Tokenize
tokenized_data, tokenizer = tokenize_texts(formatted_texts)
if tokenized_data is None:
return
# Step 4: Split data
train_data, val_data = split_data(tokenized_data)
# Print statistics
print_statistics(tokenized_data, train_data, val_data, tokenizer)
# Step 5: Save
save_processed_data(train_data, val_data)
# Final success message
print(f"\n{GREEN}{BOLD}{'โ' * 50}")
print(f" โ
Data preparation complete!")
print(f" โ Next step: Run finetune.py to train the model")
print(f"{'โ' * 50}{RESET}\n")
if __name__ == '__main__':
main()
Complete Code: finetune.py
Python
#!/usr/bin/env python3
"""
๐ Fine-Tune DistilGPT-2 with LoRA
====================================
Fine-tunes a pre-trained DistilGPT-2 model on the education Q&A dataset
using LoRA (Low-Rank Adaptation) for parameter-efficient training.
Part of: ๐ง Build Your Own AI โ From Zero to ChatBot (Level 5)
"""
import os
import sys
import json
import math
# โโโ ANSI Color Codes โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
CYAN = "\033[96m"
GREEN = "\033[92m"
YELLOW = "\033[93m"
RED = "\033[91m"
MAGENTA = "\033[95m"
BLUE = "\033[94m"
BOLD = "\033[1m"
DIM = "\033[2m"
RESET = "\033[0m"
# โโโ Paths (relative to this script) โโโโโโโโโโโโโโโโโโโโโโโโโโโ
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
DATA_DIR = os.path.join(SCRIPT_DIR, "data")
DATASET_FILE = os.path.join(DATA_DIR, "education_qa.jsonl")
OUTPUT_DIR = os.path.join(SCRIPT_DIR, "output", "edu-chatbot-lora")
def print_banner():
"""Print the fine-tuning banner."""
banner = f"""
{CYAN}{BOLD}โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ ๐ Level 5: Fine-Tune DistilGPT-2 with LoRA โ
โ โ
โ Training a real AI model on education Q&A data โ
โ Using parameter-efficient LoRA adaptation โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ{RESET}
"""
print(banner)
def check_dependencies():
"""Check if all required libraries are installed."""
print(f"{BLUE}{BOLD}[Pre-Check]{RESET} Verifying dependencies...\n")
required = {
"torch": "PyTorch",
"transformers": "Hugging Face Transformers",
"peft": "PEFT (LoRA)",
"datasets": "Hugging Face Datasets",
}
all_ok = True
for module, name in required.items():
try:
__import__(module)
print(f" {GREEN}โ{RESET} {name} ({module})")
except ImportError:
print(f" {RED}โ{RESET} {name} ({module}) โ {RED}not installed{RESET}")
all_ok = False
if not all_ok:
print(f"\n{RED}{BOLD}โ Missing dependencies!{RESET}")
print(f" {YELLOW}Run: pip install -r requirements.txt{RESET}")
return False
print(f"\n {GREEN}{BOLD}โ
All dependencies satisfied!{RESET}\n")
return True
def load_dataset_from_jsonl(filepath):
"""Load training data from JSONL file."""
if not os.path.exists(filepath):
print(f"{RED}{BOLD}โ Error:{RESET} Dataset not found: {filepath}")
print(f" {YELLOW}Run prepare_data.py first, or ensure education_qa.jsonl exists.{RESET}")
return None
data = []
with open(filepath, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
try:
entry = json.loads(line)
instruction = entry.get("instruction", "")
response = entry.get("response", "")
if instruction and response:
text = f"### Question:\n{instruction}\n\n### Answer:\n{response}"
data.append({"text": text})
except json.JSONDecodeError:
continue
return data
def format_number(n):
"""Format a number with commas for readability."""
return f"{n:,}"
def main():
"""Main fine-tuning pipeline."""
print_banner()
# โโโ Check Dependencies โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
if not check_dependencies():
sys.exit(1)
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling,
)
from peft import LoraConfig, get_peft_model, TaskType
from datasets import Dataset
# โโโ Step 1: Load Base Model โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
MODEL_NAME = "distilgpt2"
print(f"{BLUE}{BOLD}[Step 1/5]{RESET} Loading base model: {CYAN}{MODEL_NAME}{RESET}")
print(f" {DIM}(Downloading from Hugging Face Hub if not cached...){RESET}\n")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
# Set pad token
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id
# โโโ Print Model Info โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
total_params = sum(p.numel() for p in model.parameters())
print(f"{MAGENTA}{BOLD}{'โ' * 55}")
print(f" ๐ Model Information")
print(f"{'โ' * 55}{RESET}\n")
print(f" {CYAN}{'โข':>3} Model Name {RESET} {MODEL_NAME}")
print(f" {CYAN}{'โข':>3} Architecture {RESET} GPT-2 (Decoder-only Transformer)")
print(f" {CYAN}{'โข':>3} Total Parameters {RESET} {format_number(total_params)}")
print(f" {CYAN}{'โข':>3} Model Size {RESET} ~{total_params * 4 / (1024**2):.1f} MB (FP32)")
print(f" {CYAN}{'โข':>3} Layers {RESET} {model.config.n_layer}")
print(f" {CYAN}{'โข':>3} Attention Heads {RESET} {model.config.n_head}")
print(f" {CYAN}{'โข':>3} Embedding Dim {RESET} {model.config.n_embd}")
print(f" {CYAN}{'โข':>3} Vocabulary Size {RESET} {format_number(model.config.vocab_size)}")
print(f"\n{MAGENTA}{BOLD}{'โ' * 55}{RESET}\n")
# โโโ Step 2: Apply LoRA โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
print(f"{BLUE}{BOLD}[Step 2/5]{RESET} Applying LoRA (Low-Rank Adaptation)\n")
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # LoRA rank
lora_alpha=16, # LoRA alpha (scaling factor)
lora_dropout=0.05, # Small dropout for regularization
target_modules=["c_attn"], # Target attention layers in GPT-2
bias="none",
)
model = get_peft_model(model, lora_config)
# Print LoRA info
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
all_params = sum(p.numel() for p in model.parameters())
trainable_pct = 100 * trainable_params / all_params
print(f" {GREEN}LoRA Configuration:{RESET}")
print(f" {DIM}โโ{RESET} Rank (r): {YELLOW}8{RESET}")
print(f" {DIM}โโ{RESET} Alpha: {YELLOW}16{RESET}")
print(f" {DIM}โโ{RESET} Target Modules: {YELLOW}c_attn (attention layers){RESET}")
print(f" {DIM}โโ{RESET} Dropout: {YELLOW}0.05{RESET}")
print(f" {DIM}โโ{RESET} Bias: {YELLOW}none{RESET}")
print()
print(f" {GREEN}{BOLD}Trainable parameters: {YELLOW}{format_number(trainable_params)}{GREEN} ({YELLOW}{trainable_pct:.2f}%{GREEN} of total {format_number(all_params)}){RESET}")
print(f" {DIM} โ Training only {trainable_pct:.2f}% of the model โ like adding sticky notes to a textbook! ๐{RESET}\n")
# โโโ Step 3: Load and Prepare Data โโโโโโโโโโโโโโโโโโโโโโโโโโ
print(f"{BLUE}{BOLD}[Step 3/5]{RESET} Loading and preparing training data\n")
raw_data = load_dataset_from_jsonl(DATASET_FILE)
if raw_data is None:
return
print(f" {GREEN}โ Loaded {BOLD}{len(raw_data)}{RESET}{GREEN} training examples{RESET}")
# Create Hugging Face Dataset
dataset = Dataset.from_list(raw_data)
# Tokenize
def tokenize_function(examples):
tokenized = tokenizer(
examples["text"],
truncation=True,
padding="max_length",
max_length=256,
)
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenized
tokenized_dataset = dataset.map(
tokenize_function,
batched=True,
remove_columns=["text"],
desc="Tokenizing",
)
# Split into train/val
split = tokenized_dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split["train"]
val_dataset = split["test"]
print(f" {GREEN}โ Training examples: {BOLD}{len(train_dataset)}{RESET}")
print(f" {GREEN}โ Validation examples: {BOLD}{len(val_dataset)}{RESET}\n")
# โโโ Step 4: Training โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
print(f"{BLUE}{BOLD}[Step 4/5]{RESET} Starting LoRA fine-tuning\n")
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
learning_rate=2e-4,
warmup_steps=50,
weight_decay=0.01,
logging_steps=10,
eval_strategy="epoch",
save_strategy="epoch",
save_total_limit=2,
fp16=torch.cuda.is_available(),
report_to="none", # Disable wandb/tensorboard
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False, # Causal LM, not masked LM
)
# Custom callback for colored output
from transformers import TrainerCallback
class ColoredLoggingCallback(TrainerCallback):
"""Custom callback for beautiful colored training logs."""
def on_log(self, args, state, control, logs=None, **kwargs):
if logs is None:
return
step = state.global_step
epoch = logs.get("epoch", 0)
if "loss" in logs:
loss = logs["loss"]
lr = logs.get("learning_rate", 0)
# Color code loss: green if low, yellow if medium, red if high
if loss < 2.0:
loss_color = GREEN
elif loss < 4.0:
loss_color = YELLOW
else:
loss_color = RED
print(f" {DIM}Step {step:>4}{RESET} โ "
f"Epoch {CYAN}{epoch:.2f}{RESET} โ "
f"Loss {loss_color}{BOLD}{loss:.4f}{RESET} โ "
f"LR {DIM}{lr:.2e}{RESET}")
if "eval_loss" in logs:
eval_loss = logs["eval_loss"]
perplexity = math.exp(eval_loss) if eval_loss < 100 else float("inf")
print(f"\n {MAGENTA}{BOLD}๐ Evaluation:{RESET} "
f"Loss = {YELLOW}{eval_loss:.4f}{RESET}, "
f"Perplexity = {YELLOW}{perplexity:.2f}{RESET}\n")
print(f" {GREEN}Training Configuration:{RESET}")
print(f" {DIM}โโ{RESET} Epochs: {YELLOW}3{RESET}")
print(f" {DIM}โโ{RESET} Batch size: {YELLOW}4{RESET}")
print(f" {DIM}โโ{RESET} Learning rate: {YELLOW}2e-4{RESET}")
print(f" {DIM}โโ{RESET} Warmup steps: {YELLOW}50{RESET}")
print(f" {DIM}โโ{RESET} Logging every: {YELLOW}10 steps{RESET}")
print(f" {DIM}โโ{RESET} Device: {YELLOW}{'CUDA (GPU) ๐' if torch.cuda.is_available() else 'CPU ๐ป'}{RESET}")
print(f" {DIM}โโ{RESET} Output: {YELLOW}{OUTPUT_DIR}{RESET}")
print()
print(f" {CYAN}{BOLD}{'โ' * 55}{RESET}")
print(f" {CYAN}{BOLD} Training Progress{RESET}")
print(f" {CYAN}{BOLD}{'โ' * 55}{RESET}\n")
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
data_collator=data_collator,
callbacks=[ColoredLoggingCallback()],
)
# Train!
train_result = trainer.train()
print(f"\n {CYAN}{BOLD}{'โ' * 55}{RESET}")
print(f" {GREEN}{BOLD}โ
Training Complete!{RESET}")
print(f" {CYAN}{BOLD}{'โ' * 55}{RESET}\n")
# Print training summary
metrics = train_result.metrics
print(f" {GREEN}Training Summary:{RESET}")
print(f" {DIM}โโ{RESET} Total steps: {YELLOW}{metrics.get('total_flos', 'N/A')}{RESET}")
print(f" {DIM}โโ{RESET} Training loss: {YELLOW}{metrics.get('train_loss', 'N/A'):.4f}{RESET}")
print(f" {DIM}โโ{RESET} Training time: {YELLOW}{metrics.get('train_runtime', 0):.1f}s{RESET}")
samples_per_sec = metrics.get('train_samples_per_second', 0)
print(f" {DIM}โโ{RESET} Samples/sec: {YELLOW}{samples_per_sec:.2f}{RESET}")
# โโโ Step 5: Save Model โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
print(f"\n{BLUE}{BOLD}[Step 5/5]{RESET} Saving fine-tuned model\n")
# Save LoRA adapter
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
# Calculate saved model size
total_size = 0
for f_name in os.listdir(OUTPUT_DIR):
f_path = os.path.join(OUTPUT_DIR, f_name)
if os.path.isfile(f_path):
total_size += os.path.getsize(f_path)
print(f" {GREEN}โ LoRA adapter saved to:{RESET} {DIM}{OUTPUT_DIR}{RESET}")
print(f" {GREEN}โ Tokenizer saved to:{RESET} {DIM}{OUTPUT_DIR}{RESET}")
print(f" {GREEN}โ Adapter size:{RESET} {YELLOW}{total_size / 1024:.1f} KB{RESET}")
print(f" {DIM} โ The adapter is tiny because LoRA only saves the changed parameters!{RESET}")
# โโโ Final Message โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
print(f"""
{GREEN}{BOLD}{'โ' * 55}
๐ Your model is ready! Run chat.py to talk to it.
Commands:
python chat.py โ Start chatting
python chat.py --compare โ Compare base vs fine-tuned
{'โ' * 55}{RESET}
""")
if __name__ == "__main__":
main()
Complete Code: chat.py
Python
#!/usr/bin/env python3
"""
๐ฌ Interactive Chat with Your Fine-Tuned AI
=============================================
Chat with the DistilGPT-2 model fine-tuned on education Q&A data.
Supports temperature control, comparison mode, and beautiful terminal UI.
Part of: ๐ง Build Your Own AI โ From Zero to ChatBot (Level 5)
"""
import os
import sys
# โโโ ANSI Color Codes โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
CYAN = "\033[96m"
GREEN = "\033[92m"
YELLOW = "\033[93m"
RED = "\033[91m"
MAGENTA = "\033[95m"
BLUE = "\033[94m"
BOLD = "\033[1m"
DIM = "\033[2m"
RESET = "\033[0m"
BG_CYAN = "\033[46m"
BG_GREEN = "\033[42m"
WHITE = "\033[97m"
# โโโ Paths (relative to this script) โโโโโโโโโโโโโโโโโโโโโโโโโโโ
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
MODEL_DIR = os.path.join(SCRIPT_DIR, "output", "edu-chatbot-lora")
BASE_MODEL = "distilgpt2"
def print_welcome():
"""Print a beautiful welcome banner with ASCII art."""
banner = f"""
{CYAN}{BOLD}
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ โโโโโโโโโโ โโโ โโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโ โโโโโโโโโโ
โ โโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โโโ โโโโโโโโโโโโโโโโ โโโ โโโโโโโโโโโ โโโ โโโ โ
โ โโโ โโโโโโโโโโโโโโโโ โโโ โโโโโโโโโโโ โโโ โโโ โ
โ โโโโโโโโโโโ โโโโโโ โโโ โโโ โโโโโโโโโโโโโโโโโ โโโ โ
โ โโโโโโโโโโ โโโโโโ โโโ โโโ โโโโโโโ โโโโโโโ โโโ โ
โ โ
โ ๐ง Your AI Education Assistant ๐ โ
โ Fine-tuned on Indian education Q&A โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ{RESET}
{YELLOW}{BOLD}Commands:{RESET}
{DIM}โโ{RESET} {CYAN}quit{RESET} Exit the chat
{DIM}โโ{RESET} {CYAN}temp 0.8{RESET} Set temperature (0.1 = focused, 1.5 = creative)
{DIM}โโ{RESET} {CYAN}compare{RESET} Compare base model vs fine-tuned model
{DIM}โโ{RESET} {CYAN}help{RESET} Show this help message
{DIM}โโ{RESET} {CYAN}clear{RESET} Clear the screen
{GREEN}Ask me anything about Science, Math, Indian History, or Study Tips!{RESET}
{DIM}{'โ' * 60}{RESET}
"""
print(banner)
def print_help():
"""Print help message with available commands."""
print(f"""
{YELLOW}{BOLD}๐ Available Commands:{RESET}
{DIM}โโ{RESET} {CYAN}quit / exit{RESET} Exit the chat
{DIM}โโ{RESET} {CYAN}temp <value>{RESET} Set generation temperature (default: 0.7)
{DIM}โ{RESET} {DIM}Low (0.1-0.3) = deterministic, focused answers{RESET}
{DIM}โ{RESET} {DIM}Med (0.5-0.8) = balanced, natural responses{RESET}
{DIM}โ{RESET} {DIM}High (1.0-1.5) = creative, varied outputs{RESET}
{DIM}โโ{RESET} {CYAN}compare{RESET} Compare base vs fine-tuned model responses
{DIM}โโ{RESET} {CYAN}help{RESET} Show this help message
{DIM}โโ{RESET} {CYAN}clear{RESET} Clear the screen
""")
def load_model(model_path, base_model_name):
"""Load the fine-tuned model (base + LoRA adapter)."""
try:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
except ImportError as e:
print(f"{RED}{BOLD}โ Error:{RESET} Missing dependency: {e}")
print(f" {YELLOW}Run: pip install transformers peft torch{RESET}")
return None, None, None
# Check if fine-tuned model exists
if not os.path.exists(model_path):
print(f"{RED}{BOLD}โ Fine-tuned model not found!{RESET}")
print(f" {DIM}Expected at: {model_path}{RESET}")
print(f"\n {YELLOW}You need to train the model first:{RESET}")
print(f" {CYAN} 1. python prepare_data.py{RESET}")
print(f" {CYAN} 2. python finetune.py{RESET}")
print(f" {CYAN} 3. python chat.py โ then come back here!{RESET}")
return None, None, None
print(f" {DIM}Loading base model ({base_model_name})...{RESET}")
try:
tokenizer = AutoTokenizer.from_pretrained(model_path)
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
print(f" {DIM}Applying LoRA adapter...{RESET}")
model = PeftModel.from_pretrained(base_model, model_path)
model.eval()
return model, tokenizer, base_model
except Exception as e:
print(f"{RED}{BOLD}โ Error loading model:{RESET} {e}")
print(f" {YELLOW}The model files may be corrupted. Try running finetune.py again.{RESET}")
return None, None, None
def generate_response(model, tokenizer, prompt, temperature=0.7, max_new_tokens=200,
top_k=50, top_p=0.9):
"""Generate a response from the model."""
import torch
# Format the prompt
formatted_prompt = f"### Question:\n{prompt}\n\n### Answer:\n"
inputs = tokenizer(formatted_prompt, return_tensors="pt", truncation=True, max_length=256)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
with torch.no_grad():
outputs = model.generate(
input_ids,
attention_mask=attention_mask,
max_new_tokens=max_new_tokens,
temperature=max(temperature, 0.01), # Avoid division by zero
top_k=top_k,
top_p=top_p,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
repetition_penalty=1.2,
no_repeat_ngram_size=3,
)
# Decode only the generated part
generated_ids = outputs[0][input_ids.shape[1]:]
response = tokenizer.decode(generated_ids, skip_special_tokens=True)
# Clean up the response
response = response.strip()
# Stop at certain markers
for stop_marker in ["### Question:", "### Answer:", "\n\n\n"]:
if stop_marker in response:
response = response[:response.index(stop_marker)].strip()
return response
def compare_models(base_model, finetuned_model, tokenizer, prompt, temperature=0.7):
"""Compare responses from base model vs fine-tuned model."""
print(f"\n {MAGENTA}{BOLD}๐ฌ Comparison Mode{RESET}")
print(f" {MAGENTA}{'โ' * 55}{RESET}")
print(f" {DIM}Prompt: \"{prompt}\"{RESET}\n")
# Base model response
print(f" {RED}{BOLD}โโ ๐ Base Model (distilgpt2 โ no fine-tuning){RESET}")
print(f" {RED}{BOLD}โ{RESET}")
base_response = generate_response(base_model, tokenizer, prompt, temperature)
for line in base_response.split("\n"):
print(f" {RED}โ{RESET} {DIM}{line}{RESET}")
print(f" {RED}{BOLD}โ{'โ' * 50}{RESET}\n")
# Fine-tuned model response
print(f" {GREEN}{BOLD}โโ ๐ Fine-Tuned Model (trained on education Q&A){RESET}")
print(f" {GREEN}{BOLD}โ{RESET}")
ft_response = generate_response(finetuned_model, tokenizer, prompt, temperature)
for line in ft_response.split("\n"):
print(f" {GREEN}โ{RESET} {line}")
print(f" {GREEN}{BOLD}โ{'โ' * 50}{RESET}\n")
print(f" {YELLOW}๐ก Notice the difference? The fine-tuned model gives more relevant,")
print(f" education-focused answers!{RESET}\n")
def main():
"""Main interactive chat loop."""
print_welcome()
# โโโ Load Model โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
print(f" {BLUE}{BOLD}๐ Loading your fine-tuned AI model...{RESET}\n")
finetuned_model, tokenizer, base_model = load_model(MODEL_DIR, BASE_MODEL)
if finetuned_model is None:
print(f"\n {RED}Cannot start chat without a trained model.{RESET}")
sys.exit(1)
print(f"\n {GREEN}{BOLD}โ
Model loaded successfully!{RESET}")
print(f" {DIM}{'โ' * 60}{RESET}\n")
# โโโ Chat Settings โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
temperature = 0.7
compare_prompt = None
# โโโ Chat Loop โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
while True:
try:
# Get user input with colored prompt
user_input = input(f" {CYAN}{BOLD}You > {RESET}").strip()
if not user_input:
continue
# โโโ Handle Commands โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
lower_input = user_input.lower()
# Quit command
if lower_input in ("quit", "exit", "q"):
print(f"\n {YELLOW}{BOLD}๐ Goodbye! Keep learning and exploring AI!{RESET}")
print(f" {DIM}\"The best way to understand AI is to build one yourself.\"{RESET}\n")
break
# Help command
if lower_input == "help":
print_help()
continue
# Clear command
if lower_input == "clear":
os.system("cls" if os.name == "nt" else "clear")
print_welcome()
continue
# Temperature command
if lower_input.startswith("temp "):
try:
new_temp = float(lower_input.split()[1])
if 0.01 <= new_temp <= 2.0:
temperature = new_temp
# Describe the temperature
if new_temp < 0.3:
desc = "very focused & deterministic"
elif new_temp < 0.6:
desc = "balanced & reliable"
elif new_temp < 1.0:
desc = "natural & varied"
else:
desc = "creative & experimental"
print(f" {GREEN}๐ก๏ธ Temperature set to {BOLD}{temperature}{RESET}{GREEN} ({desc}){RESET}\n")
else:
print(f" {RED}โ Temperature must be between 0.01 and 2.0{RESET}\n")
except (ValueError, IndexError):
print(f" {RED}โ Usage: temp 0.8{RESET}\n")
continue
# Compare command
if lower_input == "compare":
compare_input = input(f" {MAGENTA}Enter a question to compare > {RESET}").strip()
if compare_input:
compare_models(base_model, finetuned_model, tokenizer, compare_input, temperature)
else:
print(f" {YELLOW}โ Please enter a question for comparison.{RESET}\n")
continue
# โโโ Generate Response โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
print(f" {DIM}Thinking...{RESET}", end="\r")
response = generate_response(
finetuned_model, tokenizer, user_input,
temperature=temperature
)
# Clear "Thinking..." and print response
print(f" {' ' * 30}", end="\r") # Clear line
if response:
print(f" {GREEN}{BOLD}AI > {RESET}{GREEN}{response}{RESET}\n")
else:
print(f" {YELLOW}AI > {DIM}(The model didn't generate a response. Try rephrasing your question.){RESET}\n")
except KeyboardInterrupt:
print(f"\n\n {YELLOW}{BOLD}๐ Goodbye! (Ctrl+C detected){RESET}\n")
break
except EOFError:
print(f"\n\n {YELLOW}{BOLD}๐ Goodbye!{RESET}\n")
break
except Exception as e:
print(f" {RED}โ Error: {e}{RESET}\n")
if __name__ == "__main__":
main()
Looking Ahead
The future of AI and your next steps
The Future of AI
"The best way to predict the future is to invent it." โ Alan Kay
Learning Objectives
- Understand where AI technology is heading in the next 5-10 years
- Know about the major research frontiers in language models
- Appreciate the role AI will play in Indian education
- Think critically about the ethical implications of AI
- Have a clear roadmap for your own AI learning journey
7.1 You've Come a Long Way!
Let's take a moment to appreciate what you've accomplished:
| Chapter | What You Built | Key Concept |
|---|---|---|
| 2 | Bigram text generator | Prediction = counting patterns |
| 3 | Neural network from scratch | Learning = adjusting weights |
| 4 | Transformer & self-attention | Attention = understanding context |
| 5 | Mini-GPT trained on stories | Language model = next token prediction |
| 6 | Fine-tuned chatbot | Fine-tuning = specializing a pre-trained model |
You now understand the complete pipeline that powers ChatGPT, Claude, and Gemini.
7.2 Where AI is Heading
7.2.1 Bigger Models, Smarter Reasoning
The trend in AI is clear: scale brings capabilities.
GPT-2 (2019): 1.5 billion parameters
GPT-3 (2020): 175 billion parameters
GPT-4 (2023): 1.7 trillion parameters โ Could reason, analyze, create
GPT-5+ (2025+): ??? parameters โ ???
But it's not just about size. The frontier is moving toward:
- Chain-of-Thought Reasoning: Models that "think step by step" before answering (like you saw in Claude's thinking mode!)
- Tool Use: Models that can search the web, run code, use calculators โ not just generate text
- Multimodal AI: Models that see images, hear audio, AND process text simultaneously
- Long Context: From 4K tokens to 1M+ tokens โ models can now read entire books at once!
7.2.2 Smaller, Faster, On-Device Models
The opposite trend is also happening:
- Quantization: Compressing models to run on phones and laptops
- Distillation: Training small models to mimic large ones
- Edge AI: Running models directly on your device without internet
- Gemma, Phi, LLaMA: Open-source models small enough for your laptop
Tip
Key Insight: The future isn't just "bigger is better". It's "smart enough for the task, small enough for the device."
7.2.3 AI Agents
The next big leap is from chat models to AI agents:
Today: You ask ChatGPT a question โ It gives an answer
Tomorrow: You tell an AI agent a goal โ It plans, acts, and delivers
Example:
"Plan a 3-day trip to Rajasthan for my family of 4,
budget โน50,000, book hotels, and create an itinerary."
The agent would:
1. Research destinations
2. Compare hotel prices
3. Book rooms
4. Create a day-by-day plan
5. Send you a WhatsApp summary
7.3 AI in Indian Education
7.3.1 The Opportunity
India has:
- 250+ million students in schools
- 1.5 million schools, many with teacher shortages
- 22 official languages to teach in
- Vast rural-urban education gap
AI can help address ALL of these challenges:
| Challenge | AI Solution |
|---|---|
| Teacher shortage | AI tutors that explain concepts 24/7 |
| Language barrier | Real-time translation to any Indian language |
| Quality gap | Same quality of education in Delhi and a village in Bihar |
| Personalization | Each student learns at their own pace |
| Assessment | Instant, detailed feedback on assignments |
7.3.2 What's Already Happening
- DIKSHA (by NCERT): AI-powered learning platform for Indian students
- Byju's, Vedantu: Personalized learning using AI recommendations
- Google Translate: Now handles Hindi, Tamil, Bengali, and more
- Bhashini: Government of India's AI translation platform for all 22 scheduled languages
- ChatGPT/Gemini in Hindi: Students using AI assistants in their own language
7.3.3 What You Could Build
With what you've learned in this book, you could build:
- A Subject Tutor Bot: Fine-tune a model on Class 6-10 NCERT content
- A Question Paper Generator: Train on past papers to generate new questions
- A Doubt-Solver: A chatbot that explains concepts in simple language
- A Language Tutor: Practice English speaking with an AI partner
- A Study Planner: AI that creates personalized study schedules
Important
You have the skills now! Level 5 taught you how to fine-tune models. You can create specialized education AI tools for Indian students TODAY.
7.4 The Ethics of AI
As someone who now understands HOW AI works, you have a responsibility to think about these issues:
7.4.1 Bias in AI
AI models learn from data. If the data has biases, the model will too.
Example: If an AI is trained mostly on English text from Western countries, it might:
- Not understand Indian cultural context
- Give advice that doesn't apply to Indian families
- Reinforce stereotypes about gender, caste, religion
What you can do:
- Fine-tune models on diverse, inclusive data
- Always test your models for bias
- Include data from multiple Indian languages and cultures
7.4.2 AI and Jobs
Common fear: "AI will take all our jobs!"
Reality: AI will change jobs, not eliminate all of them.
Jobs at risk: Jobs that will grow: Jobs AI can't do:
โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
Data entry AI trainers Creative leadership
Basic translation Prompt engineers Emotional support
Simple coding AI ethics experts Complex problem-solving
Form filling AI-augmented teachers Physical craftsmanship
Healthcare + AI Community building
Note
Think About It: The printing press didn't eliminate writers โ it created millions more. AI won't eliminate thinkers โ it will empower them.
7.4.3 AI Safety
As AI gets more powerful, we need to think about:
- Misinformation: AI can generate fake but convincing text, images, videos
- Privacy: AI models trained on personal data
- Dependence: Over-relying on AI for critical thinking
- Access: Ensuring AI benefits aren't limited to rich countries/people
7.4.4 The Responsible AI Developer
As someone building AI, follow these principles:
- Transparency: Be clear about what your AI can and cannot do
- Fairness: Test for biases across genders, languages, communities
- Privacy: Don't train on personal data without consent
- Safety: Add guardrails to prevent harmful outputs
- Accessibility: Build for all users, including those with disabilities
7.5 Your Learning Roadmap: What's Next?
You've completed this book. Here's where to go next:
๐ข Immediate Next Steps (This Week)
- Experiment with your Mini-GPT: Try different training data, model sizes, hyperparameters
- Fine-tune on YOUR data: Create a chatbot for your specific use case
- Share your work: Show your friends and teachers what you built!
๐ก Short-Term Goals (Next 1-3 Months)
| Goal | How |
|---|---|
| Learn PyTorch deeply | PyTorch tutorials |
| Understand larger models | Study nanoGPT by Andrej Karpathy |
| Try bigger open-source models | Fine-tune Gemma 2B or Phi-3 |
| Learn about vision models | Explore CLIP and image generation |
| Build a real project | Create an AI tool for your school or community |
๐ Medium-Term Goals (3-6 Months)
- Take structured courses: - fast.ai โ Practical Deep Learning (FREE) - CS229 Stanford โ Machine Learning theory - Andrej Karpathy's YouTube โ Build GPT from scratch
- Read key papers: - "Attention Is All You Need" (2017) โ The Transformer - "Language Models are Few-Shot Learners" (2020) โ GPT-3 - "Training Language Models to Follow Instructions" (2022) โ InstructGPT/RLHF - "LoRA: Low-Rank Adaptation" (2021) โ Efficient fine-tuning
- Contribute to open source: - Hugging Face Transformers - Indian language NLP projects - AI4Bharat (Indian language AI)
๐ด Long-Term Vision (6+ Months)
- Specialize in one area: - NLP (language models, chatbots) - Computer Vision (image recognition, generation) - Reinforcement Learning (game AI, robotics) - AI Safety & Ethics
- Build something impactful: - An education tool for rural Indian schools - A healthcare assistant in Indian languages - A farming advisory AI for Indian farmers - A legal aid chatbot for common citizens
- Consider AI research: - Apply to IIT, IIIT, or ISI for AI/ML programs - Look into Google AI India, Microsoft Research India - Contribute to cutting-edge research papers
7.6 Final Words
When you started this book, AI seemed like magic โ something only Google and OpenAI could build.
Now you know the truth:
AI is not magic. It's math, patterns, and a lot of training data.
You've built every piece yourself:
- A model that counts patterns (Chapter 2)
- A network that learns from mistakes (Chapter 3)
- An attention mechanism that understands context (Chapter 4)
- A GPT that generates coherent text (Chapter 5)
- A fine-tuned chatbot that answers questions (Chapter 6)
You are no longer just a user of AI. You are a builder.
The world needs more people like you โ people who understand how AI works, who can build it responsibly, and who can use it to solve real problems.
India, with its 1.4 billion people, 22 languages, and incredible diversity, needs AI solutions built BY Indians, FOR Indians.
You have the knowledge. You have the tools. Now go build something amazing. ๐
๐ญ Discussion Questions
If you could build any AI tool for India, what would it be and why?
Do you think AI will ever truly "understand" language, or will it always be "just predicting the next word"? What's the difference?
How should India approach AI regulation โ strict rules like the EU, or open innovation like the US?
If AI can write essays, code, and solve math problems, what should schools focus on teaching?
You've built a language model from scratch. Does knowing how it works change how you interact with ChatGPT and similar tools?
Key Concepts Summary
| Chapter | Core Insight |
|---|---|
| 1 | AI is not magic โ it's math and patterns |
| 2 | The simplest AI just counts what comes after what |
| 3 | Neural networks learn by adjusting weights to reduce error |
| 4 | Attention lets models understand which parts of input matter |
| 5 | GPT = Transformer + next-token prediction + lots of data |
| 6 | Fine-tuning = specializing a pre-trained model for your task |
| 7 | The future is yours to build! |
"I hear and I forget. I see and I remember. I do and I understand."
You didn't just hear about AI. You didn't just see AI. You built AI. You understand AI.
๐ Congratulations on completing this journey!
7.7 ๐ Production AI โ Key Terms Explained
As you move from building toy models to understanding production AI systems like ChatGPT, Claude, and Gemini, you'll encounter these critical terms:
๐ค Agentic Frameworks
What it means: AI systems that don't just answer one question โ they take multiple autonomous steps to complete a complex task.
Comparison
Simple AI: User asks → AI answers → Done
Agentic AI: User asks → AI plans → reads files → writes code →
tests it → fixes bugs → reports back → Done
Example: When an AI reads 10 markdown files, converts them to HTML, updates a React component, and starts a dev server โ that's agentic behavior: multi-step workflows with tool use, planning, and self-correction.
Why It Matters
Agentic AI is the frontier of AI development in 2024-25. Companies like Google (Gemini), Anthropic (Claude), and OpenAI (GPT) are racing to build agents that can autonomously code, research, and build entire applications.
๐ Multi-file Code Understanding
What it means: The AI can understand how multiple files relate to each other โ imports, function calls, data flow, and architectural patterns across an entire codebase.
Example
Not just: "What does model.py do?"
But: "train.py imports MiniGPT from model.py,
trains it on stories.txt data,
saves checkpoints to disk,
and generate.py loads those checkpoints
to run interactive chat."
Real projects have 10 to 1,000+ files. Understanding one file in isolation is useless โ you need to understand the entire system.
๐ฏ Low Hallucination Rates
What it means: The AI doesn't make things up. When unsure, it acknowledges uncertainty instead of confidently generating false information.
| Type | Example | Problem |
|---|---|---|
| Hallucination ❌ | "Python was created in 1985" | Wrong year (it was 1991) |
| Low Hallucination ✅ | "Python was created in 1991" | Correct fact |
| Honest Uncertainty ✅ | "I'm not certain โ please verify" | Transparent about limits |
Critical for Education
In education, wrong answers are worse than no answer. A tutor that confidently tells a student "water boils at 90°C" is dangerous. Low hallucination rates are essential for any educational AI.
๐พ Prompt Caching
What it means: When you send the same context repeatedly (like a system prompt or a large document), the AI remembers it instead of re-processing it. Saves up to 90% of input costs.
| Scenario | Tokens Processed | Cost |
|---|---|---|
| Without caching โ Request 1 | [50K system prompt] + "What is AI?" | ₹5.00 |
| Without caching โ Request 2 | [50K system prompt] + "What is ML?" | ₹5.00 |
| Without caching โ Request 3 | [50K system prompt] + "What is DL?" | ₹5.00 |
| Total without caching: | ₹15.00 | |
| With caching โ Request 1 | [50K system prompt] + "What is AI?" | ₹5.00 |
| With caching โ Request 2 | [CACHED ✅] + "What is ML?" | ₹0.50 |
| With caching โ Request 3 | [CACHED ✅] + "What is DL?" | ₹0.50 |
| Total with caching: | ₹6.00 (60% savings!) | |
How It Works
The AI provider stores computed representations (KV-cache) of your repeated prefix. On subsequent requests, it skips re-computing those tokens and charges only for the new portion. Especially powerful for chatbots with long system prompts or RAG pipelines.
๐ก๏ธ Safety & Fallbacks
What it means: Built-in guardrails that detect dangerous queries (bioweapons, hacking, harmful content) and either refuse or transparently redirect to a safer model.
How It Works
Student asks: "Explain cell division in biology"
Result: ✅ Normal answer โ no safety concerns
Attacker asks: "How to create a dangerous pathogen"
Result: 🛑 Safety filter triggered!
→ Query blocked or rerouted to safe model
→ User NOT charged premium rates
→ Incident logged for review
Production AI systems have multiple safety layers:
- Input filtering โ classify incoming queries before processing
- Output filtering โ scan generated responses before delivery
- Constitutional AI โ model trained to self-evaluate and refuse harmful requests
- Human review โ flagged interactions reviewed by safety teams
๐ Jailbreaking
What it means: Techniques to bypass AI safety filters and trick the model into producing content it's designed to refuse.
| Type | Technique | Example |
|---|---|---|
| Role-play attack | Ask AI to "pretend" to be unrestricted | "You are DAN (Do Anything Now)..." |
| Encoding trick | Encode harmful requests in code/base64 | Obfuscating the real intent |
| Prompt injection | Override system instructions via user input | "Ignore all previous instructions..." |
| Indirect attack | Slowly escalate through innocent-seeming steps | Gradual boundary pushing |
Defenses against jailbreaking:
- Multi-layer safety checks that can't be fooled by simple prompt tricks
- Constitutional AI โ model trained to recognize and refuse manipulation
- Input/output filtering independent of the model itself
- Red-teaming โ dedicated teams that try to break the system to find vulnerabilities
- Continuous updates โ safety systems updated as new attack vectors are discovered
Why This Matters for Education
When deploying AI chatbots for students, jailbreaking resistance is critical. Students are naturally curious โ they WILL try to make the bot say unexpected things. Your safety layers must be robust enough to handle this while still being helpful for genuine educational questions.
๐ Key Terms Summary
| Term | One-Line Meaning | Book Connection |
|---|---|---|
| Agentic Framework | AI that takes multiple autonomous steps | Ch 8 → Level 7 (AI Agents) |
| Multi-file Understanding | AI reads entire codebases, not just one file | How this book was built! |
| Low Hallucination | AI doesn't confidently make things up | Ch 8 → Safety & Guardrails |
| Prompt Caching | Remember repeated context, save 90% cost | API cost optimization |
| Safety & Fallbacks | Block dangerous queries, redirect safely | Ch 8 → Safety & Guardrails |
| Jailbreaking | Tricks to bypass AI safety rules | Ch 6 → Ethics of Fine-Tuning |
7.8 ๐ How to Compare AI Models — Key Metrics Explained
When choosing an AI model for your project, you'll see comparison tables with metrics like cost, context, and benchmarks. Here's what each column really means:
The Comparison Table Columns
| Metric | What It Means | Why It Matters |
|---|---|---|
| Model | Name and version of the AI | Different models have different strengths |
| Best for | Primary use case | Coding vs writing vs reasoning vs chat |
| Input / MTok | Cost per 1 million input tokens | How much you pay to SEND text to the model |
| Output / MTok | Cost per 1 million output tokens | How much you pay for the model's RESPONSE |
| Context | Max tokens it can read at once | How much text it can "see" simultaneously |
| Max output | Max tokens in one response | How long its single reply can be |
| SWE-bench | Software Engineering benchmark | How well it writes and fixes real code |
๐ฐ Input / MTok & Output / MTok (Cost)
MTok = Million Tokens ≈ 750,000 words ≈ 1,500 pages of text
Input cost: 5,000 tokens × ($3 / 1M tokens) = $0.015
Output cost: 2,000 tokens × ($15 / 1M tokens) = $0.030
Total: $0.045 per question (≈ ₹3.75)
| Model | Input / MTok | Output / MTok | ~Monthly (1000 queries/day) |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | ~$375 |
| Claude Sonnet 4 | $3.00 | $15.00 | ~$540 |
| Claude Opus 4 | $15.00 | $75.00 | ~$2,700 |
| Gemini 2.5 Pro | $1.25 | $10.00 | ~$338 |
| GPT-4o mini | $0.15 | $0.60 | ~$22 |
| DeepSeek V3 | $0.27 | $1.10 | ~$41 |
Rule of Thumb
Output tokens cost 3-5x more than input tokens because generating text (running the full model forward pass + sampling) is computationally harder than just encoding input.
๐ Context Window
The context window determines how much text the model can "see" at once — think of it as the model's "desk size":
| Context Size | Equivalent | Models |
|---|---|---|
| 4K tokens | ≈3 pages | Old GPT-3.5 |
| 32K tokens | ≈24 pages | GPT-4 original |
| 128K tokens | ≈96 pages | GPT-4o, Claude Sonnet |
| 200K tokens | ≈150 pages | Claude Opus |
| 1M tokens | ≈750 pages | Gemini 2.5 Pro |
For your project: Your Mini-GPT (Level 4) has a context of 256 characters. ChatGPT has 128,000 tokens — that's 500x larger! For RAG (Chapter 8), bigger context = more textbook pages per query = better answers.
๐ค Max Output
How long the model's single response can be:
| Max Output | Equivalent | Can It... |
|---|---|---|
| 4K tokens | ≈3 pages | ✓ Answer questions, short essays |
| 8K tokens | ≈6 pages | ✓ Write detailed explanations |
| 16K tokens | ≈12 pages | ✓ Generate full chapters |
| 64K tokens | ≈48 pages | ✓ Write entire documents in one go |
Why Max Output Matters
If you ask a model with 4K max output to "write a 50-page book chapter" (65,000 tokens), it will get cut off mid-sentence. You'd need to generate it in multiple chunks. Models with larger max output (like 64K) can write much longer responses in a single call.
๐ SWE-bench (Software Engineering Benchmark)
What: Tests if AI can fix REAL bugs in real GitHub repositories. The model gets a bug report and must find the file, understand the code, and write a working fix.
| Score Range | Rating | What It Means |
|---|---|---|
| < 20% | Basic | Can write simple code snippets |
| 20-30% | Good | Can fix straightforward bugs |
| 30-40% | Strong | Can handle complex multi-file fixes |
| 40-50% | Excellent | Near human-level debugging |
| > 50% | Elite | Better than most junior developers |
| Model | SWE-bench Score | Rating |
|---|---|---|
| Claude Opus 4 | ~72% | ⭐ Elite |
| Claude Sonnet 4 | ~65% | ⭐ Elite |
| Gemini 2.5 Pro | ~63% | Excellent |
| DeepSeek V3 | ~42% | Excellent |
| GPT-4o | ~38% | Strong |
| GPT-4o mini | ~24% | Good |
๐ Other Important Benchmarks
| Benchmark | What It Tests | Real-World Meaning |
|---|---|---|
| MMLU | General knowledge (57 subjects) | "How much does it know?" |
| HumanEval | Code generation from docstrings | "Can it write functions?" |
| MATH | Competition-level math problems | "Can it solve hard math?" |
| GPQA | PhD-level science questions | "How deep is its knowledge?" |
| Arena ELO | Human preference rankings | "Which model do people prefer?" |
| Aider | Code editing & multi-file changes | "Can it refactor a codebase?" |
๐งฎ How to Choose the Right Model
Decision Guide
| Your Priority | Best Model Choice | Why |
|---|---|---|
| ๐ฐ Low Cost | GPT-4o mini, DeepSeek V3 | 10-50x cheaper than premium models |
| ๐ง Best Reasoning | Claude Opus 4, o3 | Highest scores on complex reasoning |
| ๐ป Best Coding | Claude Sonnet 4, Opus 4 | Highest SWE-bench scores |
| ๐ Long Documents | Gemini 2.5 Pro | 1M context — reads entire textbooks |
| ⇐ Speed | GPT-4o mini, Claude Haiku | Fastest response times |
| ๐ Education Bot | Claude Sonnet 4 | Best quality-to-cost ratio |
| ๐ฎ๐ณ Indian Languages | Gemini 2.5 Pro | Best Hindi/Tamil/Telugu support |
| ๐ Privacy / Self-hosted | LLaMA, Mistral, DeepSeek | Run on your own servers |
For Your Chatbot
For the education chatbot you built in this book, DeepSeek V3 or GPT-4o mini would be the most cost-effective choices for deployment. If you need the best quality and can afford it, Claude Sonnet 4 gives excellent results at a reasonable price. For serving Indian language students, Gemini 2.5 Pro has the best multilingual support.
Reference Material
Math foundations, glossary, resources, and more
Mathematical Foundations
Appendix A: Mathematical Foundations
This appendix covers the key math concepts used throughout the book. You don't need to master all of this to understand the code, but it helps to know what's happening under the hood.
A.1 Vectors and Matrices
A vector is a list of numbers:
A matrix is a 2D grid of numbers:
In our models:
- Each character/word is represented as a vector (embedding)
- Weight matrices transform vectors from one representation to another
- Attention scores form a matrix showing which tokens attend to which
A.2 Dot Product
The dot product of two vectors measures how "similar" they are:
Example:
In attention: we use the dot product of Query and Key vectors to compute attention scores. A high dot product means "this query is very interested in this key."
A.3 Matrix Multiplication
When we multiply matrices, each element of the result is a dot product:
In our code, torch.matmul(Q, K.transpose(-2, -1)) computes attention scores by multiplying the Query matrix with the transposed Key matrix.
A.4 Softmax Function
Softmax converts a vector of arbitrary numbers into a probability distribution (all positive, summing to 1):
Example:
- Larger inputs โ larger probabilities
- All outputs are between 0 and 1
- They sum to 1.0
Used in:
- Attention weights (which tokens to attend to)
- Output probabilities (which character comes next)
A.5 Cross-Entropy Loss
Cross-entropy measures how different the predicted probabilities are from the actual answer:
Where:
y_i= true label (1 for correct class, 0 for others)\hat{y}_i= predicted probability for classi
If the model is very confident and correct โ low loss If the model is wrong โ high loss
This is the loss function used throughout Chapters 3-6.
A.6 Gradient and Chain Rule
The gradient of a function tells us which direction to move to decrease the function's value:
The chain rule lets us compute gradients through multiple layers:
This is exactly what backpropagation does โ it applies the chain rule from the output layer all the way back to the input.
A.7 The Attention Formula
The most important formula in this entire book:
Breaking it down:
QK^Tโ compute similarity between queries and keys/ \sqrt{d_k}โ scale down to prevent extreme values\text{softmax}(\cdot)โ convert to probabilities\cdot Vโ weighted sum of values
A.8 Sigmoid Function
Properties:
- Output is always between 0 and 1
- Used in Chapter 3 as the neuron activation function
- S-shaped curve
A.9 ReLU and GELU
ReLU (Rectified Linear Unit):
Simple and effective. Used in earlier models.
GELU (Gaussian Error Linear Unit):
Where \Phi(x) is the standard normal CDF. Used in GPT-2 and modern models. Smoother than ReLU.
Glossary of Terms
Appendix B: Glossary
| Term | Definition |
|---|---|
| Activation Function | A function applied after a neuron's weighted sum to introduce non-linearity (e.g., sigmoid, ReLU, GELU) |
| Attention | A mechanism that lets each token in a sequence "look at" other tokens and decide how much to focus on each |
| Autoregressive | Generating output one token at a time, using previous outputs as input for the next prediction |
| Backpropagation | The algorithm for computing gradients by applying the chain rule backwards through a neural network |
| Batch | A group of training examples processed together for efficiency |
| Bigram | A pair of consecutive characters or words; the simplest language model |
| Causal Mask | A triangular mask that prevents a token from attending to future tokens |
| Cross-Entropy | A loss function that measures the difference between predicted and actual probability distributions |
| Dropout | Randomly turning off neurons during training to prevent overfitting |
| Embedding | A dense vector representation of a token, learned during training |
| Epoch | One complete pass through the entire training dataset |
| Feed-Forward Network (FFN) | A simple neural network with one or two hidden layers, used within transformer blocks |
| Fine-Tuning | Adapting a pre-trained model to a specific task by training on task-specific data |
| Gradient | The direction and magnitude of change that would reduce the loss function |
| Gradient Descent | An optimization algorithm that updates weights in the direction that reduces loss |
| Head (Attention) | One independent attention computation within multi-head attention |
| Hidden Layer | A layer of neurons between the input and output layers |
| Inference | Using a trained model to make predictions (as opposed to training) |
| Layer Normalization | Normalizing the values within a layer to have mean 0 and standard deviation 1 |
| Learning Rate | A hyperparameter controlling how much weights change in each update step |
| Logits | The raw, unnormalized scores output by a model before softmax |
| LoRA | Low-Rank Adaptation โ an efficient fine-tuning method that only updates small adapter matrices |
| Loss | A number measuring how wrong the model's predictions are |
| Multi-Head Attention | Running multiple attention computations in parallel, each focusing on different aspects |
| Neuron | The basic unit of a neural network: computes weighted sum + activation |
| N-gram | A sequence of N consecutive tokens (bigram = 2-gram, trigram = 3-gram) |
| One-Hot Encoding | Representing a token as a vector of all zeros except for a 1 at the token's index |
| Overfitting | When a model memorizes training data instead of learning general patterns |
| Parameters | The learnable weights and biases of a model |
| PEFT | Parameter-Efficient Fine-Tuning โ methods like LoRA that update only a fraction of parameters |
| Perplexity | A measure of how well a model predicts text; lower = better |
| Positional Encoding | Information added to embeddings to tell the model about token positions |
| Pre-training | Training a model on a large, general dataset before fine-tuning |
| Query (Q) | In attention: "what am I looking for?" |
| Key (K) | In attention: "what do I contain?" |
| Value (V) | In attention: "what information do I give?" |
| Residual Connection | Adding the input of a layer to its output (skip connection) |
| RLHF | Reinforcement Learning from Human Feedback โ training models using human preferences |
| Sampling | Randomly selecting the next token based on probability distribution |
| Self-Attention | Attention applied within a single sequence (each token attends to all other tokens) |
| Softmax | A function that converts logits into a probability distribution |
| Temperature | A parameter controlling the randomness of sampling (low = predictable, high = creative) |
| Tokenization | Converting text into a sequence of tokens (characters, subwords, or words) |
| Top-k Sampling | Only considering the k most probable tokens when sampling |
| Transformer | The neural network architecture based on self-attention, introduced in 2017 |
| Vocabulary | The set of all unique tokens a model knows |
| Weight | A learnable parameter that determines how much influence one input has |
| Weight Tying | Sharing weights between the input embedding and output projection layers |
Resources for Further Learning
Appendix C: Resources for Further Learning
๐บ Video Courses (Free)
| Resource | What You'll Learn | Link |
|---|---|---|
| 3Blue1Brown: Neural Networks | Beautiful visual explanations of neural networks | YouTube |
| Andrej Karpathy: Let's Build GPT | Build GPT from scratch (2 hours) | YouTube |
| Andrej Karpathy: Neural Networks: Zero to Hero | Complete deep learning series | YouTube |
| fast.ai | Practical deep learning for coders | fast.ai |
| CS231n (Stanford) | Computer vision & deep learning | YouTube |
| CS224n (Stanford) | NLP with deep learning | YouTube |
๐ Books
| Book | Level | Focus |
|---|---|---|
| Deep Learning (Goodfellow et al.) | Advanced | Complete theory reference |
| Hands-On Machine Learning (Gรฉron) | Intermediate | Practical ML with scikit-learn & TF |
| Natural Language Processing with Transformers (Tunstall et al.) | Intermediate | Hugging Face ecosystem |
| The Hundred-Page Machine Learning Book (Burkov) | Beginner | Concise overview |
๐ Key Papers
| Paper | Year | Why It Matters |
|---|---|---|
| "Attention Is All You Need" | 2017 | Introduced the Transformer |
| "BERT: Pre-training of Deep Bidirectional Transformers" | 2018 | Bidirectional understanding |
| "Language Models are Few-Shot Learners" (GPT-3) | 2020 | Scaling and in-context learning |
| "Training Language Models to Follow Instructions" | 2022 | RLHF and InstructGPT |
| "LoRA: Low-Rank Adaptation" | 2021 | Efficient fine-tuning |
| "Constitutional AI" | 2022 | AI safety approach |
๐ ๏ธ Tools and Libraries
| Tool | Purpose |
|---|---|
| PyTorch | Deep learning framework (used in this book) |
| Hugging Face Transformers | Pre-trained models and fine-tuning |
| Hugging Face PEFT | Parameter-efficient fine-tuning (LoRA) |
| Hugging Face Datasets | Easy dataset loading |
| Google Colab | Free GPU for training |
| Weights & Biases | Experiment tracking |
๐ฎ๐ณ Indian AI Resources
| Resource | Focus |
|---|---|
| AI4Bharat | NLP for Indian languages |
| IIT Madras NPTEL | Free AI/ML courses in Hindi and English |
| Bhashini | Government translation platform |
| IndicNLP | NLP tools for Indic languages |
Setting Up Your Environment
Appendix D: Setting Up Your Environment
D.1 Installing Python
Windows:
- Download Python from python.org
- During installation, check โ "Add Python to PATH"
- Open Command Prompt and verify:
python --version
Linux/Mac:
Bash
# Usually pre-installed. Check with:
python3 --version
# If not installed:
# Ubuntu: sudo apt install python3 python3-pip
# Mac: brew install python3
D.2 Setting Up a Virtual Environment (Recommended)
Bash
# Create a virtual environment
python -m venv ai-env
# Activate it
# Windows:
ai-env\Scripts\activate
# Linux/Mac:
source ai-env/bin/activate
# Install dependencies
pip install -r requirements.txt
D.3 Using Google Colab (No Installation Needed!)
If you don't want to install anything locally:
- Go to colab.research.google.com
- Create a new notebook
- Upload the Python files or copy-paste the code
- Run! (Colab gives you free GPU access)
D.4 Troubleshooting Common Issues
| Problem | Solution |
|---|---|
ModuleNotFoundError: No module named 'torch' | Run pip install torch |
| CUDA out of memory | Reduce batch_size in config |
| Training is very slow | Use Google Colab for GPU access |
PermissionError on Windows | Run terminal as Administrator |
| Model generates gibberish | Train for more steps or check data quality |
D.5 Hardware Recommendations
| Level | Minimum | Recommended |
|---|---|---|
| Level 1-2 | Any computer | Any computer |
| Level 3 | 4GB RAM | 8GB RAM |
| Level 4 | 4GB RAM, CPU OK | 8GB RAM, GPU preferred |
| Level 5 | 8GB RAM | 16GB RAM, NVIDIA GPU |
Tip
If you don't have a GPU, use Google Colab (free) for Levels 4 and 5. It provides a free NVIDIA T4 GPU that's more than enough!
End of Appendices
The Training Stories
This is the complete training dataset used in Chapter 5 (Level 4) to train your Mini-GPT model. These 30 stories were carefully chosen to give the model a mix of narrative styles, scientific knowledge, and Indian cultural context โ all in simple English.
Why These Stories?
When training a language model, the training data determines what the model learns. We chose stories that:
| Category | Purpose | Examples |
|---|---|---|
| Indian folk tales | Cultural context, moral lessons | The fox and the pot, the honest woodcutter |
| Science paragraphs | Factual knowledge | Photosynthesis, water cycle, magnets |
| Nature descriptions | Vocabulary, descriptive language | Sunrise, moon, seasons |
| Character stories | Narrative structure | Arjun the reader, Priya the singer |
| Moral stories | Story patterns, cause-effect | Tortoise and rabbit, ant and grasshopper |
Important
- Simple vocabulary (suitable for Class 6-8 level)
- Short sentences (easier for a small model to learn)
- Repetitive patterns (helps the model learn grammar faster)
- Mix of topics (gives the model breadth)
- ~8,800 characters total (small but sufficient for a demo model)
Key Design Decisions:
The Complete Dataset
Below is every story the model trains on. Read through them โ when you later see the model generating text, you'll recognize the patterns it learned from these stories!
Story 1: The Wise Farmer
Once upon a time, in a small village near the river, there lived a wise old farmer. He worked hard every day in his fields. The farmer grew rice, wheat, and vegetables. He shared his food with everyone in the village. People loved him because he was kind and generous.
What the model learns: Opening phrases ("Once upon a time"), character introductions, village/farming vocabulary.
Story 2: Nature's Beauty
The sun rises in the east and sets in the west. Every morning, the birds sing beautiful songs. The flowers open their petals to welcome the sunlight. The trees provide shade and fresh air. Nature is beautiful and full of wonders.
What the model learns: Descriptive language, nature vocabulary, present tense patterns.
Story 3: The Clever Fox (Panchatantra-style)
A clever fox lived in a forest near a village. One hot summer day, the fox was very thirsty. He searched for water everywhere but could not find any. Then he saw a pot with some water at the bottom. The fox put small stones into the pot one by one. Slowly the water came up to the top. The fox drank the water happily. This story teaches us that intelligence solves problems.
What the model learns: Problem-solving narratives, sequential actions, moral conclusions.
Story 4: The River Ganga
The river Ganga flows from the Himalayas to the Bay of Bengal. It is one of the longest rivers in India. Many cities and towns are built along its banks. People use the river water for drinking and farming. The Ganga is very important for the people of India.
What the model learns: Indian geography, factual sentences, proper nouns.
Story 5: The Kind King
A kind king ruled a beautiful kingdom. His people were happy and peaceful. The king built schools for children and hospitals for the sick. He made sure everyone had food to eat and a place to live. The kingdom prospered under his wise rule.
What the model learns: Governance vocabulary, cause-effect relationships.
Story 6: The Night Sky
The moon shines brightly in the night sky. Stars twinkle like tiny diamonds above us. The sky changes color from blue to orange during sunset. Clouds float gently across the sky like cotton balls. Looking at the sky fills our hearts with wonder.
What the model learns: Similes ("like tiny diamonds"), poetic descriptions, visual imagery.
Story 7: Arjun the Reader
A small boy named Arjun loved to read books. He would sit under the banyan tree and read for hours. His favorite books were about science and adventure. One day he read about the solar system and the planets. He dreamed of becoming a scientist when he grew up.
What the model learns: Character development, Indian names, aspirational narratives.
Story 8: Water โ Essential for Life
Water is essential for all living things. Plants need water to grow and make food. Animals drink water to stay alive and healthy. The water cycle keeps water moving around the earth. Rain fills the rivers and lakes with fresh water.
What the model learns: Scientific facts, cause-effect, ecosystem vocabulary.
Story 9: The Honest Woodcutter
A poor woodcutter lived at the edge of a forest. Every day he would cut wood and sell it in the market. One day his axe fell into the river. He sat by the river and cried because he was very poor. A kind spirit appeared and asked him what happened. The spirit dove into the water and brought up a golden axe. The woodcutter said that was not his axe. The spirit brought up a silver axe. Again the woodcutter said it was not his. Finally the spirit brought up his old iron axe. The woodcutter was happy and said yes that is mine. The spirit was pleased with his honesty and gave him all three axes.
What the model learns: Longer narratives, dialogue patterns, honesty theme, repetitive structure (which helps small models learn!).
Story 10: Day and Night
The earth goes around the sun in one year. The moon goes around the earth in about one month. The earth spins on its axis once every day. This spinning gives us day and night. When our part of the earth faces the sun it is daytime. When it faces away from the sun it is nighttime.
What the model learns: Astronomical facts, cause-effect explanations.
Story 11: The Proud Peacock
A beautiful peacock lived in a garden near the palace. It had colorful feathers of blue and green. When it danced in the rain everyone would stop and watch. The peacock was proud of its beautiful feathers. It spread its tail like a magnificent fan.
Story 12: The Importance of Trees
Trees are very important for our planet. They give us oxygen to breathe and clean the air. Trees provide fruits and nuts for us to eat. Birds build their nests in the branches of trees. We should plant more trees and take care of them.
Story 13: Priya the Singer
A young girl named Priya wanted to learn music. She practiced singing every day after school. Her teacher said she had a beautiful voice. Priya worked very hard and never missed a practice session. After many months she sang in a concert and everyone clapped.
Story 14: The Human Body
The heart pumps blood through our body. Blood carries oxygen and food to every part of the body. The brain controls all our movements and thoughts. Our bones give shape to our body and protect our organs. The human body is an amazing machine.
Story 15: Tortoise and the Rabbit
An old tortoise and a young rabbit decided to have a race. The rabbit ran very fast and went far ahead. He thought he had plenty of time so he took a nap. The tortoise kept walking slowly but steadily. When the rabbit woke up the tortoise had already crossed the finish line. Slow and steady wins the race.
Story 16: Indian Festivals
India has many beautiful festivals throughout the year. Diwali is the festival of lights celebrated with joy and happiness. Holi is the festival of colors where people play with colored powder. Eid brings people together for prayers and feasts. Christmas is celebrated with decorations and gifts.
Story 17: Magnets
A magnet has two poles called north and south. Like poles repel each other and unlike poles attract. Magnets can attract things made of iron and steel. The earth itself is like a giant magnet. A compass needle points north because of the earth magnetic field.
Story 18: The Merchant's Journey
There was a merchant who traveled from town to town selling goods. He carried silk cloths and precious spices on his camel. One day he got lost in the desert during a sandstorm. He prayed for help and soon the storm passed away. He followed the stars in the night sky and found his way home.
Story 19: Light and Colors
Light travels in straight lines very fast. When light passes through a prism it splits into seven colors. These colors are violet indigo blue green yellow orange and red. We can see a rainbow after rain because water drops act like tiny prisms. Light is a form of energy that helps us see the world.
Story 20: The Mother Bird
A mother bird built a nest in a tall tree. She laid three small eggs in the nest. She sat on the eggs to keep them warm for many days. Soon the eggs cracked and three baby birds came out. The mother bird brought food for her babies every day until they learned to fly.
Story 21: Photosynthesis
Plants make their own food through photosynthesis. They use sunlight water and carbon dioxide for this process. The green color in leaves comes from a substance called chlorophyll. Chlorophyll captures sunlight to make food for the plant. Plants give out oxygen during photosynthesis which we breathe.
Story 22: The Brave Soldier
A brave soldier named Ravi protected his village from danger. He stood guard at the border day and night without complaint. The villagers respected him and treated him like a hero. Ravi taught the young boys how to be brave and strong. He said courage means doing the right thing even when you are afraid.
Story 23: Indian Seasons
The seasons change throughout the year in India. Summer is hot and dry with temperatures rising very high. The monsoon brings heavy rains and cools the land. Winter is cold and pleasant in most parts of the country. Spring brings new flowers and green leaves on the trees.
Story 24: The Kind Fisherman
A fisherman went to the sea every morning in his small boat. He would throw his net into the water and wait patiently. Sometimes he caught many fish and sometimes very few. One day he caught a beautiful golden fish. The golden fish spoke and asked to be set free. The kind fisherman released it back into the sea.
Story 25: Electricity
Electricity flows through wires like water flows through pipes. We use electricity to power lights fans and computers. A battery stores electrical energy for later use. Switches control the flow of electricity in a circuit. We should use electricity wisely and not waste it.
Story 26: The Two Friends and the Bear
Two friends were walking through a forest one day. Suddenly they saw a large bear coming toward them. One friend quickly climbed a tree to save himself. The other friend lay down on the ground and pretended to be dead. The bear came close and smelled him then walked away. When the bear left the friend in the tree came down. He asked what the bear whispered in his ear. The friend on the ground said the bear told me not to trust a friend who runs away in danger.
Story 27: Mountains
Mountains are the tallest landforms on the earth. The Himalayas are the highest mountains in the world. Mount Everest is the tallest peak standing at eight thousand meters. Many rivers begin from the glaciers in the mountains. Mountains affect the weather and rainfall in nearby areas.
Story 28: The Ant and the Grasshopper
A little ant worked hard all summer long. It collected food and stored it carefully in its home. A grasshopper spent the whole summer singing and dancing. When winter came the ant had plenty of food to eat. The grasshopper had nothing and was cold and hungry. The ant shared some food with the grasshopper and said it is wise to prepare for the future.
Story 29: Sound
Sound is a form of energy that travels in waves. We hear sounds when these waves reach our ears. Sound travels faster through water than through air. It travels fastest through solid objects like metal. Very loud sounds can damage our hearing so we should protect our ears.
Story 30: The Dedicated Teacher
A teacher loved her students very much. She came to school early every day to prepare her lessons. She explained difficult topics in simple and easy ways. Her students always performed well in their examinations. She believed that every child can learn if given the right guidance.
Data Analysis
Total stories: 30
Total characters: 8,867
Total words: ~1,530
Unique characters: ~55
Average story length: ~295 characters (~51 words)
Topic distribution:
Indian folk/moral tales: 10 stories (33%)
Science facts: 10 stories (33%)
Nature/descriptions: 5 stories (17%)
Character stories: 5 stories (17%)
Tip
Experiment Idea: Try adding your own stories to stories.txt and re-training the model. Does the generated text change? Does more data improve quality? This is how real AI researchers iterate on data!
Project Walkthrough & Task Tracker
Appendix F: Project Walkthrough & Task Tracker
The Building Journey
This project was built as a learning experience โ and the process of building it is educational too! Here's how the project came together:
Task Checklist (All Completed โ )
Level 1: Prediction (Pure Python)
- [x]
lesson.mdโ What is prediction? Bigrams explained with Indian context - [x]
step1_bigram.pyโ Count character patterns from sample text - [x]
step2_generate.pyโ Generate text using bigram probabilities
Level 2: Neural Network (NumPy)
- [x]
lesson.mdโ Neurons, weights, activation, backpropagation - [x]
step1_neuron.pyโ Build a single neuron, test on AND/OR gates - [x]
step2_network.pyโ Multi-layer neural network with forward pass - [x]
step3_train.pyโ Full training with backpropagation from scratch - [x]
step4_visualize.pyโ Loss curves and training visualization
Level 3: Transformer (PyTorch)
- [x]
lesson.mdโ Attention mechanism with classroom analogy - [x]
step1_embedding.pyโ Token and positional embeddings - [x]
step2_attention.pyโ Self-attention from scratch - [x]
step3_transformer_block.pyโ Multi-head attention + FFN + residual - [x]
step4_put_it_together.pyโ Complete Mini-Transformer model
Level 4: Mini-GPT
- [x]
lesson.mdโ Autoregressive generation, temperature, top-k - [x]
model.pyโ Complete MiniGPT model class - [x]
train.pyโ Training loop with progress and evaluation - [x]
generate.pyโ Interactive text generation - [x]
data/stories.txtโ 30 training stories
Level 5: Real Fine-Tune
- [x]
lesson.mdโ Pre-training, fine-tuning, LoRA explained - [x]
prepare_data.pyโ Data loading and tokenization - [x]
finetune.pyโ LoRA fine-tuning with Hugging Face - [x]
chat.pyโ Interactive chat with comparison mode - [x]
data/education_qa.jsonlโ 60+ education Q&A pairs
Documentation
- [x]
README.mdโ Complete project guide - [x]
requirements.txtโ All Python dependencies
Project Statistics
| Metric | Value |
|---|---|
| Total files | 25+ |
| Total lines of code | ~3,500+ |
| Total documentation | ~15,000+ words |
| Total training data | 30 stories + 60 Q&A pairs |
| Languages used | Python, Markdown |
| Libraries used | NumPy, Matplotlib, PyTorch, Transformers, PEFT |
| Estimated learning time | 5-6 hours (all levels) |
Key Features
- ๐จ Rich colored terminal output โ every script visually shows what's happening
- ๐ฌ Extensive comments โ every code block explains WHY it exists
- ๐ฎ๐ณ Indian context โ Panchatantra stories, NCERT science, Hindi terms
- ๐ Lesson files โ conceptual primers before diving into code
- ๐ค Interactive chat โ talk to your trained model in Levels 4 & 5
- ๐ 60+ education Q&A pairs โ covering Class 6-8 science, math, GK
- ๐ Backpropagation from scratch โ Level 2 implements it without autograd
- ๐ง Self-attention from scratch โ Level 3 builds the core transformer mechanism
- โก LoRA fine-tuning โ Level 5 uses industry-standard tools
Tip
Start from Level 1 even if you know some ML โ the progression builds understanding layer by layer!
What's Next?
Future improvements, extensions, and your roadmap forward
๐ฎ Future Modifications & Extensions
What You'll Explore
- 10 concrete ways to improve and extend the chatbot
- Architecture upgrades from character-level to production-grade
- Multi-language support for Indian languages
- RAG (Retrieval-Augmented Generation) for accurate answers
- Deployment options โ web, WhatsApp, mobile, classroom
- New levels to continue your AI journey
8.1 ๐ง Model Improvements
Your current Mini-GPT is a great learning tool, but there's a huge gap between it and production models like ChatGPT. Here's how to bridge that gap:
| Area | Current | Improvement |
|---|---|---|
| Tokenization | Character-level | Switch to BPE (Byte-Pair Encoding) like GPT uses โ better vocabulary, faster training |
| Model Size | ~1.5M params, 4 blocks | Scale to 10-50M params, 6-8 blocks โ much more coherent output |
| Training Data | 30 stories (9KB) | Use Project Gutenberg, Wikipedia, NCERT textbooks โ 10MB+ |
| Context Window | 256 characters | Expand to 512-1024 tokens for longer coherent passages |
| Architecture | Basic transformer | Add RoPE, SwiGLU activation, GQA (like LLaMA) |
Key Insight
The biggest improvement comes from more data, not bigger models. Going from 9KB to 10MB of training data will dramatically improve your model's output quality โ even with the same architecture!
8.2 ๐ฃ๏ธ Multi-Language Support
India has 22 official languages. Currently our model only speaks English. Here's how to make it multilingual:
Python
# Future: Train on multiple Indian languages
LANGUAGES = {
"hindi": "Add Hindi stories and NCERT content",
"tamil": "Tamil literature and textbooks",
"telugu": "Telugu educational content",
"bengali": "Bengali stories and science",
}
# How to implement:
# 1. Use SentencePiece tokenizer (supports all scripts)
# 2. Mix multilingual data during training
# 3. Fine-tune with language-specific LoRA adapters
# 4. Use AI4Bharat's IndicNLP resources
Indian Language AI Resources
- AI4Bharat โ NLP tools and datasets for all Indian languages
- Bhashini โ Government of India's translation platform
- IndicTrans2 โ Open-source translation model for 22 Indian languages
- Sangraha โ Large-scale Indic language dataset
8.3 ๐ฌ Better Chat Experience
The current terminal-based chat works, but students deserve a modern UI:
| Feature | How to Build |
|---|---|
| Web UI | Add an /ai-tutor route to EduArtha with a React chat interface |
| Streaming | Show tokens appearing one-by-one (like ChatGPT) using Server-Sent Events |
| Chat History | Save conversations to SQLite/PostgreSQL for review |
| Voice Input | Speech-to-text using Web Speech API or OpenAI Whisper |
| Voice Output | Text-to-speech for answers โ great for younger students! |
| Markdown Rendering | Render math formulas, code blocks, and tables in responses |
8.4 ๐ RAG โ Smarter Education Bot
The most impactful upgrade: Retrieval-Augmented Generation (RAG). Instead of relying only on what the model memorized, RAG searches a knowledge base for relevant information before answering:
โ
1. Search NCERT textbook database (vector similarity)
2. Retrieve relevant paragraphs from Class 7 Science Ch.1
3. Feed retrieved text + question to the LLM as context
4. Generate accurate, sourced answer with page references
Python
# RAG Pipeline (conceptual code)
from langchain import VectorStore, RetrievalQA
from sentence_transformers import SentenceTransformer
# Step 1: Index your textbooks
embedder = SentenceTransformer("all-MiniLM-L6-v2")
db = VectorStore.from_documents(ncert_chapters, embedder)
# Step 2: When student asks a question
def answer_with_rag(question):
relevant_docs = db.similarity_search(question, k=3)
context = "\n".join([doc.text for doc in relevant_docs])
prompt = f"Based on this textbook content:\n{context}\n\nAnswer: {question}"
return model.generate(prompt)
# Tools needed: ChromaDB, FAISS, LangChain, Sentence-Transformers
Why RAG Matters
Fine-tuning teaches the model how to answer. RAG gives it what to answer with. Together, they create a chatbot that is both fluent AND accurate โ the holy grail of educational AI.
8.5 ๐ฏ Subject-Specific Fine-Tuning
Instead of one generic bot, create specialized tutors โ each using the same base model but different LoRA adapters:
Python
# One model, many experts โ just swap LoRA adapters!
SUBJECT_TUTORS = {
"science_tutor": "Fine-tune on NCERT Science Class 6-10",
"math_tutor": "Fine-tune on solved problems + step-by-step",
"history_tutor": "Fine-tune on Indian history with timelines",
"english_tutor": "Fine-tune on grammar rules + essay examples",
"coding_tutor": "Fine-tune on Python exercises + explanations",
}
# At runtime:
base_model = load_model("distilgpt2")
adapter = load_lora_adapter("science_tutor") # Swap this!
model = merge(base_model, adapter)
# Each adapter is only ~5MB โ store dozens of experts cheaply!
8.6 ๐ Training Improvements
| Technique | What It Does | Difficulty |
|---|---|---|
| Learning Rate Scheduler | Warmup + cosine decay โ smoother training | โญ Easy |
| Mixed Precision (FP16) | 2x faster training, half the memory | โญ Easy |
| Gradient Accumulation | Train with larger effective batch size on small GPU | โญ Easy |
| DPO (Direct Preference Optimization) | Align model to prefer good answers over bad ones | โญโญ Medium |
| Quantization (4-bit / 8-bit) | Run larger models on small GPUs โ use QLoRA | โญโญ Medium |
| Flash Attention | 3-5x faster attention computation | โญโญ Medium |
| Distributed Training | Train across multiple GPUs with DeepSpeed/FSDP | โญโญโญ Hard |
8.7 ๐ก๏ธ Safety & Guardrails
Before deploying to real students, add these safety layers:
Python
# Safety layers to add to chat.py
class SafeChatBot:
def generate_safe(self, question):
# 1. Input filtering โ block inappropriate questions
if self.is_inappropriate(question):
return "I can only help with educational topics!"
# 2. Generate answer
answer = self.model.generate(question)
# 3. Factuality check โ flag uncertain answers
confidence = self.check_confidence(answer)
if confidence < 0.5:
answer += "\nโ ๏ธ I'm not very sure. Please verify with your teacher!"
# 4. Source attribution
sources = self.find_sources(answer)
if sources:
answer += f"\n๐ Source: {sources}"
return answer
8.8 ๐ฑ Deployment Options
Your chatbot doesn't have to live in the terminal forever. Here are four deployment paths:
| Option | Description | Best For | Cost |
|---|---|---|---|
| ๐ EduArtha Web | Add /ai-tutor route to Next.js app with WebSocket streaming | Online students | Free (existing server) |
| ๐ฑ WhatsApp Bot | Twilio API integration โ students chat on WhatsApp directly | Rural students with basic phones | ~โน500/month |
| ๐ฒ Mobile App | React Native app with offline quantized model | Students without internet | Free (open source) |
| ๐ฅ๏ธ Classroom Kiosk | Raspberry Pi + screen โ students walk up and ask questions | Government schools | ~โน5,000 one-time |
Build a WhatsApp Education Bot
The most impactful deployment for India: a WhatsApp bot that any student can message. Steps:
- Set up a Twilio account (free trial available)
- Create a Flask/FastAPI webhook endpoint
- Load your fine-tuned model on the server
- When a WhatsApp message arrives โ generate response โ send back
- Students text questions like "What is photosynthesis?" and get instant answers!
This works on any phone โ no app download needed. Perfect for rural India. ๐ฎ๐ณ
8.9 ๐ Analytics & Feedback Loop
Track student interactions to continuously improve your bot:
- Question frequency โ What topics do students ask about most?
- Struggle patterns โ Where do they ask the same question repeatedly?
- Accuracy tracking โ Teacher reviews a sample of answers weekly
- Student ratings โ "Was this answer helpful?" thumbs up/down
- Data flywheel โ Use real student questions to create better training data!
This is the AI data flywheel!
8.10 ๐ฌ New Levels โ Continue Your Journey
This book covered Levels 1-5. Here's where to go next:
| Level | Topic | What You'll Build |
|---|---|---|
| Level 6 | RAG System | Chatbot that searches NCERT textbooks before answering |
| Level 7 | AI Agent | Agent that uses tools โ calculator, web search, code execution |
| Level 8 | Vision Model | Image recognition โ identify plants, animals, diagrams |
| Level 9 | Speech Model | Voice chatbot โ speak questions, hear answers in Hindi |
| Level 10 | Production Deploy | Docker, API server, load balancing, monitoring |
The Bigger Picture
You've built something remarkable โ a complete AI system from zero. But this is just the beginning. The techniques in this book are the same foundations used by Google, OpenAI, and Anthropic. The difference is scale.
India needs AI builders who understand these foundations deeply. With 250 million students and 22 languages, the opportunity to build impactful AI tools is enormous.
You have the knowledge. You have the code. Now go build something that matters. ๐๐ฎ๐ณ
๐ Chapter Summary โ Future Roadmap
- Model improvements: BPE tokenization, more data, larger architecture โ dramatically better output
- Multi-language: SentencePiece + multilingual data โ support all Indian languages
- RAG: Search textbooks before answering โ accurate, sourced answers
- Subject tutors: Multiple LoRA adapters on one base model โ specialized experts
- Training upgrades: FP16, gradient accumulation, DPO โ faster, better training
- Safety: Content filtering, confidence scores, source attribution โ trustworthy bot
- Deployment: Web, WhatsApp, mobile, classroom kiosk โ reach every student
- Analytics: Track questions, accuracy, ratings โ continuous improvement
- New levels: RAG, agents, vision, speech, production โ your learning never stops!