EduArtha Interactive Books
Understanding AI: From Zero to nanoCodeGPT
An interactive journey building a multi-language coding AI from scratch using Karpathy's nanoGPT architecture.
Credit to Andrej Karpathy for the original foundation
nanoGPT โ Project Overview & Extensibility to Code Models
What is nanoGPT?
nanoGPT is an open-source project by Andrej Karpathy โ the simplest, fastest repository for training and fine-tuning medium-sized GPTs. It is a minimal, educational, and highly hackable implementation of the GPT (Generative Pre-trained Transformer) architecture.
๐ก NOTE
The entire codebase is intentionally small (~300 lines for the model + ~300 lines for training), making it an ideal starting point for learning and experimentation.
Key Files
| File | Purpose |
|---|---|
model.py |
GPT model definition (GPT-2 architecture) |
train.py |
Training loop with distributed training support |
sample.py |
Text generation / inference script |
configurator.py |
Configuration management |
data/ |
Data preparation scripts (Shakespeare, OpenWebText) |
Architecture
nanoGPT implements a standard GPT-2-style transformer:
Key Capabilities
- โ Train GPT models from scratch on any text dataset
- โ Fine-tune pre-trained GPT-2 checkpoints (124M, 355M, 774M, 1.5B params)
- โ Supports single GPU, multi-GPU (DDP), and Apple Silicon (MPS)
- โ Highly configurable (block size, embedding size, number of heads/layers, dropout, etc.)
- โ Clean PyTorch implementation โ easy to modify
๐งโ๐ป Can nanoGPT Be Extended to a Coding Model?
Yes, absolutely! Here's how:
nanoGPT is a general-purpose text generation model. Since programming languages (Python, Java, C, C++) are structured text, you can train or fine-tune nanoGPT to generate code. Here's a practical roadmap:
Step 1: Prepare a Code Dataset
You need a large corpus of source code. Here are recommended sources:
| Source | Description | Size |
|---|---|---|
| The Stack v2 | Deduplicated, permissively-licensed code from GitHub | ~3TB+ |
| CodeParrot | Python code from GitHub | ~50GB |
| GitHub Archive | Raw GitHub event data | Varies |
| Custom scraping | Your own curated repositories | Varies |
Create a data preparation script similar to data/openwebtext/prepare.py:
Python# data/code/prepare.py โ Example structure
import os
import tiktoken
import numpy as np
from datasets import load_dataset
# Load a code dataset (e.g., Python subset of The Stack)
dataset = load_dataset("bigcode/the-stack-v2",
data_dir="data/python",
split="train",
streaming=True)
# Tokenize using GPT-2 tokenizer (or a code-specific one)
enc = tiktoken.get_encoding("gpt2")
# Process and save as binary files
# train.bin, val.bin
๐ก TIP
For multi-language support (Python + Java + C + C++), prepend each file with a language tag like <|python|>, <|java|>, <|c|>, <|cpp|> so the model learns to differentiate languages.
Step 2: Choose a Tokenizer Strategy
The default GPT-2 tokenizer works but is not optimal for code. Consider:
| Tokenizer | Pros | Cons |
|---|---|---|
| GPT-2 BPE (default) | Works out of the box, compatible with fine-tuning | Wastes tokens on whitespace/indentation |
| Code-specific BPE (e.g., from CodeLlama) | Better code compression, fewer tokens per file | Requires model changes for vocab size |
| Custom trained BPE | Optimal for your specific codebase | Requires training a new tokenizer |
โ ๏ธ IMPORTANT
If you change the tokenizer vocabulary size, you must update vocab_size in the model config and cannot fine-tune from GPT-2 weights.
Step 3: Training Approaches
Option A: Fine-tune GPT-2 on Code (Easiest)
Bash# Download GPT-2 weights and fine-tune on your code dataset
python train.py config/finetune_code.py
Example config:
Python# config/finetune_code.py
init_from = 'gpt2' # Start from GPT-2 124M
out_dir = 'out-code'
dataset = 'code' # Points to data/code/
batch_size = 8
block_size = 1024
max_iters = 50000
learning_rate = 3e-5 # Lower LR for fine-tuning
warmup_iters = 500
eval_interval = 500
Option B: Train from Scratch (More compute, better results)
Bashpython train.py config/train_code.py
Python# config/train_code.py
init_from = 'scratch'
out_dir = 'out-code-scratch'
dataset = 'code'
n_layer = 12
n_head = 12
n_embd = 768
block_size = 2048 # Longer context for code
batch_size = 12
max_iters = 600000
learning_rate = 6e-4
Step 4: Code-Specific Enhancements
To make nanoGPT better for code, consider these modifications:
4a. Increase Context Length
Code files are typically longer than natural language paragraphs. Increase block_size from 1024 to 2048 or 4096.
4b. Add Fill-in-the-Middle (FIM) Training
Modern code models use FIM to support code completion:
Python# During data preparation, randomly apply FIM transformation:
# Original: prefix + middle + suffix
# Transformed: <fim_prefix>prefix<fim_suffix>suffix<fim_middle>middle
4c. Add Special Tokens
Add tokens for code-specific tasks:
<|file_start|>,<|file_end|>โ file boundaries<|python|>,<|java|>,<|c|>,<|cpp|>โ language tags<|fim_prefix|>,<|fim_middle|>,<|fim_suffix|>โ FIM markers
4d. Use RoPE (Rotary Positional Embeddings)
Replace learned positional embeddings with RoPE for better length generalization โ many community forks already implement this.
Step 5: Sampling / Inference
Bashpython sample.py --out_dir=out-code \
--start="def fibonacci(n):" \
--num_samples=3 \
--max_new_tokens=256
โ ๏ธ Limitations & Considerations
| Aspect | Limitation | Mitigation |
|---|---|---|
| Model Size | nanoGPT typically trains 124Mโ1.5B params; real code models are 7Bโ70B+ | Still useful for learning & prototyping |
| Context Length | GPT-2 default is 1024 tokens (~20-40 lines of code) | Increase block_size, use RoPE |
| No Instruction Following | nanoGPT generates continuations, not instruction-following responses | Add instruction fine-tuning stage (SFT) |
| No RLHF | No human feedback alignment | Can add DPO/RLHF post-training |
| Compute | Training a competitive code model needs many GPUs | Fine-tuning GPT-2 is feasible on 1 GPU |
๐๏ธ Recommended Architecture for a Code Model Extension
Summary
| Question | Answer |
|---|---|
| Can nanoGPT be extended to code? | Yes โ |
| Which languages? | Any โ Python, Java, C, C++, and more |
| Easiest approach? | Fine-tune GPT-2 on code data |
| Best results? | Train from scratch with code-specific tokenizer + FIM |
| Production-ready? | Not directly โ but an excellent learning/prototyping tool |
| Similar projects that did this? | CodeParrot, SantaCoder, StarCoder all started from similar foundations |
๐ก TIP
If you want to build this, I can help you create the data preparation pipeline, modify model.py for code-specific features, and set up the training configuration. Just let me know!
๐ The Project Roadmap
To bring this coding model to life, here is our comprehensive implementation plan:
Comprehensive Implementation
- ๐ง Modified GPT Architecture: Infused with code-specific enhancements.
- ๐ Multi-Language Data Pipeline: Sourcing and preparing Python, Java, C, and C++ datasets.
- ๐ง Code-Optimized Tokenizer: Built with dedicated special tokens.
- ๐๏ธ Training Infrastructure: Rigorous FIM support for code completion tasks.
- ๐ฅ๏ธ Interactive UI: A beautiful web interface for code generation and direct interaction.
Dataset
The dataset bigcode/the-stack-smol on HuggingFace is gated (it requires you to log in to HuggingFace and accept their terms of use before downloading).
Since we are just doing a demonstration to make sure the entire pipeline works perfectly, I've modified the data_prep.py script to seamlessly fall back to generating dummy code snippets (e.g. def hello_world_python_1(): print("Hello")) if the download fails.
The updated script is currently running in the background and will successfully output train.bin and val.bin containing this dummy data! You'll receive a notification as soon as it's done, at which point you can start testing python train.py.
The Heart of AI (Self-Attention)
The absolute core of a Transformer is the Causal Self-Attention block. Everything else exists to support it.
The Q, K, V Transformation
Computers can't do math on letters, so first, we turn every word into a list of numbers (an embedding). But in a sentence, words need context. To get context, we multiply every word vector by three different mathematical weight matrices to create three new vectors:
- Query (Q): What am I looking for? (e.g., a verb looking for a noun)
- Key (K): What do I have? (e.g., "I am a noun")
- Value (V): What is my underlying meaning?
The Math: Dot Product (Q ยท K^T)
We use the Dot Product to mathematically find matching concepts. If the word "sat" acts as a Query, we multiply it against the Keys of all the words before it.
Python# Calculate Q, K, V for all tokens at once
qkv = self.c_attn(x)
q, k, v = qkv.split(self.n_embd, dim=2)
# The core Attention Formula:
y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
If "sat" and "cat" match mathematically, their Dot Product explodes into a massive score! We pass this raw score through a Softmax function, which turns the scores into percentages (e.g. 98%).
The Causal Mask
Notice the is_causal=True flag? That applies a mathematical "mask" (a triangle of negative infinities) over the attention matrix. It strictly forces the model to only look at the past and the present. If "cat" is allowed to "cheat" and look at "sat" during training, it will memorize the future instead of learning to predict it!
The Brain (MLP & GELU)
Attention is only communication. The actual "thinking" happens in the Multi-Layer Perceptron (MLP).
The 4x Expansion
Pythonclass MLP(nn.Module):
def __init__(self, config):
super().__init__()
# Expand the vector size by 4x
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
self.gelu = nn.GELU()
# Squeeze it back down
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
Notice that the first layer multiplies the embedding size by 4 (e.g., from 768 to 3,072). This temporarily gives the neural network a massive mathematical "scratchpad" to unfold complex logic before squeezing the final answer back down into the main pipeline.
The Magic of GELU (Fixing the Dying ReLU)
Older AI models used a math function called ReLU, which had a harsh, 90-degree cliff at zero. If a neuron accidentally outputted a slightly negative number, ReLU instantly killed it to exactly 0. The calculus derivative became 0, the math stopped flowing backwards, and the neuron permanently died ("The Dying ReLU Problem").
GELU (Gaussian Error Linear Unit) fixes this! Instead of a violent cliff, it creates a smooth, curved ramp near zero. The derivative never hits exactly zero, meaning the math never stops. The neuron stays alive, receives the mathematical feedback, and learns to adjust itself in the next training iteration!
The Physics (Residuals)
Before Transformers, researchers tried to build very deep networks (like 100 layers deep) and completely failed due to the Vanishing Gradient Problem. The data had to travel through so many matrix multiplications that by the time it reached the end, the original meaning of the word was crushed to zero by tiny decimals.
The Residual Highway
The solution is the simple + sign, called a Residual Connection.
Pythondef forward(self, x, ...):
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x
Imagine a massive, straight highway running from the start of the neural network to the end (the x variable). The model pulls the word off the highway, does the complex Attention math, and then adds the new context back onto the original highway (x = x + ...). Because the original highway is completely untouched, calculus gradients can travel backwards at the speed of light!
Layer Normalization
If you keep adding vectors together 12 times in a row, the numbers will quickly explode from 3.0 to 300,000.0 and crash your GPU with a NaN error. LayerNorm (ln_1) forcefully squashes the numbers back down so their mean is 0 and variance is 1, keeping the math perfectly stable.
Scaling (Grokking)
Memorization vs Grokking
In the early stages of training (Iteration 1,000 to 5,000), the AI is just a giant statistical parrot (Overfitting). It memorizes exact strings from the dataset. However, once it sees millions of examples, a mathematical phenomenon happens called Grokking. The model is forced to compress the information by discovering the underlying abstract rules of the data (variables, loops, syntax) to save space. It stops parroting and starts generalizing!
The Chinchilla Scaling Laws
How do researchers decide exactly how big to make a model? In 2022, DeepMind published the Chinchilla Scaling Laws, proving that model size and training data must be scaled equally: For every 1 parameter in a model, you should train it on roughly 20 tokens of data.
The Sweet Spot
Embedding sizes are constrained by GPU hardware (they must be divisible by the number of attention heads, often aligning with 128 bytes for memory warps). The "sweet spot" for modern language models is typically between 4,096 and 12,288 dimensions. Any larger, and the model suffers from the Curse of Dimensionality (becoming lazy and memorizing instead of learning).
The Future (Mamba, MoE)
If we want to upgrade nanoCodeGPT from a learning tool to a production-grade architecture, we look to the absolute cutting-edge research of 2025/2026:
- State Space Models (Mamba): The biggest flaw in Attention is that it scales quadratically
(O(N^2)). Mamba completely replaces Self-Attention with a selective state-space equation that processes tokens in linear time(O(N)), theoretically allowing the model to read a 100,000-line codebase instantly. - Mixture of Experts (MoE): Instead of having one giant MLP (Brain), models like DeepSeek split the MLP into smaller "Experts" and build a Router. By separating logic-neurons from english-neurons, 80% of the brain is asleep at any given time, making inference lightning fast.
- Test-Time Compute (o1 Paradigm): Injecting "Reasoning Traces" into the dataset so the model writes a hidden
<thought>block to plan the architecture and spot bugs before outputting the final code!
Deploying the Web UI (Flask)
Once you have successfully trained your coding model, the next step is putting it into the hands of users! In the EDUARTHA_FLY project, we built a beautiful Web UI that connects directly to our trained nanoCodeGPT model using a Flask backend.
The Backend Architecture
We use Python and Flask to serve the model over a simple REST API. The core of this system lives in app.py.
Pythonfrom flask import Flask, request, jsonify
import torch
from model import GPTConfig, GPT
app = Flask(__name__)
# Load your custom trained model
checkpoint = torch.load('out-code/ckpt.pt', map_location='cuda')
gptconf = GPTConfig(**checkpoint['model_args'])
model = GPT(gptconf)
model.load_state_dict(checkpoint['model'])
model.eval()
@app.route('/generate', methods=['POST'])
def generate():
data = request.json
prompt = data.get('prompt', '')
# Tokenize, Generate, Decode...
return jsonify({"generated": output_text})
โน๏ธ Bypassing Pickling Security Risks
When loading PyTorch checkpoints (.pt files), be cautious of untrusted models, as standard unpickling can execute arbitrary code. The EDUARTHA_FLY pipeline safely loads our self-trained models, but if you load community checkpoints, ensure weights_only=True when using torch.load().
Running the EDUARTHA_FLY Application
To launch the interactive coding companion, you simply execute the Flask application. It automatically binds to port 5000 and serves both the API and the static web assets.
Bash# Navigate to the EDUARTHA_FLY directory
cd C:\Users\hp\PycharmProjects\EDUARTHA_FLY
# Run the backend and web server
python app.py
You can then visit http://localhost:5000 in your browser to interact with the AI directly! It includes support for temperature adjustment, language selection, and Fill-in-the-Middle (FIM) operations.