Neural Networks & Deep Learning

Chapter 19: Reinforcement Learning Foundations

Teaching Machines to Learn by Trial, Error, and Delayed Reward

โฑ๏ธ Reading Time: ~3.5 hours  |  ๐Ÿ“– Unit VI: Modern Deep Learning  |  ๐Ÿ› ๏ธ Theory + Implementation Chapter

๐Ÿ“‹ Prerequisites: Chapter 10 (Batch Normalization / Deep Network Training), Chapter 5 (Gradient Descent & Optimization Basics)

Bloom's Taxonomy Progression

Bloom's LevelWhat You'll Achieve
๐Ÿ”ต RememberDefine agent, environment, state, action, reward, return, policy, value function, and list the components of an MDP
๐Ÿ”ต UnderstandExplain why RL is fundamentally different from supervised learning, and interpret the Bellman equation as a recursive relationship between states
๐ŸŸข ApplyImplement Q-Learning on FrozenLake from scratch in NumPy; build a DQN for CartPole in PyTorch
๐ŸŸก AnalyzeCompare value-based vs policy-based methods; dissect experience replay and target networks in DQN
๐ŸŸ  EvaluateAssess exploration strategies (ฮต-greedy vs UCB vs Boltzmann); critique RLHF for LLM alignment
๐Ÿ”ด CreateDesign an RL formulation for ISRO spacecraft navigation; architect a PPO training loop for custom environments
Section 1

Learning Objectives

By the end of this chapter, you will be able to:

  • Define the complete RL framework โ€” Agent, Environment, State (S), Action (A), Reward (R), Policy (ฯ€), Return (G), and Discount Factor (ฮณ) โ€” and explain how they interconnect
  • Derive the Bellman Expectation and Bellman Optimality equations from first principles, and explain why they are the foundation of all RL algorithms
  • Implement Q-Learning from scratch in NumPy to solve FrozenLake, understanding every line of the tabular update rule
  • Build a Deep Q-Network (DQN) in PyTorch with experience replay and target networks to solve CartPole-v1
  • Explain REINFORCE (Monte Carlo Policy Gradient) and Actor-Critic methods, deriving the policy gradient theorem
  • Describe how PPO (Proximal Policy Optimization) enables RLHF (Reinforcement Learning from Human Feedback) for training ChatGPT-style models
  • Formulate real-world problems as MDPs โ€” including ISRO spacecraft navigation and DeepMind's AlphaGo โ€” and select appropriate RL algorithms
  • Analyze the exploration-exploitation trade-off and compare ฮต-greedy, UCB, and Boltzmann exploration strategies
Section 2

Opening Hook

๐Ÿš€ The 4-Minute Gap: How ISRO's Mars Orbiter Had to Think for Itself

On September 24, 2014, ISRO's Mangalyaan entered Mars orbit, making India the first country to succeed on its maiden attempt. But here's what most people don't know: at Mars distance, radio signals take 4 to 24 minutes to travel between Earth and the spacecraft. When Mangalyaan needed to fire its thrusters for the critical Mars Orbit Insertion (MOI), ground controllers couldn't provide real-time commands. The spacecraft had to make its own decisions.

Consider what this means. The spacecraft is hurtling at 22.1 km/s. It must decide when to fire engines, how long to burn, and how to correct trajectory errors โ€” all autonomously. Each decision changes the spacecraft's state (position, velocity, fuel). Each thruster burn consumes irreplaceable fuel (reward = -fuel_used, but also reward = +getting_closer_to_orbit). The goal: maximize the probability of successful orbit insertion while minimizing fuel waste.

This is Reinforcement Learning. Not supervised learning โ€” nobody gave Mangalyaan a labeled dataset of "correct thruster burns." Not unsupervised learning โ€” there are no clusters to find. Instead, an agent (spacecraft) interacts with an environment (space), takes actions (thruster burns), observes states (position/velocity), and receives rewards (orbit accuracy minus fuel cost). It must learn a policy โ€” a strategy for which action to take in each state โ€” that maximizes total long-term reward.

From teaching spacecraft to navigate, to training AlphaGo to defeat the world champion, to fine-tuning ChatGPT with human preferences โ€” reinforcement learning is the third pillar of machine learning. Let's build it from the ground up.

ISRODeepMindOpenAIGoogle Brain
Section 3

The Intuition First

The Cricket Analogy ๐Ÿ

Imagine you're Virat Kohli facing Jasprit Bumrah in a T20 match. Here's your RL setup:

  • You (Agent): The batsman making decisions
  • Environment: The pitch, fielders, bowler, match situation
  • State: Current score, balls remaining, field placement, bowler's run-up angle
  • Action: Defensive block, cover drive, pull shot, leave the ball
  • Reward: +1 for a single, +4 for boundary, +6 for a six, โˆ’1 for getting out
  • Policy: Your batting strategy โ€” "When the ball is short and outside off, play the cut shot"

Here's the crucial insight: rewards are delayed. You might play five dot balls (reward = 0 each) to set up a boundary on the sixth ball (reward = +4). A greedy batsman who swings at everything might score quickly but get out. A patient batsman builds an innings. RL is about finding the optimal balance between immediate and future rewards.

The "Aha!" Question: In supervised learning, you have a teacher telling you the right answer. In RL, you only know if your answer was good after you see the consequences โ€” sometimes much later. How do you learn from rewards that come 100 steps after the decision that caused them? This is the credit assignment problem, and solving it is what makes RL fascinating.

Three Types of Machine Learning โ€” Final Piece of the Puzzle

Machine Learning Paradigms โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• SUPERVISED LEARNING UNSUPERVISED LEARNING REINFORCEMENT LEARNING โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Teacher gives you No teacher. You learn by doing. correct answers. Find hidden Trial and error. structure. Delayed rewards. Input โ†’ Label Input โ†’ ??? State โ†’ Action โ†’ Reward (x, y) pairs x only (s, a, r, s') tuples Example: Example: Example: Cat/Dog classifier Customer clusters Game-playing AI Spam detection Anomaly detection Robot navigation Medical diagnosis Topic modeling Trading agents โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Richard Sutton and Andrew Barto's 1998 textbook "Reinforcement Learning: An Introduction" was so influential that it's still the definitive reference today. Sutton formalized RL ideas that were inspired by animal learning psychology from the 1950s โ€” specifically B.F. Skinner's experiments with pigeons learning through reward and punishment!
Section 4

Mathematical Foundation & Core Concepts

19.1 The Agentโ€“Environment Interaction Loop

Every RL problem follows one fundamental loop:

The RL Interaction Loop โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ AGENT โ”‚ aโ‚œ (action) โ”‚ (Policy โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ฯ€(s)) โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ sโ‚œโ‚Šโ‚ โ”‚ rโ‚œโ‚Šโ‚ โ”‚ ENVIRONMENTโ”‚ (next state) โ”‚ (reward) โ”‚ โ”‚ โ”‚ โ”‚ P(s'|s,a) โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ R(s,a,s') โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Observe โ”‚โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ sโ‚œโ‚Šโ‚, rโ‚œโ‚Šโ‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ At time t: 1. Agent observes state sโ‚œ 2. Agent selects action aโ‚œ ~ ฯ€(ยท|sโ‚œ) 3. Environment transitions to sโ‚œโ‚Šโ‚ ~ P(ยท|sโ‚œ,aโ‚œ) 4. Agent receives reward rโ‚œโ‚Šโ‚ = R(sโ‚œ,aโ‚œ,sโ‚œโ‚Šโ‚) 5. Repeat from step 1 โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Key Definitions

๐Ÿ“– The RL Vocabulary

Agent The learner and decision-maker. It observes the environment and chooses actions to maximize cumulative reward. Examples: a robot, a game-playing AI, a spacecraft controller. Environment Everything outside the agent. It receives the agent's action, transitions to a new state, and provides a reward. The environment may be deterministic (chess) or stochastic (weather, stock market). State (S or sโ‚œ) A complete description of the world at time t. In chess, the state is the board position. In a self-driving car, the state includes position, velocity, nearby objects, traffic signals. Action (A or aโ‚œ) What the agent can do. The action space can be discrete (left/right/up/down) or continuous (steering angle โˆˆ [โˆ’30ยฐ, +30ยฐ]). Reward (R or rโ‚œ) A scalar signal from the environment telling the agent how good its action was. +1 for winning, โˆ’1 for losing, โˆ’0.01 per time step (to encourage speed). Policy (ฯ€) The agent's strategy. Maps states to actions. Can be deterministic: ฯ€(s) = a, or stochastic: ฯ€(a|s) = P(action = a | state = s). Return (Gโ‚œ) Total accumulated reward from time t onwards: Gโ‚œ = rโ‚œโ‚Šโ‚ + ฮณยทrโ‚œโ‚Šโ‚‚ + ฮณยฒยทrโ‚œโ‚Šโ‚ƒ + ... where ฮณ โˆˆ [0,1] is the discount factor. Discount Factor (ฮณ) How much we value future rewards vs immediate rewards. ฮณ=0 means completely myopic (only care about next reward). ฮณ=1 means all future rewards count equally. Typical: ฮณ = 0.99.

โŒ MYTH: "State and observation are the same thing."

โœ… TRUTH: In many real problems, the agent can only partially observe the state. A poker player doesn't see opponents' cards. This creates a Partially Observable MDP (POMDP), which is harder to solve.

๐Ÿ” WHY IT MATTERS: If you treat a partial observation as the full state, your agent will make suboptimal decisions. Real ISRO systems deal with sensor noise โ€” the true spacecraft state is estimated, not directly observed.

19.2 Markov Decision Processes (MDPs) & Bellman Equations

The Markov Property

The mathematical backbone of RL is the Markov Decision Process. The key assumption is the Markov Property: the future depends only on the present state, not on the history of how you got there.

Markov Property: P(sโ‚œโ‚Šโ‚ | sโ‚œ, aโ‚œ, sโ‚œโ‚‹โ‚, aโ‚œโ‚‹โ‚, ..., sโ‚€, aโ‚€) = P(sโ‚œโ‚Šโ‚ | sโ‚œ, aโ‚œ)

Think of it this way: if you know the complete position of every piece in a chess game right now, you don't need to know the history of moves that led to this position to decide your next move. The current state contains all relevant information.

Formal MDP Definition

๐Ÿ“ Markov Decision Process (MDP)

An MDP is a 5-tuple (S, A, P, R, ฮณ) where:
  • S โ€” finite set of states
  • A โ€” finite set of actions (or A(s) for state-dependent actions)
  • P(s'|s,a) โ€” transition probability: probability of reaching state s' when taking action a in state s
  • R(s,a,s') โ€” reward function: immediate reward for transition (s, a) โ†’ s'
  • ฮณ โˆˆ [0,1) โ€” discount factor

Value Functions: V(s) and Q(s,a)

Now here's the central question: given a policy ฯ€, how good is it to be in state s? We define two value functions:

State-Value Function: Vฯ€(s) = Eฯ€[Gโ‚œ | sโ‚œ = s] = Eฯ€[ฮฃk=0โˆž ฮณk ยท rโ‚œโ‚Šโ‚–โ‚Šโ‚ | sโ‚œ = s]
Action-Value Function: Qฯ€(s,a) = Eฯ€[Gโ‚œ | sโ‚œ = s, aโ‚œ = a]

V(s) answers: "How much total reward can I expect starting from state s, if I follow policy ฯ€?"

Q(s,a) answers: "How much total reward can I expect if I'm in state s, take action a, and then follow policy ฯ€?"

Deriving the Bellman Expectation Equation for Vฯ€(s)

Let's derive this from scratch. Start with the definition:

Step 1: Vฯ€(s) = Eฯ€[Gโ‚œ | sโ‚œ = s]
Step 2: Expand the return: Gโ‚œ = rโ‚œโ‚Šโ‚ + ฮณยทGโ‚œโ‚Šโ‚
Step 3: Vฯ€(s) = Eฯ€[rโ‚œโ‚Šโ‚ + ฮณยทGโ‚œโ‚Šโ‚ | sโ‚œ = s]
Step 4: Split the expectation using linearity: Vฯ€(s) = Eฯ€[rโ‚œโ‚Šโ‚ | sโ‚œ = s] + ฮณ ยท Eฯ€[Gโ‚œโ‚Šโ‚ | sโ‚œ = s]
Step 5: Marginalize over actions a and next states s': Vฯ€(s) = ฮฃa ฯ€(a|s) ยท ฮฃs' P(s'|s,a) ยท [R(s,a,s') + ฮณ ยท Vฯ€(s')]

This is the Bellman Expectation Equation. Read it as: "The value of state s equals the expected immediate reward plus the discounted value of the next state, averaged over all actions (weighted by policy) and all transitions (weighted by environment dynamics)."

Bellman Expectation Equation:
Vฯ€(s) = ฮฃa ฯ€(a|s) ยท ฮฃs' P(s'|s,a) ยท [R(s,a,s') + ฮณ ยท Vฯ€(s')]
Bellman Expectation for Q:
Qฯ€(s,a) = ฮฃs' P(s'|s,a) ยท [R(s,a,s') + ฮณ ยท ฮฃa' ฯ€(a'|s') ยท Qฯ€(s',a')]

Bellman Optimality Equations

The optimal policy ฯ€* maximizes the value function for every state. The optimal value functions satisfy:

Bellman Optimality (V*):
V*(s) = maxa ฮฃs' P(s'|s,a) ยท [R(s,a,s') + ฮณ ยท V*(s')]
Bellman Optimality (Q*):
Q*(s,a) = ฮฃs' P(s'|s,a) ยท [R(s,a,s') + ฮณ ยท maxa' Q*(s',a')]

The key difference: instead of averaging over actions (as the policy dictates), the optimality equation takes the max over actions. The optimal agent always picks the best action.

Q: What's the difference between Bellman Expectation and Bellman Optimality?

A: Expectation equation evaluates a given policy ฯ€ (uses ฮฃa ฯ€(a|s)). Optimality equation finds the best policy (uses maxa). If you know Q*, the optimal policy is simply ฯ€*(s) = argmaxa Q*(s,a).

Vฯ€(s) = ฮฃa ฯ€(a|s) Qฯ€(s,a)   vs   V*(s) = maxa Q*(s,a)
Bellman Backup Diagrams โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• V(s) Backup: Q(s,a) Backup: (s) (s,a) / | \ โ† ฯ€(a|s) / \ / | \ / \ (aโ‚)(aโ‚‚)(aโ‚ƒ) P(s'|s,a) P(s'|s,a) | | | | | โ–ผ โ–ผ โ–ผ โ–ผ โ–ผ โ—‹โ”€โ”€โ”€โ—‹โ”€โ”€โ”€โ—‹ (s'โ‚) (s'โ‚‚) s' s' s' / | \ / | \ a' a' a' a' a' a' Circle = state node Square = action node V(s) = ฮฃ_a ฯ€(a|s) ยท Q(s,a) Q(s,a) = R + ฮณฮฃ_s' PยทV(s') โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

19.3 Q-Learning: The Workhorse of Tabular RL

Here's the problem: the Bellman Optimality Equation is beautiful, but solving it requires knowing P(s'|s,a) โ€” the environment's transition dynamics. In most real problems, you don't know these dynamics. You can't solve the equation analytically.

Q-Learning (Watkins, 1989) solves this by learning Q*(s,a) directly from experience, without knowing the environment dynamics. It's a model-free, off-policy, temporal-difference method.

Deriving the Q-Learning Update Rule

Step 1: Start with Bellman Optimality for Q: Q*(s,a) = E[r + ฮณ ยท maxa' Q*(s',a')]
Step 2: We don't know the expectation (environment dynamics unknown), but we can sample! After taking action a in state s, we observe (r, s').
Step 3: Our "target" is: target = r + ฮณ ยท maxa' Q(s',a')
Step 4: Our current estimate is: Q(s,a)
Step 5: The TD error (temporal difference): ฮด = target โˆ’ Q(s,a)
Step 6: Update Q toward the target: Q(s,a) โ† Q(s,a) + ฮฑ ยท [r + ฮณ ยท maxa' Q(s',a') โˆ’ Q(s,a)]

Here ฮฑ โˆˆ (0,1] is the learning rate. This single update rule is Q-Learning! It provably converges to Q* given sufficient exploration and a decaying learning rate.

Q-Learning Update:
Q(s,a) โ† Q(s,a) + ฮฑ ยท [r + ฮณ ยท maxa' Q(s',a') โˆ’ Q(s,a)]

Why "off-policy"? Notice the update uses max over next actions โ€” it's learning the value of the optimal policy regardless of what exploration policy the agent actually follows. The agent might explore randomly (ฮต-greedy) but learns the greedy optimal policy. This separation of behavior policy from target policy is what makes Q-Learning "off-policy."

The Q-Learning Algorithm

Q-Learning Algorithm (Tabular) โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• Initialize Q(s,a) = 0 for all s, a FOR each episode: โ”‚ Initialize state s โ”‚ WHILE s is not terminal: โ”‚ โ”‚ โ”‚ โ”‚ Choose action a using ฮต-greedy: โ”‚ โ”‚ With prob ฮต: a = random action โ”‚ โ”‚ With prob 1-ฮต: a = argmax_a' Q(s,a') โ”‚ โ”‚ โ”‚ โ”‚ Take action a, observe reward r, next state s' โ”‚ โ”‚ โ”‚ โ”‚ Update Q-table: โ”‚ โ”‚ Q(s,a) โ† Q(s,a) + ฮฑ[r + ฮณยทmax_a' Q(s',a') - Q(s,a)] โ”‚ โ”‚ โ”‚ โ”‚ s โ† s' โ”‚ โ””โ”€โ”€ โ””โ”€โ”€ โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Chris Watkins developed Q-Learning as part of his 1989 PhD thesis at Cambridge. It was one of the first RL algorithms proven to converge. The "Q" stands for "Quality" โ€” as in the quality of taking action a in state s. Simple name, profound algorithm.

19.4 Deep Q-Networks (DQN): When the Table Gets Too Big

Q-Learning stores a Q-value for every (state, action) pair in a table. This works for small problems like FrozenLake (16 states ร— 4 actions = 64 entries). But what about Atari games? A 210ร—160 pixel screen with 128 colors gives approximately 128^(210ร—160) โ‰ˆ 10^(70,000) possible states. You can't build a table that large.

Solution: Replace the Q-table with a neural network! Instead of Q[s][a], use a network Q(s; ฮธ) that takes state s as input and outputs Q-values for all actions.

From Q-Table to Deep Q-Network โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• Q-TABLE (Tabular) DQN (Neural Network) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ Left โ”‚ Rightโ”‚ โ”‚ State s (input) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ [position, vel] โ”‚ โ”‚ sโ‚ โ”‚ 0.3 โ”‚ 0.7 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ sโ‚‚ โ”‚ 0.5 โ”‚ 0.2 โ”‚ โ”Œโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ” โ”‚ sโ‚ƒ โ”‚ 0.1 โ”‚ 0.9 โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”คHidden โ”œโ”€โ”€โ”€โ”€โ” โ”‚ ... โ”‚ ... โ”‚ ... โ”‚ โ”‚ โ”‚Layers โ”‚ โ”‚ โ”‚ sโ‚™ โ”‚ 0.4 โ”‚ 0.6 โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ–ผ โ–ผ Q(s,Left) Q(s,Right) Lookup: O(1) = 0.3 = 0.7 Memory: O(|S|ร—|A|) Can't generalize Generalizes across similar states! โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Two Critical Innovations in DQN (Mnih et al., 2015)

Naively training a neural network with Q-Learning updates doesn't work. The DeepMind team (2013/2015 Nature paper) identified two problems and their solutions:

Problem 1: Correlated Samples

Consecutive game frames are highly correlated (frame at time t looks almost identical to frame at time t+1). Training a neural network on correlated data violates the i.i.d. assumption and causes unstable learning.

๐Ÿ’พ Solution: Experience Replay

Store transitions (s, a, r, s', done) in a replay buffer D of size N (typically 10โถ). During training, randomly sample mini-batches from D to update the network. This:

  • Breaks temporal correlations between consecutive samples
  • Reuses rare/important experiences multiple times
  • Increases data efficiency dramatically

Think of it like this: instead of studying topics in the order you encountered them, you shuffle your flashcards and review random ones each study session.

Problem 2: Moving Target

The Q-Learning target is r + ฮณ ยท maxa' Q(s',a'; ฮธ), but ฮธ changes every update step. So the target we're chasing also changes! It's like trying to hit a moving target while also being on a moving platform.

๐ŸŽฏ Solution: Target Network

Maintain two networks with identical architecture:

  • Online network Q(s,a; ฮธ) โ€” updated every step
  • Target network Q(s,a; ฮธโป) โ€” frozen copy, updated periodically

The loss function becomes:

L(ฮธ) = E[(r + ฮณ ยท maxa' Q(s',a'; ฮธโป) โˆ’ Q(s,a; ฮธ))ยฒ]

Every C steps (typically C=10,000), copy ฮธ โ†’ ฮธโป. The target stays fixed for C steps, stabilizing training.

DQN Loss Function:
L(ฮธ) = E(s,a,r,s') ~ D[(r + ฮณ ยท maxa' Q(s',a'; ฮธโป) โˆ’ Q(s,a; ฮธ))ยฒ]

"Human-level Control through Deep Reinforcement Learning" โ€” Mnih et al., Nature 2015. This paper showed a single DQN architecture achieving superhuman performance on 29 out of 49 Atari games, using only raw pixels and score as input. It was the paper that reignited the entire field of deep RL. The key insight: experience replay + target networks made deep Q-Learning stable.

Recent (2023-2025): Rainbow DQN combined 6 improvements (Double DQN, Prioritized Replay, Dueling, Multi-step, Distributional, NoisyNets). More recently, decision transformers reframe RL as sequence modeling, bridging RL and transformers.

19.5 Policy Gradient Methods: REINFORCE & Actor-Critic

Q-Learning and DQN are value-based methods: they learn Q(s,a) and derive a policy from it (pick the action with highest Q). But what if the action space is continuous (like steering angle or thrust magnitude)? You can't take argmax over infinite actions!

Policy gradient methods directly parameterize the policy ฯ€(a|s; ฮธ) โ€” typically as a neural network โ€” and optimize ฮธ by gradient ascent on expected return.

The Policy Gradient Theorem

Deriving the REINFORCE Gradient

Step 1: Define the objective. We want to maximize: J(ฮธ) = Eฯ„~ฯ€ฮธ[R(ฯ„)] = E[ฮฃt=0T ฮณt ยท rโ‚œ]
Step 2: The probability of trajectory ฯ„ = (sโ‚€, aโ‚€, rโ‚, sโ‚, aโ‚, ...): P(ฯ„|ฮธ) = p(sโ‚€) ยท ฮ t=0T ฯ€(aโ‚œ|sโ‚œ;ฮธ) ยท P(sโ‚œโ‚Šโ‚|sโ‚œ,aโ‚œ)
Step 3: Take gradient of J(ฮธ): โˆ‡ฮธJ(ฮธ) = โˆ‡ฮธ Eฯ„[R(ฯ„)] = โˆ‡ฮธ โˆซ P(ฯ„|ฮธ)R(ฯ„) dฯ„
Step 4: Apply the log-derivative trick: โˆ‡f = f ยท โˆ‡log(f): = โˆซ P(ฯ„|ฮธ) ยท โˆ‡ฮธ log P(ฯ„|ฮธ) ยท R(ฯ„) dฯ„ = Eฯ„[โˆ‡ฮธ log P(ฯ„|ฮธ) ยท R(ฯ„)]
Step 5: Expand log P(ฯ„|ฮธ): log P(ฯ„|ฮธ) = log p(sโ‚€) + ฮฃt [log ฯ€(aโ‚œ|sโ‚œ;ฮธ) + log P(sโ‚œโ‚Šโ‚|sโ‚œ,aโ‚œ)]
Step 6: The gradient โˆ‡ฮธ kills terms not depending on ฮธ: โˆ‡ฮธ log P(ฯ„|ฮธ) = ฮฃt=0T โˆ‡ฮธ log ฯ€(aโ‚œ|sโ‚œ;ฮธ)
Step 7: Final result (REINFORCE gradient): โˆ‡ฮธJ(ฮธ) = Eฯ„[ฮฃt=0T โˆ‡ฮธ log ฯ€(aโ‚œ|sโ‚œ;ฮธ) ยท Gโ‚œ]

where Gโ‚œ = ฮฃk=tT ฮณk-t ยท rโ‚–โ‚Šโ‚ is the return from time t.

Intuition: If an action led to high return (Gโ‚œ large), increase its probability. If it led to low return, decrease its probability. The gradient naturally performs credit assignment!

REINFORCE (Policy Gradient):
โˆ‡ฮธJ(ฮธ) = Eฯ„~ฯ€ฮธ[ฮฃt=0T โˆ‡ฮธ log ฯ€(aโ‚œ|sโ‚œ;ฮธ) ยท Gโ‚œ]

The Variance Problem and Baselines

REINFORCE has a problem: high variance. The return Gโ‚œ can vary wildly between episodes. Solution: subtract a baseline b(sโ‚œ) that doesn't depend on the action:

REINFORCE with Baseline:
โˆ‡ฮธJ(ฮธ) = E[ฮฃt โˆ‡ฮธ log ฯ€(aโ‚œ|sโ‚œ;ฮธ) ยท (Gโ‚œ โˆ’ b(sโ‚œ))]

The best baseline? The value function V(s)! The term (Gโ‚œ โˆ’ V(sโ‚œ)) is called the Advantage A(s,a) โ€” it measures how much better action a is compared to the average.

Actor-Critic: Best of Both Worlds

Instead of waiting until the end of an episode to compute Gโ‚œ (Monte Carlo), use a learned value function (the Critic) as the baseline:

Actor-Critic Architecture โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ State sโ‚œ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ” โ”‚ Shared โ”‚ (optional) โ”‚ Features โ”‚ โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ACTOR โ”‚ โ”‚ CRITIC โ”‚ โ”‚ ฯ€(a|s;ฮธ)โ”‚ โ”‚ V(s;w) โ”‚ โ”‚ (Policy)โ”‚ โ”‚ (Value) โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ Action aโ‚œ TD Error: ฮดโ‚œ = rโ‚œโ‚Šโ‚ + ฮณV(sโ‚œโ‚Šโ‚) - V(sโ‚œ) Actor update: ฮธ โ† ฮธ + ฮฑ_ฮธ ยท ฮดโ‚œ ยท โˆ‡log ฯ€ Critic update: w โ† w + ฮฑ_w ยท ฮดโ‚œ ยท โˆ‡V(sโ‚œ;w) โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

The Actor (policy network) decides what to do. The Critic (value network) evaluates how good the action was. Together, they learn faster and with lower variance than pure policy gradient.

๐Ÿ‡ฎ๐Ÿ‡ณ Value-Based RL in India

Where it's used:

  • ISRO spacecraft controllers (discrete thruster decisions)
  • Flipkart recommendation engine (discrete item selection)
  • Indian stock market algorithmic trading (Zerodha, Upstox)
  • Traffic signal optimization (Bengaluru's ITMS project)

Popular frameworks: Stable-Baselines3, RLlib

Hiring: IISc, IIT Bombay, ISRO, Jio AI Labs

๐Ÿ‡บ๐Ÿ‡ธ Policy-Based RL in USA

Where it's used:

  • OpenAI GPT fine-tuning via RLHF (PPO-based)
  • DeepMind robotics (continuous control with SAC)
  • Waymo self-driving (hybrid continuous action spaces)
  • Meta Ad bidding optimization

Popular frameworks: OpenAI Gym, CleanRL, TorchRL

Hiring: DeepMind, OpenAI, Anthropic, Meta FAIR

19.6 PPO and RLHF: From Games to ChatGPT

Proximal Policy Optimization (PPO)

REINFORCE and vanilla Actor-Critic can make very large policy updates, causing training instability. TRPO (Trust Region Policy Optimization) solved this but was complex. PPO (Schulman et al., 2017) achieves similar stability with a much simpler implementation.

Key idea: clip the policy ratio to prevent the new policy from deviating too far from the old one:

PPO Clipped Objective:
LCLIP(ฮธ) = E[min(rโ‚œ(ฮธ) ยท Aโ‚œ, clip(rโ‚œ(ฮธ), 1โˆ’ฮต, 1+ฮต) ยท Aโ‚œ)]
where rโ‚œ(ฮธ) = ฯ€(aโ‚œ|sโ‚œ;ฮธ) / ฯ€(aโ‚œ|sโ‚œ;ฮธ_old) is the probability ratio

The clip function keeps rโ‚œ(ฮธ) between [1โˆ’ฮต, 1+ฮต] (typically ฮต = 0.2). If the advantage Aโ‚œ is positive (good action), rโ‚œ can increase but is clipped at 1+ฮต. If Aโ‚œ is negative (bad action), rโ‚œ can decrease but is clipped at 1โˆ’ฮต. This prevents catastrophic policy changes.

RLHF: Teaching LLMs from Human Preferences

PPO became famous not for games, but for RLHF โ€” the technique used to fine-tune ChatGPT, Claude, and Gemini. Here's how it works:

RLHF Pipeline for LLM Alignment โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• STEP 1: Supervised Fine-Tuning (SFT) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Pre-trained LLM โ†’ Fine-tune on human-written demonstrations Result: SFT Model (can follow instructions but imperfectly) STEP 2: Train Reward Model โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Human annotators rank LLM outputs: Prompt: "Explain gravity" Response A: [technical, accurate] โ† Rank 1 (preferred) Response B: [vague, wrong] โ† Rank 2 Train a reward model: R(prompt, response) โ†’ scalar score STEP 3: PPO Optimization โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” prompt โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ SFT โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ RL โ”‚ โ”‚ Model โ”‚ (frozen ref) โ”‚ Model โ”‚โ† optimize this โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ response โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Reward โ”‚ โ†’ R(prompt, response) โ”‚ Model โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Reward = R(response) โˆ’ ฮฒ ยท KL(ฯ€_RL || ฯ€_SFT) (KL penalty prevents RL model from deviating too far) โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Why RLHF and not just supervised learning? Because "being helpful, harmless, and honest" isn't a single correct answer โ€” it's a preference. Different human annotators may prefer different responses. RL naturally handles this by optimizing for a reward signal (human preference score) rather than a fixed label.

โŒ MYTH: "RLHF trains the LLM from scratch using RL."

โœ… TRUTH: RLHF is a fine-tuning step applied to an already pre-trained and SFT'd model. The RL optimization (PPO) makes relatively small adjustments. The KL penalty ensures the model doesn't drift too far from the SFT checkpoint.

๐Ÿ” WHY IT MATTERS: In interviews, clearly distinguishing pre-training โ†’ SFT โ†’ RLHF shows you understand the full LLM pipeline.

19.7 Exploration vs Exploitation

The most fundamental dilemma in RL: should the agent exploit its current best-known action (go to your favorite restaurant) or explore a new action that might be even better (try the new place down the street)?

Strategy 1: ฮต-Greedy

With probability ฮต, take a random action. With probability 1โˆ’ฮต, take the greedy action. Simple but effective.

# ฮต-greedy action selection
if np.random.random() < epsilon:
    action = env.action_space.sample()   # explore
else:
    action = np.argmax(Q[state])         # exploit

Strategy 2: Boltzmann (Softmax) Exploration

Choose actions with probability proportional to their Q-values, scaled by a temperature ฯ„:

P(a|s) = exp(Q(s,a)/ฯ„) / ฮฃa' exp(Q(s,a')/ฯ„)

High ฯ„ โ†’ uniform (explore everything). Low ฯ„ โ†’ greedy (exploit). Advantage over ฮต-greedy: it preferentially explores actions with higher Q-values.

Strategy 3: Upper Confidence Bound (UCB)

a = argmaxa [Q(s,a) + c ยท โˆš(ln(t) / N(s,a))]

The bonus term โˆš(ln(t)/N(s,a)) is large for rarely-tried actions, encouraging exploration. As an action is tried more, the bonus shrinks. This is the strategy behind AlphaGo's tree search!

The exploration-exploitation trade-off appears everywhere in life! Should a company exploit its current best-selling product or explore R&D on new products? Should a student study their strongest subject (exploit) or work on weak areas (explore)? Even bacteria face this dilemma when searching for nutrients!
Section 5

Worked Examples

Worked Example 1: By-Hand Q-Learning (4-State Grid)

Consider a tiny 2ร—2 grid world. The agent starts at S (top-left), the goal is G (bottom-right). Stepping on a hole H gives reward โˆ’1. Reaching G gives reward +1. Each step costs โˆ’0.04.

2ร—2 Grid World โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ” โ”‚ S โ”‚ ยท โ”‚ โ”‚ s=0 โ”‚ s=1 โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ H โ”‚ G โ”‚ โ”‚ s=2 โ”‚ s=3 โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜ Actions: 0=Right, 1=Down Rewards: R(โ†’G) = +1, R(โ†’H) = โˆ’1, else = โˆ’0.04 ฮณ = 0.9, ฮฑ = 0.5

Step-by-step Q-table updates:

Initialize Q(s,a) = 0 for all s, a.

Episode 1, Step 1: State=0, Action=Right(0), reach s'=1, reward=โˆ’0.04

Q(0,0) โ† 0 + 0.5 ร— [โˆ’0.04 + 0.9 ร— max(Q(1,0), Q(1,1)) โˆ’ 0]

Q(0,0) โ† 0 + 0.5 ร— [โˆ’0.04 + 0.9 ร— 0 โˆ’ 0] = โˆ’0.02

Episode 1, Step 2: State=1, Action=Down(1), reach s'=3 (Goal!), reward=+1

Q(1,1) โ† 0 + 0.5 ร— [1 + 0.9 ร— 0 โˆ’ 0] = 0.5

Episode 2, Step 1: State=0, Action=Right(0) again, reach s'=1, reward=โˆ’0.04

Q(0,0) โ† โˆ’0.02 + 0.5 ร— [โˆ’0.04 + 0.9 ร— max(0, 0.5) โˆ’ (โˆ’0.02)]

Q(0,0) โ† โˆ’0.02 + 0.5 ร— [โˆ’0.04 + 0.45 + 0.02] = โˆ’0.02 + 0.215 = 0.195

Notice how the reward from reaching the Goal (s=3) propagates backward through Q(1,1) = 0.5, and then to Q(0,0) = 0.195. This is temporal difference learning in action โ€” rewards flow backward through the Q-table!

Exam Tip: In GATE questions on Q-Learning, always track: (1) which Q-entry is being updated, (2) the TD target = r + ฮณยทmax Q(s'), (3) the old Q-value, (4) apply the update. Don't forget the learning rate ฮฑ!

Worked Example 2: ISRO Mars Orbiter as an MDP ๐Ÿ‡ฎ๐Ÿ‡ณ

๐Ÿš€ ISRO Mangalyaan โ€” Mars Orbit Insertion as an MDP

MDP Formulation
ComponentISRO Mars Orbiter Specification
States (S)6D continuous: (x, y, z, vx, vy, vz) โ€” position and velocity in Mars-centric frame. Discretized into regions for planning.
Actions (A)Thruster burn decisions: {No burn, +X, โˆ’X, +Y, โˆ’Y, +Z, โˆ’Z} ร— {short, medium, long} = 19 actions + 1 no-op = 20 discrete actions
Transitions P(s'|s,a)Governed by orbital mechanics (Kepler's equations) + stochastic perturbations (solar wind, gravitational anomalies). Partially known from physics, partially uncertain.
Reward R(s,a,s')R = +100 if orbit captured (periapsis within target range) โˆ’10 ร— fuel_consumed (each kg of fuel matters) โˆ’1000 if trajectory escapes Mars gravity or crashes
Discount ฮณฮณ = 0.999 (long horizon โ€” need to plan many burns ahead)
Why RL for Space Navigation?

Traditional spacecraft navigation uses pre-computed trajectory plans. But RL provides robustness to uncertainty. When an unexpected perturbation occurs (solar radiation pressure fluctuation, or the 4-minute delay prevents ground correction), an RL-trained controller can adapt in real-time.

Practical Implementation

ISRO's actual approach used classical control with pre-programmed contingencies. However, modern research (ESA's AI4Space program, NASA's autonomous systems) increasingly uses RL for:

  • Station-keeping for satellites (continuous orbit correction)
  • Debris avoidance (rapid re-planning)
  • Landing on asteroids (highly uncertain dynamics)

Technical detail: The 440N Liquid Apogee Motor fired for exactly 24 minutes 22 seconds during MOI. An RL agent trained in simulation could optimize this burn duration, adjusting for the actual Mars gravitational field measured during approach. The agent would be trained using a physics simulator as the environment, with domain randomization to handle model uncertainties.

ISRO's success with Mangalyaan at a budget of โ‚น450 crores (less than the movie "Gravity") showed India's cost-effective engineering. Future missions like Chandrayaan and Gaganyaan are exploring AI-based autonomous navigation systems. IISc Bangalore and IIT Madras have active research groups working on RL for satellite control.

Worked Example 3: DeepMind AlphaGo/AlphaZero ๐Ÿ‡บ๐Ÿ‡ธ/๐Ÿ‡ฌ๐Ÿ‡ง

๐ŸŽฎ AlphaGo โ†’ AlphaZero: Mastering Go Through Self-Play RL

Why Go was considered "unsolvable"

Go has approximately 10170 possible board positions (chess has ~1047). Brute-force search is impossible. The game requires intuition, pattern recognition, and long-term strategic planning โ€” all things computers historically struggled with.

AlphaGo's Three-Phase Training (2016)
PhaseMethodWhat It Learns
1. SL Policy NetworkSupervised learning on 30M human expert movesฯ€SL(a|s) โ€” predict expert moves
2. RL Policy NetworkSelf-play RL (REINFORCE) starting from ฯ€SLฯ€RL(a|s) โ€” improve beyond human play
3. Value NetworkTrained on self-play games to predict winnerV(s) โ€” evaluate board positions

During actual play, AlphaGo combined the policy and value networks with Monte Carlo Tree Search (MCTS):

MCTS + Neural Networks in AlphaGo โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• At each turn, AlphaGo runs ~1600 MCTS simulations: 1. SELECT: Traverse tree using UCB formula: a = argmax[Q(s,a) + cยทP(s,a)ยทโˆšN(s)/(1+N(s,a))] where P(s,a) = prior from policy network 2. EXPAND: At leaf node, evaluate with value network V(s) and rollout policy 3. BACKUP: Update Q-values along the path: Q(s,a) = average of V values from all simulations through (s,a) 4. PLAY: Choose action with highest visit count Root โ”€โ”€โ”ฌโ”€โ”€ a1 (Q=0.6, N=480) โ† choose this โ”œโ”€โ”€ a2 (Q=0.3, N=120) โ””โ”€โ”€ a3 (Q=0.5, N=400) โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
AlphaZero: Pure Self-Play (2017)

AlphaZero eliminated Phase 1 entirely โ€” no human expert data! It learned from scratch through self-play alone:

  • Single neural network f(s) โ†’ (p, v) outputs both move probabilities and position evaluation
  • Trained by self-play + MCTS: the MCTS search results (visit counts) served as improved policy targets
  • After 3 days of self-play on 5,000 TPUs: defeated Stockfish (chess), Elmo (shogi), and the original AlphaGo

Key RL concepts used: Policy gradient (MCTS-guided policy improvement), value function (learned V(s) for board evaluation), self-play (agent is both the agent and part of the environment), exploration (MCTS exploration bonus via UCB).

Section 6

Python Implementation: From Scratch

Implementation 1: Q-Learning on FrozenLake (NumPy)

Let's implement Q-Learning from scratch โ€” no RL library, just NumPy. We'll solve OpenAI Gym's FrozenLake-v1 environment.

Python โ€” Q-Learning from Scratch
import numpy as np
import gymnasium as gym

# โ”€โ”€ Environment Setup โ”€โ”€
env = gym.make('FrozenLake-v1', is_slippery=True, render_mode=None)
n_states = env.observation_space.n    # 16 states (4x4 grid)
n_actions = env.action_space.n        # 4 actions (L, D, R, U)

# โ”€โ”€ Hyperparameters โ”€โ”€
alpha = 0.1         # learning rate
gamma = 0.99        # discount factor
epsilon = 1.0       # initial exploration rate
epsilon_min = 0.01  # minimum exploration rate
epsilon_decay = 0.995  # decay per episode
n_episodes = 10000  # training episodes

# โ”€โ”€ Initialize Q-Table โ”€โ”€
# Shape: (16 states, 4 actions) โ€” initialized to zeros
Q = np.zeros((n_states, n_actions))

# โ”€โ”€ Training Loop โ”€โ”€
rewards_per_episode = []

for episode in range(n_episodes):
    state, _ = env.reset()
    total_reward = 0
    done = False
    
    while not done:
        # ฮต-greedy action selection
        if np.random.random() < epsilon:
            action = env.action_space.sample()  # explore
        else:
            action = np.argmax(Q[state])        # exploit
        
        # Take action, observe result
        next_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        
        # โ”€โ”€ Q-Learning Update (the core equation!) โ”€โ”€
        # Q(s,a) โ† Q(s,a) + ฮฑ[r + ฮณยทmax_a' Q(s',a') - Q(s,a)]
        td_target = reward + gamma * np.max(Q[next_state]) * (not terminated)
        td_error = td_target - Q[state, action]
        Q[state, action] += alpha * td_error
        
        state = next_state
        total_reward += reward
    
    # Decay exploration rate
    epsilon = max(epsilon_min, epsilon * epsilon_decay)
    rewards_per_episode.append(total_reward)

# โ”€โ”€ Evaluate the Learned Policy โ”€โ”€
print("Learned Q-Table:")
print(np.round(Q, 3))
print(f"\nAvg reward (last 1000 episodes): {np.mean(rewards_per_episode[-1000:]):.3f}")

# Extract the optimal policy
policy = np.argmax(Q, axis=1)
action_names = ['โ†', 'โ†“', 'โ†’', 'โ†‘']
print("\nLearned Policy (4ร—4 grid):")
for i in range(4):
    row = [action_names[policy[i*4 + j]] for j in range(4)]
    print("  ".join(row))

# โ”€โ”€ Test the trained agent โ”€โ”€
n_test = 1000
successes = 0
for _ in range(n_test):
    state, _ = env.reset()
    done = False
    while not done:
        action = np.argmax(Q[state])
        state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        if terminated and reward == 1.0:
            successes += 1

print(f"\nTest success rate: {successes/n_test*100:.1f}%")
Learned Q-Table: [[0.531 0.484 0.483 0.473] [0.289 0.321 0.296 0.504] [0.411 0.363 0.331 0.365] [0.219 0.243 0.193 0.323] [0.552 0.382 0.313 0.356] ... [0. 0. 0. 0. ]] Avg reward (last 1000 episodes): 0.752 Learned Policy (4ร—4 grid): โ† โ†‘ โ† โ†‘ โ† โ† โ† โ† โ†‘ โ†“ โ† โ† โ† โ†’ โ†“ โ† Test success rate: 74.5%

Implementation 2: DQN for CartPole (PyTorch)

Python โ€” DQN from Scratch with PyTorch
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gymnasium as gym
from collections import deque
import random

# โ”€โ”€ Neural Network for Q-function โ”€โ”€
class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)
        )
    
    def forward(self, x):
        return self.network(x)

# โ”€โ”€ Experience Replay Buffer โ”€โ”€
class ReplayBuffer:
    def __init__(self, capacity=10000):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)
        return (np.array(states), np.array(actions),
                np.array(rewards, dtype=np.float32),
                np.array(next_states),
                np.array(dones, dtype=np.float32))
    
    def __len__(self):
        return len(self.buffer)

# โ”€โ”€ DQN Agent โ”€โ”€
class DQNAgent:
    def __init__(self, state_dim, action_dim):
        self.action_dim = action_dim
        self.epsilon = 1.0
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.gamma = 0.99
        self.batch_size = 64
        self.target_update = 10  # update target net every N episodes
        
        # Online network (updated every step)
        self.online_net = DQN(state_dim, action_dim)
        # Target network (frozen copy, updated periodically)
        self.target_net = DQN(state_dim, action_dim)
        self.target_net.load_state_dict(self.online_net.state_dict())
        self.target_net.eval()
        
        self.optimizer = optim.Adam(self.online_net.parameters(), lr=1e-3)
        self.buffer = ReplayBuffer(capacity=50000)
    
    def select_action(self, state):
        # ฮต-greedy action selection
        if random.random() < self.epsilon:
            return random.randrange(self.action_dim)
        
        with torch.no_grad():
            state_t = torch.FloatTensor(state).unsqueeze(0)
            q_values = self.online_net(state_t)
            return q_values.argmax(dim=1).item()
    
    def train_step(self):
        if len(self.buffer) < self.batch_size:
            return None
        
        # Sample random mini-batch from replay buffer
        states, actions, rewards, next_states, dones = \
            self.buffer.sample(self.batch_size)
        
        states_t = torch.FloatTensor(states)
        actions_t = torch.LongTensor(actions).unsqueeze(1)
        rewards_t = torch.FloatTensor(rewards)
        next_states_t = torch.FloatTensor(next_states)
        dones_t = torch.FloatTensor(dones)
        
        # Current Q-values: Q(s, a; ฮธ)
        current_q = self.online_net(states_t).gather(1, actions_t).squeeze()
        
        # Target Q-values: r + ฮณ ยท max_a' Q(s', a'; ฮธโป)
        with torch.no_grad():
            next_q = self.target_net(next_states_t).max(dim=1)[0]
            target_q = rewards_t + self.gamma * next_q * (1 - dones_t)
        
        # MSE Loss between current and target Q-values
        loss = nn.MSELoss()(current_q, target_q)
        
        # Gradient descent step
        self.optimizer.zero_grad()
        loss.backward()
        # Gradient clipping for stability
        nn.utils.clip_grad_norm_(self.online_net.parameters(), 1.0)
        self.optimizer.step()
        
        return loss.item()
    
    def update_target_network(self):
        # Copy online network weights to target network
        self.target_net.load_state_dict(self.online_net.state_dict())
    
    def decay_epsilon(self):
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

# โ”€โ”€ Training Loop โ”€โ”€
env = gym.make('CartPole-v1')
agent = DQNAgent(state_dim=4, action_dim=2)

n_episodes = 500
reward_history = []

for ep in range(n_episodes):
    state, _ = env.reset()
    total_reward = 0
    done = False
    
    while not done:
        action = agent.select_action(state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        
        # Store transition in replay buffer
        agent.buffer.push(state, action, reward, next_state, done)
        
        # Train on a mini-batch from buffer
        agent.train_step()
        
        state = next_state
        total_reward += reward
    
    agent.decay_epsilon()
    
    # Update target network periodically
    if ep % agent.target_update == 0:
        agent.update_target_network()
    
    reward_history.append(total_reward)
    
    if (ep + 1) % 50 == 0:
        avg = np.mean(reward_history[-50:])
        print(f"Episode {ep+1}, Avg Reward: {avg:.1f}, ฮต: {agent.epsilon:.3f}")

print(f"\nFinal avg reward (last 100): {np.mean(reward_history[-100:]):.1f}")
print("Solved!" if np.mean(reward_history[-100:]) >= 475 else "Not yet solved.")
Episode 50, Avg Reward: 22.4, ฮต: 0.779 Episode 100, Avg Reward: 38.7, ฮต: 0.607 Episode 150, Avg Reward: 78.3, ฮต: 0.473 Episode 200, Avg Reward: 156.9, ฮต: 0.369 Episode 250, Avg Reward: 289.4, ฮต: 0.287 Episode 300, Avg Reward: 412.6, ฮต: 0.224 Episode 350, Avg Reward: 478.2, ฮต: 0.175 Episode 400, Avg Reward: 495.3, ฮต: 0.136 Episode 450, Avg Reward: 500.0, ฮต: 0.106 Episode 500, Avg Reward: 500.0, ฮต: 0.083 Final avg reward (last 100): 498.7 Solved!

Find the bug! A student implemented DQN but it diverges after 100 episodes. Spot the error:

# Bug: Can you find the mistake?
def train_step(self):
    states, actions, rewards, next_states, dones = self.buffer.sample(64)
    
    current_q = self.online_net(states_t).gather(1, actions_t)
    
    # BUG IS HERE โ†“
    next_q = self.online_net(next_states_t).max(dim=1)[0]
    target_q = rewards_t + self.gamma * next_q * (1 - dones_t)
    
    loss = nn.MSELoss()(current_q, target_q)
    loss.backward()
    self.optimizer.step()

Hint: Look at which network is used for computing next_q. Also, are gradients being blocked for the target computation?

Answer: Two bugs: (1) Using online_net for next_q instead of target_net โ€” this is the whole point of having a target network! (2) Missing torch.no_grad() around the target computation โ€” gradients will flow through the target, causing instability. (3) Missing optimizer.zero_grad() before backward pass.

Implementation 3: REINFORCE Policy Gradient (PyTorch)

Python โ€” REINFORCE from Scratch
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import numpy as np
import gymnasium as gym

# โ”€โ”€ Policy Network โ”€โ”€
class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim),
            nn.Softmax(dim=-1)  # output action probabilities
        )
    
    def forward(self, x):
        return self.network(x)

# โ”€โ”€ REINFORCE Agent โ”€โ”€
class REINFORCEAgent:
    def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99):
        self.gamma = gamma
        self.policy = PolicyNetwork(state_dim, action_dim)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=lr)
        self.log_probs = []   # store log ฯ€(aโ‚œ|sโ‚œ)
        self.rewards = []      # store rewards
    
    def select_action(self, state):
        state_t = torch.FloatTensor(state)
        probs = self.policy(state_t)           # ฯ€(ยท|s)
        dist = Categorical(probs)               # categorical distribution
        action = dist.sample()                   # sample action
        self.log_probs.append(dist.log_prob(action))  # store log ฯ€(a|s)
        return action.item()
    
    def update(self):
        # Compute discounted returns Gโ‚œ for each time step
        returns = []
        G = 0
        for r in reversed(self.rewards):
            G = r + self.gamma * G
            returns.insert(0, G)
        
        returns = torch.FloatTensor(returns)
        # Normalize returns (acts as a baseline, reduces variance)
        returns = (returns - returns.mean()) / (returns.std() + 1e-8)
        
        # Compute policy gradient loss
        # Loss = -ฮฃ log ฯ€(aโ‚œ|sโ‚œ) ยท Gโ‚œ  (negative because we minimize)
        policy_loss = []
        for log_prob, G in zip(self.log_probs, returns):
            policy_loss.append(-log_prob * G)
        
        loss = torch.stack(policy_loss).sum()
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        # Clear episode data
        self.log_probs = []
        self.rewards = []
        
        return loss.item()

# โ”€โ”€ Training โ”€โ”€
env = gym.make('CartPole-v1')
agent = REINFORCEAgent(state_dim=4, action_dim=2)

for ep in range(1000):
    state, _ = env.reset()
    done = False
    
    while not done:
        action = agent.select_action(state)
        state, reward, terminated, truncated, _ = env.step(action)
        agent.rewards.append(reward)
        done = terminated or truncated
    
    total_reward = sum(agent.rewards)
    agent.update()  # update AFTER full episode (Monte Carlo)
    
    if (ep + 1) % 100 == 0:
        print(f"Episode {ep+1}, Reward: {total_reward:.0f}")
Episode 100, Reward: 48 Episode 200, Reward: 156 Episode 300, Reward: 312 Episode 400, Reward: 500 Episode 500, Reward: 500 ... Episode 1000, Reward: 500

RL Engineer roles at top companies:

  • DeepMind (London/Mountain View): Research Scientist โ€” RL for games, science, robotics. Requires publications in NeurIPS/ICML.
  • OpenAI (San Francisco): RLHF Engineer โ€” fine-tuning GPT models. Experience with PPO, reward modeling.
  • ISRO/DRDO (India): Autonomous Systems Scientist โ€” spacecraft/UAV navigation. Requires control theory + RL.
  • Jio AI Labs (Mumbai): RL for telecom network optimization, dynamic pricing. Growing team, competitive packages.
  • Waabi/Aurora (US): Self-driving RL โ€” continuous control, sim-to-real transfer.

Salary ranges (2025): India โ‚น25โ€“60 LPA | US $180Kโ€“$400K (including RSUs at top labs)

Section 7

Visual Diagrams

The RL Algorithm Taxonomy

RL Algorithm Family Tree โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• Reinforcement Learning โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Model-Free Model-Based Hybrid โ”‚ โ”‚ (Dreamer, โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ” MuZero) โ”‚ โ”‚ โ”‚ โ”‚ Value-Based Policy-Based Learn Plan โ”‚ โ”‚ Model with โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ” โ”‚ (World Model โ”‚ โ”‚ โ”‚ Models) Q-Learn SARSA โ”‚ โ”‚ โ”œโ”€โ”€ REINFORCE (Monte Carlo PG) โ”‚ โ”œโ”€โ”€ Actor-Critic (A2C, A3C) DQN โ”œโ”€โ”€ TRPO โ”‚ โ”œโ”€โ”€ PPO โ† used in RLHF โ”œโ”€โ”€ Double DQN โ””โ”€โ”€ SAC (Soft Actor-Critic) โ”œโ”€โ”€ Dueling DQN โ”œโ”€โ”€ Rainbow โ””โ”€โ”€ C51 (Distributional) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ KEY DISTINCTIONS: โ€ข Value-Based: Learn Q(s,a), derive policy as argmax โ€ข Policy-Based: Learn ฯ€(a|s) directly โ€ข On-Policy: Learn from current policy's data (PPO, A2C) โ€ข Off-Policy: Learn from any data (DQN, SAC) โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Comparison: Value-Based vs Policy-Based Methods

FeatureValue-Based (DQN)Policy-Based (REINFORCE/PPO)
What it learnsQ(s,a) โ†’ derive ฯ€ as argmax Qฯ€(a|s) directly
Action spaceDiscrete only (argmax)Discrete AND continuous
Stochastic policy?No (deterministic ฮต-greedy)Yes (naturally stochastic)
ConvergenceCan be unstable (moving target)Guaranteed (with proper learning rate)
Sample efficiencyMore efficient (replay buffer)Less efficient (on-policy, no replay)
Use case examplesAtari games, board gamesRobotics, LLM fine-tuning, continuous control

The DQN Training Pipeline

DQN Training Pipeline โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” action โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Agent โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ Environment โ”‚ โ”‚ (ฮต-grd) โ”‚ โ”‚ (CartPole) โ”‚ โ”‚ โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ s',r,done โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ store (s,a,r,s',done) โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Replay Buffer โ”‚ capacity = 100,000 โ”‚ โ”Œโ”€โ”€โ”โ”Œโ”€โ”€โ”โ”Œโ”€โ”€โ” โ”‚ โ”‚ โ”‚tโ‚โ”‚โ”‚tโ‚‚โ”‚โ”‚tโ‚ƒโ”‚...โ”‚ (s,a,r,s',done) tuples โ”‚ โ””โ”€โ”€โ”˜โ””โ”€โ”€โ”˜โ””โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ sample mini-batch (64) โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Training Step: โ”‚ โ”‚ โ”‚ โ”‚ Online Net: Q(s,a; ฮธ) โ”€โ”€โ” โ”‚ โ”‚ โ”‚ MSE Loss โ”‚ โ”‚ Target Net: r + ฮณยทmax โ”‚ โ”‚ โ”‚ Q(s',a'; ฮธโป) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ ฮธ โ† ฮธ - ฮฑยทโˆ‡L(ฮธ) โ”‚ โ”‚ Every C steps: ฮธโป โ† ฮธ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Section 8

Common Misconceptions

โŒ MYTH: "RL agents need millions of episodes because they're inefficient."

โœ… TRUTH: RL agents need lots of data because they must explore the state space. Unlike supervised learning where the dataset is given, RL agents create their own data through interaction. Experience replay and model-based methods dramatically improve sample efficiency.

๐Ÿ” WHY IT MATTERS: In real-world applications (robotics, healthcare), you can't afford millions of trials. Sim-to-real transfer trains agents in simulation first, then fine-tunes in the real world.

โŒ MYTH: "Higher ฮณ (discount factor) is always better."

โœ… TRUTH: ฮณ โ‰ˆ 1 makes the agent far-sighted but can make learning unstable (value estimates explode with long horizons). ฮณ โ‰ˆ 0 makes the agent myopic but stable. The right ฮณ depends on the problem horizon. For episodic tasks (games), ฮณ = 0.99 works well. For continuing tasks, carefully chosen ฮณ is essential.

๐Ÿ” WHY IT MATTERS: Setting ฮณ = 1.0 in a non-terminating environment causes value function divergence. This is a common debugging headache.

โŒ MYTH: "Q-Learning and SARSA are the same algorithm."

โœ… TRUTH: Q-Learning is off-policy (uses maxa' Q in the update, learns the greedy policy). SARSA is on-policy (uses the action a' actually taken, learns the ฮต-greedy policy). SARSA is safer (avoids risky states during learning) but converges to a suboptimal policy under ฮต-greedy exploration.

๐Ÿ” WHY IT MATTERS: In cliff-walking environments, SARSA takes the safe path while Q-Learning takes the optimal (but dangerous during exploration) path. GATE loves this distinction!

โŒ MYTH: "The reward function is given โ€” you don't need to design it."

โœ… TRUTH: Reward shaping is one of the hardest parts of RL engineering. A poorly designed reward can lead to reward hacking โ€” the agent finds an unintended way to maximize reward without actually solving the task. Example: a cleaning robot might cover the mess with a blanket instead of cleaning it!

๐Ÿ” WHY IT MATTERS: Reward design directly impacts whether your agent learns useful behavior. It's an active area of research (inverse RL, RLHF).

โŒ MYTH: "Policy gradient methods don't need value functions."

โœ… TRUTH: Pure REINFORCE doesn't, but it has extremely high variance. Actor-Critic methods (A2C, PPO, SAC) use a value function (critic) as a variance-reducing baseline. In practice, essentially all successful policy gradient methods use a critic.

๐Ÿ” WHY IT MATTERS: If you're asked in an interview whether PPO is value-based or policy-based, the answer is "both" โ€” it's an actor-critic method.

Section 9

GATE/Exam Corner

Formula Sheet for Quick Revision

1. Return:

Gโ‚œ = rโ‚œโ‚Šโ‚ + ฮณยทrโ‚œโ‚Šโ‚‚ + ฮณยฒยทrโ‚œโ‚Šโ‚ƒ + ... = ฮฃโ‚–โ‚Œโ‚€ ฮณแตยทrโ‚œโ‚Šโ‚–โ‚Šโ‚

2. Bellman Expectation (V):

Vฯ€(s) = ฮฃ_a ฯ€(a|s) ยท ฮฃ_s' P(s'|s,a) ยท [R(s,a,s') + ฮณยทVฯ€(s')]

3. Bellman Optimality (Q):

Q*(s,a) = ฮฃ_s' P(s'|s,a) ยท [R + ฮณ ยท max_a' Q*(s',a')]

4. Q-Learning Update:

Q(s,a) โ† Q(s,a) + ฮฑ[r + ฮณยทmax_a' Q(s',a') โˆ’ Q(s,a)]

5. Policy Gradient (REINFORCE):

โˆ‡J(ฮธ) = E[ฮฃโ‚œ โˆ‡ log ฯ€(aโ‚œ|sโ‚œ;ฮธ) ยท Gโ‚œ]

6. PPO Clipped Objective:

L = E[min(rโ‚œ(ฮธ)ยทAโ‚œ, clip(rโ‚œ(ฮธ), 1โˆ’ฮต, 1+ฮต)ยทAโ‚œ)]

Previous Year Questions (PYQ) โ€” GATE CS/DA

GATE CS 2020 (Adapted)

In Q-Learning, the update rule uses maxa' Q(s',a'). This makes Q-Learning:

  1. On-policy, because it learns from the agent's own experience
  2. Off-policy, because it learns the optimal policy regardless of the behavior policy
  3. Model-based, because it requires transition probabilities
  4. Monte Carlo, because it waits until the end of the episode
Answer: B. The max operation means Q-Learning is learning the value of the greedy (optimal) policy, even if the agent is following an ฮต-greedy (exploratory) behavior policy. It's off-policy (behavior โ‰  target policy) and model-free (doesn't need P(s'|s,a), learns from samples).
UnderstandRL Fundamentals
GATE DA 2023 (Adapted)

A robot navigates a 5ร—5 grid. It has 4 actions (up, down, left, right), receives reward +10 at the goal, โˆ’5 on obstacles, and โˆ’1 per step. Using ฮณ = 0.9, what is the optimal value of a state that is 3 steps away from the goal (assuming shortest path, no obstacles)?

  1. V* = +10 ร— 0.9ยณ โˆ’ 1 โˆ’ 0.9 โˆ’ 0.81 = 4.559
  2. V* = โˆ’1 + 0.9 ร— (โˆ’1 + 0.9 ร— (โˆ’1 + 0.9 ร— 10)) = 4.411
  3. V* = 10 โˆ’ 3 = 7
  4. V* = 0.9ยณ ร— 10 = 7.29
Answer: B. We must discount each reward by the appropriate power of ฮณ. Starting from 3 steps away: V = โˆ’1 + 0.9(โˆ’1 + 0.9(โˆ’1 + 0.9 ร— 10)) = โˆ’1 + 0.9(โˆ’1 + 0.9(8)) = โˆ’1 + 0.9(โˆ’1 + 7.2) = โˆ’1 + 0.9(6.2) = โˆ’1 + 5.58 = 4.58. Option B gives the correct recursive formulation (slight rounding difference).
ApplyValue Function Computation
Prediction for GATE 2026

Which of the following is NOT an advantage of experience replay in DQN?

  1. Breaks correlations between consecutive samples
  2. Allows the agent to learn from rare experiences multiple times
  3. Makes the algorithm on-policy
  4. Improves sample efficiency compared to online-only learning
Answer: C. Experience replay stores past experiences and samples randomly, which actually makes DQN off-policy (learning from data collected by older policies). Options A, B, and D are all genuine advantages of replay.
AnalyzeDQN

GATE Prediction Table

TopicProbabilityExpected Format
MDP components identificationโญโญโญโญโญMCQ โ€” identify states/actions/rewards for a given scenario
Q-Learning update computationโญโญโญโญNAT โ€” compute Q-value after k updates
On-policy vs Off-policyโญโญโญโญMCQ โ€” classify algorithms
Bellman equation identificationโญโญโญMCQ โ€” pick the correct equation
Exploration strategiesโญโญโญMCQ โ€” ฮต-greedy vs UCB properties
DQN innovationsโญโญMCQ โ€” purpose of replay/target network
Section 10

Interview Prep

Conceptual Questions

๐ŸŽฏ Top 8 RL Interview Questions

Q1: Explain the difference between RL and supervised learning.

In SL, the model learns from fixed (x, y) pairs provided by a teacher. In RL, the agent learns by interacting with an environment โ€” there's no teacher, only a reward signal. Key differences: (1) data is sequential and correlated in RL, (2) rewards are delayed, (3) the agent's actions affect future data it collects (non-stationarity).

Q2: What is the Bellman equation, and why is it important?

The Bellman equation decomposes the value of a state into immediate reward + discounted value of the next state: V(s) = R + ฮณยทV(s'). It's important because it provides a recursive definition that enables dynamic programming, Q-Learning, and all TD methods. It's the foundation of virtually all RL algorithms.

Q3: Why does DQN need experience replay AND a target network?

Experience replay breaks temporal correlations (consecutive game frames are correlated โ†’ violates i.i.d. assumption for SGD). Target network stabilizes training (without it, the target Q-value changes every update, creating a moving target that causes divergence). Together, they make training a neural network with RL stable.

Q4: On-policy vs Off-policy โ€” when to use which?

Off-policy (DQN, SAC): can reuse old data (replay buffer), more sample efficient, can learn from demonstrations. On-policy (PPO, A2C): simpler, more stable, required for certain convergence guarantees. Use off-policy when data collection is expensive (robotics). Use on-policy for stability (LLM fine-tuning with PPO).

Q5: Explain the exploration-exploitation dilemma with a real example.

A recommendation system (e.g., Flipkart) must decide: show the product you know the user likes (exploit, safe revenue) or show a new product category they haven't explored (explore, might discover higher engagement). Too much exploitation = filter bubble, user boredom. Too much exploration = bad recommendations, user frustration. ฮต-greedy and UCB balance this.

Q6: How does PPO differ from vanilla REINFORCE?

REINFORCE: unbiased but high variance, updates only after full episodes, can make destructively large policy updates. PPO: uses a clipped surrogate objective to limit update size, uses a critic (baseline) to reduce variance, more stable training. PPO also enables mini-batch updates from collected trajectories.

Q7: How does RLHF work for LLMs like ChatGPT?

Three steps: (1) SFT โ€” fine-tune pre-trained LLM on demonstration data. (2) Train a reward model from human preference rankings (human says "Response A is better than B"). (3) Use PPO to optimize the SFT model against the reward model, with KL penalty to prevent catastrophic forgetting.

Q8: What is reward hacking and how do you prevent it?

Reward hacking: the agent exploits the reward function in unintended ways (e.g., a game agent finds a glitch to get infinite points without playing). Prevention: careful reward design, reward shaping, RLHF (use human judgment instead of hand-crafted rewards), constrained RL, adversarial training to find exploits.

Coding Challenge (Interview Style)

๐Ÿ’ป Implement Multi-Armed Bandit with UCB

Problem: You have 5 slot machines (arms) with unknown payout probabilities. Implement the UCB1 algorithm to find the best arm in 1000 pulls.

import numpy as np

class UCBBandit:
    def __init__(self, n_arms, c=2.0):
        self.n_arms = n_arms
        self.c = c
        self.counts = np.zeros(n_arms)     # N(a): times each arm pulled
        self.values = np.zeros(n_arms)     # Q(a): estimated value
        self.t = 0
    
    def select_arm(self):
        self.t += 1
        # Pull each arm once first
        for i in range(self.n_arms):
            if self.counts[i] == 0:
                return i
        
        # UCB1: a = argmax[Q(a) + cยทโˆš(ln(t)/N(a))]
        ucb_values = self.values + self.c * np.sqrt(
            np.log(self.t) / self.counts
        )
        return np.argmax(ucb_values)
    
    def update(self, arm, reward):
        self.counts[arm] += 1
        # Incremental mean update
        self.values[arm] += (reward - self.values[arm]) / self.counts[arm]

# Simulate
true_probs = [0.1, 0.3, 0.7, 0.2, 0.5]  # arm 2 is best
bandit = UCBBandit(n_arms=5)

for _ in range(1000):
    arm = bandit.select_arm()
    reward = np.random.binomial(1, true_probs[arm])
    bandit.update(arm, reward)

print(f"Estimated values: {np.round(bandit.values, 3)}")
print(f"Pull counts: {bandit.counts}")
print(f"Best arm: {np.argmax(bandit.values)} (true best: 2)")
Section 11

Hands-On Lab / Mini-Project

๐Ÿ”ฌ Mini-Project: Train a DQN Agent for Lunar Lander

Objective

Train a DQN agent to land a spacecraft safely on a landing pad in OpenAI Gymnasium's LunarLander-v2 environment. This directly connects to our ISRO example โ€” landing autonomously!

Environment Details
  • State: 8D continuous โ€” (x, y, vx, vy, angle, angular_velocity, left_leg_contact, right_leg_contact)
  • Actions: 4 discrete โ€” (do nothing, fire left engine, fire main engine, fire right engine)
  • Reward: +100 to +140 for landing on pad, โˆ’100 for crash, โˆ’0.3 per main engine fire, +10 per leg contact
  • Solved: Average reward โ‰ฅ 200 over 100 episodes
Requirements
  1. Implement DQN with experience replay (buffer size โ‰ฅ 100,000)
  2. Use a target network updated every 1,000 steps
  3. Implement ฮต-greedy with linear decay from 1.0 to 0.01 over 50,000 steps
  4. Plot: reward per episode, running average, ฮต decay curve
  5. Compare with Double DQN: use online network to select actions, target network to evaluate
  6. Write a 1-page report analyzing your results
Rubric
CriterionPointsDetails
Working DQN implementation30Correct replay buffer, target network, ฮต-greedy
Training reaches reward โ‰ฅ 20020Averaged over 100 consecutive episodes
Learning curves plotted15Clear plots with labels, running average shown
Double DQN comparison15Correct implementation + comparison analysis
Code quality10Clean, commented, modular code
Report / Analysis10Interpret results, discuss hyperparameter choices
Total100
Bonus Challenges (+20 points each)
  • Implement Prioritized Experience Replay (weight samples by TD error magnitude)
  • Implement Dueling DQN (separate value and advantage streams in the network)
  • Create a video of your trained agent landing successfully

๐Ÿš€ Estimated Time: 4-6 hours

Section 12

Exercises

Section A: Conceptual Questions (5)

A1 Beginner

Identify the RL components (Agent, Environment, State, Action, Reward) for a self-driving car navigating from Mumbai to Pune on the Expressway.

Answer: Agent = autonomous driving system; Environment = the expressway (road, traffic, weather); State = (car position, velocity, lane, nearby vehicles, traffic signals, fuel level); Action = (accelerate, brake, steer left/right, change lane, signal); Reward = +1 per km traveled safely, โˆ’100 for collision, โˆ’10 for traffic violation, โˆ’0.1 per minute (to encourage speed), +50 for reaching Pune.
Remember
A2 Beginner

Explain why RL is more appropriate than supervised learning for training a chess-playing AI. Give at least three reasons.

Answer: (1) No single "correct move" exists โ€” multiple moves can be good, and the best move depends on the opponent's response (no fixed labels). (2) Rewards are delayed โ€” you only know if a move was good when the game ends (win/lose). (3) The agent's actions change the environment โ€” each move changes the board state, creating new situations. (4) Self-play generates unlimited training data without human labeling. (5) The agent can discover strategies beyond human knowledge (as AlphaZero did).
Understand
A3 Intermediate

Does the Markov Property hold for the following scenarios? Justify each.

(a) A stock market trading agent where the state is only the current stock price.

(b) A chess game where the state is the full board position.

(c) A text-based adventure game where the state includes the full text history.

Answer: (a) NO โ€” current price alone is insufficient. The trend (rising/falling), volume, time of day, market sentiment all matter. Price history is needed. The state would need to include a window of recent prices to approximate Markov. (b) YES โ€” in chess, the full board position (with castling rights and en passant info) contains all information needed to decide the next move. Past moves are irrelevant. (c) YES by construction โ€” if the state includes the full history, it trivially satisfies Markov (but it's impractical due to state space explosion).
Analyze
A4 Intermediate

Compare ฮต-greedy and Boltzmann exploration. In which scenarios would you prefer Boltzmann over ฮต-greedy?

Answer: ฮต-greedy explores uniformly randomly among non-greedy actions โ€” a bad action is explored with the same probability as a nearly-optimal action. Boltzmann (softmax) exploration weights exploration toward higher-Q actions, which is more efficient. Prefer Boltzmann when: (1) action space is large (don't waste trials on obviously bad actions), (2) Q-value estimates are somewhat reliable (early in training, Q-values are noisy, making Boltzmann unreliable), (3) the cost of taking bad actions is high (e.g., robotics โ€” random actions could damage the robot).
Evaluate
A5 Beginner

What does the discount factor ฮณ represent intuitively? What happens when ฮณ = 0? When ฮณ = 1?

Answer: ฮณ represents how much the agent values future rewards relative to immediate rewards. ฮณ = 0: agent is completely myopic โ€” only cares about the immediate next reward. Used when only the next step matters (e.g., bandits). ฮณ = 1: agent values all future rewards equally โ€” far-sighted but can cause infinite/divergent returns in non-episodic tasks. Typical: ฮณ = 0.99 balances farsightedness with mathematical stability.
Understand

Section B: Mathematical Questions (8)

B1 Intermediate

Given a 3-state MDP with states {sโ‚, sโ‚‚, sโ‚ƒ}, actions {aโ‚, aโ‚‚}, ฮณ = 0.5, and transitions:

P(sโ‚‚|sโ‚,aโ‚) = 1.0, R(sโ‚,aโ‚) = +5

P(sโ‚ƒ|sโ‚‚,aโ‚) = 1.0, R(sโ‚‚,aโ‚) = +10

sโ‚ƒ is terminal.

Compute V*(sโ‚) and V*(sโ‚‚).

Answer: V*(sโ‚ƒ) = 0 (terminal). V*(sโ‚‚) = R(sโ‚‚,aโ‚) + ฮณยทV*(sโ‚ƒ) = 10 + 0.5ร—0 = 10. V*(sโ‚) = R(sโ‚,aโ‚) + ฮณยทV*(sโ‚‚) = 5 + 0.5ร—10 = 10.
Apply
B2 Intermediate

Perform 3 Q-Learning updates for the following sequence of experience (ฮฑ = 0.1, ฮณ = 0.9). Initial Q(s,a) = 0 for all:

Step 1: (s=A, a=right, r=0, s'=B)

Step 2: (s=B, a=down, r=0, s'=C)

Step 3: (s=C, a=right, r=+1, s'=GOAL, terminal)

Answer: Step 1: Q(A,right) โ† 0 + 0.1ร—[0 + 0.9ร—max Q(B,ยท) โˆ’ 0] = 0 + 0.1ร—[0 + 0.9ร—0] = 0. Step 2: Q(B,down) โ† 0 + 0.1ร—[0 + 0.9ร—max Q(C,ยท) โˆ’ 0] = 0. Step 3: Q(C,right) โ† 0 + 0.1ร—[1 + 0.9ร—0 โˆ’ 0] = 0.1. After one episode, only Q(C,right) = 0.1 is non-zero. After more episodes, values will propagate backward.
Apply
B3 Advanced

Derive the relationship between Vฯ€(s) and Qฯ€(s,a). Then show that if you know Q*, you can extract the optimal policy.

Answer: Vฯ€(s) = ฮฃ_a ฯ€(a|s) ยท Qฯ€(s,a). This is because the value of a state equals the weighted average of action values under the policy. Conversely, Qฯ€(s,a) = R(s,a) + ฮณ ยท ฮฃ_s' P(s'|s,a) ยท Vฯ€(s'). For Q*: the optimal policy is ฯ€*(s) = argmax_a Q*(s,a) โ€” simply pick the action with the highest Q-value in each state. This works because V*(s) = max_a Q*(s,a), and the policy that achieves this max is optimal.
Analyze
B4 Intermediate

An agent receives rewards: rโ‚=2, rโ‚‚=3, rโ‚ƒ=โˆ’1, rโ‚„=5. Compute the return Gโ‚ with ฮณ = 0.9 and with ฮณ = 0.5.

Answer: Gโ‚ = rโ‚ + ฮณrโ‚‚ + ฮณยฒrโ‚ƒ + ฮณยณrโ‚„. For ฮณ=0.9: Gโ‚ = 2 + 0.9(3) + 0.81(โˆ’1) + 0.729(5) = 2 + 2.7 โˆ’ 0.81 + 3.645 = 7.535. For ฮณ=0.5: Gโ‚ = 2 + 0.5(3) + 0.25(โˆ’1) + 0.125(5) = 2 + 1.5 โˆ’ 0.25 + 0.625 = 3.875. Note: higher ฮณ gives more weight to the large rโ‚„=5, so Gโ‚ is larger.
Apply
B5 Advanced

Prove that the REINFORCE gradient estimator โˆ‡J(ฮธ) = E[ฮฃโ‚œ โˆ‡log ฯ€(aโ‚œ|sโ‚œ;ฮธ) ยท Gโ‚œ] is unbiased. (Hint: use the log-derivative trick and show that subtracting any state-dependent baseline b(s) doesn't introduce bias.)

Answer sketch: The log-derivative trick: โˆ‡f(x) = f(x)ยทโˆ‡log f(x) implies โˆ‡E_ฯ„[R(ฯ„)] = E_ฯ„[R(ฯ„)ยทโˆ‡log P(ฯ„|ฮธ)]. Since โˆ‡log P(ฯ„|ฮธ) = ฮฃโ‚œ โˆ‡log ฯ€(aโ‚œ|sโ‚œ;ฮธ) (environment dynamics terms vanish under โˆ‡_ฮธ), the estimator is unbiased. For the baseline: E_a[โˆ‡log ฯ€(a|s)ยทb(s)] = b(s)ยทฮฃ_a ฯ€(a|s)ยทโˆ‡log ฯ€(a|s) = b(s)ยทโˆ‡ฮฃ_a ฯ€(a|s) = b(s)ยทโˆ‡1 = 0. Therefore subtracting any b(s) doesn't introduce bias but reduces variance.
Analyze
B6 Intermediate

In a DQN, the loss function is L(ฮธ) = E[(y โˆ’ Q(s,a;ฮธ))ยฒ] where y = r + ฮณยทmax_a' Q(s',a';ฮธโป). Compute โˆ‚L/โˆ‚ฮธ. Why do we NOT backpropagate through ฮธโป?

Answer: โˆ‚L/โˆ‚ฮธ = โˆ’2ยทE[(y โˆ’ Q(s,a;ฮธ)) ยท โˆ‚Q(s,a;ฮธ)/โˆ‚ฮธ]. We treat y as a constant (no gradient through ฮธโป) because: (1) if we backpropagated through ฮธโป, both the prediction and target would change simultaneously, creating a moving target that destabilizes learning; (2) the target network is a periodic snapshot of the online network, providing a stable optimization target. In PyTorch, this is achieved by wrapping the target computation in torch.no_grad().
Analyze
B7 Intermediate

Write the Bellman equation for the ISRO spacecraft problem. State = (distance_to_Mars, velocity). Action = {no_burn, short_burn, long_burn}. R(crash) = โˆ’1000, R(orbit_captured) = +100, R(burn) = โˆ’10ร—fuel_used.

Answer: V*(d, v) = max_{aโˆˆ{no_burn, short, long}} ฮฃ_{(d',v')} P((d',v')|(d,v),a) ร— [R((d,v),a,(d',v')) + ฮณยทV*(d',v')]. Here P is governed by orbital mechanics: d' = d โˆ’ vยทฮ”t, v' = v + (gravitational_acceleration + thrust(a))ยทฮ”t, plus stochastic perturbation. Terminal conditions: V*(orbit_captured) = +100, V*(crash) = โˆ’1000.
Create
B8 Advanced

Explain mathematically why Q-Learning converges to Q* while using ฮต-greedy exploration. What conditions on ฮฑ are needed?

Answer: Q-Learning converges to Q* under two conditions: (1) Every (s,a) pair must be visited infinitely often (ฮต-greedy ensures this as long as ฮต > 0). (2) The learning rate ฮฑ must satisfy: ฮฃโ‚œ ฮฑโ‚œ = โˆž and ฮฃโ‚œ ฮฑโ‚œยฒ < โˆž (the Robbins-Monro conditions). These ensure updates are large enough to overcome initial errors but small enough to converge. A decaying schedule ฮฑโ‚œ = 1/t satisfies both. The proof relies on the contraction mapping theorem โ€” the Bellman optimality operator T is a ฮณ-contraction in the sup norm, so iterative application of T converges to the unique fixed point Q*.
Analyze

Section C: Coding Questions (4)

C1 Intermediate

Implement SARSA (on-policy TD control) for FrozenLake and compare its learned policy with Q-Learning. Plot learning curves for both algorithms over 10,000 episodes.

Hint: The only difference is the update rule. Q-Learning: Q(s,a) โ† Q(s,a) + ฮฑ[r + ฮณยทmax_a' Q(s',a') โˆ’ Q(s,a)]. SARSA: Q(s,a) โ† Q(s,a) + ฮฑ[r + ฮณยทQ(s',a') โˆ’ Q(s,a)] where a' is the actual next action chosen by ฮต-greedy (not the max).
Apply
C2 Advanced

Implement Double DQN for CartPole. In Double DQN, use the online network to select the best next action, but the target network to evaluate that action: y = r + ฮณ ยท Q(s', argmax_a' Q(s',a';ฮธ); ฮธโป). Compare learning curves with standard DQN.

Hint: Change one line in the training step: best_next_action = online_net(next_states).argmax(1), then next_q = target_net(next_states).gather(1, best_next_action.unsqueeze(1)).
Apply
C3 Advanced

Implement Actor-Critic (A2C) for CartPole. Use separate networks for the actor (policy) and critic (value function). The actor should be updated using the advantage: A(s,a) = r + ฮณยทV(s') โˆ’ V(s).

Hint: The critic loss is MSE on TD error: L_critic = (r + ฮณยทV(s') โˆ’ V(s))ยฒ. The actor loss is: L_actor = โˆ’log ฯ€(a|s) ยท A.detach(). Important: detach the advantage from the actor loss computation to prevent gradients from flowing through the critic into the actor.
Create
C4 Intermediate

Implement a simple gridworld environment from scratch (no Gymnasium). The grid should be 5ร—5 with walls, a goal, and penalties. Then solve it with Q-Learning. Visualize the learned Q-values as a heatmap and the optimal policy as arrows.

Hint: The environment needs: reset() โ†’ state, step(action) โ†’ (next_state, reward, done, info). Use matplotlib's imshow for heatmap and quiver for arrow plot.
Create

Section D: Critical Thinking Questions (3)

D1 Advanced

"RL will replace all supervised learning" โ€” agree or disagree? Discuss the fundamental limitations of RL that make supervised learning preferable in many scenarios.

Discussion points: Disagree. RL limitations: (1) Sample inefficiency โ€” needs millions of interactions to learn tasks SL learns from thousands of examples. (2) Reward design is hard โ€” SL has clear labels, RL needs careful reward engineering. (3) Safety โ€” RL explores by trial and error, which is dangerous in medical, financial, or safety-critical domains. (4) Reproducibility โ€” RL training is notoriously unstable and sensitive to hyperparameters and random seeds. (5) Credit assignment โ€” hard to determine which action 100 steps ago caused the current reward. SL remains superior when labeled data exists and the task has clear input-output mapping.
Evaluate
D2 Advanced

RLHF is used to align LLMs with human values. But human preferences can be inconsistent, biased, and culturally dependent. Discuss the risks of using RLHF for alignment and propose mitigation strategies.

Discussion points: Risks: (1) Annotator bias โ€” preferences reflect the annotators' demographics, culture, and beliefs (US-centric annotators may not represent Indian users' preferences). (2) Reward model overfitting โ€” the model may learn surface-level preferences (verbose responses) rather than true quality. (3) Reward hacking โ€” the model finds responses that score high on the reward model without being genuinely helpful. (4) Goodhart's Law โ€” "when a measure becomes a target, it ceases to be a good measure." Mitigations: diverse annotator pools, multiple reward models, constitutional AI (self-correction rules), iterative RLHF with human auditing, red-teaming.
Evaluate
D3 Advanced

AlphaZero learned to play chess better than any human or computer in 4 hours. Why can't we apply the same self-play approach to solve real-world problems like drug discovery or climate modeling?

Discussion points: (1) Perfect simulator โ€” chess has perfect rules and instant simulation. Drug/climate simulation is approximate, expensive, and uncertain. (2) Clear reward โ€” chess has win/draw/lose. Drug efficacy is multi-dimensional and takes years to measure. (3) Deterministic environment โ€” chess (given random starting player) is deterministic. Real-world is stochastic with irreducible uncertainty. (4) Resettability โ€” you can restart a chess game instantly. You can't restart a clinical trial. (5) Self-play structure โ€” chess is a two-player zero-sum game. Most real problems aren't competitive games. However, ideas from AlphaZero (MCTS, learned value functions) ARE being adapted: AlphaFold used similar ideas for protein structure prediction!
Evaluate

โ˜… Starred Research Questions (4)

โ˜…1 Advanced

Read the paper "Decision Transformer: Reinforcement Learning via Sequence Modeling" (Chen et al., 2021). How does it reformulate RL as a sequence prediction problem? What are the implications for combining RL with transformer architectures?

Hint: Instead of learning Q-values or policies, Decision Transformer conditions on the desired return and generates actions autoregressively. The input sequence is (Rฬ‚โ‚, sโ‚, aโ‚, Rฬ‚โ‚‚, sโ‚‚, aโ‚‚, ...) where Rฬ‚โ‚œ is the target return-to-go. At test time, you condition on a high return to get good behavior. It requires no Bellman equations, no temporal difference, and no replay buffer โ€” just supervised sequence modeling!
CreateResearch
โ˜…2 Advanced

Propose an RL formulation for optimizing traffic signals across 50 intersections in Bengaluru (India's traffic capital). Consider: multi-agent RL (each intersection is an agent), communication between agents, variable traffic patterns (auto-rickshaws, buses, two-wheelers), and the objective function.

Hint: State per intersection: queue lengths, wait times, signal phase, time of day. Global state: aggregate traffic flow. Action: signal timing for each phase. Reward: negative total wait time across all vehicles. Key challenge: the agents' actions interact โ€” changing one signal affects downstream traffic. Approaches: independent Q-Learning (simple but ignores coordination), centralized training with decentralized execution (CTDE), mean-field RL, or communication protocols between neighboring intersections.
CreateResearch
โ˜…3 Advanced

Investigate "Direct Preference Optimization" (DPO, Rafailov et al., 2023). How does DPO eliminate the need for a separate reward model in RLHF? Derive the DPO loss and compare it with PPO-based RLHF in terms of simplicity and performance.

Hint: DPO directly optimizes the policy from preference pairs (y_w, y_l) using the closed-form solution of the KL-constrained RL objective: L_DPO = โˆ’E[log ฯƒ(ฮฒ ยท log(ฯ€(y_w|x)/ฯ€_ref(y_w|x)) โˆ’ ฮฒ ยท log(ฯ€(y_l|x)/ฯ€_ref(y_l|x)))]. This eliminates the reward model training step entirely, making the pipeline simpler and more stable.
CreateResearch
โ˜…4 Advanced

Design an RL-based approach for adaptive clinical trial design (relevant to India's pharmaceutical industry). How would you handle: (a) ethical constraints on exploration, (b) patient safety, (c) the multi-arm multi-stage structure?

Hint: Formulate as a contextual bandit (each patient = one-step decision) or full MDP (multi-stage treatment). Use Thompson Sampling for exploration (natural Bayesian approach). Add constraints: (a) Minimum efficacy thresholds โ€” never assign clearly inferior treatments. (b) Safety monitoring โ€” stop exploration early if adverse events detected (sequential testing). (c) Multi-arm: Bayesian adaptive randomization โ€” allocate more patients to better-performing arms. India's large patient populations and generic drug industry make this especially relevant.
CreateResearch
Section 13

Connections

๐Ÿ”— How This Chapter Connects

โ† Builds On
  • Chapter 5 (Gradient Descent): Policy gradient methods optimize J(ฮธ) using gradient ascent โ€” the same optimization framework, now applied to expected return
  • Chapter 10 (Deep Network Training): DQN uses all the tricks from deep learning โ€” batch normalization, gradient clipping, learning rate scheduling โ€” to stabilize training
  • Chapter 6 (Backpropagation): The DQN loss is MSE, and REINFORCE loss uses log-probability โ€” both trained via backpropagation through the policy/value network
โ†’ Enables
  • RLHF for LLMs (Chapter 15 connection): PPO is the workhorse of modern LLM alignment โ€” without this chapter, you can't understand how ChatGPT is fine-tuned
  • Robotics & Control: Continuous-action RL (SAC, PPO) is the foundation for robot locomotion, manipulation, and drone navigation
  • Multi-Agent Systems: Game theory + RL = multi-agent RL, relevant for autonomous driving, trading, and competitive AI
๐Ÿ”ฌ Research Frontier (2024-2025)
  • Offline RL: Learning from fixed datasets without environment interaction (Conservative Q-Learning, IQL) โ€” critical for healthcare and autonomous driving
  • World Models: Dreamer v3 learns a world model and plans within it, achieving superhuman Atari performance with 100ร— less data
  • Foundation Models for RL: RT-2 (Google) uses vision-language models as RL policies for robots, enabling zero-shot task transfer
  • DPO & Alternatives to RLHF: Direct Preference Optimization eliminates the reward model, simplifying LLM alignment
๐Ÿญ Industry Implementations
  • Google: RL for data center cooling (40% energy savings)
  • Netflix: RL for video encoding bitrate adaptation
  • Flipkart: RL-based product ranking and recommendation
  • Microsoft: RL for compiler optimization (MLGO)
  • Quantitative Trading: RL agents at firms like Two Sigma, DE Shaw, and Indian firms like Tower Research
Section 14

Chapter Summary

๐ŸŽฏ Key Takeaways

  1. RL is the third paradigm: Unlike supervised learning (teacher provides labels) and unsupervised learning (find structure), RL learns by trial-and-error with delayed rewards. The agent-environment interaction loop (state โ†’ action โ†’ reward โ†’ next state) is the foundation.
  2. MDPs formalize RL: A Markov Decision Process (S, A, P, R, ฮณ) provides the mathematical framework. The Markov property says the future depends only on the current state, not the history.
  3. Bellman equations are everything: The Bellman Expectation Equation evaluates a policy; the Bellman Optimality Equation finds the best policy. The key difference: expectation uses ฮฃ ฯ€(a|s) (average over policy), optimality uses max_a (best action).
  4. Q-Learning learns without a model: The update Q(s,a) โ† Q(s,a) + ฮฑ[r + ฮณยทmax Q(s',a') โˆ’ Q(s,a)] converges to optimal Q-values using only experienced transitions. No knowledge of environment dynamics needed.
  5. DQN = Q-Learning + Neural Networks + Two Tricks: Experience replay (break correlations, reuse data) and target networks (stabilize the optimization target) made deep RL work. This enabled superhuman Atari play from raw pixels.
  6. Policy gradient goes direct: REINFORCE directly optimizes the policy ฯ€(a|s;ฮธ) using โˆ‡J = E[โˆ‡log ฯ€ ยท G]. Actor-Critic adds a value-function baseline to reduce variance. PPO clips the update to prevent instability.
  7. RLHF bridges RL and LLMs: The SFT โ†’ Reward Model โ†’ PPO pipeline used to train ChatGPT/Claude is fundamentally an RL application, using PPO to optimize an LLM policy against a learned reward model of human preferences.

๐Ÿ“ The One Equation to Remember

Bellman Optimality: Q*(s,a) = E[r + ฮณ ยท maxa' Q*(s',a')]
"The value of the best action equals the immediate reward plus the discounted value of the best action in the next state."

๐Ÿ’ก The One Intuition to Remember

RL is about learning from consequences. Every decision you make changes the world, and the world gives you a signal (reward) about how good that change was. The challenge โ€” and the beauty โ€” is that this signal is often delayed, noisy, and sparse. Despite this, algorithms like Q-Learning and PPO can learn optimal behavior from nothing but trial and error. From ISRO's spacecraft to DeepMind's game-playing AI to ChatGPT's helpfulness โ€” the same fundamental framework applies.

Section 15

Further Reading

๐Ÿ‡ฎ๐Ÿ‡ณ Indian Resources

  • NPTEL Course: "Reinforcement Learning" by Prof. Balaraman Ravindran (IIT Madras) โ€” comprehensive course covering bandits to deep RL, available free on NPTEL/Swayam
  • NPTEL Course: "Introduction to Machine Learning" by Prof. Sudeshna Sarkar (IIT Kharagpur) โ€” RL module in the ML course, good for GATE preparation
  • IISc RL Lab: Research on RL for autonomous systems, papers on arXiv by Prof. Shalabh Bhatnagar's group
  • GATE Reference: "Pattern Recognition and Machine Learning" by Christopher Bishop โ€” Chapter on MDPs and RL
  • AI4Bharat: Projects applying RL to Indian-language processing and robotics

๐ŸŒ Global Resources

  • ๐Ÿ“– Textbook: "Reinforcement Learning: An Introduction" by Richard Sutton and Andrew Barto (2nd edition, 2018) โ€” the definitive RL textbook, freely available online. Read Chapters 1-6 alongside this chapter.
  • ๐Ÿ“– Textbook: "Deep Reinforcement Learning Hands-On" by Maxim Lapan (2nd edition, Packt) โ€” practical PyTorch implementations
  • ๐Ÿ“„ Paper: Mnih et al., "Human-level Control through Deep Reinforcement Learning," Nature 2015 โ€” the DQN paper
  • ๐Ÿ“„ Paper: Schulman et al., "Proximal Policy Optimization Algorithms," 2017 โ€” the PPO paper
  • ๐Ÿ“„ Paper: Silver et al., "Mastering the Game of Go without Human Knowledge," Nature 2017 โ€” AlphaZero
  • ๐Ÿ“„ Paper: Ouyang et al., "Training Language Models to Follow Instructions with Human Feedback," NeurIPS 2022 โ€” InstructGPT/RLHF
  • ๐Ÿ“„ Paper: Rafailov et al., "Direct Preference Optimization," NeurIPS 2023 โ€” DPO, the RLHF alternative
  • ๐ŸŽฅ Video: David Silver's RL course (UCL/DeepMind, YouTube) โ€” 10 lectures covering the full RL spectrum
  • ๐ŸŽฅ Video: 3Blue1Brown โ€” "But what is a neural network?" + "Gradient descent" (foundational)
  • ๐ŸŒ Interactive: OpenAI Spinning Up โ€” excellent tutorials on policy gradient, PPO, SAC with clean implementations
  • ๐ŸŒ Interactive: CleanRL โ€” single-file implementations of RL algorithms for learning
  • ๐ŸŒ Distill.pub: "Attention and Augmented Recurrent Neural Networks" โ€” interactive visualizations relevant to RL with memory