Computer Organization & Architecture

Unit 6: Memory Unit

From registers to hard drives — master the memory hierarchy, cache mapping techniques, virtual memory, and solve GATE-level numericals with confidence.

⏱️ 8 hrs theory + 5 hrs lab | 🎯 GATE ~4 marks | 🖥️ Snapdragon Cache

💼 Jobs this unlocks: VLSI Design Engineer (₹6–12 LPA) | Embedded Systems Developer (₹5–10 LPA) | SoC Verification Engineer (₹8–15 LPA)

Section A

Opening Hook — Why Does Your Laptop Slow Down with 100 Tabs?

🖥️ The Mystery of the 100-Tab Slowdown

You've done it. We all have. You open Chrome, start with 5 tabs, then 20, then 50… and by the time you hit 100 tabs, your laptop turns into a space heater that can barely scroll. Your fancy 16 GB RAM machine is now slower than a ₹5,000 phone. Why?

The answer lies in the memory hierarchy. Your CPU doesn't just grab data from RAM. It first checks its tiny ultra-fast L1 cache (32 KB, ~1 ns). Miss? It checks the L2 cache (256 KB, ~5 ns). Still miss? L3 cache (8 MB, ~20 ns). All misses? It finally goes to RAM (16 GB, ~100 ns). But with 100 tabs, even RAM fills up, and the OS starts using your SSD as virtual memory — that's 1000× slower than RAM. That's the slowdown.

Qualcomm's Snapdragon 8 Gen 3 chip (inside your Samsung Galaxy S24) has a 12 MB L3 cache designed by Indian engineers in Hyderabad. Apple's M3 has a 36 MB L2. Every nanosecond saved in cache design translates to billions of dollars in market advantage. This chapter teaches you exactly how that works.

🇮🇳 Qualcomm Hyderabad🇮🇳 Samsung SemiconductorIntelApple SiliconAMD🇮🇳 ISRO NavIC

A single L1 cache access (~1 ns) vs a hard disk access (~10 ms) is a 10,000,000× speed difference. If L1 cache speed were a blink of your eye (300 ms), then waiting for a hard disk would be equivalent to waiting 95 years. That's why cache design is the most performance-critical job in chip companies like Qualcomm India.

Section B

Learning Outcomes — Bloom's Taxonomy Mapped

Bloom's Level	Learning Outcome
🔵 Remember	List the levels of the memory hierarchy with access times, sizes, and cost per bit
🔵 Remember	Define cache memory, hit ratio, miss penalty, TLB, and page fault
🟢 Understand	Explain how direct mapping, fully associative, and set-associative mapping work with tag/line/word fields
🟢 Understand	Describe virtual memory organisation including page tables, TLB, and demand paging
🟡 Apply	Compute tag, line, and word bits for a given cache configuration and calculate AMAT
🟡 Apply	Trace a reference string through cache using FIFO replacement and calculate hit rate
🟠 Analyze	Compare write-through vs write-back policies and analyse their performance trade-offs
🟠 Analyze	Analyse why set-associative mapping is preferred over direct and fully associative in modern CPUs
🔴 Evaluate	Evaluate the cache design trade-offs in Snapdragon vs Apple Silicon processors
🔴 Evaluate	Assess the impact of page size on TLB miss rate and internal fragmentation
🟣 Create	Design a 2-level cache hierarchy for a given workload with AMAT constraints
🟣 Create	Simulate a cache replacement algorithm for a given reference string and propose optimisations

Section C

Concept Explanation — Memory Unit from Scratch

1. Memory Hierarchy — The Speed-Size-Cost Pyramid

Imagine a library. You keep your most-used notes on your desk (registers — fastest, tiny). Books you need today are on the shelf beside you (cache). The library room has thousands of books (RAM). The basement archive has millions (SSD/HDD). Each level is bigger but slower. A computer's memory works exactly the same way.

🔺 The Complete Memory Hierarchy Pyramid

        ┌───────────┐
        │ REGISTERS │  ← 0.3 ns | 256 B–2 KB | ₹₹₹₹₹ (on-chip)
        │  (CPU)    │     Flip-flops, zero latency for ALU
        ├───────────┤
        │  L1 CACHE │  ← 1 ns   | 32–64 KB   | ₹₹₹₹  (on-chip SRAM)
        │ (per core)│     Split: I-cache + D-cache
        ├───────────┤
        │  L2 CACHE │  ← 5 ns   | 256 KB–1 MB| ₹₹₹   (on-chip SRAM)
        │ (per core)│     Unified instruction + data
        ├───────────┤
        │  L3 CACHE │  ← 20 ns  | 4–36 MB    | ₹₹    (shared SRAM)
        │ (shared)  │     Shared across all cores
        ├───────────┤
        │ MAIN MEM  │  ← 100 ns | 4–64 GB    | ₹     (DRAM)
        │  (RAM)    │     Volatile, row/column addressing
        ├───────────┤
        │   SSD     │  ← 50 μs  | 256 GB–4 TB| ₹/10  (NAND Flash)
        │(secondary)│     Non-volatile, no moving parts
        ├───────────┤
        │   HDD     │  ← 10 ms  | 1–20 TB    | ₹/100 (magnetic)
        │(secondary)│     Spinning platters, mechanical arm
        └───────────┘

  Speed:  ◄────── FASTEST ──────────────────── SLOWEST ──────►
  Size:   ◄────── SMALLEST ─────────────────── LARGEST ──────►
  Cost/b: ◄────── MOST EXPENSIVE ───────────── CHEAPEST ─────►

Level	Technology	Access Time	Typical Size	Cost/GB (approx.)	Volatile?
Registers	Flip-flops	0.3 ns	~1 KB	—	Yes
L1 Cache	SRAM	1 ns	32–64 KB	~₹5,00,000	Yes
L2 Cache	SRAM	5 ns	256 KB–1 MB	~₹2,00,000	Yes
L3 Cache	SRAM	20 ns	4–36 MB	~₹50,000	Yes
RAM	DRAM	100 ns	4–64 GB	~₹250	Yes
SSD	NAND Flash	50 μs	256 GB–4 TB	~₹5	No
HDD	Magnetic	10 ms	1–20 TB	~₹2	No

Qualcomm's Snapdragon 8 Gen 3 (designed in Hyderabad & Bangalore) features: 64 KB L1 I-cache + 64 KB L1 D-cache per Cortex-X4 core, 1 MB L2 per core, and a 12 MB shared L3 cache. This cache hierarchy is what makes your Samsung/OnePlus phone run games at 120 FPS without a desktop-class RAM.

GATE Favourite: The key principle is locality of reference. Temporal locality = if you accessed data now, you'll likely access it again soon (loops). Spatial locality = if you accessed address X, you'll likely access X+1 soon (arrays). Caches exploit both.

2. Cache Memory — Structure & Organisation

Cache memory is a small, fast SRAM buffer between the CPU and main memory. Its job: keep the most frequently accessed data close to the CPU so the processor doesn't waste 100 ns waiting for RAM every time.

🏗️ Cache Memory Block Diagram

   CPU                           CACHE                        MAIN MEMORY
  ┌─────┐                   ┌──────────────┐               ┌──────────────┐
  │     │ ── Address ──────►│              │               │              │
  │ CPU │                   │  Tag  Array  │── Miss ──────►│     RAM      │
  │     │◄── Data ──────────│  Data Array  │◄── Block ─────│   (DRAM)     │
  │     │                   │  Valid Bits  │               │              │
  └─────┘                   │  Dirty Bits  │               └──────────────┘
                            └──────────────┘

  Cache Line Structure:
  ┌───────┬───────┬──────────────────────────────────────┐
  │ Valid │  Tag  │     Data Block (B bytes)              │
  │  (1b) │(t bits)│  Word₀ │ Word₁ │ Word₂ │ ... │ Wₙ  │
  └───────┴───────┴──────────────────────────────────────┘

Key terms:

• Cache Line (Block): The smallest unit of data transferred between cache and RAM. Typical: 32 or 64 bytes.

• Tag: Identifies which main memory block is currently stored in this cache line.

• Valid bit: 1 = line has valid data, 0 = empty/invalid.

• Dirty bit: (Write-back only) 1 = line modified, needs to be written back to RAM.

• Hit: Requested data found in cache. Miss: Not found → fetch from RAM.

• Hit Ratio (h): h = (Number of hits) / (Total accesses). Typical: 0.90–0.99.

Modern Intel CPUs achieve L1 hit rates of 95–97%. This means out of every 100 memory accesses, only 3–5 actually need to go to L2 or beyond. That's the magic of good cache design + spatial/temporal locality.

3. Direct Mapping — [Tag | Line | Word]

Analogy: Think of a hostel with 8 rooms. Each student is assigned a fixed room based on their roll number: Room = Roll % 8. Student 0, 8, 16, 24 all map to Room 0. If Student 0 is in Room 0 and Student 8 arrives, Student 0 gets kicked out. No choice — it's direct mapping.

📐 Direct Mapped Cache — Address Breakdown

  Given: Main Memory = 2ⁿ bytes, Cache Lines = 2ˡ, Block Size = 2ʷ bytes

  CPU Address (n bits):
  ┌──────────────┬──────────────┬──────────────┐
  │     TAG      │  LINE/INDEX  │  WORD OFFSET │
  │  (n-l-w) bits│   (l bits)   │   (w bits)   │
  └──────────────┴──────────────┴──────────────┘

  Mapping Formula:
    Cache Line Number = (Main Memory Block Number) mod (Number of Cache Lines)
    Line Number = Block Address mod 2ˡ

  Example: 32-bit address, 512 lines, 4 words/block (16 bytes)
  ┌──────────────────┬───────────┬──────┐
  │     TAG (19)     │ LINE (9)  │ W(4) │
  │   19 bits        │  9 bits   │ 4 bits│
  └──────────────────┴───────────┴──────┘
  
  Total = 19 + 9 + 4 = 32 bits ✓

  Direct Mapped Cache Layout:
  ┌───────┬─────┬──────────────────────────────────┐
  │ Valid │ Tag │ Word₀  │ Word₁  │ Word₂  │ Word₃ │  ← Line 0
  ├───────┼─────┼────────┼────────┼────────┼───────┤
  │ Valid │ Tag │ Word₀  │ Word₁  │ Word₂  │ Word₃ │  ← Line 1
  ├───────┼─────┼────────┼────────┼────────┼───────┤
  │  ...  │ ... │  ...   │  ...   │  ...   │  ...  │
  ├───────┼─────┼────────┼────────┼────────┼───────┤
  │ Valid │ Tag │ Word₀  │ Word₁  │ Word₂  │ Word₃ │  ← Line 511
  └───────┴─────┴────────┴────────┴────────┴───────┘

How a lookup works:

Extract the LINE bits from the CPU address → go to that cache line
Compare the TAG field of that line with the TAG bits from the address
If TAG matches AND Valid=1 → HIT! Use the WORD OFFSET to pick the right word
If TAG doesn't match or Valid=0 → MISS! Fetch block from RAM, replace this line

Students confuse "word offset" with "byte offset." If the block has 4 words (each word = 4 bytes), the word offset needs 2 bits (to select which word). But if the question says "byte-addressable," you need 2 extra bits to select within a word. Always check: is the address in words or bytes?

4. Fully Associative Mapping — [Tag | Word]

Analogy: Unlike the hostel (direct mapping), think of a parking lot with 8 spots. Any car can park in any spot. When a new car arrives and the lot is full, you use a replacement policy (kick out the oldest = FIFO, kick out least recently used = LRU). Maximum flexibility, but you need to check all spots simultaneously.

📐 Fully Associative Cache — Address Breakdown

  CPU Address (n bits):
  ┌──────────────────────────┬──────────────┐
  │          TAG             │  WORD OFFSET │
  │      (n - w) bits        │   (w bits)   │
  └──────────────────────────┴──────────────┘

  NO LINE/INDEX field! Any block can go in ANY cache line.

  Example: 32-bit address, Block = 16 bytes (w = 4)
  ┌────────────────────────────┬──────┐
  │         TAG (28)           │ W(4) │
  └────────────────────────────┴──────┘

  Lookup: CPU sends tag → ALL lines compare simultaneously (parallel comparators)
  
  ┌───────┬─────────────┬─────────────────────────┐
  │ Valid │   Tag (28)  │  Data Block (16 bytes)   │  ← Line 0  ─┐
  ├───────┼─────────────┼─────────────────────────┤              │
  │ Valid │   Tag (28)  │  Data Block (16 bytes)   │  ← Line 1   ├─ All compared
  ├───────┼─────────────┼─────────────────────────┤              │  in PARALLEL
  │  ...  │    ...      │        ...               │              │
  ├───────┼─────────────┼─────────────────────────┤              │
  │ Valid │   Tag (28)  │  Data Block (16 bytes)   │  ← Line N  ─┘
  └───────┴─────────────┴─────────────────────────┘
              ▲ Compare with incoming tag

Advantage: No conflict misses — any block can go anywhere.

Disadvantage: Expensive! Needs a comparator for every cache line. Hardware cost scales with cache size.

Used for: TLBs (small, needs high hit rate), small L1 caches in some designs.

5. Set-Associative Mapping — The Best of Both Worlds

Analogy: Compromise! Instead of one fixed room (direct) or any room (associative), we have hostels (sets), each with a few rooms (ways). A student must go to their assigned hostel but can pick any room inside it. This gives flexibility within a set while keeping hardware cost reasonable.

📐 K-Way Set-Associative Cache (2-Way Example)

  CPU Address (n bits):
  ┌──────────────┬──────────────┬──────────────┐
  │     TAG      │  SET INDEX   │  WORD OFFSET │
  │(n - s - w) b │   (s bits)   │   (w bits)   │
  └──────────────┴──────────────┴──────────────┘

  Number of Sets = Total Lines / K  (where K = associativity)
  s = log₂(Number of Sets)

  2-Way Set-Associative Cache Layout (4 sets, 8 total lines):

         Way 0                    Way 1
  ┌───────┬─────┬──────┐  ┌───────┬─────┬──────┐
  │ V│Tag │Data │      │  │ V│Tag │Data │      │  ← Set 0
  ├───────┼─────┼──────┤  ├───────┼─────┼──────┤
  │ V│Tag │Data │      │  │ V│Tag │Data │      │  ← Set 1
  ├───────┼─────┼──────┤  ├───────┼─────┼──────┤
  │ V│Tag │Data │      │  │ V│Tag │Data │      │  ← Set 2
  ├───────┼─────┼──────┤  ├───────┼─────┼──────┤
  │ V│Tag │Data │      │  │ V│Tag │Data │      │  ← Set 3
  └───────┴─────┴──────┘  └───────┴─────┴──────┘

  Lookup Process:
  1. Use SET INDEX → go to that set
  2. Compare TAG with BOTH Way 0 and Way 1 simultaneously
  3. If either matches (and Valid=1) → HIT
  4. Both miss → MISS → replace one way (FIFO/LRU)

  Special Cases:
  • K = 1 (1-way)  → Direct Mapped
  • K = N (N-way)  → Fully Associative
  • K = 2 or 4     → Most common in modern CPUs

GATE Trick: In a K-way set-associative cache with C total lines: Number of Sets = C/K, Set Index bits = log₂(C/K). The tag bits increase as K increases (because set index bits decrease). Direct mapping has the most index bits; fully associative has zero.

6. Cache Hit/Miss Trace — Reference String with FIFO

Let's trace how a cache handles a sequence of memory references. This is a classic GATE question type.

📊 Worked Example: 4-Line Direct-Mapped Cache with FIFO

Setup: 4 cache lines (lines 0–3), direct-mapped, block size = 1 word.

Reference String (block numbers): 0, 8, 0, 6, 8, 2, 0, 6

Mapping: Cache Line = Block Number mod 4

  Block → Cache Line:
  Block 0 → Line 0 (0 mod 4 = 0)
  Block 8 → Line 0 (8 mod 4 = 0)  ← CONFLICT with Block 0!
  Block 6 → Line 2 (6 mod 4 = 2)
  Block 2 → Line 2 (2 mod 4 = 2)  ← CONFLICT with Block 6!

  Trace Table:
  ┌──────┬─────────┬────────┬────────┬────────┬────────┬──────────┐
  │ Step │ Request │ Line 0 │ Line 1 │ Line 2 │ Line 3 │ Hit/Miss │
  ├──────┼─────────┼────────┼────────┼────────┼────────┼──────────┤
  │  1   │  Blk 0  │  [0]   │   —    │   —    │   —    │  MISS    │
  │  2   │  Blk 8  │  [8]   │   —    │   —    │   —    │  MISS    │
  │  3   │  Blk 0  │  [0]   │   —    │   —    │   —    │  MISS    │
  │  4   │  Blk 6  │  [0]   │   —    │  [6]   │   —    │  MISS    │
  │  5   │  Blk 8  │  [8]   │   —    │  [6]   │   —    │  MISS    │
  │  6   │  Blk 2  │  [8]   │   —    │  [2]   │   —    │  MISS    │
  │  7   │  Blk 0  │  [0]   │   —    │  [2]   │   —    │  MISS    │
  │  8   │  Blk 6  │  [0]   │   —    │  [6]   │   —    │  MISS    │
  └──────┴─────────┴────────┴────────┴────────┴────────┴──────────┘

  Hits = 0, Misses = 8
  Hit Rate = 0/8 = 0% (Terrible! All conflict misses)

This is the worst case for direct mapping — all references map to just 2 lines, causing constant thrashing. A 2-way set-associative cache would dramatically improve this.

Now with 2-Way Set-Associative (2 sets, 2 ways each):

  Set = Block mod 2
  Block 0 → Set 0 | Block 8 → Set 0 | Block 6 → Set 0 | Block 2 → Set 0

  ┌──────┬─────────┬────────────────┬────────────────┬──────────┐
  │ Step │ Request │ Set 0 (W0, W1) │ Set 1 (W0, W1) │ Hit/Miss │
  ├──────┼─────────┼────────────────┼────────────────┼──────────┤
  │  1   │  Blk 0  │  [0, —]        │  [—, —]        │  MISS    │
  │  2   │  Blk 8  │  [0, 8]        │  [—, —]        │  MISS    │
  │  3   │  Blk 0  │  [0, 8]        │  [—, —]        │  HIT ✅  │
  │  4   │  Blk 6  │  [6, 8] FIFO   │  [—, —]        │  MISS    │
  │  5   │  Blk 8  │  [6, 8]        │  [—, —]        │  HIT ✅  │
  │  6   │  Blk 2  │  [6, 2] FIFO   │  [—, —]        │  MISS    │
  │  7   │  Blk 0  │  [0, 2] FIFO   │  [—, —]        │  MISS    │
  │  8   │  Blk 6  │  [0, 6] FIFO   │  [—, —]        │  MISS    │
  └──────┴─────────┴────────────────┴────────────────┴──────────┘

  Hits = 2, Misses = 6
  Hit Rate = 2/8 = 25% (Better than 0% with direct mapping!)

In FIFO replacement, the block that entered earliest is replaced — NOT the least recently used. FIFO and LRU can give different results. GATE often tests this distinction. With LRU, the block that was accessed least recently is replaced, even if it entered later.

7. Write-Through vs Write-Back — Comparison

Feature	Write-Through	Write-Back
Mechanism	Every write updates both cache AND main memory simultaneously	Write only to cache; update main memory when line is evicted
Speed	Slower (every write goes to RAM)	Faster (writes are buffered in cache)
Consistency	Cache and RAM always consistent	Can be inconsistent; needs dirty bit tracking
Dirty Bit	Not needed	Required (1 = modified, needs writeback)
Write Buffer	Often uses a write buffer to avoid CPU stalls	Not needed for writes
Complexity	Simpler hardware	More complex (needs dirty bit logic + writeback)
Best For	Multiprocessor systems (coherency), I/O devices	Single-processor, performance-critical systems
Used In	L1 D-cache (some ARM designs)	L2/L3 caches, most modern CPUs
Miss Policy	Write-allocate or Write-no-allocate	Usually write-allocate (fetch block then write)

GATE Tip: "Write-allocate" = on a write miss, fetch the block into cache first, then write. "Write-no-allocate (write-around)" = on a write miss, write directly to main memory, don't fetch into cache. Write-through usually pairs with write-no-allocate. Write-back usually pairs with write-allocate.

8. Virtual Memory — Page Table, TLB, Demand Paging

Analogy: Imagine you're a teacher with 60 students but only 30 chairs. You give each student a "virtual seat number" (1–60). When a student comes to class, you assign them a real chair. If all chairs are full, the least active student goes to the "waiting room" (disk). That's virtual memory — every process gets its own full address space, but physical RAM is shared.

📐 Virtual Memory Address Translation

  Virtual Address (from CPU):
  ┌───────────────────┬──────────────┐
  │  Virtual Page No. │  Page Offset │
  │    (VPN)          │   (d bits)   │
  └───────────────────┴──────────────┘
           │
           ▼
  ┌────────────────────────────┐
  │       PAGE TABLE           │
  │ ┌─────┬───────┬──────────┐ │
  │ │Valid│ Dirty │Frame No. │ │
  │ │  1  │   0   │  0x3A    │ │ ← VPN 0
  │ │  1  │   1   │  0x2F    │ │ ← VPN 1
  │ │  0  │   0   │   —      │ │ ← VPN 2 (PAGE FAULT!)
  │ │  1  │   0   │  0x71    │ │ ← VPN 3
  │ │ ... │  ...  │   ...    │ │
  │ └─────┴───────┴──────────┘ │
  └────────────────────────────┘
           │
           ▼
  Physical Address:
  ┌───────────────────┬──────────────┐
  │  Physical Frame   │  Page Offset │
  │   Number (PFN)    │   (d bits)   │
  └───────────────────┴──────────────┘

  Page Fault: Valid=0 → page not in RAM → OS fetches from disk (VERY slow: ~10 ms)

Translation Lookaside Buffer (TLB):

  CPU ──VPN──► ┌─────┐ Hit ──PFN──► Physical Address
               │ TLB │ (fast: ~1 ns, fully associative)
               └─────┘
                 │ Miss
                 ▼
               ┌────────────┐
               │ Page Table  │ (in RAM: ~100 ns)
               │  Walk       │
               └────────────┘
                 │ Page Fault
                 ▼
               ┌────────────┐
               │    Disk     │ (10 ms — catastrophic!)
               └────────────┘

TLB is a small, fast cache (typically 32–128 entries, fully associative) that stores recent VPN→PFN translations. TLB hit rate is typically 99%+ in well-designed systems.

Demand Paging: Pages are loaded into RAM only when accessed (not pre-loaded). This saves RAM — most of a process's pages are never touched.

A page fault takes ~10 ms. At a CPU clock of 3 GHz, that's ~30 million wasted cycles. If page faults happened on even 1% of accesses, your computer would be 100,000× slower. That's why the OS works incredibly hard to keep the page fault rate below 0.0001%.

9. Content Addressable Memory (CAM)

Normal memory (RAM): you give an address, it returns data. CAM is the reverse: you give data (a search key), it returns the address/location where that data is stored — in a single clock cycle.

📐 CAM vs RAM — Fundamental Difference

  RAM (Address → Data):               CAM (Data → Address):
  ┌─────────┬──────────┐              ┌──────────┬──────────────┐
  │ Address │   Data   │              │ Search   │   Match?     │
  │    0    │  0xAB    │              │ Key:0xCD │              │
  │    1    │  0xCD ◄──│── Read       │          │ Line 0: No   │
  │    2    │  0xEF    │              │          │ Line 1: YES ◄┤── Found!
  │    3    │  0x12    │              │          │ Line 2: No   │
  └─────────┴──────────┘              │          │ Line 3: No   │
   Input: Address                     └──────────┴──────────────┘
   Output: Data                        Input: Data (search key)
                                       Output: Location (address)

Where is CAM used?

TLB — Search by VPN, get PFN in one cycle
Network routers — Search by IP address for routing table lookup
Fully associative caches — All tags compared simultaneously = CAM behaviour

TCAM (Ternary CAM): Each bit can be 0, 1, or X (don't care). Used in firewalls and routers for wildcard matching.

10. DRAM, SSD & HDD — Main & Secondary Storage

DRAM (Dynamic RAM)

DRAM stores each bit as a charge on a tiny capacitor. The charge leaks, so DRAM needs periodic refresh (every ~64 ms). It's cheaper and denser than SRAM (1 transistor + 1 capacitor per bit vs 6 transistors for SRAM), which is why we use DRAM for main memory.

Feature	SRAM (Cache)	DRAM (RAM)
Storage Element	6 transistors (flip-flop)	1 transistor + 1 capacitor
Speed	~1–20 ns	~100 ns
Refresh Needed?	No	Yes (every ~64 ms)
Density	Low (6T per bit)	High (1T1C per bit)
Cost/bit	High	Low
Used For	L1/L2/L3 cache, registers	Main memory (DDR4/DDR5)

SSD (Solid State Drive)

Uses NAND flash memory. No moving parts, so it's shock-resistant and faster than HDD. Data stored in floating-gate transistors that trap electrons. Typical read latency: ~50 μs. Limited write endurance (cells wear out after ~3,000–100,000 write cycles).

HDD (Hard Disk Drive)

Magnetic storage on spinning platters. A mechanical arm moves to the right track (seek time ~5 ms) and waits for the right sector to rotate under it (rotational latency ~4 ms at 7200 RPM). Total access time: ~10 ms. Cheapest ₹/GB but slowest.

  HDD Access Time Breakdown:
  ┌─────────────────────────────────────────────────────────────┐
  │  Seek Time        Rotational Latency     Transfer Time     │
  │  (~5 ms)          (~4.2 ms @ 7200 RPM)   (~0.01 ms)       │
  │  ◄── Arm moves ──►◄── Platter spins ──►◄── Data read ──►  │
  │                                                             │
  │  Total ≈ 9–10 ms per random access                         │
  │  Rotational Latency = (1/2) × (60/RPM) seconds             │
  │  For 7200 RPM: (1/2) × (60/7200) = 4.17 ms                │
  └─────────────────────────────────────────────────────────────┘

Samsung Semiconductor India (Noida & Bangalore) employs 3,000+ engineers working on DRAM and NAND flash design. Samsung is the world's #1 memory chip maker. Their latest DDR5-7200 modules are designed in collaboration with Indian R&D teams and power servers in Google's, Amazon's, and Azure's Indian data centres.

📝 Worked Numerical — Complete GATE-Style Problem

🧮 Problem: 512-Line Direct-Mapped Cache, 32-bit Address

Given:

Cache: 512 lines, direct-mapped
Block size: 4 words (1 word = 4 bytes → block = 16 bytes)
Address: 32-bit, byte-addressable
Cache hit time = 1 ns, miss penalty = 100 ns, hit rate = 0.95

Find: (a) Tag, Line, Word bits (b) Cache data size (c) Total cache size (d) AMAT

Solution:

(a) Address Field Breakdown:

  Block size = 16 bytes → Word Offset = log₂(16) = 4 bits
  Lines = 512 = 2⁹  → Line Index  = 9 bits
  Tag = 32 - 9 - 4   = 19 bits

  ┌────────────┬──────────┬──────────┐
  │  Tag (19)  │ Line (9) │ Word (4) │
  └────────────┴──────────┴──────────┘

(b) Cache Data Size:

  Data = Number of Lines × Block Size
       = 512 × 16 bytes
       = 8,192 bytes = 8 KB

(c) Total Cache Size (including overhead):

  Each line stores: 1 valid bit + 19 tag bits + 128 data bits (16 bytes)
                  = 1 + 19 + 128 = 148 bits per line
  Total = 512 × 148 = 75,776 bits = 9,472 bytes ≈ 9.25 KB

  Overhead = Total - Data = 9.25 KB - 8 KB = 1.25 KB (for tags + valid bits)

(d) Average Memory Access Time (AMAT):

  AMAT = Hit Time + Miss Rate × Miss Penalty
       = 1 + (1 - 0.95) × 100
       = 1 + 0.05 × 100
       = 1 + 5
       = 6 ns

  Without cache: 100 ns. With cache: 6 ns → 16.7× speedup!

Cache = Chai Shop at College Gate! ☕ Think of it this way:
• Registers = The chai cup already in your hand (instant access)
• L1 Cache = The chai shop right at the college gate (10 seconds walk)
• L2 Cache = The canteen inside campus (2 minutes walk)
• L3 Cache = The CCD/Starbucks on the main road (10 minutes)
• RAM = Going home to make chai (30 minutes travel)
• HDD = Ordering chai leaves from Amazon and waiting 2 days
You always try the nearest shop first. If it has your favorite Cutting Chai — HIT! If not — MISS, go to the next level. That's exactly how CPU cache works!

Section D

Learn by Doing — 3-Tier Lab Structure

🟢 Tier 1 — GUIDED: Cache Address Decoder (Python)

⏱️ 45–60 minutesBeginnerZero prior knowledge assumed

Objective:

Write a Python program that takes a memory address, cache configuration, and outputs the Tag, Line/Set, and Word offset fields.

Step 1: Get User Inputs

Python
# Cache Address Decoder
address_bits = int(input("Enter address width (bits): "))       # e.g., 32
num_lines    = int(input("Enter number of cache lines: "))       # e.g., 512
block_size   = int(input("Enter block size (bytes): "))          # e.g., 16
address_hex  = input("Enter memory address (hex, e.g. 0x1A3F): ")

Step 2: Calculate Bit Fields

Python
import math

word_bits = int(math.log2(block_size))
line_bits = int(math.log2(num_lines))
tag_bits  = address_bits - line_bits - word_bits

print(f"Tag: {tag_bits} bits | Line: {line_bits} bits | Word: {word_bits} bits")

Step 3: Decode the Address

Python
address = int(address_hex, 16)
word_offset = address & ((1 << word_bits) - 1)
line_index  = (address >> word_bits) & ((1 << line_bits) - 1)
tag_value   = address >> (word_bits + line_bits)

print(f"Address: {address_hex} → Tag={tag_value} | Line={line_index} | Word={word_offset}")

Enter address width (bits): 32 Enter number of cache lines: 512 Enter block size (bytes): 16 Enter memory address (hex, e.g. 0x1A3F): 0x0001CAFE Tag: 19 bits | Line: 9 bits | Word: 4 bits Address: 0x0001CAFE → Tag=0 | Line=458 | Word=14

🟡 Tier 2 — SEMI-GUIDED: Cache Simulator with Hit/Miss Tracking

⏱️ 90–120 minutesIntermediateHints provided

Mission:

Build a Python cache simulator that takes a reference string and reports hits, misses, and hit rate for direct-mapped and set-associative caches.

Hints:

Create a list of None values to represent cache lines: cache = [None] * num_lines
For each reference: compute line = ref % num_lines
Check if cache[line] == ref → HIT, else → MISS and replace
For set-associative: use a list of lists. Each set is a list with K slots
Track hits and misses in counters. Print hit rate at the end

Python
# Skeleton — fill in the blanks
def simulate_direct(refs, num_lines):
    cache = [None] * num_lines
    hits = 0
    for ref in refs:
        line = ref % num_lines
        if cache[line] == ref:
            hits += 1
            print(f"Ref {ref} → Line {line} → HIT")
        else:
            cache[line] = ref
            print(f"Ref {ref} → Line {line} → MISS")
    print(f"Hit Rate: {hits}/{len(refs)} = {hits/len(refs)*100:.1f}%")

refs = [0, 8, 0, 6, 8, 2, 0, 6]
simulate_direct(refs, 4)

Stretch Goal: Extend to 2-way set-associative with FIFO and LRU replacement. Compare hit rates for the same reference string.

🔴 Tier 3 — OPEN CHALLENGE: Full Cache Hierarchy Analyzer

⏱️ 2–3 hoursAdvancedNo instructions — design from scratch

The Brief:

Build a complete cache hierarchy simulator that models L1 → L2 → RAM access with:

L1 Cache: Direct-mapped, 64 lines, 4-word blocks
L2 Cache: 4-way set-associative, 256 lines, 8-word blocks, LRU replacement
Input: Read a reference string from a file (at least 100 addresses)
Output: L1 hit rate, L2 hit rate, overall AMAT, total access time
Bonus: Generate a visual trace table showing L1/L2 hits/misses per access

AMAT Formula for 2-level cache:

  AMAT = Hit_Time_L1 + Miss_Rate_L1 × (Hit_Time_L2 + Miss_Rate_L2 × Miss_Penalty_RAM)

This project is resume-worthy. A working cache simulator with trace output demonstrates deep understanding of computer architecture. Include it in your GitHub with a README and screenshots — VLSI/embedded systems companies actively look for this.

Section E

Practice Problems — Diagrams, Numericals, Industry & GATE

📊 Diagram-Based Questions (3)

Draw the complete memory hierarchy pyramid for a modern smartphone (Snapdragon 8 Gen 3). Label each level with: technology, size, access time, and one real-world example of data stored at that level.

RememberDiagram

Refer to Section C.1 pyramid. Registers: current instruction operands. L1: loop variable. L2: function's local arrays. L3: shared data structures. RAM: open application data. SSD: installed apps. HDD: movies/backups.

Draw a detailed block diagram of a 2-way set-associative cache with 8 sets. Show the address field breakdown for a 32-bit address with 64-byte blocks. Label all comparators, MUX, valid bits, tag arrays, and data arrays.

ApplyDiagram

Word offset = log₂(64) = 6 bits. Sets = 8 → Set index = 3 bits. Tag = 32 - 3 - 6 = 23 bits. Two comparators (one per way), both compare tag with incoming 23 bits. Outputs go through OR → HIT signal. MUX selects data from the matching way.

Draw the virtual memory address translation flow diagram showing: CPU → TLB → Page Table → Physical Memory, with the page fault handler path to disk. Include all timing labels.

UnderstandDiagram

CPU sends VPN → TLB check (~1 ns). TLB hit: PFN directly → physical address. TLB miss: Page table walk in RAM (~100 ns). Valid=1: get PFN, update TLB. Valid=0: Page fault → OS interrupt → fetch from disk (~10 ms) → update page table → update TLB → retry.

🧮 Numerical Problems (6)

A direct-mapped cache has 1024 lines, block size = 8 words (1 word = 4 bytes), address = 32 bits. Find: (a) Tag, Line, and Byte Offset bits (b) Total cache data storage in KB (c) Total cache size including tag and valid bits.

ApplyIntermediate

Block = 8×4 = 32 bytes. Byte offset = log₂(32) = 5 bits. Lines = 1024 = 2¹⁰ → Line = 10 bits. Tag = 32 - 10 - 5 = 17 bits. (b) Data = 1024 × 32 = 32,768 bytes = 32 KB. (c) Per line: 1 + 17 + 256 = 274 bits. Total = 1024 × 274 = 280,576 bits = 34.25 KB.

A 4-way set-associative cache has 256 total lines, block size = 64 bytes, address = 32 bits. Find: (a) Number of sets (b) Tag, Set, Offset bits (c) Number of tag comparators needed.

ApplyIntermediate

(a) Sets = 256/4 = 64. (b) Offset = log₂(64) = 6 bits. Set index = log₂(64) = 6 bits. Tag = 32 - 6 - 6 = 20 bits. (c) 4 comparators (one per way, all compare in parallel within the selected set).

A system has: L1 hit time = 1 ns, L1 miss rate = 5%, L2 hit time = 10 ns, L2 miss rate = 20%, RAM access time = 100 ns. Calculate the Average Memory Access Time (AMAT).

ApplyGATE

AMAT = 1 + 0.05 × (10 + 0.20 × 100) = 1 + 0.05 × (10 + 20) = 1 + 0.05 × 30 = 1 + 1.5 = 2.5 ns

A virtual memory system has: virtual address = 32 bits, physical address = 28 bits, page size = 4 KB. Find: (a) Number of virtual pages (b) Number of physical frames (c) Page table entries (d) Size of page table if each entry is 4 bytes.

ApplyGATE

(a) Page offset = log₂(4K) = 12 bits. Virtual pages = 2^(32-12) = 2²⁰ = 1,048,576. (b) Physical frames = 2^(28-12) = 2¹⁶ = 65,536. (c) Page table entries = 2²⁰ (one per virtual page). (d) Size = 2²⁰ × 4 = 4 MB.

An HDD spins at 10,000 RPM. Average seek time = 4 ms. Sector size = 512 bytes, transfer rate = 200 MB/s. Calculate average access time for one sector.

ApplyIntermediate

Rotational latency = (1/2) × (60/10000) = 3 ms. Transfer time = 512 / (200 × 10⁶) ≈ 0.0025 ms ≈ 0. Access time = Seek + Rotational + Transfer = 4 + 3 + 0 ≈ 7 ms.

A CPU generates 64-bit addresses. The cache is fully associative with 128 lines, block size = 32 bytes. (a) How many tag bits per line? (b) If hit rate = 0.92, hit time = 2 ns, miss penalty = 80 ns, find AMAT. (c) How many comparators are needed?

ApplyAdvanced

(a) Word offset = log₂(32) = 5 bits. No line bits (fully associative). Tag = 64 - 5 = 59 bits. (b) AMAT = 2 + 0.08 × 80 = 2 + 6.4 = 8.4 ns. (c) 128 comparators (one per cache line — all compared in parallel).

🏭 Industry Application Questions (3)

Qualcomm's Snapdragon 8 Gen 3 has a 12 MB L3 cache shared across 8 cores. If each core generates 2 billion memory accesses per second and the L3 hit rate is 70% (for accesses that miss L1+L2), calculate how many RAM accesses per second the L3 cache prevents.

AnalyzeIndustry

Total accesses reaching L3 = 8 × 2B = 16 billion/sec. L3 prevents: 0.70 × 16B = 11.2 billion RAM accesses/sec. Without L3, the memory bus would need to handle 16B accesses/sec instead of 4.8B — it would be a bottleneck.

Samsung's DDR5-7200 has a peak bandwidth of 57.6 GB/s per channel. A server motherboard has 8 channels. If a database workload requires 400 GB/s bandwidth, is this configuration sufficient? What would you recommend?

EvaluateIndustry

Total bandwidth = 8 × 57.6 = 460.8 GB/s. Yes, sufficient (400 < 460.8) with ~13% headroom. However, sustained bandwidth is typically 60-70% of peak, giving ~276–322 GB/s. Recommendation: use 12 channels or add HBM (High Bandwidth Memory) for the most demanding queries.

ISRO's NavIC satellite navigation system needs to store ephemeris data for 7 satellites with 1 ms update rate. Each update is 256 bytes. Design the cache requirements if data must be accessed within 10 ns with 99.9% hit rate.

CreateIndustry

Data rate per satellite: 256 bytes/ms = 256 KB/s. Total for 7 satellites: 1.75 MB/s. Working set: 7 × 256 = 1,792 bytes per update cycle. An L2-level SRAM cache of ~32 KB with fully associative mapping would easily hold all active ephemeris records. With 10 ns requirement → L2 SRAM is ideal. Set 7 fixed entries with pinning to guarantee 100% hit rate for primary data.

🎯 GATE Previous Year Style Questions (5)

G1 GATE

A direct-mapped cache has 2¹⁴ bytes of data and 2⁶ byte blocks. The address is 32 bits. What is the tag field size in bits?

ApplyGATE CS

✅ Answer: (A) 12. Lines = 2¹⁴/2⁶ = 2⁸ = 256 lines. Byte offset = 6 bits. Line index = 8 bits. Tag = 32 - 8 - 6 = 18. Wait — let me recheck: Data = 2¹⁴ bytes, block = 2⁶ bytes. Lines = 2¹⁴/2⁶ = 2⁸. Offset = 6, Index = 8, Tag = 32-8-6 = 18. Answer: (C) 18.

G2 GATE

Consider a 2-way set-associative cache with 256 cache lines and block size of 4 words (word = 4 bytes). The address length is 32 bits. The size of the tag field is:

18 bits
19 bits
20 bits
21 bits

ApplyGATE CS

✅ Answer: (C) 20. Block = 4×4 = 16 bytes → Offset = 4. Sets = 256/2 = 128 = 2⁷ → Set index = 7. Tag = 32 - 7 - 4 = 21. Corrected: Answer is (D) 21 bits.

G3 GATE

The effective access time of a memory system with cache hit rate h, cache access time t₁, and main memory access time t₂ (using simultaneous access) is:

h × t₁ + (1-h) × t₂
t₁ + (1-h) × t₂
h × t₁ + (1-h) × (t₁ + t₂)
h × (t₁ + t₂) + (1-h) × t₂

UnderstandGATE CS

✅ Answer: (A). With simultaneous access (cache and memory accessed at the same time): if hit, time = t₁ (cancel memory). If miss, time = t₂. Effective = h×t₁ + (1-h)×t₂. Note: with hierarchical access (check cache first, then memory), the formula is t₁ + (1-h)×t₂ — option (B).

G4 GATE

In a virtual memory system with page size of 4 KB, a process has a virtual address space of 2³² bytes. The physical memory is 2²⁸ bytes. How many entries does the page table have?

2¹⁶
2²⁰
2²⁴
2²⁸

ApplyGATE CS

✅ Answer: (B) 2²⁰. Page table has one entry per virtual page. Pages = 2³²/2¹² = 2²⁰ = 1,048,576 entries. Physical memory size determines the number of bits in each entry (frame number), not the number of entries.

G5 GATE

A CPU generates 20-bit addresses. The main memory access time is 100 ns. The cache access time is 10 ns with a hit ratio of 0.9. Using hierarchical access, the effective memory access time is:

20 ns
19 ns
110 ns
91 ns

ApplyGATE CS

✅ Answer: (A) 20 ns. Hierarchical: Effective = t_cache + (1-h) × t_memory = 10 + (1-0.9) × 100 = 10 + 10 = 20 ns.

Section F

MCQ Assessment Bank — 30 Questions (Bloom's Mapped)

Remember / Identify (Q1–Q5)

Which memory is fastest in the memory hierarchy?

DRAM
Cache (SRAM)
CPU Registers
SSD

Remember

✅ Answer: (C) CPU Registers — Access time ~0.3 ns, directly inside the CPU with zero latency for ALU operations.

SRAM is used in cache memory because:

It is cheaper than DRAM
It is faster and doesn't need refresh
It has higher density
It uses capacitors for storage

Remember

✅ Answer: (B) — SRAM uses flip-flops (6 transistors), doesn't need refresh, and is faster (~1-20 ns) than DRAM (~100 ns). It's more expensive but speed is the priority for cache.

In a direct-mapped cache, the address is divided into:

Tag and Offset
Tag, Line, and Word Offset
Tag and Set
Page Number and Offset

Remember

✅ Answer: (B) — Direct-mapped: [Tag | Line/Index | Word/Byte Offset]. The Line bits select which cache line, Tag identifies the block, Offset selects the byte within the block.

Which memory needs periodic refresh?

SRAM
DRAM
ROM
Flash

Remember

✅ Answer: (B) DRAM — Stores bits as charges on capacitors that leak over time. Must be refreshed every ~64 ms to retain data.

TLB stands for:

Translation Lookaside Buffer
Table Lookup Block
Transfer Line Buffer
Tag Line Base

Remember

✅ Answer: (A) Translation Lookaside Buffer — A small, fast cache that stores recent virtual-to-physical address translations to speed up memory access.

Understand / Explain (Q6–Q10)

Why does increasing cache associativity reduce conflict misses?

It increases cache size
It allows a block to be placed in multiple locations
It makes the cache faster
It reduces the block size

Understand

✅ Answer: (B) — Higher associativity means more "ways" per set. A block has more placement options, reducing the chance of two blocks fighting for the same slot (conflict miss).

What is the principle of temporal locality?

If a memory location is accessed, nearby locations will also be accessed
If a memory location is accessed, it will likely be accessed again soon
Memory should be accessed in sequential order
Frequently accessed data should be stored on disk

Understand

✅ Answer: (B) — Temporal locality: recently accessed data will likely be accessed again soon (e.g., loop counters, frequently called functions). Option (A) describes spatial locality.

In write-back policy, when is data written to main memory?

On every write operation
Only when the cache line is evicted (replaced)
At fixed time intervals
When the CPU is idle

Understand

✅ Answer: (B) — Write-back writes to cache only. The dirty bit tracks modifications. Data is written to main memory only when the line is evicted and dirty bit = 1.

What happens during a page fault?

Cache line is replaced
TLB is flushed
Required page is loaded from disk to RAM by the OS
CPU clock speed is reduced

Understand

✅ Answer: (C) — Page fault = referenced page is not in RAM (valid bit = 0). OS handles the interrupt: finds the page on disk, loads it into a free frame in RAM, updates the page table, and resumes the instruction.

Q10

Why is fully associative mapping expensive to implement?

It needs more cache lines
It requires a comparator for every cache line
It needs larger block sizes
It requires more address bits

Understand

✅ Answer: (B) — Every incoming tag must be compared with ALL cache lines simultaneously, requiring N comparators for N lines. Hardware cost is proportional to cache size.

Apply / Calculate (Q11–Q20)

Q11

A direct-mapped cache has 256 lines, block size = 32 bytes, address = 32 bits. The number of tag bits is:

Apply

✅ Answer: (B) 19. Byte offset = log₂(32) = 5. Line index = log₂(256) = 8. Tag = 32 - 8 - 5 = 19 bits.

Q12

A cache has hit rate = 0.96, hit time = 2 ns, miss penalty = 50 ns. The AMAT is:

4 ns
3 ns
5 ns
4 ns

Apply

✅ Answer: (A) 4 ns. AMAT = 2 + (1-0.96) × 50 = 2 + 0.04 × 50 = 2 + 2 = 4 ns.

Q13

In a 4-way set-associative cache with 512 total lines, the number of sets is:

Apply

✅ Answer: (B) 128. Sets = Total Lines / Associativity = 512 / 4 = 128.

Q14

A virtual memory system has 20-bit virtual addresses, page size = 1 KB. The number of page table entries is:

512
1024
2048
4096

Apply

✅ Answer: (B) 1024. Page offset = log₂(1K) = 10 bits. Virtual pages = 2^(20-10) = 2¹⁰ = 1024.

Q15

A fully associative cache has 64 lines, block size = 16 bytes, address = 32 bits. The tag size is:

24 bits
26 bits
28 bits
30 bits

Apply

✅ Answer: (C) 28 bits. Byte offset = log₂(16) = 4. No line index (fully associative). Tag = 32 - 4 = 28 bits.

Q16

A 2-level cache system has: L1 access = 1 ns (miss rate 10%), L2 access = 10 ns (miss rate 5%), RAM access = 200 ns. What is the AMAT?

2 ns
3 ns
2 ns
12 ns

ApplyGATE

✅ Answer: (B) 3 ns. AMAT = 1 + 0.10 × (10 + 0.05 × 200) = 1 + 0.10 × (10+10) = 1 + 0.10 × 20 = 1 + 2 = 3 ns.

Q17

An HDD rotates at 7200 RPM. The average rotational latency is approximately:

2.08 ms
4.17 ms
8.33 ms
16.67 ms

Apply

✅ Answer: (B) 4.17 ms. One rotation = 60/7200 = 8.33 ms. Average rotational latency = half a rotation = 8.33/2 = 4.17 ms.

Q18

Cache size = 64 KB, block size = 64 bytes. The number of cache lines is:

512
1024
2048
4096

Apply

✅ Answer: (B) 1024. Lines = Cache Size / Block Size = 64 KB / 64 B = 65536/64 = 1024.

Q19

The effective memory access time with hit ratio h=0.9, cache time=10ns, memory time=100ns (hierarchical access) is:

19 ns
20 ns
28 ns
100 ns

Apply

✅ Answer: (B) 20 ns. Hierarchical: T_eff = t_cache + (1-h)×t_memory = 10 + 0.1×100 = 10 + 10 = 20 ns.

Q20

A page table has 2²⁰ entries, each entry is 4 bytes. The total page table size is:

1 MB
2 MB
4 MB
8 MB

Apply

✅ Answer: (C) 4 MB. Size = 2²⁰ × 4 bytes = 4 × 2²⁰ = 4 MB.

Analyze / Compare (Q21–Q25)

Q21

Which cache mapping has the highest conflict miss rate for a given cache size?

Direct mapped
2-way set-associative
4-way set-associative
Fully associative

Analyze

✅ Answer: (A) Direct mapped — each block has exactly one possible location, so conflict misses are maximum. Fully associative has zero conflict misses.

Q22

Increasing block size in a cache initially reduces miss rate but then increases it. This increase is due to:

Increased hit time
Increased conflict misses and reduced number of lines
Decreased tag bits
Increased write-back overhead

Analyze

✅ Answer: (B) — Larger blocks exploit spatial locality (reducing compulsory misses), but for a fixed cache size, fewer lines means more conflicts. Also, larger blocks mean more unused data brought in (pollution).

Q23

In a multiprocessor system, which write policy simplifies cache coherence?

Write-back
Write-through
Write-allocate
Write-no-allocate

Analyze

✅ Answer: (B) Write-through — main memory always has the latest data, making it easier for other processors to see consistent values. Write-back requires coherence protocols (MESI, MOESI).

Q24

Which replacement policy can suffer from Bélády's anomaly?

LRU
FIFO
Optimal
Random

AnalyzeGATE

✅ Answer: (B) FIFO — Bélády's anomaly: increasing the number of cache lines can actually increase the miss rate with FIFO replacement. LRU and Optimal are stack algorithms and immune to this anomaly.

Q25

Why is the TLB typically fully associative despite the high hardware cost?

It needs to store large pages
It has very few entries and must maximize hit rate
It operates at disk speed
It replaces the page table entirely

Analyze

✅ Answer: (B) — TLB is small (32–128 entries), so the hardware cost of full associativity is manageable. But a TLB miss is extremely expensive (page table walk in RAM), so maximizing hit rate is critical.

Evaluate / Create (Q26–Q30)

Q26

A system architect must choose between a 16 KB direct-mapped cache and an 8 KB 2-way set-associative cache. Assuming the workload has significant conflict misses, which is likely better?

16 KB direct-mapped (bigger is always better)
8 KB 2-way set-associative (less conflicts)
Both perform identically
Cannot determine without the workload

Evaluate

✅ Answer: (D) — While 2-way reduces conflicts, the 16 KB direct-mapped has 2× more lines. The answer depends on the specific access pattern. This is why cache simulation is essential in real design.

Q27

If a system's page fault rate increases from 0.001% to 0.01%, and each page fault costs 10 ms, the impact on effective access time is:

Negligible (< 1% change)
Significant (~10× increase in fault overhead)
System crashes
Only affects disk performance

Evaluate

✅ Answer: (B) — Page fault overhead: 0.001% × 10ms = 0.0001ms = 100ns per access. At 0.01%: 0.00001 × 10ms = 1000ns = 1μs. That's a 10× increase — very significant for high-performance systems.

Q28

To design a cache that eliminates all conflict misses, you would choose:

Direct-mapped with large blocks
Fully associative mapping
Set-associative with 2 ways
Write-back policy

Create

✅ Answer: (B) — Fully associative has zero conflict misses because any block can go anywhere. Only compulsory and capacity misses remain.

Q29

Which approach would most effectively reduce TLB misses for a workload with a 2 GB working set?

Increase TLB entries from 64 to 128
Use larger page sizes (2 MB instead of 4 KB)
Add another TLB level
Both (B) and (C)

EvaluateGATE

✅ Answer: (D) — Larger pages: 2 GB / 2 MB = 1024 pages vs 2 GB / 4 KB = 524,288 pages. Fewer pages = TLB can cover more of the working set. A second-level TLB catches misses from L1 TLB without going to the page table in RAM.

Q30

A chip designer has a transistor budget of 500K transistors for cache. SRAM uses 6 transistors/bit. What is the maximum data capacity of the cache?

~10 KB
~64 KB
~128 KB
~83 KB

Create

✅ Answer: (A) ~10 KB. Total bits = 500,000 / 6 ≈ 83,333 bits ≈ 10,416 bytes ≈ ~10 KB. Note: this is data only — tags, valid bits, and control logic need additional transistors.

Section G

Short Answer Questions (8)

SA1

Define the hit ratio and explain why a hit ratio of 0.95 vs 0.90 can make a significant difference in AMAT. Provide a numerical example.

Answer: Hit ratio h = (Number of cache hits) / (Total memory accesses). Example: Cache time = 1 ns, RAM = 100 ns. At h=0.90: AMAT = 1 + 0.10×100 = 11 ns. At h=0.95: AMAT = 1 + 0.05×100 = 6 ns. A 5% improvement in hit rate gives a 45% reduction in AMAT (11→6 ns). This is because each miss is extremely expensive (100× the hit time), so even small improvements in hit rate yield large performance gains.

SA2

Explain the difference between compulsory, capacity, and conflict misses (the 3 C's of cache misses).

Answer: Compulsory (cold-start) misses: First access to a block — it has never been in cache. Unavoidable. Capacity misses: The working set is larger than cache — blocks get evicted and re-fetched. Solved by increasing cache size. Conflict misses: Multiple blocks map to the same line/set (in direct-mapped or set-associative). Solved by increasing associativity. A fully associative cache has zero conflict misses.

SA3

Distinguish between SRAM and DRAM with respect to storage cell structure, speed, cost, refresh requirement, and usage.

Answer: SRAM: 6 transistors (flip-flop), ~1-20 ns, expensive, no refresh, used for cache. DRAM: 1 transistor + 1 capacitor, ~100 ns, cheap, needs refresh every ~64 ms (charge leaks), used for main memory (RAM). SRAM is ~5-10× faster but ~10-20× more expensive per bit than DRAM.

SA4

What is a TLB and why is it typically fully associative? What happens on a TLB miss?

Answer: TLB (Translation Lookaside Buffer) is a small, fast cache (32-128 entries) that stores recent virtual page → physical frame translations. It's fully associative because: (1) it's small (manageable hardware cost), and (2) TLB misses are very expensive (page table walk in RAM: ~100 ns). Maximizing hit rate is critical. On TLB miss: the hardware/OS walks the page table in RAM to find the mapping. If the page is valid → PFN found, TLB updated. If page is invalid → page fault, OS loads page from disk.

SA5

Write the AMAT formula for a 2-level cache system and calculate AMAT for: L1 time=1ns (miss rate 8%), L2 time=10ns (miss rate 20%), RAM=200ns.

Answer: AMAT = T_L1 + MR_L1 × (T_L2 + MR_L2 × T_RAM) = 1 + 0.08 × (10 + 0.20 × 200) = 1 + 0.08 × (10 + 40) = 1 + 0.08 × 50 = 1 + 4 = 5 ns. Without any cache: 200 ns. The 2-level cache provides a 40× speedup.

SA6

Explain demand paging and how it differs from pre-paging. What are the advantages of demand paging?

Answer: Demand paging: Pages are loaded into RAM only when referenced (on first access → page fault → load). Pre-paging: Pages are loaded before they are needed (prefetching). Advantages of demand paging: (1) Saves memory — only needed pages are in RAM, (2) Faster program startup — don't wait to load entire program, (3) Allows running programs larger than physical memory. Disadvantage: initial page faults cause delays.

SA7

Compare write-through and write-back cache policies with respect to: speed, consistency, dirty bit usage, and suitability for multiprocessor systems.

Answer: Write-through: Writes to both cache and RAM simultaneously. Slower (every write hits RAM), but cache and RAM are always consistent. No dirty bit needed. Better for multiprocessor (coherence is simpler). Write-back: Writes only to cache. Faster (no RAM access on every write). Needs dirty bit to track modified lines. Inconsistent until eviction. Preferred for single-processor performance. Most modern CPUs use write-back for L1/L2 with coherence protocols (MESI) for multiprocessor.

SA8

What is Content Addressable Memory (CAM)? How does it differ from conventional RAM? Where is it used?

Answer: RAM: Input = address → Output = data at that address. CAM: Input = data (search key) → Output = address/location where that data exists. CAM compares the search key against all stored entries simultaneously (parallel search) in a single clock cycle. Used in: TLBs (search by virtual page number), fully associative caches (parallel tag comparison), network routers (IP lookup tables), and firewalls. TCAM (Ternary CAM) adds "don't care" bits for wildcard matching.

Section H

Long Answer Questions (3)

📝 LA1: Compare all three cache mapping techniques with diagrams, formulas, advantages, disadvantages, and real-world usage (15 marks)

Model Answer Structure:

1. Direct Mapping

Address: [Tag | Line Index | Word Offset]

Formula: Cache Line = Block Number mod (Number of Lines)

Each block has exactly ONE possible cache line.

Advantages: Simple hardware (1 comparator), fast lookup, cheap.

Disadvantages: High conflict miss rate — two blocks mapping to the same line cause thrashing.

Used in: Simple embedded systems, L1 cache in some older designs.

2. Fully Associative Mapping

Address: [Tag | Word Offset] — NO line index field.

Any block can go in ANY cache line.

Advantages: Zero conflict misses, highest flexibility.

Disadvantages: Needs N comparators (one per line), expensive hardware, slower for large caches.

Used in: TLBs (small, need high hit rate), small special-purpose caches.

3. Set-Associative Mapping

Address: [Tag | Set Index | Word Offset]

Formula: Set = Block Number mod (Number of Sets). Within a set, block can go in any way.

K-way: Each set has K lines. Needs K comparators (manageable).

Advantages: Balance of conflict reduction and hardware cost. Optimal for most workloads.

Disadvantages: More complex than direct, slightly slower than direct (K-way comparison).

Used in: L1/L2/L3 in ALL modern CPUs (2-way to 16-way).

  Comparison Summary:
  ┌────────────────┬────────┬──────────────┬───────────────────┐
  │ Feature        │ Direct │ Fully Assoc. │ K-Way Set-Assoc.  │
  ├────────────────┼────────┼──────────────┼───────────────────┤
  │ Placement      │ Fixed  │ Anywhere     │ Within a set      │
  │ Comparators    │ 1      │ N            │ K (per set)       │
  │ Conflict Miss  │ High   │ None         │ Low               │
  │ Hardware Cost  │ Low    │ Very High    │ Medium            │
  │ Flexibility    │ Low    │ Very High    │ High              │
  │ Hit Rate       │ Lower  │ Highest      │ Near-highest      │
  └────────────────┴────────┴──────────────┴───────────────────┘

📝 LA2: Explain virtual memory organisation with page table, TLB, and demand paging. Include a complete address translation diagram (15 marks)

Model Answer should cover:

Virtual Memory Concept: Each process gets its own virtual address space (e.g., 4 GB for 32-bit). Physical RAM is shared. The OS + hardware translate virtual → physical addresses transparently.
Page Table: A data structure (one per process) stored in RAM. Maps virtual page numbers (VPN) to physical frame numbers (PFN). Each entry has: valid bit, dirty bit, frame number, permission bits.
TLB: A fast hardware cache (fully associative, ~32–128 entries) that stores recent VPN→PFN translations. Hit rate typically 99%+. Prevents expensive page table walks for most accesses.
Demand Paging: Pages loaded only when accessed. Page fault → OS interrupt → load from disk → update page table → resume. Process starts with zero pages in RAM.
Address Translation Flow: CPU sends virtual address → TLB check (1 ns). Hit → PFN directly. Miss → Page table walk (100 ns). Valid → get PFN, update TLB. Invalid → Page fault → Disk (10 ms) → load page → update page table → update TLB → retry.
Page Replacement: When RAM is full: LRU, FIFO, or Clock algorithm selects a victim page. If dirty → write back to disk first.

Include the full address translation diagram from Section C.8.

📝 LA3: Solve a comprehensive cache design problem with AMAT calculation for a 2-level cache system (15 marks)

Problem: A processor has a 2-level cache system:

L1: Direct-mapped, 128 lines, 32-byte blocks, hit time = 1 ns, miss rate = 10%
L2: 4-way set-associative, 1024 lines, 64-byte blocks, hit time = 8 ns, miss rate = 5% (local)
RAM access time: 100 ns. Address: 32-bit, byte-addressable.

Find: (a) L1 address breakdown (tag/line/offset) (b) L2 address breakdown (c) AMAT (d) Speedup over no cache (e) If L1 miss rate improves to 5%, new AMAT and % improvement.

Solution:

(a) L1: Block = 32 bytes → offset = 5 bits
    Lines = 128 = 2⁷ → index = 7 bits
    Tag = 32 - 7 - 5 = 20 bits → [20|7|5]

(b) L2: Block = 64 bytes → offset = 6 bits
    Sets = 1024/4 = 256 = 2⁸ → set index = 8 bits
    Tag = 32 - 8 - 6 = 18 bits → [18|8|6]

(c) AMAT = T_L1 + MR_L1 × (T_L2 + MR_L2 × T_RAM)
         = 1 + 0.10 × (8 + 0.05 × 100)
         = 1 + 0.10 × (8 + 5)
         = 1 + 0.10 × 13
         = 1 + 1.3 = 2.3 ns

(d) Speedup = RAM_time / AMAT = 100 / 2.3 = 43.5×

(e) New AMAT = 1 + 0.05 × 13 = 1 + 0.65 = 1.65 ns
    Improvement = (2.3 - 1.65) / 2.3 × 100 = 28.3%

Section I

Industry Spotlight — A Day in the Life

👨‍💻 Vikram Sahu, 32 — Cache Design Engineer at Samsung Semiconductor, Bangalore

Background: B.Tech (ECE) from NIT Bhopal. M.Tech from IIT Madras (VLSI). Joined Samsung Semiconductor India (SSIR) as a campus hire. Now leads a team of 6 engineers designing L2 cache controllers for Exynos mobile processors.

A Typical Day:

8:30 AM — Morning sync with the Seoul (Korea) team. Review overnight simulation results for the new Exynos 2500 L2 cache design. A corner-case coherence bug was found — discuss fix approaches.

9:30 AM — Write RTL (Register Transfer Level) code in SystemVerilog for a new cache replacement algorithm. Samsung is exploring RRIP (Re-Reference Interval Prediction) to replace LRU.

11:00 AM — Run synthesis and timing analysis using Synopsys Design Compiler. Target: L2 hit time ≤ 4 ns at 3.5 GHz. Current design meets timing with 200 ps slack.

1:00 PM — Lunch at Samsung's Bangalore campus. Discuss power-performance trade-offs with the power management team. Every picojoule per cache access matters for phone battery life.

2:00 PM — Run cache trace simulations using SPEC CPU2017 benchmarks. Compare 4-way vs 8-way L2 on workload mix: Chrome, WhatsApp, games, camera app. 8-way gives 2% higher hit rate but 15% more power.

4:30 PM — Code review for a junior engineer's TLB prefetcher design. Suggest optimisations for reducing TLB miss penalty from 20 ns to 14 ns.

6:00 PM — Write a technical report comparing the Exynos cache hierarchy with Snapdragon 8 Gen 3. Present findings to the architecture team in Seoul next week.

Detail	Info
Tools Used Daily	SystemVerilog, Synopsys VCS, Design Compiler, GEM5 simulator, Python (scripting), Perforce (version control)
Entry Salary (India)	₹10–15 LPA (M.Tech) / ₹6–8 LPA (B.Tech)
Mid-Level (5–8 yrs)	₹20–35 LPA
Senior (10+ yrs)	₹40–80 LPA + RSUs
Companies Hiring (India)	Samsung SSIR, Qualcomm Hyderabad, Intel Bangalore, AMD Hyderabad, ARM Bangalore, Texas Instruments, MediaTek Noida, NVIDIA

Section J

Earn With It — Memory Optimization Skills

💰 Your Earning Path After This Chapter

Portfolio Piece: A working cache simulator (Python) with trace output + a technical blog post explaining cache mapping with diagrams — hosted on GitHub.

Skill Paths Unlocked:

• Embedded Systems (Immediate): Optimise memory usage in Arduino/ESP32 projects. Freelance IoT gigs: ₹3,000–₹10,000/project

• VLSI/SoC Design (After GATE/M.Tech): Cache controller design at Samsung, Qualcomm, Intel. Entry: ₹10–15 LPA

• Systems Programming: Write cache-friendly C/C++ code. Performance optimisation gigs: ₹5,000–₹20,000/project

• Technical Content Writing: Write COA tutorials for GeeksforGeeks, Naukri, or Unstop. ₹500–₹2,000/article

Opportunity	Skills Needed	Platform	Earning Potential
COA Tutorial Writer	Cache concepts + writing	GeeksforGeeks, Medium	₹500–₹2,000/article
Embedded IoT Projects	C/C++, memory optimization	Freelancer, Internshala	₹3,000–₹10,000/project
GATE Coaching Assistant	COA + numerical solving	Unacademy, Physics Wallah	₹5,000–₹15,000/month
Performance Tuning	Cache-aware coding	Upwork, Toptal	$25–$75/hour

Start with technical writing. Write 5 well-explained COA articles with diagrams on Medium or Dev.to. Apply to GeeksforGeeks as a technical writer (₹500–₹2,000/article). This builds your portfolio AND earns money while you're still learning.

Section K

Chapter Summary — Memory Unit at a Glance

🧠 Key Takeaways

Memory Hierarchy: Registers → L1 → L2 → L3 → RAM → SSD → HDD. Faster = smaller = costlier.
Locality of Reference: Temporal (reuse recently accessed) + Spatial (access nearby addresses). The foundation of caching.
Cache Mapping: Direct (simple, conflict-prone), Fully Associative (flexible, expensive), Set-Associative (practical balance).
Address Fields: Direct: [Tag|Line|Offset]. Associative: [Tag|Offset]. Set-Assoc: [Tag|Set|Offset].
AMAT = Hit Time + Miss Rate × Miss Penalty. For multi-level: recurse into each level.
Write Policies: Write-through (consistent, slow) vs Write-back (fast, needs dirty bit).
Virtual Memory: Gives each process its own address space. Page table maps VPN→PFN. TLB caches translations.
Page Fault: Referenced page not in RAM → OS loads from disk (~10 ms). Must be extremely rare (<0.001%).
SRAM vs DRAM: SRAM = fast/expensive (cache). DRAM = slow/cheap/needs refresh (main memory).
CAM: Searches by content, not address. Used in TLBs and fully associative caches.

📋 Essential Formulas

  ┌────────────────────────────────────────────────────────────────┐
  │ AMAT = Hit_Time + Miss_Rate × Miss_Penalty                   │
  │                                                                │
  │ 2-Level AMAT = T₁ + MR₁ × (T₂ + MR₂ × T_RAM)               │
  │                                                                │
  │ Tag bits (Direct)    = n - log₂(Lines) - log₂(Block)          │
  │ Tag bits (Assoc)     = n - log₂(Block)                        │
  │ Tag bits (Set-Assoc) = n - log₂(Sets) - log₂(Block)           │
  │                                                                │
  │ Sets = Total_Lines / Associativity                             │
  │ Cache_Data = Lines × Block_Size                                │
  │                                                                │
  │ Virtual Pages = 2^(VA_bits - Offset_bits)                      │
  │ Physical Frames = 2^(PA_bits - Offset_bits)                    │
  │ Page Table Size = Num_Virtual_Pages × Entry_Size               │
  │                                                                │
  │ Rotational Latency = (1/2) × (60/RPM) seconds                 │
  │ Disk Access = Seek + Rotational_Latency + Transfer_Time        │
  │                                                                │
  │ Effective Access Time (Hierarchical) = t_c + (1-h) × t_m      │
  │ Effective Access Time (Simultaneous) = h×t_c + (1-h) × t_m    │
  └────────────────────────────────────────────────────────────────┘

Section L

Earning Checkpoint — Self-Assessment

Skill / Concept	Tool / Method	Deliverable	Earning Ready?
Memory Hierarchy	Conceptual	—	✅ Yes — interview ready
Cache Address Decoding	Python script	Address Decoder tool on GitHub	✅ Yes — useful for GATE coaching content
Cache Mapping (3 types)	Diagrams + calculations	Blog post with ASCII diagrams	✅ Yes — technical writing gigs
Hit/Miss Trace Simulation	Python simulator	Cache Simulator on GitHub	✅ Yes — portfolio piece
AMAT Calculation	Formula application	Solved numericals set	✅ Yes — GATE coaching assistance
Virtual Memory Concepts	Conceptual + calculations	—	✅ Yes — interview ready
Cache Hierarchy Design	Full simulator (Tier 3)	L1+L2 Hierarchy Simulator	✅ Yes — resume-worthy project
VLSI/SoC Cache Design	SystemVerilog (beyond chapter)	—	⬜ Not yet — needs M.Tech/advanced courses

Minimum Viable Earning Setup after this chapter: A GitHub profile with (1) Cache Address Decoder (Python), (2) Cache Hit/Miss Simulator, (3) 2–3 well-written technical blog posts explaining cache mapping with diagrams = you can earn ₹3,000–₹10,000/month from technical writing + GATE coaching content while still in college.

✅ Unit 6 complete. You've mastered the Memory Unit — from registers to virtual memory!

[QR: Link to EduArtha video tutorial — COA Unit 6: Memory Unit]