Week 3 Day 2: Paged Attention, Part 2
In this chapter, we move from paged KV storage to the runtime metadata and execution path needed for real paged attention.
Part 1 introduced fixed-size pages, a model-owned page pool shared by layer caches, and per-layer page metadata. That change already improves the storage abstraction, but it does not yet remove the dense gather before attention. To get the full benefit, the attention path itself must understand how to read from pages.
Paged KV Cache vs Paged Attention
These two ideas are related, but they are not the same:
- Paged KV cache KV is stored in fixed-size pages.
- Paged attention The attention path reads KV directly from those pages via metadata such as a page table.
You can implement the first one without the second one, but the real serving payoff comes when both are present.
The Metadata a Paged Runtime Needs
Once KV is paged, dense B x H x S x D tensors are no longer the natural runtime representation. Instead, the runtime should prepare metadata like:
block_table: [B, max_pages_per_request]
context_lens: [B]
For the current layer being executed:
block_table[b, i]gives the page id for requestb’s current-layer logical pageicontext_lens[b]gives the valid token count for requestb
This is the bridge between the scheduler and the attention kernel.
A production runtime often also carries write-side metadata such as slot_mapping.
For this chapter, we keep the write side inside the cache and focus on the read-side
metadata needed by attention.
Why block_table Matters
Suppose one layer cache for request A has:
page_ids = [12, 5, 3]
context_len = 10
page_size = 4
Then the logical sequence positions map to physical storage like this:
logical 0..3 -> page 12
logical 4..7 -> page 5
logical 8..9 -> page 3
The attention runtime does not need a fully gathered dense tensor if it already knows:
- which current-layer page each logical block lives in,
- how long the context is,
- and where the current query positions are.
That is exactly what block_table and context_lens encode.
The Real Attention API
At this point, the runtime should grow a new attention entry point:
paged_attention(
query,
key_pages,
value_pages,
block_table,
context_lens,
page_size,
scale=None,
mask="causal",
)
With shapes like:
query: B, H_q, L, D
key_pages[i]: 1, H_kv, page_size, D
value_pages[i]: 1, H_kv, page_size, D
block_table: B, max_pages
context_lens: B
Compared with the Week 2 dense path, the important difference is that the source length is no longer represented as one contiguous tensor dimension. It is reconstructed logically from the page table.
In this chapter, paged_attention should read pages directly from a GPU
kernel. The runtime contract is now: model code and batching code pass pages
plus metadata, and the attention kernel walks that metadata without first
rebuilding dense K/V.
Prefill Metadata
During prefill, a chunk may span multiple pages. The runtime needs to know:
- which current-layer pages already existed,
- which new pages were allocated,
- how many valid tokens are in the tail page,
- how to map incoming K/V rows into page storage.
In this teaching implementation, the cache still owns the write-side bookkeeping. The attention path only needs the block table after the write is done.
Decode Metadata
During decode, each active request typically writes one token.
The runtime should be able to:
- append the token’s K/V to the current tail page,
- allocate a new page only if the tail page is full,
- update the current layer cache’s
context_len, - run attention over the full logical context using
block_table
This is the point where decode stops paying the repeated dense-repack cost from Week 2.
How This Maps to tiny-llm
src/tiny_llm/attention.py
Add a new function:
def paged_attention(...):
...
For this course implementation, make it a FlashAttention-style page-walking Metal kernel:
- use
block_table[b]to find the physical pages for requestb, - use
context_lens[b]to ignore unused tail capacity, - visit K/V in small tiles instead of materializing dense K/V,
- merge each tile into the output with online softmax.
The important change from Week 2 is the K/V address calculation. Week 2 can
advance through dense K/V by pointer arithmetic. Week 3 must translate each
logical key position through block_table first:
logical key position -> logical page -> physical page id -> slot in page
After that lookup, the online-softmax update is the same idea as Week 2 FlashAttention. We still avoid a dense K/V gather before attention.
The page pool should therefore expose contiguous physical storage:
key_pages: P, H_kv, page_size, D
value_pages: P, H_kv, page_size, D
A Python list of page tensors is convenient for teaching the allocator, but a
GPU kernel needs a single buffer so page_id can be turned into an address.
src/tiny_llm/qwen3_week3.py
The attention module should call the paged runtime directly:
metadata = cache.update_and_fetch_paged(...)
x = paged_attention(...)
Week 3 cache handles are expected to provide paged metadata. If a dense cache is passed to the Week 3 model, that is a programming error rather than a signal to silently fall back to Week 2 attention.
src/tiny_llm/batch.py
The scheduler now needs to prepare runtime metadata instead of only dense K/V:
- per-layer page tables for each active request
- padded batch
block_table context_lens
This is where continuous batching and paged attention finally connect. In Week 2, batching worked by repacking tensors. In Week 3, batching should work by reusing page tables and updating only the new slots.
Recommended Incremental Rollout
The safest implementation order is:
- paged storage
block_table/context_lensplumbing- FlashAttention-style page-walking GPU attention
- model and batch dispatch
This order matters because it gives us a clean correctness baseline at each step.
Correctness Invariants
These are the invariants worth checking in tests:
context_lenalways equals the number of written logical token positions.block_tablereconstructs the same logical KV order as the dense baseline.- the allocator never hands the same page to two live cache handles unless explicit sharing is implemented.
- releasing a request returns all pages owned by all of its layer caches exactly once.
- decode allocates a new page only when the tail page overflows.
Task 1: Add Batch Metadata
src/tiny_llm/paged_kv_cache.py
src/tiny_llm/kv_cache.py
src/tiny_llm/batch.py
Extend the batch cache and scheduler so they can prepare:
block_tablecontext_lens
for all active requests.
Task 2: Define paged_attention
src/tiny_llm/attention.py
src/extensions/src/paged_attention.cpp
src/extensions/src/paged_attention.metal
Add a paged attention interface whose inputs come from the paged runtime rather than a dense reconstructed S dimension.
The reference solution walks every request’s block table and keeps online softmax state:
running_max = max(previous_max, page_max)
running_sum = previous_sum * exp(previous_max - running_max) + page_sum
output = previous_output * exp(previous_max - running_max) + page_output
After all visible pages are consumed, divide output by running_sum.
This is the key idea that lets the kernel avoid materializing dense K/V while
still producing the same result as dense attention.
Task 3: Dispatch from the Model
src/tiny_llm/qwen3_week3.py
Update the model so it can route to paged attention when the cache provides paged runtime metadata.
Task 4: Connect It to Continuous Batching
src/tiny_llm/batch.py
Update request admission, slot reuse, and request removal so that:
- finished requests free their pages,
- in this teaching implementation, that means freeing pages from every layer cache,
- new requests allocate from the shared pool,
- active decode steps reuse page metadata instead of rebuilding dense K/V.
After this chapter, the serving stack has the right structure for a real high-throughput runtime: paging is no longer just a storage trick, but part of the execution model itself.
Your feedback is greatly appreciated. Welcome to join our Discord Community.
Found an issue? Create an issue / pull request on github.com/skyzh/tiny-llm.
tiny-llm-book © 2025 by Alex Chi Z is licensed under CC BY-NC-SA 4.0.