The vLLM tour - paged attention, batching, prefix cache, speculative decoding

concept 1

paged attention splits the KV cache into blocks

Like an operating system pages physical memory, vLLM stores each sequence's K/V in fixed-size blocks (default BLOCK_SIZE = 16). A block table maps the sequence's logical token positions to physical block ids - so the cache doesn't have to be one contiguous slab per request.

Why this matters: classical attention implementations allocate a contiguous KV buffer sized to the request's maximum length. Most of that buffer is wasted. With paging, vLLM allocates one block at a time as tokens arrive, packs many sequences into the same physical KV pool, and never fragments.

Add tokens with the slider. Watch new physical blocks get pulled from the free pool and stitched into the sequence's block table.

tokens generated

concept 2

continuous batching schedules at the iteration level

Instead of waiting for a whole batch to finish before starting the next (static batching), vLLM runs one decode step at a time across whatever requests are alive. Finished requests drop out immediately; new ones can join on the very next step.

The result is much higher GPU utilization: the model is never blocked waiting for the slowest request in a batch to finish. It also makes latency-vs-throughput tuning a scheduler problem rather than a bucketing problem.

Step the simulator forward. Each request finishes at a different time; a new one slots in as soon as a slot frees up.

concept 3

prefix caching reuses KV blocks across requests

Block-level KV state is content-addressed by the hash of its tokens (and parent block). Two requests that begin with the same prompt - a shared system prompt, a reused tool description, a long document - collide on the early block hashes and share the cached KV blocks directly.

The first request fills the cache. The second request just walks the hash chain, gets cache hits on the shared prefix, and only runs the model on the divergent tail.

Type a different suffix below. The shared prefix lights up green (cached); only the new tail is computed.

request B suffix length

concept 4

speculative decoding drafts cheap, verifies correct

A small/cheap draft proposes k next tokens. The big target model runs one forward pass to score all k, accepts the longest matching prefix, and resamples the first rejection. Output is bit-identical to plain decoding from the target.

The win: when the draft is right, you got k tokens for the cost of one big-model forward pass. When it's wrong, you fall back to the safe sample at the rejection point - never worse than non-spec.

Step through one round: draft proposes, target verifies, accepted span is committed, rejected suffix is thrown away.

concept 5

chunked prefill splits long prompts so decodes don't starve

A naive scheduler runs each prefill to completion before any decode can run on the same step - a 4k-token prefill blocks every other request's decode for many iterations. Chunked prefill breaks the prefill into fixed-size chunks (e.g. 512 tokens) and lets each chunk ride alongside the decodes already in flight.

No FLOPs are saved - the prefill still costs the same total compute. The wins are: (1) latency-sensitive decodes don't starve, and (2) prefill is compute-bound while decode is memory-bandwidth-bound, so packing a prefill chunk into the same iteration as the in-flight decodes uses different GPU resources at the same time - net throughput goes up.

Slide the chunk size. The top track is the unchunked baseline; the bottom track is what chunked prefill produces. Watch when the green decode steps get to run.

chunk size (tokens)

concept 6

tensor parallelism shards weight matrices across GPUs

Slice the weights. The input x to this block is the output of the previous block (e.g. the attention block right before this MLP). That previous block ended with an all-reduce, which already produced an identical copy of its output on every rank - so the next block starts with x already replicated, no broadcast required. Within a block, ranks chain a column-parallel GEMM (replicated input → sharded intermediate) into a row-parallel GEMM (sharded input → partial-sum output); the intermediate between the two linears is sharded, not replicated. Only the final all-reduce at the block's exit makes ranks wait on each other - one collective per attention block, one per MLP block.

Step through one MLP block. Watch x flow: replicated (from the previous block's all-reduce) → col-parallel → sharded intermediate → row-parallel → partial-sum → all-reduce → replicated, ready for the next block.

tensor_parallel_size

concept 7

disaggregated P/D prefill and decode want different hardware

Prefill is compute-bound (one big forward pass over thousands of tokens). Decode is memory-bandwidth-bound (one token at a time, dominated by KV reads). Running both on the same pool means each starves the other. Disaggregated serving runs them on separate worker pools and ships the KV cache between them.

The request lifecycle: arrive at a prefill worker → run the big prefill → transfer the resulting KV blocks (over NVLink, RDMA, or nixl) to a decode worker → stream output tokens from the decode worker. The two pools can be sized and tuned independently.

Step through one request's lifecycle.