vllm · paged attention · kv cache

Illustrated vLLM

A small set of interactive, visual walk-throughs of how vLLM serves large language models - the moving parts behind paged attention, continuous batching, prefix caching, and the position-invariant KV cache work in V1.

Click on any diagram to interact.

start here

The vLLM tour

Seven core ideas that make vLLM fast - one diagram per concept, each one interactive.

/vllm
deep dive

PIC / Legolink

A long, test-by-test walk-through of position-invariant chunks, span_starts, and the Legolink gap policy that re-prefills KV in place. Six interactive visualizations, one per pinning test.

/pic

Why this site

vLLM is one of the most-deployed LLM inference engines, and a lot of its cleverness is structural: how memory is laid out, how requests are scheduled, how the KV cache is reused. Diagrams help.

The two pages here have very different shapes. /vllm is a short tour for someone who has heard of paged attention and wants to actually see it. /pic is a long-form walk-through of one specific piece of work - the spans / Legolink tests in vLLM's V1 KV cache - visualized one test at a time.

Inspired by poloclub/transformer-explainer. Source on GitHub.