block
vLLM stores KV in fixed-size blocks of BLOCK_SIZE = 16 tokens. A 64-token prompt is 4 blocks.
vLLM · spans · Legolink
An interactive walk-through of the six tests that pin down how position-invariant chunks, span boundaries, and the Legolink gap policy behave inside vLLM's V1 KV cache.
Hover any term to highlight it across the page.
vLLM stores KV in fixed-size blocks of BLOCK_SIZE = 16 tokens. A 64-token prompt is 4 blocks.
A position-invariant chunk: a block whose hash is computed without its parent. Same content → same hash, no matter where it appears in the prompt.
An array of token positions where the hash chain is reset. Each entry marks the first block of a new PIC chunk.
On a prefix-cache hit, the SpanAwareGapPolicy opens a
gap interval that re-prefills tokens, overwriting the
cached K/V in place.
Structural: test_pic_chunk_hash_invariant_across_positions -
pins the hash function (no LLM).
E2E: test_pic_chunk_hash_invariant_across_positions_e2e -
warms up the chunk alone, then verifies the warmup K/V slot is hit
by two requests that place the chunk at different positions
(position 16 in req_A, position 48 in req_B).
Request A places the chunk after a 1-block prefix (position 16).
Request B places the same chunk after a 3-block prefix (position 48).
Without PIC, the chunk's block hash would chain through different parents
and the two would differ. With span_starts,
the chunk's parent is dropped, so the hashes collide.
test_span_boundary_resets_block_hash_chain_e2e
Warm up a chunk's tokens alone, then send a 64-token prompt two ways:
baseline (no span_starts) and marked (span_starts=[32]).
Only the marked run can hit the warmup slot for block 2 - because only
its hash chain is reset to NONE_HASH at the span boundary.
The same 64 tokens are hashed twice: once as a baseline, once with
span_starts=[32]. Slide the span position to see which
block hashes change.
test_same_pic_chunk_reuse_across_prefixes_e2e
Warm up a chunk alone, then run two requests with different 1-block
prefixes but the same chunk. Both must HIT the warmup K/V slot via
PIC fan-in; the two requests' prefix slots must differ (no
cross-prefix reuse on prefix blocks). Two requests with different prefixes but the same PIC chunk hash the
chunk identically - the prefix blocks differ.
This is the user-visible promise of PIC: drop the same document, function, or system prompt into any conversation and its KV cache entry is reusable across requests. Hash collision on the chunk → cache hit.
test_pic_chunk_warmup_then_three_requests (e2e):
pre-warm the PIC chunk, then run req_A / req_B / req_C and check
that the warmup slot(s) survive every request, B fully reuses A,
and req_C only adds the prefix_Y blocks - chunk bytes are reused
across A and C.
What's tested is block-hash equality - the hash that the prefix
cache uses as a lookup index (a dictionary key, not
the attention K tensor). The K/V tensors stored under that hash are not
compared by this test. req_A = prefix_X + chunk + suffix;
req_B is byte-identical to A with a different request_id;
req_C = prefix_Y + chunk + suffix swaps the prefix.
Assertions check which block_hashes collide across the three.
Mechanism (hash_block_tokens,
kv_cache_utils.py:555): when is_span_start=True
for the block at span_starts[i], the parent hash is
replaced with NONE_HASH. The chunk's hash then depends
only on (NONE_HASH, chunk_tokens, extra_keys). Every
downstream block chains from that prefix-free root, so the entire
post-span tail hash-collides across A and C as long as
chunk + tail are byte-equal.
This test currently FAILS by design.
Only the chunk is marked PIC; tail blocks inherit a fan-in hash
because they chain through the chunk's hash. vLLM treats that as a
cache hit and silently hands req_C the tail K/V that req_A wrote,
even though req_C's cross-attention sees a different prefix. The
test asserts |C \ A| >= 5 (2 prefix_Y + 3 fresh tail
blocks); current vLLM gives 3.
test_legolink_partial_recompute_within_gap_interval
Mode LL-32 sets gap_length = 2 · BLOCK_SIZE.
With span_starts = [32] on a 96-token prompt, the policy
opens the gap (32, 64): only the chunk + the
first downstream block can be recomputed. Blocks 4 and 5 (further
along the span chain) must keep their cached K/V.
Two parts. Structural - call
SpanAwareGapPolicy(gap_length=32).get_gaps(...) on a
Request with span_starts=[32] and pin the returned
interval to [(32, 64)]. End-to-end -
run the same tokenized prompt twice through an LL-32 LLM, snapshot
the KV cache between runs, and bound the byte-diff: the gap can
recompute at most 2 prompt blocks, so the number of K/V byte-hashes
that change is bounded by 2 + a small decode-step budget.
test_legolink_gap_huge_equals_full_recompute
Same prompt, no padding, 4 blocks (64 tokens). Five configurations
compared: FR, SPANS+no span_starts,
SPANS+span_starts=[0], LL-FULL+no span_starts,
LL-FULL+span_starts=[0]. All five must produce
bit-identical text and bit-identical top-K logprobs - because none
of them actually does cross-prefix PIC reuse; they only exercise
different code paths that should all reduce to the same K/V.
The five modes intentionally cover every "should be a no-op" lever in the spans machinery: env-flag-on-with-no-span_starts, the entire prompt declared as one span (PIC reset at block 0 = no-op since block 0's parent was already None), gap policy with no span_starts to fire on, and gap policy that fires but covers the entire prompt so the re-prefill sees the same context as a cold prefill. Output text is bit-exact and the top-1 token matches; lower-rank logprobs may drift by ~0.04 because the cold-prefill and gap-recompute code paths are not numerically identical.