PIC / Legolink Explainer - Visualizing the spans KV-cache tests

0 · the four ideas you need

Hover any term to highlight it across the page.

block

vLLM stores KV in fixed-size blocks of BLOCK_SIZE = 16 tokens. A 64-token prompt is 4 blocks.

PIC chunk

A position-invariant chunk: a block whose hash is computed without its parent. Same content → same hash, no matter where it appears in the prompt.

span_starts

An array of token positions where the hash chain is reset. Each entry marks the first block of a new PIC chunk.

Legolink (gap policy)

On a prefix-cache hit, the SpanAwareGapPolicy opens a gap interval that re-prefills tokens, overwriting the cached K/V in place.

test 1 · structural + e2e

same chunk, two positions, one hash

Structural: test_pic_chunk_hash_invariant_across_positions - pins the hash function (no LLM).
E2E: test_pic_chunk_hash_invariant_across_positions_e2e - warms up the chunk alone, then verifies the warmup K/V slot is hit by two requests that place the chunk at different positions (position 16 in req_A, position 48 in req_B).

Request A places the chunk after a 1-block prefix (position 16).
Request B places the same chunk after a 3-block prefix (position 48).
Without PIC, the chunk's block hash would chain through different parents and the two would differ. With span_starts, the chunk's parent is dropped, so the hashes collide.

test 2 · e2e · with warmup

a span boundary cuts the hash chain

test_span_boundary_resets_block_hash_chain_e2e
Warm up a chunk's tokens alone, then send a 64-token prompt two ways: baseline (no span_starts) and marked (span_starts=[32]). Only the marked run can hit the warmup slot for block 2 - because only its hash chain is reset to NONE_HASH at the span boundary.

The same 64 tokens are hashed twice: once as a baseline, once with span_starts=[32]. Slide the span position to see which block hashes change.

span_starts position

test 3 · e2e · with warmup

fan-in: different prefix, same chunk, cache reuse

test_same_pic_chunk_reuse_across_prefixes_e2e
Warm up a chunk alone, then run two requests with different 1-block prefixes but the same chunk. Both must HIT the warmup K/V slot via PIC fan-in; the two requests' prefix slots must differ (no cross-prefix reuse on prefix blocks). Two requests with different prefixes but the same PIC chunk hash the chunk identically - the prefix blocks differ.

This is the user-visible promise of PIC: drop the same document, function, or system prompt into any conversation and its KV cache entry is reusable across requests. Hash collision on the chunk → cache hit.

test 4 · structural · no recompute

PIC spans line up block hashes for cross-request reuse

test_pic_chunk_warmup_then_three_requests (e2e): pre-warm the PIC chunk, then run req_A / req_B / req_C and check that the warmup slot(s) survive every request, B fully reuses A, and req_C only adds the prefix_Y blocks - chunk bytes are reused across A and C.

What's tested is block-hash equality - the hash that the prefix cache uses as a lookup index (a dictionary key, not the attention K tensor). The K/V tensors stored under that hash are not compared by this test. req_A = prefix_X + chunk + suffix; req_B is byte-identical to A with a different request_id; req_C = prefix_Y + chunk + suffix swaps the prefix. Assertions check which block_hashes collide across the three.

Mechanism (hash_block_tokens, kv_cache_utils.py:555): when is_span_start=True for the block at span_starts[i], the parent hash is replaced with NONE_HASH. The chunk's hash then depends only on (NONE_HASH, chunk_tokens, extra_keys). Every downstream block chains from that prefix-free root, so the entire post-span tail hash-collides across A and C as long as chunk + tail are byte-equal.

This test currently FAILS by design. Only the chunk is marked PIC; tail blocks inherit a fan-in hash because they chain through the chunk's hash. vLLM treats that as a cache hit and silently hands req_C the tail K/V that req_A wrote, even though req_C's cross-attention sees a different prefix. The test asserts |C \ A| >= 5 (2 prefix_Y + 3 fresh tail blocks); current vLLM gives 3.

test 5 · runtime · recompute

Gap interval bounds the partial recompute

test_legolink_partial_recompute_within_gap_interval
Mode LL-32 sets gap_length = 2 · BLOCK_SIZE. With span_starts = [32] on a 96-token prompt, the policy opens the gap (32, 64): only the chunk + the first downstream block can be recomputed. Blocks 4 and 5 (further along the span chain) must keep their cached K/V.

Two parts. Structural - call SpanAwareGapPolicy(gap_length=32).get_gaps(...) on a Request with span_starts=[32] and pin the returned interval to [(32, 64)]. End-to-end - run the same tokenized prompt twice through an LL-32 LLM, snapshot the KV cache between runs, and bound the byte-diff: the gap can recompute at most 2 prompt blocks, so the number of K/V byte-hashes that change is bounded by 2 + a small decode-step budget.

test 6 · runtime · 5-mode equivalence

five "no-PIC-fan-in" configurations all produce the same next-token distribution

test_legolink_gap_huge_equals_full_recompute
Same prompt, no padding, 4 blocks (64 tokens). Five configurations compared: FR, SPANS+no span_starts, SPANS+span_starts=[0], LL-FULL+no span_starts, LL-FULL+span_starts=[0]. All five must produce bit-identical text and bit-identical top-K logprobs - because none of them actually does cross-prefix PIC reuse; they only exercise different code paths that should all reduce to the same K/V.

The five modes intentionally cover every "should be a no-op" lever in the spans machinery: env-flag-on-with-no-span_starts, the entire prompt declared as one span (PIC reset at block 0 = no-op since block 0's parent was already None), gap policy with no span_starts to fire on, and gap policy that fires but covers the entire prompt so the re-prefill sees the same context as a cold prefill. Output text is bit-exact and the top-1 token matches; lower-rank logprobs may drift by ~0.04 because the cold-prefill and gap-recompute code paths are not numerically identical.