Compressing the KV cache without retraining, and what mostly doesn't work
A long-context LLM in production spends most of its memory on the KV cache. Per request, per layer, per head: every prior token’s K and V vectors sit in memory waiting to be read by the next decode step. At 2k context a small model’s KV cache is the dominant working-memory consumer after the model weights. At 16k context on Qwen2.5-1.5B, the cache I measured was 881 MB per request — bigger than the model itself.
So you want to compress it. The intuition is the same one CSA exploited: not every cached entry contributes meaningfully to the next-token output, so a lot of those bytes are slack. Throw most of them away, keep enough that the model can still answer, and your batch size, context length, and concurrent-request capacity all scale together.
The plan I started with was hierarchical: keep coarse summaries over the full context for awareness, plus exact token-level entries for whatever the question actually needs. Conceptually clean. In practice, most of the intuitive ways to pick which entries to keep don’t work — including some I’d have bet on. The thing that did work is structurally weirder than the recipe I expected to find.
This is what I learned across a couple of weeks of experiments on Qwen2.5-1.5B at 4–16k context.
What didn’t work
Three failure modes, all instructive.
Per-token importance, no matter how cleverly aggregated. The cheap signal you reach for first is “how much attention did this token receive during prefill?” You sum it, you take the max, you weight the tail higher. All variants have the same structural ceiling on multi-content tasks because the prefill-time signal has no idea which content the question will eventually target. A token can be important for some possible question and irrelevant for the actual one; the importance ranking averages across questions you’ll never be asked. On single-content tasks (one needle in the haystack), per-token importance is fine. On anything that retrieves over multiple stored entities, it hits a ceiling well above what a smarter approach should be able to reach.
Question-aware importance is actively worse. The intuitive fix is to make importance question-conditional: re-weight using the question tokens themselves as the importance signal. This should help — it directly addresses the “which content matters now” gap. It doesn’t. The question tokens encode a lot of structural signal (sentence framing, grammar, “what is the…” phrasing) and relatively little content signal. So the importance ranking picks up the framing tokens and drops the content digits of the answer.
The failure mode is specific and disturbing: crossover hallucinations. The model is asked about one entity in the context and returns an answer like AMBER-FOX-7261 — the prefix lifted from the entity it’s asked about, the digits silently pulled from a different fact entirely. The importance signal kept the entity name (high-attention, structurally framed by the question) and dropped the actual digit positions (low-attention, content-bearing) — so at generation time it filled the missing digits from whatever other content was still in the cache. Question-aware importance isn’t just neutral; it produces a class of confident, structurally-plausible, completely-wrong answers that the question-blind per-token version doesn’t have.
Static chunk routing without summaries. If the question is “what’s the value of X” and X is in chunk 3, the obvious thing is: route attention to chunk 3 and throw away the other chunks. Routing accuracy is high — the chunks containing the answer get picked correctly. The model still fails completely. On a multi-needle eval where summary-routing-at-the-same-compression-ratio gets 5/6, naive chunk routing gets 0/6. The cache is right; the model can’t use it.
The mechanism is that an attention layer trained on continuous K/V positions expects continuous K/V positions. When chunks get dropped, what remains has gaps — a few hundred token positions, then nothing, then a few hundred more. The model’s positional reasoning treats this as “everything in between is recent context that just happens to have low attention” — and produces a recency bias on the wrong things. Routing the right chunks isn’t enough; you have to fill the gaps with something the model recognizes as “stuff that’s there.”
What did work — and why it works isn’t intuitive
The thing that worked is keeping per-chunk summaries for the entire context, plus routed exact-token attention to the chunks the question actually needs. The summary primitive is the simplest possible: mean-pool of K and V vectors within the chunk (after RoPE, since we want the summary to live in a comparable embedding space to the exact tokens).
The numbers, on Qwen2.5-1.5B:
| Task | Context | Approach | KV cache | Match |
|---|---|---|---|---|
| single needle | 16k | baseline | 881 MB | 2/3 |
| single needle | 16k | summary k=3 r=1024 | 87 MB | 2/3 (10× reduction, parity) |
| multi-hop | 4k | baseline | 224 MB | 6/6 |
| multi-hop | 4k | summary k=3 r=1024 | 67 MB | 6/6 (3.4× reduction, parity) |
| multi-needle | 4k | baseline | 223 MB | 6/6 |
| multi-needle | 4k | summary k=1 r=256 | 22 MB | 5/6 (10× reduction, one failure) |
10× cache reduction on single needle with zero accuracy loss. 3.4× on a genuinely compositional multi-hop task with zero accuracy loss. 10× on multi-needle with one failure that the importance signal can’t fix at any retention level — a noise-floor effect specific to a single problematic prompt arrangement.
The interesting part is the mechanism. The textbook framing of summaries is that they preserve information at lower fidelity — mean-pooled K and V are a lossy compression that carries semantic content. The model attends to the summary and retrieves an approximate version of what’s there.
That isn’t what happens. Direct measurement: the Q·K similarity between a query and a mean-pooled summary is weaker than the Q·K similarity between the query and the original exact tokens, because mean-pooling reduces the norm. The summary doesn’t carry the content well; it carries it badly. If summaries worked by preserving information, they wouldn’t work.
What they actually do is prevent the gap-induced recency bias that destroys static chunk routing. Summaries-everywhere give the attention layer a continuous-feeling cache. The model treats the summaries as background context flow — low-attention, but present at every position — and uses the routed exact tokens for the content recall it actually needs. The summary is structural scaffolding, not a content store. The exact-token routing does the real retrieval. Both are required; neither works alone.
I didn’t predict this when I started. I expected to be tuning the summary primitive — would mean-pool work, or would I need something smarter, or a learned compressor? The answer turned out to be that the summary primitive almost doesn’t matter for the right reason — it just has to occupy the position. The expensive intuition was wrong; the right mechanism was nearly free.
When this earns its keep
Three guardrails on the result.
It’s on a 1.5B model, on synthetic tasks. Single-needle, multi-needle, and multi-hop are intentionally clean — they have known ground truth, known compositional structure, and known failure modes. Real long-context benchmarks (RULER, LongBench) might reveal failures the synthetics don’t catch. The 10× number won’t survive intact at every model and every task; it’s a starting hypothesis for production-shaped evaluations, not a finished claim.
It’s prefill-time compression. All compression decisions happen once after prefill, based on a single importance pass over the full context. The cache is then fixed for decode. The literature’s strongest contemporary work (StreamingLLM, H2O, Scissorhands) does decode-time routing — re-routing per generated token based on the running query. I’d expect decode-time routing to close the remaining accuracy gap on the failure cases (multi-needle’s one stuck failure, for instance), at the cost of a more invasive implementation. The static prefill-time recipe is the cheap version that already lands at 10×.
The mechanism finding is what generalizes. The exact compression ratios are about the model and tasks tested. The mechanism — that summaries-everywhere work by preventing gap-induced recency bias rather than by preserving content — should be a property of any model whose attention assumes continuous K/V positions. Which is all of them. If you’re designing a sparse KV scheme and you’re spending effort on better summary primitives, this suggests the marginal effort is mostly wasted — the dumb summary works as well as anything fancier, because it isn’t really storing content.
The pattern
This is the third post in a row where the unexpected finding is the same shape: the cheap static recipe works as well as anything more elaborate, and the win-by-routing intuition turns out to under-deliver. CSA found a learned per-query router matched a static per-(layer, head) allocation rule. Projection sparsity found a learned per-token router lost to static iterative-greedy head pruning. Here, per-token and question-aware importance signals get beaten — or hallucinate badly — relative to a flat summary-everywhere approach that doesn’t try to be selective about which content to preserve, only about whether to keep continuity at all.
I don’t think this means cleverness in routing is permanently dead. It might mean the model scales tested are too small for routing overhead to pay off, or the training budgets too short, or the architectures too thin. But it does mean the burden of proof has shifted: a routing-style intervention should now have to beat its static baseline directly, not just demonstrate that it learned something. The static recipes are stronger than the recent architecture literature suggests, and the right reflex on a new sparsity proposal is to ask what the boring static version looks like before you decide whether the clever version is doing real work.