90% of attention is slack in pretrained transformers
Run a long-context LLM for a while and you’ll notice something: the longer the conversation gets, the slower each new token comes back. That slowness is mostly one thing — attention. Every new token the model generates has to read every prior token in its cache, across every head and every layer. At long context, that reading is the dominant cost of generation.
So a question worth asking: how much of it does the model actually need?
If you look at what attention is doing inside the model — the probabilities the softmax produces, head by head, layer by layer — the answer looks suspiciously low. On most heads, on most queries, the attention mass concentrates on a small handful of keys. A dozen or so out of however many are in the context. The model is reading every key, computing a score for every one, and most of those scores end up near-zero by the time the softmax shrinks them.
So the obvious experiment: just don’t compute the parts the model wasn’t going to use anyway. Take a pretrained transformer, run it normally, but per (layer, head, query) keep only the top-k keys and zero out the rest. See what happens to perplexity — the standard language-modeling quality metric.
What happens — and what I’m calling Calibrated Sparse Attention (CSA) — is that across every model and corpus I’ve tested in the 100M–3B range, there’s a per-(layer, head) budget totalling around 10% of the dense reads at which perplexity barely moves. Not on average. Uniformly across every position in the context window.
| Model | Context | Corpus | Dense ppl | % of dense reads | Per-position max Δ |
|---|---|---|---|---|---|
| GPT-2 small (124M) | 1024 | WikiText | 26.49 | 5.7% | ±1.2 ppl |
| Pythia 410M | 1024 | WikiText | 16.26 | 9.6% | +0.13 ppl |
| Pythia 410M | 2048 | WikiText | 15.27 | 9.3% | ±0.07 ppl |
| Pythia 410M | 1024 | Code | 3.78 | 9.4% | ±0.01 ppl |
| Qwen 2.5 1.5B | 1024 | WikiText | 8.76 | 9.9% | ±0.02 ppl |
| Llama 3.2 1B | 1024 | WikiText | 9.71 | 9.4% | ±0.04 ppl |
| Llama 3.2 1B | 2048 | WikiText | 9.12 | 8.9% | ±0.06 ppl |
| Llama 3.2 3B | 1024 | WikiText | 7.79 | 9.6% | ±0.04 ppl |
Four model families. Five scales. Two context lengths. Two corpora. The band sits at 5.7–9.9% across all of them. No fine-tuning, no labels, no gradients, no architecture changes. Source: github.com/jnormore/csa.
The rest of this post is what the procedure is, why it works, what was surprising while running it, and what the headline number does and doesn’t translate to in practice.
Where existing work doesn’t reach
Sparse attention isn’t new. The work I’m familiar with splits into two camps:
- Sparse-from-pretraining. BigBird, Longformer, Mistral’s sliding-window attention, DeepSeek’s Native Sparse Attention. These bake a fixed sparse pattern into the model at training time. Strong results, but the model has to be trained to expect the sparse pattern. You can’t drop these onto a densely trained Llama or Qwen and expect quality to hold.
- KV-cache eviction. StreamingLLM, H2O, Scissorhands. These run on top of an unchanged dense model and decide which keys to drop from the cache. They reduce memory and bandwidth by shrinking what stays cached; they don’t reduce per-query attention compute against the keys that do stay.
There’s a third cut that neither addresses: take a densely trained model, leave the cache alone, and just don’t compute most of the per-query attention. That’s the slot CSA fills.
The premise is that a densely trained model already concentrates most of its attention mass on a small subset of keys per (layer, head, query) — that the dense computation is doing a lot of work to produce values that are arithmetically near-zero by the time they hit the softmax. If that’s right, those near-zero terms are computational slack, and you can skip them without changing what the model does.
The non-obvious part is figuring out which terms to skip, cheaply, without per-query training.
The procedure
CSA is four steps.
1. Calibrate. Run the dense model on a small representative sample — 8k to 50k tokens is plenty — and record softmaxed attention probabilities from every (layer, head). For each (layer, head), compute the mean effective rank at threshold p: the smallest integer k such that the top-k attention probabilities sum to at least p (we use 0.9), averaged across query positions. The result is a table effective_rank[L][H] — one number per attention head.
This is the diffuseness signal. A head whose attention concentrates on a handful of keys has a small effective rank. A head that spreads attention broadly has a large one. The variance across heads matters more than the absolute numbers — most heads cluster around 30–80 keys, but a few diffuse heads need 150+.
2. Allocate. Pick a target per-query total budget B. Allocate B across the L × H cells in proportion to effective_rank[l][h], with a per-cell floor (we use 2) so peaked heads don’t get starved to zero. Largest-remainder apportionment keeps the integer sum exact. The result is a table k[L][H].
3. Cap-sweep. Sweep a small set of caps K_max ∈ {32, 64, 128, 256, 512}. For each, evaluate against dense on a held-out sample — both aggregate perplexity and per-position perplexity. Pick the smallest cap where both are within tolerance.
The per-position check is the part that matters most and is the thing aggregate-only evaluations miss. More on this below.
4. Apply. In each attention layer at inference, compute Q · K^T as usual, then mask all but the top-k(l, h, K_max) logits per query, where k(l, h, K_max) = min(k[l][h], K_max). Renormalize via softmax. Proceed normally.
The deployed policy is L · H + 1 integers — 145 for GPT-2 small, 385 for Pythia 410M. Tiny JSON file. The calibration runs in tens of seconds; the cap-sweep takes minutes.
Per-position is the metric that matters
Aggregate perplexity is necessary and insufficient. The failure mode of aggressive sparse attention is concentrated at the long-range positions — the ones where you actually need a lot of context, which is also the use case that justifies the work in the first place. A policy that preserves average perplexity but degrades the last few position bins by 50 ppl is “good on aggregate” and useless in practice.
At the dense-matching cap, CSA preserves quality uniformly across the in-window position bins — the longest-position bin is within ±0.13 ppl of dense on every configuration except GPT-2 small (which is within ±1.2 ppl). Below the dense-matching cap, position-conditional degradation snaps on immediately. On GPT-2 small at 3.5% of dense reads, the mid-window positions are within ~3 ppl of dense; the last bin (positions 945–1008) is +19 ppl. At 2.0% of dense reads, the last bin is +99 ppl.
This is why the cap-sweep step is per-position rather than aggregate. A policy chosen on aggregate perplexity alone would happily sit below the dense-matching cap and silently hand back garbage on long-range queries.
The methodological point: sparse-attention work that reports only aggregate perplexity is reporting half the story. The cheap fix is to bin by in-window position and look at the worst bin.
The surprising finding: per-query routing doesn’t help
The natural next question is whether you can do better than a static per-(layer, head) allocation by routing per query — different queries surely need different budgets, right?
To test this I trained a small MLP — about 35k parameters — to predict per-(layer, head, query) oracle k* from the query vector, its norm, position, and a learned (layer, head) embedding. The router achieves validation Pearson correlation 0.92 on log(k*). Looks great.
Then the ablation: zero out the query vector and the norm, leaving only (layer, head, position). The router still gets Pearson 0.905. Query content adds 0.02 Pearson over the static prior.
Plugged in as a per-query budget, the trained router produces perplexity statistically indistinguishable from CSA’s static rule, across cap settings from 32 to 512. It matches CSA. It does not beat CSA.
The implication is the part I keep coming back to: at these model scales, attention bandwidth is essentially a static structural property of the (layer, head, position) triple. Per-query content adds vanishingly little signal. The model’s heads have a budget that’s set during training and stays approximately constant across whatever the query happens to be.
That’s not what I expected going in. I expected query content to matter, and the only question to be how cheaply you could predict it. Instead the predictability comes from somewhere much simpler.
Cross-cutting holds, including across corpora
Three observations across the table at the top of this post are worth pulling out:
Doubling context doesn’t move the fraction. Pythia 410M at 1024 lands at 9.6%; at 2048 it lands at 9.3%. Same model, same training, twice the context. The attention slack scales linearly with context, which is the structurally clean outcome — it means the slack is a property of the per-(layer, head) attention behavior, not of any particular context length.
Switching corpus type doesn’t move the fraction. Same Pythia 410M model, switching from WikiText prose to Python source code: dense perplexity goes from 16.26 to 3.78 (code is way more predictable). The dense-matching fraction goes from 9.6% to 9.4%. The per-position bound actually tightens on code (±0.01 vs +0.13). Whatever sets the budget is the model, not the calibration corpus.
Switching family and scale doesn’t move the fraction much. Llama 3.2 1B → 3B (3× parameters within the same family): 9.4% → 9.6%. Qwen 2.5 1.5B with grouped-query attention: 9.9%. The band stays narrow across model families, scales, and attention architectures.
Three coordinates that you might think would matter — context, data distribution, model — and none of them move the budget fraction meaningfully. That suggests there’s a structural reason for the ~10% number, not a coincidence-of-models reason. I don’t have a clean theoretical claim for what that reason is. It’s probably worth more attention than I’ve given it.
Resource accounting
The above is a measurement of attention reads. The interesting question is how that translates to actual cost reduction. Attention is one component of model compute; FFN is the other. The headline 10× attention FLOPs reduction does not become a 10× end-to-end speedup.
The Amdahl math, for Pythia 410M with d=1024 at the two context lengths tested:
| Context | Attention fraction (decode) | Speedup at 10× attention reduction |
|---|---|---|
| 1024 | 20% | 1.22× |
| 2048 | 33% | 1.43× |
That’s the decode case, which is what dominates generation past a few thousand tokens. Prefill speedups are smaller (attention fraction is lower because each prefill token sees less context on average than each decode token). End-to-end gains improve as context grows because attention’s fraction of total compute grows linearly with T while FFN stays constant.
The other thing the headline number doesn’t realize directly: KV-cache memory. CSA reads sparsely per query but doesn’t evict — the cache stays full. The natural composition is a permanent eviction policy at the per-(layer, head) cap, which would shrink memory by the same factor. The numbers above are attention compute; the KV-memory win is gated on actually evicting.
And neither number is realized today by the patched eager-attention implementation that runs the experiments. The eager path reads all of K to compute the score matrix, then masks afterward — only V reads are saved. The full reduction needs a Triton or FlashAttention block-sparse kernel that selects K and V positions before fetching them. That’s straightforward engineering work that wasn’t the point of this round.
Where this fits
CSA’s specific bet is: keep the model, keep the cache, just don’t compute the parts of attention the model wasn’t going to use anyway. That bet stacks with the other things people do for inference cost — KV quantization, speculative decoding, permanent eviction policies, sparse kernels. It doesn’t compete with any of them.
The thing I find most interesting is the routing finding above: the predictable structure in attention is dominated by static per-(layer, head) properties, not per-query routing. If that holds at larger scale, a lot of the architectural complexity people are designing into adaptive attention mechanisms — sophisticated routers, learned gating, mixture-of-attention — is solving a problem that may not exist at the resolution they assume. Most of the win is available from a 145-integer policy you compute once.
There’s a natural follow-up question: if attention reads have an order of magnitude of slack, what about the projection machinery that produces Q, K, V in the first place? Same model, same densely trained heads — but the W_Q, W_K, W_V, W_O matrices might be similarly overallocated. I tried that. It went less well. That’s the next post.
Source for CSA: github.com/jnormore/csa. The whitepaper has the full per-experiment writeup and resource analysis.