Searching for tiny-cache LLMs

TL;DR — A genetic algorithm is the right tool when search-space choices interact and you don’t know which combination works. I pointed one at the LLM architecture space — 40 genes (attention type, recurrence type, block patterns, layer counts), 200 evaluations — to test whether any hybrid configuration closes the long-context recall gap to a plain transformer while shrinking KV cache, the per-token attention state that dominates LLM inference memory at long context. The motivating bet: replace some attention layers with Mamba (a state-space layer whose hidden state stays fixed-size instead of growing) and trade a bit of quality for a much smaller cache. The result: no hybrid in the searched space matches plain attention on recall at this training scale. The best one cuts cache 366× (607 MB → 1.66 MB at 16k tokens of context) for 1.2% worse pretraining loss, but retrieves planted facts at ~40% the reliability of a plain transformer. For cache-bound serving, picking a point on the curve is fine; for retrieval-heavy work, plain attention still wins.

A 152M-parameter GPT-2-small at 16k tokens of context uses 607 MB on its KV cache — the per-token attention state every transformer accumulates as it reads tokens, so that the next token can attend back to them. The model’s own weights are 305 MB. At 65k tokens the cache crosses 2.4 GB. The cache grows linearly in context length (L, in the rest of the post); the weights stay flat. For long-context serving, you’re not paying for a model that’s a few hundred MB — you’re paying for one whose memory footprint is dominated by cache, and that cost compounds with concurrent users.

There are a lot of ways to compress that cache after training. I’ve written about one (compressing the KV cache without retraining). What I wanted to test in this post is the upstream version: can you train an architecture that doesn’t produce most of the cache in the first place?

Going in, I figured: probably not. Hybrid architectures — attention mixed with cheaper non-attention layers like Mamba, RWKV, or RetNet, all of which keep a fixed-size hidden state instead of an ever-growing per-token cache — usually look fine on aggregate benchmarks but lose on direct long-context retrieval (finding a specific fact buried somewhere in the context). That’s the pattern in every hybrid I’ve trained — competitive on aggregate, behind on retrieval. The takeaway has been “you need attention for recall.” But that takeaway comes from a handful of hand-picked architectures. The hybrid space is much bigger than what anyone’s tried — block patterns, attention kinds, recurrence kinds, GQA ratios, FFN choices, all interacting. The question I wanted to test is whether that takeaway still holds when you actually search the space rather than sample it.

A genetic algorithm is the right tool for this. The gene choices interact — window-attention works with Mamba differently than full attention does, alternating-block patterns behave differently from bookend ones — so hill-climbing rejects useful combinations because each piece looks neutral in isolation. Population search holds many candidates in mind at once and crossover assembles configurations no sequential trace would build. Same shape as cevolve on code-optimization choices.

The short answer the search returned: it still holds. No hybrid in the space I searched closes the recall gap to full attention. The longer answer is the Pareto curve the search traced on the way there.

How the search is set up

The gene vocabulary is 40 values. d_model in {384, 512, 768, 1024}. n_layers in {8, 12, 16, 20, 24}. Per-block attention kind in {full, local-window, MLA, none}. Per-block recurrence kind in {Mamba, none}. Block-mixing pattern in {uniform, alternating, attention-bookend, state-then-attention, attention-then-state}. Usual hyperparams underneath: FFN ratio, activation, RoPE / NoPE / ALiBi, GQA ratio, LayerScale, drop-path. A tier-0 filter rejects candidates above 500M params or 2 GB of cache at L=16k before they ever get trained.

Each evaluation in the population trains 500M tokens of FineWeb-Edu at seq_len=2048 on a single H100 NVL pod and reports val_bpb, HellaSwag, ARC, and a small needle-in-haystack probe at L=2k. Fitness combines bpb, recall, log cache MB, and log decode throughput. There’s a soft structural penalty on architectures with zero attention layers — at this training scale the needle probe is mostly noise, and a model with no attention isn’t going to do recall whatever the measurement says. The penalty does the gating the measurement can’t.

The search itself runs 200 evaluations on H100 NVL pods, with tournament-3 selection, 2 elites, and steady-state replacement. No LLM in the proposer loop — the gene vocabulary is small and discrete enough that crossover does the imaginative work just fine. Each generation, the GA selects parents from the current population, builds children by combining genes, and dispatches them to workers. The Pareto front is updated as new evaluations land.

What the search returned

The search converged on a Pareto front with real structure — multiple distinct architectures spanning two orders of magnitude on cache, all hybrids by virtue of the structural penalty against pure-recurrent. Picking the four most distinct points and scaling each up to 1.5B ClimbMix tokens × 3 seeds, plus the GPT-2-small anchor for comparison:

KV cache at L=16k (MB, log scale) val_bpb (ClimbMix, 1.5B tokens) 1 10 100 1000 1.25 1.15 1.10 1.09 GPT-2-small (1.092, 607 MB) ind-0007 (1.105, 1.66 MB) ind-0092 (1.125, 10.5 MB) ind-0052 (1.158, 2.7 MB) ind-0054 (1.243, 1.25 MB)
Five points trained at 1.5B ClimbMix tokens, 3 seeds each (ind-0054 is n=1 — a pod failed mid-run and we didn't retry). The dashed line is the rough lower envelope. The four hybrids span two orders of magnitude on cache and ~0.15 bpb on quality. The curve is real curvature, not noise — each point is a meaningfully different architecture.
archnval_bpbhellaarc_ecore (3-task)cache @ L=16k
GPT-2-small31.092 ± 0.0010.312 ± 0.0120.303 ± 0.0050.416 ± 0.014607 MB
ind-000721.105 ± 0.0020.301 ± 0.0040.319 ± 0.0040.466 ± 0.0121.66 MB
ind-009231.125 ± 0.0010.289 ± 0.0040.309 ± 0.0210.447 ± 0.01010.5 MB
ind-005231.158 ± 0.0050.287 ± 0.0100.290 ± 0.0040.426 ± 0.0192.7 MB
ind-005411.2430.2940.3040.3831.25 MB

The search picked out architectures that span the front meaningfully. ind-0007 (1024-wide, 8 layers, Mamba in every layer with local-window-256 attention — attention that only looks at the last 256 tokens — in the middle 6) sits in the top-left corner: 366× less cache than GPT-2-small at L=16k, 1.2% worse pretraining loss, slightly better on the 3-task CORE average. The other points trade cache for various combinations of loss and recall — ind-0054 (384-wide) pushes the cache axis to 1.25 MB.

If the search returned these and you stopped at val_bpb and aggregate benchmarks, the hybrids look competitive. The actual test the GA was trying to run, though, is whether any of them retrieves like full attention does — which the RULER results below answer directly.

How the cache stays small

For full attention with GQA, per-layer cache is 2 (K, V) × n_kv_heads × d_head × L × dtype_bytes. Every term except L is fixed by the architecture. The cache scales 1:1 with the prompt.

A Mamba layer caches a per-channel selective-state matrix whose size depends only on d_model × d_state × expand — no L term. Constant cost in context length. Local-window attention caches n_kv_heads × d_head × min(L, window) — it caps at the window size and never grows past it. Combine the two and you can build an architecture whose cache is bounded regardless of context length.

We measured ind-0007’s actual checkpoint:

KV cache vs context length, measured on the trained checkpoints GPT-2-small 41 MB  (L=1k) 154 MB  (L=4k) 607 MB  (L=16k) 2,419 MB  (L=65k) ind-0007 (1024×8 Mamba + local-window-256 hybrid) 1.66 MB  (every L from 1k to 65k) ratio at L=16k: 366×   at L=65k: 1,457×
The arithmetic is on the model file: GPT-2-small's cache grows linearly in L; ind-0007's is bounded. This is a fact about capacity. It does not tell you the model can use the 16k or 65k tokens you give it.

That distinction — capacity vs use — is what the next section is about.

Does the search’s bet pay off on recall?

This is the question the GA was set up to answer. The implicit hypothesis was: somewhere in the hybrid space there’s a configuration where Mamba state + sparse attention reads from long context as reliably as full attention does. The four points the search picked out are the candidates. The test is RULER — a long-context retrieval benchmark. S-NIAH-1 plants a single fact (key + 10-digit value) in a long stream of unrelated text and asks the model to retrieve the value via free generation; multi-NIAH does the same with several facts at once. n=64 trials per setting, 95% Wilson CIs.

archL=2k S-NIAH-1L=2k multi-NIAHL=4k S-NIAH-1L=8k S-NIAH-1
GPT-2-small0.25 [0.16, 0.37]0.59 [0.47, 0.71]0.06 [0.02, 0.15]0.09 [0.04, 0.19]
ind-00070.13 [0.06, 0.23]0.25 [0.16, 0.37]0.09 [0.04, 0.19]0.11 [0.05, 0.21]
ind-00920.19 [0.11, 0.30]0.13 [0.06, 0.23]0.06 [0.02, 0.15]0.05 [0.02, 0.13]
ind-00520.08 [0.03, 0.17]0.41 [0.29, 0.53]0.08 [0.03, 0.17]0.05 [0.02, 0.13]
ind-00540.11 [0.05, 0.21]0.19 [0.11, 0.30]0.08 [0.03, 0.17]0.11 [0.05, 0.21]

Two readouts.

At training length (L=2k), GPT-2-small genuinely retrieves better than any hybrid found in the search. On multi-NIAH the gap is unambiguous — every hybrid’s CI ends below GPT-2-small’s CI lower bound. On S-NIAH-1 the CIs overlap, but the point estimates uniformly favor full attention. The hybrids do retrieve — every multi-NIAH CI excludes zero — they just retrieve at 20–70% the rate of full attention. The closest hybrid to the baseline is ind-0052, which keeps full attention in half its layers; the worst-recall hybrid is ind-0092, the one whose attention layers use MLA (DeepSeek-style compressed-cache full attention).

At L > 2k, every model collapses. GPT-2-small drops from 0.59 → 0.16 at L=4k on multi-NIAH. ind-0007 drops from 0.25 → 0.13. By L=8k everyone is at noise. This is not a hybrid problem — it’s a training-distribution problem. All five models were trained at seq_len=2048; the positional encodings for the attention layers don’t extrapolate and the Mamba states never learned to compress >2k contexts. The 366× cache reduction at L=16k is real arithmetic, and none of these models actually uses 16k tokens.

So the search’s bet didn’t pay off. Full attention still wins on direct in-distribution retrieval, even against the best hybrid configurations the GA could assemble from a 40-gene vocabulary.

What the search teaches

Even with the bet lost, the GA produced a useful artifact. Three observations from the curve worth pulling out.

Hybrid recall is real but weaker. All four hybrids do non-zero retrieval at L=2k. The Pareto curve in (cache, recall) is a real curve, not a cliff. If you can tolerate 30–60% lower retrieval reliability in exchange for 50–500× less cache at training length, the hybrids are usable. Chat history with batch=128, edge inference where 80 MB doesn’t fit but 2 MB does, any deployment where serving capacity is the binding constraint — picking a point on this curve is a legitimate engineering choice.

Whether window-limited attention matters is testable. ind-0052 has full attention in half its layers but is worse on S-NIAH-1 than ind-0092 (MLA, also full reach) and similar to ind-0007 (window-256). The hybrid’s recall gap isn’t about window limits per se — it’s about having Mamba in the stack at all. That said, ind-0007 (windowed) drops less than GPT-2-small at L=4k on the cache it actually uses — windowed attention has no positional encoding to extrapolate, so it degrades more gracefully outside training length.

Cross-layer KV sharing is a different lever the search couldn’t reach. The gene vocabulary didn’t include cross-layer K/V sharing (YOCO, CLA — published techniques that share K/V across groups of full-attention layers for ~4–12× cache reduction with minimal quality loss). Those configurations would add a different region of the Pareto curve — recall-preserving, modest cache reduction — but they wouldn’t close the recall gap the hybrids show, because they keep attention intact. They’re a separate cache-reduction lever, not an answer to the question this search asked.

One follow-up I tested before walking away

The obvious next question is whether the recall gap is a training-distribution artifact rather than a structural limit. Frontier models handle long context in two stages: pretrain at moderate context (4k or 8k), then continue pretraining at extended context (32k, 128k+) on 1–5% of the original token budget. The hybrid’s O(L) training scaling makes that extension stage cheap — where pure attention costs O(L²) and quickly becomes intractable. If the recall gap closes after extension, hybrids are usable; if it doesn’t, the Mamba state has a real ceiling.

I ran a cheap smoke version on ind-0007: resumed the checkpoint and trained another 1B tokens at seq_len=8192 with peak_lr=5e-5. The post-extension RULER numbers at L=8k fell within the same Wilson CIs as pre-extension — multi-NIAH went from 4/64 to 6/64, S-NIAH-1 from 7/64 to 5/64. Aggregate benchmarks slightly regressed (CORE 0.466 → 0.455, HellaSwag 0.301 → 0.294), suggesting the LR/budget combination was bad for the model overall, not just useless for long context. The Mamba state didn’t learn to compress 8k tokens with this much extension. A larger run — 5–10B tokens, tuned LR schedule, multiple d_state choices — might tell a different story. But the cheap version of the test didn’t deliver, and I’m not the person who’s going to run the expensive version.

Closing

The premise was: full attention seems hard to replace, but maybe nobody had searched the hybrid space carefully enough. So I ran a GA over it. At 1.5B ClimbMix tokens, seq_len=2048, with this gene vocabulary, full attention still wins on recall. The GA’s actual product is the cache-vs-recall curve and the result that no point on it closes the recall gap. For some serving applications, picking a point on the curve and accepting the recall cost is the right move; for anything where exact retrieval matters, the answer remains attention.