Jason Normore

Jason NormoreSoftware developer. Founder & CTO at Mantle. Writing about LLM systems, commerce infrastructure, and the craft of building products.https://jasonnormore.com/Compressing the KV cache without retraining, and what mostly doesn't workhttps://jasonnormore.com/blog/compressing-the-kv-cache/https://jasonnormore.com/blog/compressing-the-kv-cache/10× KV reduction without retraining is reachable on a pretrained transformer. The mechanism isn't what you'd expect — summaries work not because they preserve information but because they prevent gap-induced recency bias.Sun, 17 May 2026 00:00:00 GMTTrying to push sparse attention further, and what didn't workhttps://jasonnormore.com/blog/projection-sparsity-and-what-didnt-work/https://jasonnormore.com/blog/projection-sparsity-and-what-didnt-work/The natural follow-up to CSA: if 90% of attention reads are skippable, can the projection machinery be trimmed too? The hypothesis survives, the obvious method doesn't, and the gap between them is the interesting part.Fri, 15 May 2026 00:00:00 GMT90% of attention is slack in pretrained transformershttps://jasonnormore.com/blog/calibrated-sparse-attention/https://jasonnormore.com/blog/calibrated-sparse-attention/A small training-free procedure that measures per-(layer, head) attention diffuseness, allocates a budget proportional to it, and applies it as a top-k mask at inference. Result: dense quality at ~10% of the attention reads, across seven model/context/corpus combinations.Wed, 13 May 2026 00:00:00 GMTKnowledge in weights, not in promptshttps://jasonnormore.com/blog/knowledge-in-weights-not-prompts/https://jasonnormore.com/blog/knowledge-in-weights-not-prompts/The frontier-agent answer to personalization is more prompt — RAG, memory injection, larger context windows. A hybrid Mamba+attention base plus a small LoRA adapter lets the model just learn you instead.Mon, 11 May 2026 00:00:00 GMTA runtime for agent-authored appshttps://jasonnormore.com/blog/a-runtime-for-agent-authored-apps/https://jasonnormore.com/blog/a-runtime-for-agent-authored-apps/Agents are great at one-shot work and bad at durable work. cue is a small daemon that closes the gap — actions, triggers, addressable URLs, each invocation sandboxed in a fresh unikernel.Thu, 23 Apr 2026 00:00:00 GMTSandboxing agent-generated code with disposable unikernelshttps://jasonnormore.com/blog/sandboxing-agent-code-with-disposable-unikernels/https://jasonnormore.com/blog/sandboxing-agent-code-with-disposable-unikernels/Coding agents need to run code. The options today are all bad. unitask is a small tool that runs each call in a fresh unikernel under declarative policy — code in, runs, returns, destroyed.Wed, 22 Apr 2026 00:00:00 GMTMatching frontier LLMs with diverse small ensembleshttps://jasonnormore.com/blog/matching-frontier-llms-with-diverse-small-ensembles/https://jasonnormore.com/blog/matching-frontier-llms-with-diverse-small-ensembles/An OpenAI-compatible ensemble proxy that lands at GPT-5 accuracy on a 150-case benchmark — at 13× less cost. The catch is that diversity, not model count, does the work.Wed, 08 Apr 2026 00:00:00 GMTCode optimization with LLM-imagined ideas and evolutionary searchhttps://jasonnormore.com/blog/code-optimization-with-llms-and-evolution/https://jasonnormore.com/blog/code-optimization-with-llms-and-evolution/Pairing LLM idea generation with genetic algorithms for code optimization — and the synergies hill-climbing can't reach.Mon, 06 Apr 2026 00:00:00 GMT