Knowledge in weights, not in prompts

The frontier-agent answer to making an LLM personal is, almost without exception, more prompt. The system prompt grows (“the user prefers terse responses”). Memory features inject retrieved facts at the top of each turn. Context windows expand to absorb more session history. RAG pipelines fetch and crowbar in whatever documents look relevant. Every call hauls along a junk drawer of user-specific context that the model has to re-read every time.

It works, sort of. It also has obvious costs that compound — per-token bill grows with how much we know about you, latency grows with the same, privacy is bounded by what your vendor sees in those prompts, and the personalization quality degrades once the prompt fills up. There’s also something philosophically weird about a learning system that doesn’t actually learn from its users.

I’ve been building a small local model called cognit to chase the inverse approach: personalization through weights, not through prompts. A frozen shared base model. A small LoRA adapter that captures everything cognit has learned about you. Marked turns and corrected responses get committed to the adapter as background work between sessions. The prompt stays minimal. The model already knows you.

What cognit is, structurally, is what falls out when you decide that personal knowledge belongs in weights rather than prompts, and then ask where each kind of knowledge actually goes. Three different timescales of memory, three different mechanisms — none of them the prompt.

The three timescales of memory

Cognit’s answer is that personal knowledge lives at three different timescales, and each gets a different mechanism. They are not interchangeable, and the value of the architecture is that nobody is doing the wrong job.

ChannelStoresPersistsCost
Attention KVVerbatim tokens, exact positionsWithin turnGrows with context length
SSM hidden stateCompressed running summaryWithin and across sessionsConstant per layer (~KB)
LoRA adapterDurable learned patternsAcross all sessionsTiny (~MB), free at inference
persistence scope (rightward = older history retained) ages ago now attention KV within turn SSM state across sessions LoRA adapter spacer
Three timescales, three mechanisms. Attention KV holds the most recent tokens exactly. SSM state compresses everything from the conversation so far. LoRA holds patterns learned across every session you've ever had with this profile.

Three jobs:

The clean division of labor is the part of cognit I keep coming back to. The frontier-agent pattern crams all three jobs into the prompt channel — the model has to re-read its “memory” every turn, every channel of personalization growing the same context bill, every distinction between “what I just said” and “what I taught you last week” collapsed into one bag of text. The three-channel model lets each job find its natural home.

The SSM half is what makes this work at long context

The thing the SSM half buys you is the only reason this is a viable architecture for a daily-driver local model. Decode cost per generated token, simulated at the layer level on a small hidden-dim configuration (d_model=128, d_inner=256, d_state=16, 6 layers):

Architecturepast=1Kpast=16Kpast=65Kpast=100K
flat-mamba (0 attn)0.220.220.130.12
sparse-jamba (1 of 6 attn)0.170.230.430.58
medium-jamba (2 of 6 attn)0.140.260.751.08
dense-jamba (3 of 6 attn)0.200.331.061.53
pure-attention (6 of 6)0.270.521.972.89

ms per decoded token, MPS / Apple Silicon, fp32.

decode latency at past=100K context (ms per token, lower is better) flat-mamba 0.12 sparse-jamba 0.58 medium-jamba 1.08 dense-jamba 1.53 pure-attention ↑ 25× slower per token than pure mamba at the same context 2.89 0 1 ms 2 ms 3 ms

The shape is what matters. At short context, the architectures cluster — attention is fast because there isn’t much to attend to. At 100K tokens of context, pure-attention is ~25× slower than pure-Mamba per generated token, and ~5× slower than the sparse hybrid. Cache memory scales similarly: pure-attention’s KV cache at 100K is 614 MB; the hybrid with one attention layer is ~100 MB; flat Mamba is essentially zero (the constant-size SSM state weighs a fraction of an MB).

You don’t get to keep that long without giving something up. Mamba’s constant-cost state isn’t free — it’s the lossy half of the model, the part that gives up exact recall on tokens that aren’t in the immediate attention window. The hybrid is the part that lets you decide how much exact recall you need: one attention layer is enough for length-generalization on a synthetic copy task at the scale I tested (going from 0.29 success at length 1024 with pure Mamba to 1.00 with one attention layer added anywhere in the stack); more attention layers buy you sharper recall at the cost of longer-context decode latency.

For a personal local model on a laptop, the right slice is something like sparse-jamba — one or two attention layers, the rest Mamba. You get exact recall over the recent attention window, compressed long-range state through the SSM hidden state, and constant per-token cost as conversations grow. The KV cache stays bounded; the SSM state stays small; nothing balloons.

Adapter drift, and a cheap fix

The half of cognit that isn’t architecture is the LoRA training loop — and it has one engineering risk you have to take seriously.

In the prompt-inflation paradigm, forgetting is impossible. The prompt is the source of truth; what’s not in the prompt isn’t remembered. That guarantee is the entire reason RAG-style systems can claim “perfect recall.”

In a weights-as-memory system, earlier learned patterns in the LoRA adapter can get overwritten by later training passes as the adapter is reused across sessions. The base model is frozen and can’t degrade — this is adapter interference, not classical catastrophic forgetting of a full fine-tune — but it’s real. In a measurement on Zamba2-1.2B with no protection, a single previously-trained pattern showed a roughly 50× loss increase after nine drift training steps on unrelated content.

The fix is the kind of fix you hope for: cheap, obvious in retrospect. Whenever you train on new captures, also sample N previously-trained captures from the queue and mix them into the batch. We already have all the captures sitting in a queue (the capture queue is how marked turns get committed to LoRA in the first place); replay just reads from it. With replay_size=5, the same drift experiment moves from +2.0 loss to +0.025 — about 99% effective. L2 regularization toward pre-pass weights at typical LoRA scales gives nowhere near that protection (a measured 4% reduction).

This is the part where I’d normally hedge that the result is small-scale and needs to be re-run at production model sizes. It is, and it does. But the mechanism is mechanically clear enough that I expect the shape to hold — old patterns survive new training when they’re sampled into the new training, and they don’t when they aren’t.

What cognit is not

Three things this is not, in increasing order of “this would defeat the point if I let it in”:

These belong in agents that consume cognit, not in cognit itself. cognit is a model that knows you. Tool use, retrieval, prompt construction — those happen at the layer above.

What success looks like

A user runs cognit chat for a few weeks. They never edit a system prompt. They never maintain a memory file. They mark a few responses as good or wrong, occasionally ingest a folder of notes. The model gets sharper at their specific work. Switching --profile work to --profile personal feels like switching to a different model — different knowledge, different tone, no cross-contamination. The adapter file after a year of personal use is a few hundred MB and lives on their disk, not on a vendor’s server.

That’s the shape I think a local personal LLM is supposed to be. The current frontier-agent reliance on prompt inflation is a design choice, not a feature of the problem. cognit is one shape of an alternative; there are presumably others.

Code: github.com/jnormore/cognit. The interesting bits live in the LoRA training loop and the SSM state save/load.