Knowledge in weights, not in prompts
The frontier-agent answer to making an LLM personal is, almost without exception, more prompt. The system prompt grows (“the user prefers terse responses”). Memory features inject retrieved facts at the top of each turn. Context windows expand to absorb more session history. RAG pipelines fetch and crowbar in whatever documents look relevant. Every call hauls along a junk drawer of user-specific context that the model has to re-read every time.
It works, sort of. It also has obvious costs that compound — per-token bill grows with how much we know about you, latency grows with the same, privacy is bounded by what your vendor sees in those prompts, and the personalization quality degrades once the prompt fills up. There’s also something philosophically weird about a learning system that doesn’t actually learn from its users.
I’ve been building a small local model called cognit to chase the inverse approach: personalization through weights, not through prompts. A frozen shared base model. A small LoRA adapter that captures everything cognit has learned about you. Marked turns and corrected responses get committed to the adapter as background work between sessions. The prompt stays minimal. The model already knows you.
What cognit is, structurally, is what falls out when you decide that personal knowledge belongs in weights rather than prompts, and then ask where each kind of knowledge actually goes. Three different timescales of memory, three different mechanisms — none of them the prompt.
The three timescales of memory
Cognit’s answer is that personal knowledge lives at three different timescales, and each gets a different mechanism. They are not interchangeable, and the value of the architecture is that nobody is doing the wrong job.
| Channel | Stores | Persists | Cost |
|---|---|---|---|
| Attention KV | Verbatim tokens, exact positions | Within turn | Grows with context length |
| SSM hidden state | Compressed running summary | Within and across sessions | Constant per layer (~KB) |
| LoRA adapter | Durable learned patterns | Across all sessions | Tiny (~MB), free at inference |
Three jobs:
- Attention is for “what token came where, three sentences ago.” Exact recall on a small recent window. The KV cache lets the model quote you, count you, paste back the structure of what you just said. That’s the part you can’t compress.
- SSM state is for “what kind of thing were we talking about.” Compressed conversational gist that survives across sessions. You save it to disk on session exit, load it the next time you continue, and the model picks up the thread without you having to re-explain. The size of this file is measured in KB, not GB, because the SSM dimension is fixed regardless of how long the conversation has been.
- LoRA weights are for “this user prefers terse responses; this is their authentication flow; this is their writing style.” Durable patterns that you want the model to keep learning over weeks and months. When you mark a turn as positive or correct a response, that goes to LoRA, not to a memory file that gets injected into the next prompt.
The clean division of labor is the part of cognit I keep coming back to. The frontier-agent pattern crams all three jobs into the prompt channel — the model has to re-read its “memory” every turn, every channel of personalization growing the same context bill, every distinction between “what I just said” and “what I taught you last week” collapsed into one bag of text. The three-channel model lets each job find its natural home.
The SSM half is what makes this work at long context
The thing the SSM half buys you is the only reason this is a viable architecture for a daily-driver local model. Decode cost per generated token, simulated at the layer level on a small hidden-dim configuration (d_model=128, d_inner=256, d_state=16, 6 layers):
| Architecture | past=1K | past=16K | past=65K | past=100K |
|---|---|---|---|---|
| flat-mamba (0 attn) | 0.22 | 0.22 | 0.13 | 0.12 |
| sparse-jamba (1 of 6 attn) | 0.17 | 0.23 | 0.43 | 0.58 |
| medium-jamba (2 of 6 attn) | 0.14 | 0.26 | 0.75 | 1.08 |
| dense-jamba (3 of 6 attn) | 0.20 | 0.33 | 1.06 | 1.53 |
| pure-attention (6 of 6) | 0.27 | 0.52 | 1.97 | 2.89 |
ms per decoded token, MPS / Apple Silicon, fp32.
The shape is what matters. At short context, the architectures cluster — attention is fast because there isn’t much to attend to. At 100K tokens of context, pure-attention is ~25× slower than pure-Mamba per generated token, and ~5× slower than the sparse hybrid. Cache memory scales similarly: pure-attention’s KV cache at 100K is 614 MB; the hybrid with one attention layer is ~100 MB; flat Mamba is essentially zero (the constant-size SSM state weighs a fraction of an MB).
You don’t get to keep that long without giving something up. Mamba’s constant-cost state isn’t free — it’s the lossy half of the model, the part that gives up exact recall on tokens that aren’t in the immediate attention window. The hybrid is the part that lets you decide how much exact recall you need: one attention layer is enough for length-generalization on a synthetic copy task at the scale I tested (going from 0.29 success at length 1024 with pure Mamba to 1.00 with one attention layer added anywhere in the stack); more attention layers buy you sharper recall at the cost of longer-context decode latency.
For a personal local model on a laptop, the right slice is something like sparse-jamba — one or two attention layers, the rest Mamba. You get exact recall over the recent attention window, compressed long-range state through the SSM hidden state, and constant per-token cost as conversations grow. The KV cache stays bounded; the SSM state stays small; nothing balloons.
Adapter drift, and a cheap fix
The half of cognit that isn’t architecture is the LoRA training loop — and it has one engineering risk you have to take seriously.
In the prompt-inflation paradigm, forgetting is impossible. The prompt is the source of truth; what’s not in the prompt isn’t remembered. That guarantee is the entire reason RAG-style systems can claim “perfect recall.”
In a weights-as-memory system, earlier learned patterns in the LoRA adapter can get overwritten by later training passes as the adapter is reused across sessions. The base model is frozen and can’t degrade — this is adapter interference, not classical catastrophic forgetting of a full fine-tune — but it’s real. In a measurement on Zamba2-1.2B with no protection, a single previously-trained pattern showed a roughly 50× loss increase after nine drift training steps on unrelated content.
The fix is the kind of fix you hope for: cheap, obvious in retrospect. Whenever you train on new captures, also sample N previously-trained captures from the queue and mix them into the batch. We already have all the captures sitting in a queue (the capture queue is how marked turns get committed to LoRA in the first place); replay just reads from it. With replay_size=5, the same drift experiment moves from +2.0 loss to +0.025 — about 99% effective. L2 regularization toward pre-pass weights at typical LoRA scales gives nowhere near that protection (a measured 4% reduction).
This is the part where I’d normally hedge that the result is small-scale and needs to be re-run at production model sizes. It is, and it does. But the mechanism is mechanically clear enough that I expect the shape to hold — old patterns survive new training when they’re sampled into the new training, and they don’t when they aren’t.
What cognit is not
Three things this is not, in increasing order of “this would defeat the point if I let it in”:
- Not RAG. If you want a fact stuffed into a prompt, that’s the agent’s job. cognit’s premise is that you mostly don’t need to.
- Not memory injection. Auto-prepending “remember the user likes X” is exactly the prompt-inflation pattern this is built to replace.
- Not long-context tricks. cognit doesn’t need to support 1M tokens. The durable stuff lives in weights; the recent stuff lives in the attention window; nothing in between needs a context-length workaround.
These belong in agents that consume cognit, not in cognit itself. cognit is a model that knows you. Tool use, retrieval, prompt construction — those happen at the layer above.
What success looks like
A user runs cognit chat for a few weeks. They never edit a system prompt. They never maintain a memory file. They mark a few responses as good or wrong, occasionally ingest a folder of notes. The model gets sharper at their specific work. Switching --profile work to --profile personal feels like switching to a different model — different knowledge, different tone, no cross-contamination. The adapter file after a year of personal use is a few hundred MB and lives on their disk, not on a vendor’s server.
That’s the shape I think a local personal LLM is supposed to be. The current frontier-agent reliance on prompt inflation is a design choice, not a feature of the problem. cognit is one shape of an alternative; there are presumably others.
Code: github.com/jnormore/cognit. The interesting bits live in the LoRA training loop and the SSM state save/load.