Tech 13 min read

Vektor Memory supersession chains: BM25 threshold trap and a minimum schema

IkesanContents

You teach an AI “I like coffee,” and three months later you say “I switched to tea.”
The next week you ask about your preference, and the AI brings up coffee again.
This is not a search-quality problem. It is just that there is no step to take the old memory out of active circulation.

A DEV Community post about Vektor Memory is the story of chasing this bug down through five layers of Node.js.
The supersession chains added in v1.5.4 do not delete old memories. They mark them as “replaced by a new memory” and exclude them from normal recall.

You don’t forget, you misremember

This kind of failure is not “the AI forgot.”
It is the opposite. Both old and new information are still in storage.

“User prefers dark mode” and “User switched to light mode” both sit in the same DB and both come back from search.
The model can’t pick one, so it asks a clarifying question, or it just happens to use the older one.
You’d expect more memories to make it smarter, but in practice old correctness becomes noise.

In real use, the case that bites you isn’t the toy “coffee to tea” example. It’s life-state updates.

  • Moving: address, nearest station, commute time all change
  • Job change: company, role, tech stack all change
  • Tool switch: Notion → Obsidian, VSCode → Cursor
  • Settings change: dark mode → light mode, font, shell
  • Pulling back a decision (“Actually, let’s not go with that architecture”)
  • A simple correction (you gave the wrong year for a birthday)

In every case, you want “keep the old info around, but don’t mix it into current answers.”
Deleting kills the audit trail, and decay alone leaves the old info popping up for days or weeks.
This is exactly where supersession chains earn their keep.

Where this fits among past memory articles

Recent agent-memory articles each solved a different sub-problem.
Vektor Memory’s supersession chains slot in as another layer for “semantic overwrite of free-form memories.”

ApproachHow it handles old memoriesWhere it shinesWhere it doesn’t
Time decay (YourMemory)Score weakens over timeMiscellaneous “will eventually be irrelevant” memoriesWeakens rare-but-important memories you only touch every few months
Key overwrite (Cloudflare Agent Memory)Same key, explicit replaceStructured attributes (address, preferences)Free-form memories need key design upfront
Retrieval-by-source (CTX)Switch source per typeOptimizes per git/code/chat history”Old vs new” judgment is a separate layer
OCR-ization (OCR-Memory)Pull the relevant region from imagesCompressing long execution logsConsistency management is left for another layer
Memo philosophy (Contextual Agentic Memory paper)Treat as a search shelf, not memoryFraming the discussionNot an implementation by itself
Semantic replacement (Vektor Memory supersession chains)Find neighbors via embedding, retire with superseded_byFree-form preference / setting / retractionYou need to understand score types and thresholds

The short version: YourMemory “weakens by time,” Cloudflare “overwrites by key,” Vektor “replaces by meaning.”
Key overwrite is the most solid when it applies, but designing a key for every free-form memory (“I’m into Rust lately,” “Cutting back on ramen at lunch”) is tedious.
Time decay is easy to throw on, but it can’t keep up with short-term overwrites like “I retracted that three days ago.”
Semantic replacement sits between them: for free-form text, judge each insert against “is there an old memory close in meaning?” and retire it.

Where superseded_by IS NULL does its work

The original implementation is straightforward.
When a new memory comes in, search existing memories for the semantically closest ones.
If a replacement target is found, write the new memory’s ID into the old row’s superseded_by and record the replacement timestamp.

Normal active recall filters with WHERE superseded_by IS NULL to drop the old ones.
History is preserved, so you can trace what the agent used to believe.
But current answers don’t get the retired memories mixed in.

flowchart TD
    A[January<br/>Likes coffee] --> B[March<br/>Switched to tea]
    B --> C[Set superseded_by<br/>on old memory]
    C --> D[Active recall returns<br/>only the new memory]
    C --> E[Audit can still trace<br/>the old memory]

The point is that “retire from active recall” and “delete” are different.
If you delete, you can’t later reconstruct facts like “the user switched from coffee to tea over six months.”
Supersession chains are chains, so when preferences shift A → B → C, the current value is C, and following C.superseded walks back through B and A.

Why key matching alone is not enough

If your memories are attribute-shaped — “address,” “favorite drink,” “editor in use” — Cloudflare-style key overwrite is the most solid choice.
But the utterances an agent receives are mostly not attribute-shaped.

  • “Yeah I dropped coffee, I’m doing tea now”
  • “Trying to cut down on coffee these days”
  • “Tea is better in the morning”

Normalizing these into key=preferred_drink ends up requiring an LLM extraction step.
And if the LLM could reliably tag every utterance as “this is the favorite-drink key,” we wouldn’t have a problem in the first place.
Mix English and Japanese, indirect phrasing, or a new attribute name, and it slips through easily.

Putting free-form text directly into a vector DB and detecting collisions by semantic neighborhood is a different bargain: skip normalization, pick up rough collisions by embedding distance.
User likes coffee and User prefers tea now don’t overlap as strings, but in embedding space they’re close along the “drink preference” axis.
You skip the key design and do “find the close one and replace it” after the fact.

But vector similarity has its own traps.

  1. It can’t distinguish opposition from similarity. Loves Python and Hates Python are close in embedding space because they share co-occurring words. Judging supersession purely by distance lets affirmation overwrite negation
  2. Memories of different granularity collide. Likes drinks and Likes tea are close, but the latter is a specialization of the former, not a replacement
  3. “Score” means different things across retrievers. Cosine 0.9 and BM25 0.9 are entirely different events (more on this below)

In practice, use similarity for fetching neighbors, but for the actual replacement decision, ask the LLM “is this a replacement, an addition, or unrelated?” The original implementation only uses a similarity threshold; if you build it yourself, slot in an LLM judge.

The right code was dying behind a different entry point

The bug itself was about wiring, not about the supersession idea.
The dedup logic was implemented in vektor-dedup.js, wrapping the memory object with a Proxy and intercepting remember() calls.
There was schema migration, a threshold, a module export — all of it.

But the MCP server that Claude Desktop ran was booting from a different DXT entry point, not the vektor.mjs that was being edited.
That boot sequence didn’t include the dedup wrapper.
The feature existed in the right place, but it wasn’t on the execution path.

This is very familiar territory in MCP work.
The CLI entry point, the Claude Desktop entry point, and the packaged DXT entry point are all different.
”I fixed the file but nothing changed” almost always means you’re misreading which entry point is actually running.

On top of that, an optional-module load error was being swallowed by an empty catch.
MCP subprocesses under Claude Desktop don’t surface stderr cleanly, and from the PowerShell side it looked like nothing was happening.
Only after logging to a file did the SyntaxError show up — a broken escape inside a template literal.

This is a memory-feature article on the surface, but on the implementation side it reads as an MCP server debugging article.
You think you’re investigating why the AI keeps repeating itself, but what you actually need to look at is the wrap chain, the entry point, and how optional-module errors are handled.

A BM25 0.12 isn’t “too low” if you read it right

The last piece of friction was the meaning of the score.
The dedup recall fetches close candidates from existing memories.
The threshold was set to 0.95.

A 0.95 threshold makes sense for embedding cosine similarity.
Two genuinely close sentences land somewhere between 0.85 and 1.0.
But what was actually coming back through this path was a BM25 score.
BM25 is a keyword-overlap-based ranking score, not an intuitive 0-to-1 similarity like cosine.

Score typeTypical range”Close” reference pointNote
Cosine similarity (embedding)0.0 to 1.0≥ 0.85 ≈ near-synonymousBaseline shifts by model
BM25Unnormalized, roughly 0 to a few dozenDepends on query/corpus; read relativeAbsolute thresholds aren’t portable
RRF (reciprocal rank fusion)Often normalized near 0 to 1Implementation-dependentOriginal-score meaning is lost
LLM reranker confidence0 to 1≥ 0.7 often counted as relevantOften uncalibrated

In the original case, candidate scores were lining up around 0.10 to 0.12, and a 0.95 threshold dropped all of them.
After lowering the threshold to 0.09, “User prefers Node.js for coding” finally superseded the existing “User likes Node.js to code.”

This is the part that hits hardest in practice.
A field labeled score in RAG or memory-search results does not mean the same thing across retrievers.
Treat embedding cosine, BM25, post-RRF rank fusion, and LLM reranker confidence with the same threshold, and it breaks.
Either re-calibrate the threshold every time you swap retriever, or normalize internally (min-max or rank-based) before comparing.

In the CTX article, CTX deliberately changed retrieval per source — git log, code, chat history.
Not pushing every memory into a single vector DB isn’t a stylistic choice; it’s also because the meaning of “score” varies by what you’re retrieving.

A minimum schema for your own agent memory

If you’re adding supersession chains to your own agent memory, five fields are enough.

CREATE TABLE memory (
  id            TEXT PRIMARY KEY,
  text          TEXT NOT NULL,
  embedding     VECTOR(768),       -- pgvector / DuckDB VSS / sqlite-vss whichever
  created_at    TIMESTAMPTZ NOT NULL,
  superseded_by TEXT REFERENCES memory(id),
  superseded_at TIMESTAMPTZ
);

CREATE INDEX active_recall_idx ON memory (id) WHERE superseded_by IS NULL;

The write-path flow looks like this.

flowchart TD
    A[Extract new utterance] --> B[Embed]
    B --> C[Fetch top-N from<br/>active recall by similarity]
    C --> D{Neighbor found?}
    D -- No --> E[Plain INSERT]
    D -- Yes --> F[Ask LLM:<br/>replace / add / unrelated]
    F -- replace --> G[Set superseded_by<br/>on the old memory]
    F -- add --> E
    F -- unrelated --> E
    G --> E

What to actually nail down:

  1. Always go through WHERE superseded_by IS NULL for active recall. Forget this and the whole thing is pointless. A partial index on this clause makes it fast
  2. Use a separate threshold per retriever. Don’t share constants between cosine, BM25, and hybrid scores. For testing, prepare around 5 examples each of “obvious replacement,” “obvious unrelated,” and “gray area” to calibrate
  3. Run the replacement decision through an LLM. Distance alone confuses “loves Python” with “hates Python.” The original implementation only uses distance, but if you’re rolling your own, a single short prompt is worth it

Add an audit endpoint like memory/{id}/history that walks the superseded_by chain, and you can later answer “how many times has this user’s preference changed?”
That’s a property time decay and key overwrite can’t give you — it’s specific to supersession chains.

What it looks like dropped into Heartbeat

The Heartbeat memory in my own kana-chat currently runs entirely on mechanical cutoffs.
today keeps 14 days by date, task_signals is capped at 80 entries, profile at 40.
It only pushes oldest entries out, so there’s no way to express an overwrite relationship.

I noted at the end of the YourMemory article that with task_signals, a two-week-old murmur like “I want to look into auto-generating OGP images” sticks around as a suggestion candidate long after I’ve lost interest, and that quietly degrades the experience.
Decay would weaken it naturally, but there are also moments when the user explicitly says “this is no longer relevant,” and you want to take it out immediately.

Adding supersession chains would look like this:

AreaCurrentAfter supersession
todayDelete after 14 days by dateUnchanged (diary semantics: a sequence, not replacements)
task_signalsCap at 80, oldest deletedOn INSERT, similarity-judge against existing signals. If it’s a replacement, supersede the old one. Explicit retraction utterances (“forget about that”) also supersede explicitly
profileCap at 40, oldest deletedAbsorb overwrites like “I’m into Rust lately” → “These days it’s all Go” via semantic replacement. Old interests stay in history

today should stay separate.
A diary’s identity is “things accumulate within the same date,” not replacements.
Trying to apply supersession would produce weird behavior like “evening overwrites morning.”

For task_signals and profile, the safe move is to layer supersession on top while keeping the existing mechanical cutoffs.
Superseded signals fall out of active recall, so even within the 80-entry cap, only “live signals” line up as suggestion candidates.
If you also factor in a recall_count-style reference frequency, it combines with YourMemory-style reinforcement to separate “knowledge that keeps being useful” from “one-off thoughts.”

The implementation hump is where to put the replacement decision.
Heartbeat already runs on a job queue, so slipping in one short LLM call before task_signals INSERT is realistic.
The prompt is just “does this new signal replace any of the existing signals below, or is it an unrelated addition?”
Cost is one small classification call per INSERT, which should pay for itself in suggestion quality.


Writing all this out, what strikes me is that none of it is “the AI learned.” It is “we wrote it somewhere the AI can look up.”
Human memory is fused with learning, but AI memory only touches the context, not the weights. Even reflexion-style self-critique loops leave θ untouched.

The natural question is why nobody just runs auto-fine-tuning. There are reasons, though: catastrophic forgetting breaks other capabilities, you’d need per-user weights, unlearning is fundamentally hard, audits stop working, and training costs an order of magnitude more than inference. A two-layer setup — bake a personal LoRA in a nightly batch, then layer memos (including supersession) on top — feels like the realistic middle ground for now.

Not unifying “memory = learning” isn’t laziness. There’s a deliberate side to keeping them separate, for safety and reversibility.
Human memory is inseparable from learning because biology had no choice. For AI, the separation might actually be the better operational design.

References