Tech 11 min read

Agent memory is just lookup: reading arXiv:2604.27707 with CTX and OCR-Memory in mind

IkesanContents

I read Contextual Agentic Memory is a Memo, Not True Memory, a preprint from April 30, 2026 by Binyan Xu, Xilin Dai, and Kehuan Zhang.

The claim is blunt.
Calling vector stores, RAG, scratchpads, and long context windows “memory” is a category error — what they actually do is lookup, retrieving memos you wrote earlier.
That sounds provocative, but it’s a useful frame for thinking about tools like CTX and OCR-Memory.

CTX pulls relevant fragments from git log, code, and chat history, then injects them before the prompt.
OCR-Memory converts long execution histories into images, picks the relevant positions, and retrieves the original logs.
Both are useful, but neither changes the model’s weights from continued use.
The paper says: call that “memo,” not “memory.”

Changing C vs. changing theta

The paper’s framing is simple.
There are two paths to changing an LLM agent’s output.

One is changing the context.
Prompt, RAG results, MCP tool output, scratchpad, skill files, memory store — all go into the context window.
The paper calls this C-engineering.

The other is changing the model weights.
Pre-training, fine-tuning, reinforcement learning, continual learning update theta directly.
If the model distills rules from experience into weights, it can apply them to inputs it hasn’t seen before.

flowchart TD
    A[Past experience] --> B[Save to external store]
    B --> C[Retrieve and inject into context]
    C --> D[Same-weights model reads it]

    A --> E[Distill experience]
    E --> F[Integrate into weights]
    F --> G[Rules apply to new inputs too]

Current agent memory is almost entirely the first path.
MemGPT moves information in and out of context.
Generative Agents writes observations to a memory stream.
Reflexion stores self-critiques in an episodic buffer.
Voyager saves skills as code in a vector database.
All of these are “write it down, look it up later.” None of them change the model itself through experience.

This connects to what I wrote in context window degradation in Claude Code.
Even with a 1M context window, the model doesn’t read every token with equal precision.
The more old logs and tool output you stuff in, the more the model gets dragged by past logs rather than gaining experience.

Retrieval is strong on similar cases, weak on unseen combinations

The paper’s central point is that retrieval-based memory and weight-based memory generalize differently.
Retrieval-based memory is strong when the input resembles something already stored.
Weight-based memory can handle unseen combinations by applying rules abstracted from past cases.

Say you have internal business rules A, B, and C.
The agent has seen each one separately.
But “what to do when A and C apply simultaneously” doesn’t exist in the logs.
Retrieval-based memory can fetch examples of A and examples of C, but it can’t guarantee the correct combined rule.

The paper formalizes this as compositional generalization.
With kk concepts and a correct output for each pair, retrieval-based memory basically needs a stored example for each pair.
Required examples grow proportionally to the number of concept pairs.
If weight updates can learn the operation rule itself, required examples depend on the complexity of the hypothesis class instead.

The strength of the formal argument is in the problem setup, not the constants.
It closes off the escape hatches of “just improve retrieval quality” and “just increase context window.”
Even with better example selection, whether the system can treat an unstored combination as a rule is a separate question.

”Frozen novice” — an uncomfortable label

The paper calls an agent with only retrieval-based memory a frozen novice.
More memos accumulate, but the novice never becomes an expert.

This hits if you use Claude Code or Codex daily.
No matter how carefully you maintain CLAUDE.md, past logs, handoffs, and git history, a new session starts from the same weights.
Frequently referenced information grows, but the model doesn’t internalize “in this project, make this judgment.”

In the CTX article, I looked at how a UserPromptSubmit hook injects relevant context before each prompt.
That’s genuinely practical for solo development.
But in the paper’s taxonomy, CTX is a good implementation of C-engineering, not theta-learning.

Mixing these up leads to wrong expectations.
Retrieval-based memory works for “find yesterday’s decision,” “recall the file I touched before,” “restore the user’s preferences.”
But stacking those won’t necessarily make the model stronger at unseen combinations.

Memory poisoning doesn’t stop at one injection

The security angle is pointed.
External-store memory turns prompt injection from a transient problem into a persistent one.

A web page, issue, email, or PDF contains a malicious instruction.
The agent reads it once and saves a summary or “lesson” to the memory store.
Next session, it gets injected as relevant context.
By then the agent is no longer reading the original source — only the poisoned memo remains.

flowchart TD
    A[Malicious instruction in external data] --> B[Agent reads it]
    B --> C[Saved as summary or reflection]
    C --> D[Retrieved in next session]
    D --> E[Re-injected into context]
    E --> F[Transient attack becomes persistent]

This is harder to handle than the instruction hierarchy issue in GitHub and OpenAI’s prompt injection defense.
Treating the current tool message as low-trust isn’t enough.
Low-trust external data comes back later as “useful past memory.”

If you maintain a memory store, you need trust boundaries at write time, provenance tracking, write reasons, and deletion mechanisms.
Just dumping into a vector database and running similarity search lets an attacker plant instructions that persist into future sessions.

Weight integration doesn’t solve everything

The paper advocates strongly for weight-based memory, but practical answers aren’t readily available.
Safely running continuous fine-tuning per user or per project in production is expensive.
Baking wrong experience into weights is harder to undo than clearing an external store.
Authorization, auditing, rollback, forgetting, and data isolation all become issues.

So in practice, this doesn’t mean abandoning retrieval-based memory.
The paper itself proposes coexistence of fast episodic lookup and slow weight consolidation.
Recent logs, tool output, user settings, and well-grounded decisions are fine in external stores.
Only procedures that repeatedly succeed or rules extractable from multiple cases get consolidated into weights or adapters through a separate pipeline.

For personal dev environments that aren’t ready for that, getting the naming right matters more.
CLAUDE.md, CTX, YourMemory, OCR-Memory, WUPHF — they’re all useful.
But thinking of them as “better-organized work notes” rather than “a smarter model” leads to fewer surprises.

Benchmarks need more than recall

If the paper is right, memory benchmarks need rethinking.
Can the agent recall a preference from a past conversation? Can it retrieve the correct fact? Can it find the needle?
Those tests are necessary but skewed toward retrieval-based memory’s strengths.

To measure whether an agent genuinely improves from experience, you need to check whether unseen combination tasks get better after accumulating the same work logs.
For example: an agent that learned Company A’s API conventions and Company B’s audit rules separately — can it handle Company A’s API under Company B’s audit rules as a new task?
If the logs don’t contain that exact combination, retrieval alone struggles.

Research like OCR-Memory improves the ability to find the right spot in a long history.
That matters.
But “where to look” in the history and “how to judge” from experience are separate capabilities.

Inside the theorem

I mentioned “the strength of the formal argument” above without showing the theorem, so here it is.

Start with kk concepts.
If a codebase has 10 distinct conventions, k=10k = 10.
Let the concept set be F={f1,,fk}F = \{f_1, \ldots, f_k\}, and the operation that returns the correct output for a pair of concepts be :F×FY\oplus: F \times F \to Y.
Think of it as a lookup table: “when convention A and convention C apply together, the correct code looks like this.”

The total number of concept pairs is N=(k2)N = \binom{k}{2}.
k=10k = 10 gives 45 pairs; k=100k = 100 gives 4,950.
A dataset DD contains nn known pairs; the rest are unseen.

CGC (Compositional Generalization Capacity) is “the probability of answering correctly on unseen concept pairs”:

CGC(M,D)=Pr(fi,fj)[M(fi,fj)=(fi,fj)]CGC(M, D) = \Pr_{(f_i, f_j)} \bigl[M(f_i, f_j) = \oplus(f_i, f_j)\bigr]

Pairs are drawn uniformly at random.
High CGC means the model handles combinations not in DD.
Low CGC means it only gets the stored pairs right.

Now consider a frozen model (weights unchanged) inferring rules from KK few-shot examples in context.
Its accuracy has an upper bound αˉ<1\bar{\alpha} < 1.

This bound isn’t an assumption — it’s derived from Fano’s inequality.
Fano’s inequality says: if the information from observed data isn’t enough to distinguish candidates, accuracy can’t reach 1.
When the hypothesis class HH is large and logH>KlogY\log|H| > K \log|Y|, KK examples can’t narrow it down to one.
KK examples with Y|Y| possible outputs yield at most KlogYK \log|Y| bits.
If logH\log|H| bits are needed to distinguish candidates, the shortfall becomes unavoidable error.
Even filling the context window with few-shot examples, a large candidate space means perfect accuracy is out of reach.

Theorem 1 shows that the number of examples needed for target accuracy 1δ1 - \delta (smaller δ\delta = higher accuracy demand) separates structurally between retrieval and weight-based approaches.
Under the condition δ<1αˉ\delta < 1 - \bar{\alpha} — meaning you want accuracy beyond the in-context ceiling:

Retrieval-based needs Ω(k2)\Omega(k^2) examples:

nR1δαˉ1αˉ(k2)n_R \geq \frac{1 - \delta - \bar{\alpha}}{1 - \bar{\alpha}} \binom{k}{2}

Retrieval works by “finding a similar stored pair as reference,” so accuracy on unseen pairs is capped by the in-context ceiling αˉ\bar{\alpha}.
To guarantee accuracy above that ceiling, you end up needing most of the pairs stored.
(k2)\binom{k}{2} is roughly k2/2k^2/2, so required storage grows quadratically with the number of concepts.

Weight-based needs examples proportional to the VC dimension dd of the hypothesis class:

nP=O(d+log(1/δ)δ)n_P = O\left(\frac{d + \log(1/\delta)}{\delta}\right)

VC dimension measures “how complex is this rule system” in learning theory, determined independently of kk.
A structured system like “adopt the higher-priority concept” might have d=O(k)d = O(k); simpler ones could have d=O(1)d = O(1).
Weight updates depend only on rule-system complexity, so the sample count doesn’t balloon as fast as the pair count.

The gap:

nRnP=Ω(k2d)\frac{n_R}{n_P} = \Omega\left(\frac{k^2}{d}\right)

For structured operations with d=O(k)d = O(k), the gap is Ω(k)\Omega(k).
For d=O(1)d = O(1), it’s Ω(k2)\Omega(k^2).
As concepts grow, retrieval-based storage grows quadratically while weight-based grows with VC dimension only.
Improving retrieval accuracy or expanding the context window doesn’t shrink this quadratic gap.

The security side has its own formula.
With per-interaction injection probability p0p_0 and cumulative interactions N(t)N(t):

P(compromised by t)=1(1p0)N(t)P(\text{compromised by } t) = 1 - (1 - p_0)^{N(t)}

If each interaction is independently compromised with probability p0p_0, surviving all N(t)N(t) rounds has probability (1p0)N(t)(1 - p_0)^{N(t)}; the complement is “at least one hit.”
As N(t)N(t) \to \infty, this converges to 1.
This is the quantified version of “transient becomes persistent” from the poisoning section — the paper calls it evil-squared.

Plugging in numbers:

nRnP=Ω(k2/d)\frac{n_R}{n_P} = \Omega(k^2/d) with d=O(k)d = O(k) gives a gap of Ω(k)\Omega(k).
k=10k = 10 means roughly 10x; k=50k = 50 means 50x; k=100k = 100 means 100x.

Projects with 100+ conventions, API specs, and audit rules are common.
Each additional concept adds about kk new combination examples to store, so the gap widens with project longevity.

For the security formula:
p0=0.01p_0 = 0.01 (1% per interaction), 10 interactions per day for 30 days (N(t)=300N(t) = 300):

P(compromised by 30d)=10.993000.951P(\text{compromised by 30d}) = 1 - 0.99^{300} \approx 0.951

Even at 1%, 300 interactions reach 95% compromise probability.

References