Tech 14 min read

OCR-Memory Lets Agents Recall History as Images

IkesanContents

I read OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory on arXiv.
Accepted at ACL 2026 Main, the paper saves long agent execution history as images and searches over them, rather than summarizing or indexing text.

My first reaction was “OCR for memory?” but this isn’t about character recognition tools.
It brings the Contexts Optical Compression idea from the DeepSeek-OCR article into long-term agent memory.
Feeding raw text back into the context is expensive. Summarization loses detail.
So the idea is to compress history into images and retrieve only the relevant segments from the original logs.

Separating the Model That Reads History from the Model That Reasons

The key insight in OCR-Memory is that the memory retrieval model never generates answers.
Each segment of the history is rendered into an image with red bounding boxes and numbers.
At retrieval time, a VLM only picks which numbers are relevant to the current query.
Then the original text corresponding to those numbers is pulled verbatim from a database.

flowchart TD
    A[Past agent history] --> B[Split into segments]
    B --> C[Render as images with red boxes and numbers]
    C --> D[Visual Memory Bank]
    E[New task] --> F[Optical Retriever]
    D --> F
    F --> G[Select relevant numbers]
    G --> H[Retrieve verbatim text from original logs]
    H --> I[Inject into reasoning agent context]

This separation matters a lot.
If you ask a VLM to “read the relevant text from the image,” it may hallucinate plausible content where the resolution is low or the text is blurry.
OCR-Memory narrows the VLM’s job to a binary relevance decision and retrieves the evidence text deterministically from the original logs.
The paper calls this approach Locate-and-Transcribe.

Despite the name “Transcribe,” the model doesn’t freely transcribe anything.
Separating “locate the position” from “return the original text” makes it easier to avoid generation-induced hallucination.

Images Work as an Index, Not Compression

In the DeepSeek-OCR context, the main story was compressing text into visual tokens by rendering it as an image.
The DeepSeek-OCR paper reported ~97% OCR accuracy when the text-to-visual-token ratio is under 10x, dropping to ~60% at 20x.
OCR-Memory doesn’t use this idea for full-text recovery. Instead, it uses the image as an index to find relevant segments.

The implementation uses DeepSeek-OCR 3B as the base model.
The visual encoder is frozen, and only the language decoder is fine-tuned with LoRA.
Training data comes from HotpotQA supporting facts, reformulated as a task of predicting which segment numbers are relevant to a question.

ComponentRole
DeepSeek-OCR 3BBase model for reading rendered history images
Set-of-MarkRed boxes and numbers to make segments addressable
LoRA fine-tuningAdds the ability to select relevant numbers
Original log DBReturns verbatim text for selected numbers
Reasoning agentSolves the task using retrieved evidence

Under this design, memory images serve as a visual index for retrieval.
What ultimately reaches the agent is not images but the selected original text.
The image’s role is to cheaply determine where to look in a massive history.

Blur Old History, Restore on Hit

The interesting part is that older history gets lower resolution.
The paper keeps the most recent 5 steps at 1024x1024 and downgrades everything before that to 512x512.
In DeepSeek-OCR’s visual token budget, 1024x1024 costs 256 tokens and 512x512 costs 64 tokens.

When a low-resolution segment gets a retrieval hit, the system re-renders a high-resolution image from the original log and keeps it at high resolution for the rest of that episode.
The paper calls this Active Recall Upscaling.
The metaphor is “old memories are blurry, but become vivid once recalled as important.”

This connects to the forgetting curve discussed in the YourMemory article.
YourMemory decays retrieval scores over time.
OCR-Memory doesn’t remove entries from the search pool but reduces image resolution to save visual tokens.
Both share the principle of “don’t keep everything at the same fidelity,” but differ in how they degrade.

Benchmark Results Are Strong Under Token Constraints

Evaluation uses Mind2Web and AppWorld.
The memory module’s context is fixed at 4096 tokens.
OCR-Memory achieves Element Accuracy 53.8%, Step SR 46.1%, Task SR 4.8% on Mind2Web.
Average Success Rate on AppWorld is 58.1%.

MethodMind2Web Ele AccMind2Web Step SRMind2Web Task SRAppWorld Avg
Retrieval41.338.92.746.2
MemoryBank43.839.23.352.1
AWM49.142.64.355.0
ACON48.241.44.156.2
OCR-Memory53.846.14.858.1

The gains stand out under tight token budgets.
Across 1024, 2048, 4096, and 8192 token limits, OCR-Memory consistently outperforms Text-RAG. At 1024 tokens the gaps are +17.0 Element Accuracy, +14.8 Action F1, and +14.8 Step SR.

In Needle-in-a-Haystack long-context retrieval, Recall@1 is 98.5% at 4k and 94.1% at 32k.
Compression ratios stay roughly in the 10x range.
Searching over visually compressed history to locate relevant segments beats reading the full text when context is limited.

What Gets Cheaper and What Gets More Expensive

OCR-Memory is not universally cheap.
The efficiency profile in the paper shows Text-RAG at 18KB per episode for Mind2Web continuous logs, versus 1.47MB for OCR-Memory.
Retrieval latency is 0.3s for Text-RAG versus 1.7s for OCR-Memory.
On the other hand, text tokens fed back to the reasoning LLM drop from 3,980 to 596.

MetricText-RAGOCR-Memory
Disk/episode18KB1.47MB
Text returned to reasoning/step3,980 tokens596 tokens
Retrieval latency/step0.3s1.7s

What’s being saved is primarily reasoning context.
In exchange, you pay for image rendering, an additional VLM, disk space, and retrieval latency.
It’s too heavy to drop into a lightweight local agent memory.
The payoff comes in long-running web or API agents where history keeps growing and accurate past log retrieval directly impacts success rates.

This pairs well with the GLM-5.1 Long-Horizon agent article.
Even if the model itself can endure 600+ iterations, sloppy history management means it can’t retrieve old errors or successful procedures.
Long-horizon capability depends not just on model endurance but on how history is stored, searched, and re-injected.

Still at the Research System Stage

The paper itself is clear about limitations.
OCR-Memory is not training-free; it requires fine-tuning a dedicated optical retrieval model.
Rendering history into images is heavier than text storage.
On top of that, the visual encoder needs to be loaded in memory separately from the reasoning agent.

The evaluation is also on benchmarks, Mind2Web and AppWorld.
Real agent logs are a mix of screenshots, HTML, tool outputs, errors, user instructions, and intermediate reasoning.
What to segment, how much to render into images, and at what granularity to map back to original logs remain design problems.

What You’d Need to Build This Locally

The paper’s code isn’t published, but the individual components are available.

The base VLM is DeepSeek-OCR 3B, available from the GitHub repository.
At 3B it’s about 6GB in fp16, runnable on an M1 Mac or RTX 3060 class GPU.
Since the visual encoder is frozen and only LoRA is applied, training fits in 8GB VRAM.

Training data reuses HotpotQA supporting facts.
The procedure for reformulating QA pairs into “which segment numbers are relevant” is straightforward.
Split text into segments, render them as numbered images with Set-of-Mark, and label the segment numbers containing the correct supporting facts.
Start with PEFT LoRAConfig at rank 8-16, alpha 16-32.

Pillow is enough for image rendering.
Draw text with a specified font and overlay red rectangles with numbers at segment boundaries.
The paper uses a grid layout to pack multiple segments into a single image.
Resolution is 1024x1024 for recent segments and 512x512 for older ones, switched by the episode manager.

SQLite works for the log DB.
Key on segment ID, store original text with a timestamp and current resolution flag.
Active Recall Upscaling on a retrieval hit just re-renders at 1024x1024 from the original text and swaps the image in the memory bank.

For agent loop integration, insert log-save, render, and memory-bank-update after each step, and insert query, search, text-retrieval, and context-injection before each step.

To prototype without fine-tuning, skip the DeepSeek-OCR 3B LoRA training and pass Set-of-Mark images to GPT-4o or Claude Sonnet with few-shot prompting.
Just send the instruction “list all segment numbers in this image that are relevant to question X.”
API call cost goes up, but you skip training data preparation and fine-tuning iteration entirely.
How far retrieval accuracy goes is unknown without trying, but it’s enough to validate whether the OCR-Memory design works.

The tricky part is segment granularity.
The paper uses one agent step per segment.
In practice, agent logs vary from a few lines to hundreds of lines per step.
Too short and a single image packs too many segments to visually distinguish; too long and the VLM can’t pick out the needed information.
It requires experimentation with token-count-based splitting, tool-call-based splitting, and so on.

1M Context Isn’t Enough for Full History

Claude 1M context GA, DeepSeek V4’s 1M support, Gemini’s 2M — context windows are growing fast.
You might think stuffing everything in eliminates the need for something like OCR-Memory. It doesn’t.

Even with a large window, accuracy drops when lots of irrelevant text is present.
Chroma Context-1 addressed this with a self-editing mechanism where the model deletes irrelevant passages, while OCR-Memory filters visually at the retrieval stage.
There’s no guarantee a model can accurately retrieve error-avoidance procedures from 200 steps ago just because the full history fits in a 1M window.
The fact that OCR-Memory includes a Needle-in-a-Haystack experiment signals that “fitting into a wide window doesn’t preserve retrieval accuracy.”

Cost is also harsh.
As investigated in the Anthropic Prompt Cache TTL article, Prompt Caching only applies to system prompts and the unchanging prefix of the conversation; agent execution history grows every step.
New input tokens are charged at full rate.
Sending full history every step over a 200-step task makes API costs grow super-linearly.
This is why the token management guide for bloated CLAUDE.md stated that “token management is core to agent design.”

Lining up existing cloud-side approaches makes OCR-Memory’s position clearer.

Compresr Context Gateway sits between the agent and the LLM API as a proxy, summarizing tool outputs and applying speculative compression to reduce tokens.
It’s text-to-text compression that works with existing API calls, but detail is lost the moment you summarize.

Cloudflare Agent Memory is a managed service that auto-extracts memories from conversations and stores them in Durable Objects/Vectorize.
No need to build your own memory infrastructure, but the extraction logic depends on Cloudflare’s implementation.

OCR-Memory starts from a different place.
It keeps all original text while offloading “where to look” to cheap visual search.
Information that would be crushed by compression survives because the original log is restored.
The tradeoff: what would be 18KB in text storage becomes 1.47MB as images, and retrieval latency goes from 0.3s to 1.7s.

Using VLMs Other Than DeepSeek-OCR for the Optical Retriever

The earlier section covered DeepSeek-OCR 3B LoRA training and GPT-4o/Claude few-shot validation, but there are more local VLM options.

The experiment extracting RPG parameters from character images with a local Vision LLM ran Qwen2.5-VL 7B on Ollama for structured extraction from images.
The task OCR-Memory’s optical retriever demands is also structured extraction — “look at the image and return relevant numbers” — not free-form generation. It runs on the same class of capability.

Qwen2.5-VL 7B is a general-purpose VLM with strong image-text reading ability, and it runs instantly on Ollama.
Around 8GB VRAM is enough for inference, so an M1/M4 Mac or RTX 3060 class GPU works.

InternVL2-8B ranks high on OCR and document understanding benchmarks, with particular strength on text-dense images.
Set-of-Mark images with red boxes are exactly text-dense input, so the fit looks good.

Moondream 2B runs in the 2GB class as the lightest VLM, but accuracy on text-dense images is largely unverified.

The paper’s ablation study shows DeepSeek-OCR 3B without LoRA dropping nearly 10 points in Element Accuracy compared to the LoRA version.
This suggests few-shot alone isn’t enough at the 3B scale, but a 7B-8B general-purpose VLM with stronger baseline visual understanding may reach accuracy close to DeepSeek-OCR 3B+LoRA without fine-tuning.
If you want to skip LoRA training, going bigger than DeepSeek-OCR 3B with a general-purpose VLM is the practical direction.

HypurA’s NVMe streaming style SSD optimization would help OCR-Memory as image storage read acceleration.
VLM inference itself is light at 3B-7B, but each retrieval reads multiple images from disk.
If the memory bank has hundreds of 1024x1024 images being scanned every step, NVMe read speed directly impacts latency.

Distance from Kana-Chat’s Heartbeat Memory

Kana-Chat v1 launches a tmux session per task so that one task’s context doesn’t leak into another.
Heartbeat memory added in v2 has Haiku extract profile, daily activities, and task signals from the most recent 60 user messages, accumulating them as structured JSON.
Injecting this JSON into the prompt at the next job launch enables cross-session context handoff.

What Heartbeat discards is the original conversation text. Once Haiku extracts “this person is knowledgeable about security articles,” the 50 conversations that led to that extraction are compressed into structured JSON and lost.
OCR-Memory does the opposite, losslessly rendering the full history into images and having a VLM locate relevant segments to restore the original text at retrieval time.
Heartbeat is suited for remembering “roughly what happened,” while OCR-Memory is suited for recovering “exactly what was written.”
They’re not mutually exclusive — a two-layer approach where Heartbeat holds rough context and OCR-Memory-style retrieval restores originals when detail is needed is plausible.

What Kana-Chat prevents is “lateral contamination” between tasks, while what OCR-Memory addresses is “vertical bloat” within a single task.
Kana-Chat tasks usually finish within a few dozen steps, so OCR-Memory’s sweet spot is longer runs of hundreds of steps or more in Long-Horizon tasks. The coverage areas don’t overlap.

In multi-agent setups, both problems appear simultaneously.

Claude Code multi-agent review has multiple agents reviewing PRs in parallel, pooling findings, and passing them to a verification phase.
Kana-Chat also has a setup where Claude Code workers’ results are handed to a Codex reviewer.
Both face the problem of efficiently passing one agent’s work history to another, and what to pass versus discard is currently decided by the system designer.

If OCR-Memory’s mechanism were adapted for multi-agent use, each agent’s full history could be rendered into images in a shared memory bank, with a coordinator running visual searches like “what did other agents find about this file?” and passing only the hits.

YourMemory is a local MCP that decays old memories using an Ebbinghaus forgetting curve, but it targets single-agent long-term memory and doesn’t handle cross-agent history sharing.
The pattern described in the confidence-score article on reducing human review for document extraction — “handle the bulk cheaply, spend cost only where accuracy matters” — is structurally the same as OCR-Memory’s “locate cheaply with visual search, restore accurately from original text.”

References