Week 10 · May 2026

OCR-Memory: Why Text-Based Agent Memory Loses Evidence - and How Visual Encoding Fixes It

May 2, 2026 · by Satish K C 8 min read
Deep Learning LLMs Agents Efficiency

The Paper

"OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory" was published in April 2026 by Jinze Li, Yang Zhang, Xin Yang, Jiayi Qu, Jinfeng Xu, Shuo Yang, Junhua Ding, and Edith Cheuk-Han Ngai from the University of Hong Kong, University of North Texas, University of Tsukuba, and Yonsei University, accepted at ACL 2026 Main Conference. The paper argues that agent memory should shift from the text domain to the visual modality - rendering interaction trajectories as annotated images and retrieving evidence through a Locate-and-Transcribe mechanism that fetches verbatim text deterministically rather than generating it, achieving 100% retrieval faithfulness while cutting reasoning-context tokens by 6.7x compared to text-based RAG.

Read the Paper on arXiv →

The Problem Before This Paper

Long-horizon agents generate extensive interaction histories - reasoning traces, tool invocations, environment feedback - that are critical for future reference but impossible to store verbatim under finite context windows. Existing approaches force a painful trade-off. Retrieval-based systems (MemGPT, MemoryBank, Raptor) store past interactions externally and fetch relevant fragments via semantic similarity, but similarity matching is brittle for tasks that depend on causality or long-range dependencies rather than topical overlap. Experience abstraction methods (AWM, Expel, Dilu) compress trajectories into reusable skills or procedural knowledge, but discard the low-level details - exact error messages, intermediate states, nuanced dialogue turns - that are essential for debugging, faithful retrospection, and grounded decision-making. Context compression approaches (ACON, LLMLingua, MemGen) reduce the text itself via latent representations or token pruning, but text-centric compression inevitably trades compression ratio against information fidelity, especially in multimodal settings where visual layouts and structural cues are lost under pure textual summarization.

What They Built

OCR-Memory stores agent trajectories as rendered images rather than raw text, leveraging the DeepSeek-OCR (3B) vision encoder to compress dense textual content into a small number of visual tokens - achieving over 10x compression while preserving full fidelity. Each trajectory chunk is rendered into a marked image with Set-of-Mark (SoM) visual anchors: red bounding boxes annotated with unique numerical IDs that highlight individual text segments. When a new query arrives, the retrieval module scans these visual representations and outputs a binary relevance vector - predicting which segment indices are relevant - rather than generating free-form text. The corresponding original text is then deterministically fetched from an external log, completely eliminating generation-based hallucination. To handle growing history, OCR-Memory implements a multi-resolution aging policy: the five most recent interaction steps are stored at 1024x1024 (256 visual tokens), while all older history is downsampled to 512x512 (64 visual tokens). When a low-resolution memory is retrieved as relevant, an Active Recall mechanism upscales it back to high fidelity on demand - mimicking the vivid-to-fuzzy decay of human memory while preserving the ability to recover full detail when needed.

// Visual Encoding (DeepSeek-OCR):
Z = f_enc(I) ∈ R^{n(r) x d_latent}
n(r) ∈ {64, 100, 256, 400} // compressed-token budgets

// Segment Relevance Probability:
p_{i,k}(q) = exp(z_{i,k}(1)) / (exp(z_{i,k}(1)) + exp(z_{i,k}(0)))

// Adaptive Resolution Aging:
l_i = rho(delta_t_i),   I_i = phi_{l_i}(I_i^hi)

// Active Recall Upscaling (when retrieved):
if exists(i,k) in S_hat(q) s.t. l_i > l_min: I_i ← I_i^hi

Key Findings

What the experiments revealed

Results

58.1% AppWorld Avg Best across all baselines
53.8% Element Acc Mind2Web (+4.7 vs AWM)
6.7x Token Reduction 596 vs 3,980 per step
100% Faithfulness vs 84.3% (generative)

On Mind2Web, OCR-Memory scores 53.8% Element Accuracy, 59.2 Action F1, 46.1% Step Success Rate, and 4.8% Task Success Rate - outperforming ACON (48.2/54.1/41.4/4.1), AWM (49.1/55.7/42.6/4.3), and MemoryBank (43.8/49.5/39.2/3.3) across all metrics under the same context budget. On AppWorld, OCR-Memory reaches 58.1% average success rate (86.2% Easy, 57.4% Medium, 30.8% Hard), beating ACON's 56.2% and AWM's 55.0%. The retrieval-level evaluation on a dedicated Mind2Web subset shows 78.6% Recall@1 versus Dense Text-RAG's 52.7%, with 93.4% Recall@5 and MRR of 0.84 versus 0.61. On the NIAH benchmark adapted for agents, OCR-Memory maintains 98.5% retrieval accuracy at 4k context and sustains 94.1% at 32k, with a consistent 10x+ compression ratio across all lengths. The gains are backbone-agnostic: switching from GPT-4 to Qwen3-32B preserves the relative improvement over text-based retrieval (48.6% vs 35.2% Element Accuracy).

OCR-Memory
596 Text tokens per step 100% Retrieval faithfulness
vs
Text-RAG
3,980 Text tokens per step 84.3% Retrieval faithfulness

Why This Matters for AI and Automation

My Take

The core insight here is counterintuitive but well-supported: converting text to images and reading it back with a vision model is more token-efficient than storing the text directly. DeepSeek-OCR's optical compression achieves 10x+ compression ratios while maintaining 98.5% retrieval accuracy at 4k context, and the Locate-and-Transcribe mechanism solves the hallucination problem that plagues generative retrieval by making evidence recovery fully deterministic. The multi-resolution aging with Active Recall is the most production-relevant contribution - it is a clean answer to the "memory grows forever" problem that every persistent agent deployment faces. The main limitations are real: rendering text to images adds disk overhead (1.47 MB vs 18 KB per episode), retrieval latency increases from 0.3s to 1.7s, and the system requires fine-tuning a dedicated 3B-parameter retrieval model. The fine-tuning dependency on HotpotQA also raises questions about domain transfer - whether the learned grounding generalizes beyond web navigation and API interaction tasks. Still, in a landscape where every agent memory approach either loses information (summarization) or burns tokens (raw storage), OCR-Memory finds a genuinely novel third path by shifting the storage modality entirely.

Discussion question: OCR-Memory trades cheap storage for expensive reasoning tokens by encoding text as images - a favorable trade-off today when LLM inference is the bottleneck. But as inference costs drop and context windows expand, does the visual encoding approach become unnecessary overhead, or does the 100% faithfulness guarantee and multi-resolution aging keep it relevant regardless of token economics?

Share this discussion

← Back to all papers