Week 19 · June 2026

AtomMem: Building Long-Term Agent Memory from Atomic Facts

June 19, 2026 · by Satish K C 8 min read
Agents LLMs RAG Memory
Built by the Author Kravhal - autonomous agents that run your business workflows end-to-end. Pay per outcome.
Get Early Access

The Paper

"AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts" is authored by Yanyu Yao, Shangze Li, Zhi Zheng, Hui Zheng, Qi Liu, Tong Xu, and Enhong Chen from the University of Science and Technology of China (USTC) and Anhui University, submitted to arXiv in June 2026. The central claim is that the primary failure of long-term memory systems for LLM agents is not retrieval algorithm design but memory construction: existing systems either store raw conversation text, which overwhelms retrieval with redundant noise, or LLM-generated summaries, which lose fine-grained details and accumulate hallucinations as entries are rewritten across sessions. AtomMem replaces both approaches with verified atomic facts as the fundamental memory unit, then structures those facts into event memories, temporal user profiles, and an associative graph - enabling multi-hop retrieval that can chain information across sessions.

The Problem Before This Paper

Long-term memory for LLM agents sits in a representation dilemma. Raw conversation storage preserves all information but floods retrieval with noise that is difficult to rank or compress. Summary-based storage - used by MemoryBank, MemoryOS, and most commercially deployed systems - compresses conversations into LLM-written prose, but those summaries discard fine-grained details and, critically, accumulate errors: each rewrite pass is a new LLM call that can introduce hallucinations on top of previously hallucinated content, causing "uncontrolled expansion and destruction of original facts" over time. A third failure is structural: virtually all existing systems retrieve over flat, isolated memory items. They have no mechanism to connect a fact from session 3 to a fact from session 17 in order to answer a question that requires both. On temporal reasoning - questions that require tracking how user attributes change over time - the five strongest prior systems on LoCoMo peak at 51.09 Jaccard (LightMem), and most score well below 40.

What They Built

AtomMem has four components that operate in sequence. The Fact Executor is a Qwen3-14B model fine-tuned via SFT LoRA on 4,352 curated samples; it processes each new conversation turn by resolving coreferences (replacing "I" with the user's name), anchoring temporal expressions ("last Friday") to absolute dates, denoising, and decomposing the turn into self-contained atomic facts. Each fact is stored as F = {id, c, v, P, K, T, E} carrying its text, a dense embedding, participant labels, keywords, a temporal anchor, and linked event IDs. Before storage, a verification step runs a hybrid similarity check against existing memory - S_h = 0.7 * embedding_sim + 0.3 * keyword_Jaccard - to surface candidates; an LLM then resolves conflicts, extracting only residual novel content and discarding anything already captured. Verified facts are routed upward into an event memory layer (grouping related facts into narrative episode summaries) and a temporal profile layer (tracking evolving user attributes with a full version history preserved via time-stamped states). At retrieval, a three-stage pipeline executes: primary hybrid recall by participant and time filters, compensatory event-based recall that scores facts through their parent events, and Personalized PageRank (PPR) over the associative graph to expand from seed facts to connected neighbors.

S_h(x, y) = alpha * sim_e(v_x, v_y) + beta * Jac(K_x, K_y)
alpha = 0.7 (embedding similarity weight), beta = 0.3 (keyword Jaccard weight)

PPR iteration: r^(t+1) = eta * p + (1 - eta) * P^T * r^(t)
eta = 0.34 (restart probability), convergence at ||r^(t+1) - r^(t)||_1 < 1e-6
Graph channels: entity edges (IDF-weighted keyword overlap), event edges (shared episode membership), temporal edges (adjacent dialogue turns)

Key Findings

Results

On LoCoMo, AtomMem achieves 56.66 Single-Hop F1 (vs. 54.95 for MEM0, +3.1%), 42.50 Multi-Hop F1 (vs. 37.15 for MemoryOS, +14.4%), 62.78 Temporal F1 (vs. 47.41 for LightMem, +32.5%), and 64.58 Open-Domain Jaccard (vs. 54.17 for MEM0, +19.2%). AtomMem-Flat - the atomic facts layer alone, without event hierarchy, temporal profiles, or graph - uses only 722K tokens and still outperforms four of the five prior systems on multi-hop reasoning, which demonstrates that fact verification and coreference resolution alone provide the majority of the retrieval gain. On LongMemEval, AtomMem reaches 80.70 F1 on single-session user queries, 66.35 F1 on knowledge-update tasks where the system must track changing user facts, and 42.10 F1 on temporal reasoning questions - a category that most flat-retrieval systems handle near chance level. The knowledge-update score is notable: it validates that the temporal profile's version history correctly supersedes stale information rather than merging or averaging across conflicting states.

Why This Matters for AI and Automation

My Take

The most consequential design decision in AtomMem is the verification step that runs before any fact is committed to storage. The paper correctly identifies hallucination accumulation as the central failure mode of summary-based memory - each rewrite is a compounding risk - and the conflict resolution mechanism, which extracts only residual novel content, is a principled counter to that. The ablation pattern supports the architecture: even stripped of the graph and profiles, the verified-fact layer alone outperforms most prior systems, which means the quality of storage is more important than the sophistication of retrieval structure. The temporal profile with version history is the component that most current agent frameworks are missing entirely; open-source memory systems either overwrite prior states on update or have no time-indexed user model at all. What the paper does not address is the bootstrap cost: the Fact Executor requires a fine-tuned 14B model trained on 4,352 curated samples, which is a non-trivial data collection and training investment before the first fact is stored. Practitioners evaluating AtomMem need to budget that cost, or wait for a publicly released checkpoint. The benchmark scope is also narrow - LoCoMo and LongMemEval are English-language, text-based conversation datasets. How atomic fact extraction behaves on tool-use traces, code sessions, or multilingual input is untested, and those are the formats most production agent memory systems will encounter first.

Discussion Question

AtomMem's verification step resolves conflicts by extracting residual novel content before committing a new fact - but conflict resolution is itself an LLM call that can introduce hallucinations. Over thousands of sessions and hundreds of conflict events, does the verification layer reduce hallucination accumulation in memory, or does it redistribute the problem by moving it from the storage writes to the conflict resolution calls?

Read the Paper on arXiv →
← Back to all articles
Share