Article 14 · May 2026

Two Anthropic Playbooks for Production Agents: Context Engineering and Long-Running Harnesses

May 27, 2026 · by Satish K C 9 min read
Agents LLMs RAG Engineering
Built by the Author Kravhal - autonomous agents that run your business workflows end-to-end. Pay per outcome.
Get Early Access

The Big Idea

Anthropic's engineering teams published two articles that, read together, form a complete production playbook for building AI agents that don't collapse mid-task. The first, from the Applied AI team, introduces context engineering as the discipline of managing the entire token ecosystem across multi-turn interactions - arguing that prompt engineering alone is insufficient once agents operate across time. The second, from Justin Young on the Claude SDK team, tackles the specific failure modes that emerge when agents must work across multiple context windows on long-horizon tasks - the kind measured in hours or days, not seconds. Together they address the same root problem from different levels: why agents degrade over time, and what engineering structures prevent it.

Before vs After

Naive Agent Setup

  • Single high-level prompt: "build a clone of claude.ai"
  • Context fills with verbose tool results and full message history
  • Agent tries to one-shot the entire task
  • Context window exhausted mid-implementation, feature half-built
  • Next session has no memory, guesses at state, loses time
  • Late sessions declare victory on incomplete work
  • Features marked done without end-to-end verification

Engineered Agent Setup

  • Initializer agent sets up structured environment on session 1
  • Context curated: minimal tokens, high-signal identifiers
  • Coding agent works one feature at a time
  • Git commits + progress notes bridge context windows
  • Each session: read progress, run init.sh, verify baseline, then build
  • Feature list JSON prevents premature completion claims
  • Browser automation (Puppeteer MCP) forces real end-to-end testing

How It Works


Article 1 - Context Engineering

The Token Ecosystem Problem

The foundational issue is architectural. Transformers allow every token to attend to every other token, which means n tokens create n² pairwise relationships. As context accumulates across an agent run - system prompt, tool results, message history, retrieved documents - those relationships are spread across an increasingly large space. The model's ability to maintain coherence around any specific piece of information quietly erodes. Anthropic calls this context rot: a gradual, non-catastrophic degradation that is easy to miss in testing because it does not produce a hard failure, just increasingly unreliable decisions. Compounding this, current models were trained predominantly on shorter sequences, so position encoding interpolation handles longer contexts technically but with reduced precision.

The Applied AI team frames the solution as identifying the smallest set of high-signal tokens that maximizes desired outcomes. This plays out across four design surfaces: system prompts should occupy a Goldilocks zone - specific enough to give concrete behavioral signals, flexible enough to handle variation - organized with XML tags or Markdown sections and built up incrementally from observed failures rather than specified upfront. Tools must be minimal and non-overlapping; every ambiguous tool choice burns attention budget on meta-decisions rather than the actual task, and verbose tool return values fill the context with low-signal noise. Few-shot examples should be canonical and diverse rather than exhaustive edge-case coverage. And retrieval should shift from pre-inference embedding-based loading toward just-in-time strategies, where agents maintain lightweight identifiers and pull data via tool calls only when needed - what they call progressive disclosure.

Just-in-Time Retrieval vs Pre-loaded Context
PRE-LOADED RETRIEVAL Context: 80k tokens upfront doc1 ... doc47 loaded before agent starts LLM (distracted) High context rot risk JUST-IN-TIME RETRIEVAL Context: task + lightweight identifiers only tool_call() tool_call() LLM (focused) Lean context, higher per-step latency

For tasks that exceed a single context window, Anthropic identifies three distinct techniques. Compaction summarizes the conversation as it nears the context limit and reinitializes with the summary - best for conversational continuity. Structured note-taking has agents write externally persisted notes at each checkpoint and retrieve them as needed, providing persistent memory with minimal per-turn overhead; the Claude Pokemon example demonstrated an agent maintaining strategic notes across thousands of game steps after context resets. Sub-agent architectures assign specialized agents to focused tasks in clean context windows, returning condensed summaries of 1,000-2,000 tokens to a coordinator - best for complex research with parallel workstreams.


Article 2 - Agent Harnesses

The Long-Running Agent Problem

Justin Young's article takes these principles and applies them to a concrete failure case: asking Claude Opus 4.5 to build a production-quality web app across multiple context windows with only a high-level prompt. The failures were systematic. The agent repeatedly tried to one-shot the entire application, running out of context mid-implementation and leaving features half-built and undocumented. Subsequent sessions had no memory of prior work, spent time reconstructing state, and often made the problem worse by starting new features on top of a broken baseline. Later sessions exhibited the opposite failure: looking at accumulated progress and declaring the project complete without verifying that implemented features actually worked end-to-end.

The solution is a two-agent harness that mirrors how effective engineering teams operate across shifts. An initializer agent runs only on the first session and builds the scaffolding that all future sessions depend on: an init.sh script to start the development server, a claude-progress.txt file as a running log of what each session accomplished, an initial git commit, and critically, a feature_list.json file containing every required feature with a "passes": false field that only gets flipped to true after verified end-to-end testing. For the claude.ai clone task, this meant over 200 features specified upfront. The model is instructed with strong language not to remove or edit feature entries, only to change their status - and JSON was chosen over Markdown deliberately because the model is less likely to inappropriately overwrite structured JSON files.

{ "category": "functional", "description": "New chat button creates a fresh conversation", "steps": [ "Navigate to main interface", "Click the 'New Chat' button", "Verify a new conversation is created", "Check that chat area shows welcome state", "Verify conversation appears in sidebar" ], "passes": false }

Every subsequent coding session follows a fixed startup sequence: run pwd to orient, read claude-progress.txt and git logs to understand recent work, read feature_list.json to identify the next incomplete feature, start the development server via init.sh, run a basic end-to-end smoke test using Puppeteer MCP before touching any code, fix any regressions found, then implement exactly one feature. At session end, the agent commits all changes with descriptive messages and writes a progress summary. This git-plus-notes approach gives the next session two independent recovery paths: if the progress file is ambiguous, git history provides ground truth, and git revert offers a clean rollback to any known-good state.

Long-Running Agent Session Loop
SESSION 1 Initializer Agent write init.sh write feature_list.json (200+) write claude-progress.txt initial git commit runs once only SESSION N (repeats) Coding Agent 1. pwd + read progress.txt + git log 2. read feature_list.json 3. run init.sh (start dev server) 4. smoke test (Puppeteer MCP) 5. fix regressions if found 6. implement ONE feature 7. e2e test (Puppeteer MCP) 8. flip passes:true in feature_list.json 9. git commit with descriptive message 10. update claude-progress.txt next session starts here

Key Findings

Why This Matters for AI and Automation Practitioners

Failure Mode Initializer Agent Fix Coding Agent Fix
Declares victory too early Set up feature_list.json with all features marked failing Read feature list at session start; choose one incomplete feature only
Leaves broken or undocumented state Initialize git repo + progress notes file Start by reading progress + git log; end with commit + progress update
Marks features done without real testing Set up feature_list.json with end-to-end test steps Self-verify all features via browser automation before flipping passes:true
Wastes time figuring out how to run the app Write init.sh that starts the development server Read init.sh at session start; run smoke test before implementing anything

What to apply to your own agent builds

My Take

These two posts from Anthropic are unusually candid about where current agents actually fail. Most agent framework documentation shows the happy path: the agent that successfully completes the task. What Justin Young's post documents is the empirical failure taxonomy of a frontier model on a realistic task, and then the engineering patterns that address each failure category. That is the kind of content practitioners can act on directly. The context engineering post is more conceptual, but the Goldilocks system prompt framing and the progressive disclosure retrieval approach resolve two of the most common complaints I hear from teams building on top of LLMs: their instructions are either too rigid to handle variation or too vague to produce consistent behavior, and their RAG pipelines stuff context upfront rather than retrieving on-demand. The honest limitation in both posts is measurement. The harnesses article has no benchmark numbers - it is descriptive, not empirical. The context engineering article makes qualitative claims about context rot without quantifying how much accuracy degrades at what token counts across which models. For teams trying to prioritize which techniques to implement first, that absence makes it harder to reason about expected ROI. The framework is sound; the evidence base for calibrating the tradeoffs is still thin.

Discussion Question

The agent harness solution here is essentially borrowing practices from software engineering - feature lists, git commits, progress notes, end-to-end testing. If the most effective AI agent workflows look increasingly like structured software development processes, what does that imply about who should be designing these systems? Is this a prompt engineering problem, a software architecture problem, or something genuinely new?

← Back to all papers
Share