The Big Idea
Anthropic's engineering teams published two articles that, read together, form a complete production playbook for building AI agents that don't collapse mid-task. The first, from the Applied AI team, introduces context engineering as the discipline of managing the entire token ecosystem across multi-turn interactions - arguing that prompt engineering alone is insufficient once agents operate across time. The second, from Justin Young on the Claude SDK team, tackles the specific failure modes that emerge when agents must work across multiple context windows on long-horizon tasks - the kind measured in hours or days, not seconds. Together they address the same root problem from different levels: why agents degrade over time, and what engineering structures prevent it.
Before vs After
Naive Agent Setup
- Single high-level prompt: "build a clone of claude.ai"
- Context fills with verbose tool results and full message history
- Agent tries to one-shot the entire task
- Context window exhausted mid-implementation, feature half-built
- Next session has no memory, guesses at state, loses time
- Late sessions declare victory on incomplete work
- Features marked done without end-to-end verification
Engineered Agent Setup
- Initializer agent sets up structured environment on session 1
- Context curated: minimal tokens, high-signal identifiers
- Coding agent works one feature at a time
- Git commits + progress notes bridge context windows
- Each session: read progress, run init.sh, verify baseline, then build
- Feature list JSON prevents premature completion claims
- Browser automation (Puppeteer MCP) forces real end-to-end testing
How It Works
Article 1 - Context Engineering
The Token Ecosystem Problem
The foundational issue is architectural. Transformers allow every token to attend to every other token, which means n tokens create n² pairwise relationships. As context accumulates across an agent run - system prompt, tool results, message history, retrieved documents - those relationships are spread across an increasingly large space. The model's ability to maintain coherence around any specific piece of information quietly erodes. Anthropic calls this context rot: a gradual, non-catastrophic degradation that is easy to miss in testing because it does not produce a hard failure, just increasingly unreliable decisions. Compounding this, current models were trained predominantly on shorter sequences, so position encoding interpolation handles longer contexts technically but with reduced precision.
The Applied AI team frames the solution as identifying the smallest set of high-signal tokens that maximizes desired outcomes. This plays out across four design surfaces: system prompts should occupy a Goldilocks zone - specific enough to give concrete behavioral signals, flexible enough to handle variation - organized with XML tags or Markdown sections and built up incrementally from observed failures rather than specified upfront. Tools must be minimal and non-overlapping; every ambiguous tool choice burns attention budget on meta-decisions rather than the actual task, and verbose tool return values fill the context with low-signal noise. Few-shot examples should be canonical and diverse rather than exhaustive edge-case coverage. And retrieval should shift from pre-inference embedding-based loading toward just-in-time strategies, where agents maintain lightweight identifiers and pull data via tool calls only when needed - what they call progressive disclosure.
For tasks that exceed a single context window, Anthropic identifies three distinct techniques. Compaction summarizes the conversation as it nears the context limit and reinitializes with the summary - best for conversational continuity. Structured note-taking has agents write externally persisted notes at each checkpoint and retrieve them as needed, providing persistent memory with minimal per-turn overhead; the Claude Pokemon example demonstrated an agent maintaining strategic notes across thousands of game steps after context resets. Sub-agent architectures assign specialized agents to focused tasks in clean context windows, returning condensed summaries of 1,000-2,000 tokens to a coordinator - best for complex research with parallel workstreams.
Article 2 - Agent Harnesses
The Long-Running Agent Problem
Justin Young's article takes these principles and applies them to a concrete failure case: asking Claude Opus 4.5 to build a production-quality web app across multiple context windows with only a high-level prompt. The failures were systematic. The agent repeatedly tried to one-shot the entire application, running out of context mid-implementation and leaving features half-built and undocumented. Subsequent sessions had no memory of prior work, spent time reconstructing state, and often made the problem worse by starting new features on top of a broken baseline. Later sessions exhibited the opposite failure: looking at accumulated progress and declaring the project complete without verifying that implemented features actually worked end-to-end.
The solution is a two-agent harness that mirrors how effective engineering teams operate across shifts. An initializer agent runs only on the first session and builds the scaffolding that all future sessions depend on: an init.sh script to start the development server, a claude-progress.txt file as a running log of what each session accomplished, an initial git commit, and critically, a feature_list.json file containing every required feature with a "passes": false field that only gets flipped to true after verified end-to-end testing. For the claude.ai clone task, this meant over 200 features specified upfront. The model is instructed with strong language not to remove or edit feature entries, only to change their status - and JSON was chosen over Markdown deliberately because the model is less likely to inappropriately overwrite structured JSON files.
Every subsequent coding session follows a fixed startup sequence: run pwd to orient, read claude-progress.txt and git logs to understand recent work, read feature_list.json to identify the next incomplete feature, start the development server via init.sh, run a basic end-to-end smoke test using Puppeteer MCP before touching any code, fix any regressions found, then implement exactly one feature. At session end, the agent commits all changes with descriptive messages and writes a progress summary. This git-plus-notes approach gives the next session two independent recovery paths: if the progress file is ambiguous, git history provides ground truth, and git revert offers a clean rollback to any known-good state.
Key Findings
- Context rot is gradual, not catastrophic. Accuracy erodes as tokens accumulate due to n² transformer attention relationships. The failure is invisible until you run controlled tests across long agent sequences, which is why most teams discover it in production rather than in development.
- Compaction alone is insufficient for long-horizon agents. Even with compaction enabled, Claude Opus 4.5 on the Claude Agent SDK could not reliably build a production web app from a single high-level prompt. Compaction handles context limits but does not resolve the agent's tendency to one-shot tasks or declare premature completion.
- The initializer/coding agent split is architecturally trivial but operationally significant. The two agents in the harness use identical system prompts, tools, and infrastructure - they differ only in their initial user prompt. The value is not in the architecture but in what the initializer sets up: structured scaffolding that every subsequent session can rely on.
- JSON beats Markdown for structured agent state. The model is less likely to inappropriately overwrite or edit JSON files compared to Markdown. For feature lists or any structured data that agents should update incrementally rather than replace, JSON provides better behavioral guarantees.
- Browser automation testing is the critical gap in most agent workflows. Claude made code changes and ran unit tests and curl commands but consistently failed to catch end-to-end bugs without explicit browser automation. Puppeteer MCP enabled the agent to test as a human user would, dramatically improving feature verification quality.
- Git is underutilized as an agent memory and recovery mechanism. Descriptive git commits + progress notes give each new session two independent recovery paths and eliminate the time spent reconstructing state. The ability to
git revertto a known-good state removes a major source of compounding errors in long-horizon runs.
Why This Matters for AI and Automation Practitioners
| Failure Mode | Initializer Agent Fix | Coding Agent Fix |
|---|---|---|
| Declares victory too early | Set up feature_list.json with all features marked failing | Read feature list at session start; choose one incomplete feature only |
| Leaves broken or undocumented state | Initialize git repo + progress notes file | Start by reading progress + git log; end with commit + progress update |
| Marks features done without real testing | Set up feature_list.json with end-to-end test steps | Self-verify all features via browser automation before flipping passes:true |
| Wastes time figuring out how to run the app | Write init.sh that starts the development server | Read init.sh at session start; run smoke test before implementing anything |
What to apply to your own agent builds
- Design context budgets explicitly. Before building, decide what goes in the system prompt, what stays in external storage, and what gets retrieved just-in-time. Treat context slots as a scarce resource, not an infinite buffer for everything that might be useful.
- Give agents structured state they can trust. JSON feature lists, git history, and progress notes are low-overhead mechanisms that dramatically reduce the agent's startup cost per session. Any time your agent spends reconstructing state is wasted compute and latency.
- Tool design is context engineering. Every tool that returns verbose output, every tool that overlaps with another, every ambiguous parameter name degrades agent decision quality across the entire run - not just at that tool call.
- Real testing requires real tools. Unit tests and curl commands are not substitutes for end-to-end verification. If your agent's output touches a UI, a browser, or a workflow that requires human interaction, give the agent tools that can test it the way a human would.
My Take
These two posts from Anthropic are unusually candid about where current agents actually fail. Most agent framework documentation shows the happy path: the agent that successfully completes the task. What Justin Young's post documents is the empirical failure taxonomy of a frontier model on a realistic task, and then the engineering patterns that address each failure category. That is the kind of content practitioners can act on directly. The context engineering post is more conceptual, but the Goldilocks system prompt framing and the progressive disclosure retrieval approach resolve two of the most common complaints I hear from teams building on top of LLMs: their instructions are either too rigid to handle variation or too vague to produce consistent behavior, and their RAG pipelines stuff context upfront rather than retrieving on-demand. The honest limitation in both posts is measurement. The harnesses article has no benchmark numbers - it is descriptive, not empirical. The context engineering article makes qualitative claims about context rot without quantifying how much accuracy degrades at what token counts across which models. For teams trying to prioritize which techniques to implement first, that absence makes it harder to reason about expected ROI. The framework is sound; the evidence base for calibrating the tradeoffs is still thin.
Discussion Question
The agent harness solution here is essentially borrowing practices from software engineering - feature lists, git commits, progress notes, end-to-end testing. If the most effective AI agent workflows look increasingly like structured software development processes, what does that imply about who should be designing these systems? Is this a prompt engineering problem, a software architecture problem, or something genuinely new?