Week 14 · May 2026

From Model Scaling to System Scaling: The Harness Is the New Leverage

May 28, 2026 · by Satish K C 8 min read
Agents Optimization Deep Learning LLMs
Built by the Author Kravhal - autonomous agents that run your business workflows end-to-end. Pay per outcome.
Get Early Access

The Paper

"From Model Scaling to System Scaling: Scaling the Harness in Agentic AI" was published in May 2026 by Shangding Gu on arXiv. The paper argues that the dominant paradigm of scaling individual language models is no longer the primary lever for improving agentic AI performance. The central claim is that harness engineering - the design of the infrastructure wrapping a model, including tools, memory systems, context management, and inter-agent coordination - now delivers comparable or superior performance gains to raw model scaling, and that the field has underinvested in it relative to its impact.

The Problem Before This Paper

For the past several years, AI research has operated under a straightforward assumption: better model = better performance. More parameters, more data, more FLOPS. This was largely true when the primary task was single-turn text generation. The problem is that agentic tasks do not fit this frame. An agent running for hours across multiple context windows is not doing inference - it is doing a coordinated sequence of decisions, each one potentially corrupted by accumulated noise, poor tool design, or missing state management. Model capability is a necessary but not sufficient condition for agent reliability. The research literature on agentic failure modes had been growing, but no prior work had formally elevated harness engineering to a first-class research concern with its own taxonomy, benchmarks, and scaling analysis.

What They Built

Gu introduces a formal framework that separates agentic AI performance into two orthogonal dimensions: model scaling (parameter count, pre-training compute, RLHF quality) and system scaling (harness design, tool integration, memory architecture, multi-agent coordination). The harness is defined as everything outside the model weights that influences agent behavior at runtime. The paper taxonomizes harness components into three pillars: context engineering (managing what goes into the context window and when), tool architecture (minimizing overlap, reducing verbose returns, scoping tool interfaces), and orchestration (multi-agent delegation, sub-agent specialization, state handoff protocols). Evaluation draws on SWE-bench, AgentBench, WebArena, and TerminalBench - all system-level benchmarks that penalize harness failures directly, not just model capability limits.

Agent Performance = f(Model Capability, Harness Quality)
dP/dHarness >> dP/dModel for long-horizon agentic tasks

Key Findings

Results

Across SWE-bench, AgentBench, WebArena, and TerminalBench, the paper demonstrates that harness-optimized configurations of mid-tier models match or exceed baseline configurations of significantly larger models. The specific claim is that infrastructure optimization can deliver comparable performance gains to a full model generation upgrade in agentic contexts. While the paper does not publish a single headline number equivalent to a BLEU score improvement, the comparative analysis across four major agentic benchmarks is the strongest empirical case made to date that system-level investment has a higher marginal return than model-level investment for production agent workflows.

Why This Matters for AI and Automation

My Take

The timing of this paper is notable. Just last week, in Article 14, I covered Anthropic's two practitioner playbooks for production agents - context engineering from the Applied AI team, and Justin Young's long-running agent harness framework built on the Claude SDK. Those articles were written from the practitioner side: here are the patterns, here is what breaks, here is what works. Gu's paper provides the academic framing for why those patterns matter. The initializer-plus-executor pattern, the JSON-over-Markdown state preference, the just-in-time retrieval approach, the explicit context budget discipline - each of these maps directly onto one of Gu's three harness pillars. What struck me reading this paper is that the research and practitioner communities converged on the same conclusion independently: the model is not the bottleneck anymore. The question for builders is not "how capable is my model" but "how well-designed is the system around it." The open challenge is that harness quality is currently impossible to evaluate without running the full task. There is no harness equivalent of a benchmark you can run offline. Until that tooling exists, harness engineering will remain an empirical craft rather than a principled discipline.

Discussion question: If harness quality has a higher marginal return than model capability for long-horizon agentic tasks, should AI labs be publishing harness design standards the way they publish model cards - and would practitioners actually adopt them, or is harness design too task-specific to generalize?

Read the Paper on arXiv →
← Back to all papers
Share