Week 16 · Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

The Paper

"Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents" was submitted to arXiv on March 31, 2026 by Aaditya Khanal, Yangyang Tao, and Junxiu Zhou from the School of Computing and Analytics at Northern Kentucky University. The central claim is that pass@1 measured on short tasks is structurally blind to the failure modes that dominate production deployments - capability and reliability rankings diverge systematically as task duration grows, and the divergence follows domain-specific patterns that are invisible to any single-metric leaderboard.

The Problem Before This Paper

Existing agent benchmarks - SWE-bench, AgentBench, WebArena, METR, tau-bench - measure whether a model completes a task on a single attempt. They do not measure whether it completes tasks consistently, whether performance degrades over longer horizons, whether failures are graceful or catastrophic, or whether the behavioral patterns that produce success on short tasks are the same ones that produce success on long tasks. In production, none of those assumptions hold. A model ranked first on a 5-minute task benchmark may be operationally worse than a model ranked third when both are deployed on hour-long workflows. Prior work covered subsets of this problem - ReliabilityBench examined variance, METR studied long-horizon task completion - but no study jointly evaluated multiple models across multiple duration buckets with variance-aware metrics and partial-credit scoring simultaneously.

What They Built

The authors constructed a benchmark of 396 tasks across three domains (Software Engineering, Web Research, Document Processing) and four duration buckets (short: under 5 min; medium: 5-30 min; long: 30-120 min; very-long: over 120 min), with 33 tasks per domain-duration cell for balance. Ten open-weight models - ranging from Llama 3.1 8B to DeepSeek V3 (671B MoE) and Kimi K2.5 (1T MoE) - were each evaluated under two scaffolds (ReAct baseline and memory-augmented ReAct) with k=3 repeats per task, producing 23,392 total episodes. They introduce four metrics absent from existing evaluations: Reliability Decay Curve (RDC), which maps pass rate as a function of duration bucket; Variance Amplification Factor (VAF), which measures whether duration amplifies outcome variance; Graceful Degradation Score (GDS), a weighted partial-credit completion measure in [0,1]; and Meltdown Onset Point (MOP), which detects behavioral collapse using sliding-window tool-call entropy thresholding.

VAF = sigma^2[pass@1 | long] / sigma^2[pass@1 | short]
RDS = linear_regression_coeff(GDS vs. duration_bucket_index)
MOP = first step t where entropy(tool_calls[t-w:t]) > theta_H AND delta_entropy > delta_threshold

GDS = weighted_sum(completed_subtasks) in [0, 1]
Benchmark: 396 tasks x 10 models x 2 scaffolds x k=3 = 23,392 episodes

Key Findings

Aggregate pass@1 declined from 76.3% on short tasks to 52.1% on very-long tasks across all 10 models - a 24.3 percentage-point drop that no existing leaderboard captures.
Domain decay is wildly non-uniform: Software Engineering GDS collapsed from 0.90 to 0.44 (catastrophic), while Document Processing held at 0.74 to 0.71 (nearly flat), and Web Research fell moderately from 0.80 to 0.63.
Frontier models (DeepSeek V3, Kimi K2.5, MiniMax M2.5, GLM-4.5 Air) showed VAF above 2.37 while mid/small models stayed below 1.26 - high VAF is a capability signature, not an instability signal.
DeepSeek V3 exhibited a 19% meltdown rate at very-long tasks, MiniMax M2.5 hit 13%, yet both maintained the highest GDS (0.87-0.89) - ambitious multi-step strategies sometimes spiral but outperform conservative rote sequences overall.
Memory-augmented scaffolds never improved any of the 10 models and hurt 6 of them at long and very-long horizons, with Kimi K2.5 losing 0.14 GDS and Mistral 24B losing 0.13.

Results

GLM-4.5 Air ranked first at short-horizon tasks with 94.9% pass@1 but dropped to fourth at very-long (66.7%), while Llama 3.3 70B climbed from fifth-sixth at short (74.7%) to third-fourth at very-long (54.5%) - a full rank inversion invisible to any short-task evaluation. The frontier cluster (DeepSeek, Kimi, MiniMax) maintained above 79.8% pass@1 even at very-long tasks, separating cleanly from the mid-tier. The SE domain proved most sensitive to duration: even the best models saw GDS fall below 0.50 on very-long SE tasks. Mistral Nemo's medium-horizon decay was 2.4x worse than the geometric i.i.d. baseline prediction, suggesting correlated failures rather than independent random errors. Across all models, observed reliability decay exceeded what would be predicted if long tasks were simply multiple short tasks chained together.

Why This Matters for AI and Automation

Deployment decisions based on benchmark rankings are systematically wrong: a model that leads on SWE-bench may collapse when asked to run a 2-hour autonomous software task, and nothing in current evaluation infrastructure warns you.
Memory scaffolds as default long-horizon interventions are counterproductive: the assumption that giving agents more memory access improves long-horizon performance is empirically false across all model sizes tested here.
Task decomposition is the highest-leverage reliability intervention: breaking a very-long task into short tasks exploits the 24-point performance gap between buckets without changing the model at all.
MOP entropy detection enables checkpoint-and-restart rather than outright failure: detecting behavioral collapse before the task terminates opens the door to mid-task intervention policies that current frameworks do not implement.

My Take

This paper sits at the intersection of two threads I have been tracking here. Last week (Week 15), the Hermes paper demonstrated that a 3.3x performance gain over chain-of-thought came entirely from harness design, not model capability. The week before (Week 14), Shangding Gu's framework formalized why system-level engineering - context governance, memory hygiene, skill routing - is the new bottleneck. This paper from Khanal et al. provides the measurement infrastructure to make those claims quantitative at deployment scale. The memory scaffold finding is particularly pointed: the standard practitioner instinct to add memory when agents fail on long tasks is not just unhelpful, it is actively harmful across six out of ten models tested. The most actionable takeaway is architectural - task decomposition at the harness level recovers most of the 24-point reliability gap without touching the model. What the paper leaves open is whether these results transfer to proprietary frontier models (GPT-4o, Claude, Gemini were excluded due to cost at 23,392 episodes) and whether the domains generalize beyond SE, WR, and DP. For practitioners building production agents today, the implication is direct: if you are selecting a model for a long-horizon deployment, run it on your longest task bucket first, not the short one.

Discussion question: Memory scaffolds hurt long-horizon performance across all 10 models tested, yet most production agent frameworks ship with memory enabled by default. If architectural interventions like task decomposition outperform memory augmentation at no model cost, what does this mean for the memory-as-infrastructure narrative being pushed by agent platform vendors - and should reliability metrics like RDC and GDS replace or complement pass@1 in standard evaluation suites?

Read the Paper on arXiv →