Week 16 · June 2026

Beyond pass@1: Why Your Benchmark Score Means Nothing in Production

June 3, 2026 · by Satish K C 8 min read
Agents LLMs Optimization Deep Learning
Built by the Author Kravhal - autonomous agents that run your business workflows end-to-end. Pay per outcome.
Get Early Access

The Paper

"Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents" was submitted to arXiv on March 31, 2026 by Aaditya Khanal, Yangyang Tao, and Junxiu Zhou from the School of Computing and Analytics at Northern Kentucky University. The central claim is that pass@1 measured on short tasks is structurally blind to the failure modes that dominate production deployments - capability and reliability rankings diverge systematically as task duration grows, and the divergence follows domain-specific patterns that are invisible to any single-metric leaderboard.

The Problem Before This Paper

Existing agent benchmarks - SWE-bench, AgentBench, WebArena, METR, tau-bench - measure whether a model completes a task on a single attempt. They do not measure whether it completes tasks consistently, whether performance degrades over longer horizons, whether failures are graceful or catastrophic, or whether the behavioral patterns that produce success on short tasks are the same ones that produce success on long tasks. In production, none of those assumptions hold. A model ranked first on a 5-minute task benchmark may be operationally worse than a model ranked third when both are deployed on hour-long workflows. Prior work covered subsets of this problem - ReliabilityBench examined variance, METR studied long-horizon task completion - but no study jointly evaluated multiple models across multiple duration buckets with variance-aware metrics and partial-credit scoring simultaneously.

What They Built

The authors constructed a benchmark of 396 tasks across three domains (Software Engineering, Web Research, Document Processing) and four duration buckets (short: under 5 min; medium: 5-30 min; long: 30-120 min; very-long: over 120 min), with 33 tasks per domain-duration cell for balance. Ten open-weight models - ranging from Llama 3.1 8B to DeepSeek V3 (671B MoE) and Kimi K2.5 (1T MoE) - were each evaluated under two scaffolds (ReAct baseline and memory-augmented ReAct) with k=3 repeats per task, producing 23,392 total episodes. They introduce four metrics absent from existing evaluations: Reliability Decay Curve (RDC), which maps pass rate as a function of duration bucket; Variance Amplification Factor (VAF), which measures whether duration amplifies outcome variance; Graceful Degradation Score (GDS), a weighted partial-credit completion measure in [0,1]; and Meltdown Onset Point (MOP), which detects behavioral collapse using sliding-window tool-call entropy thresholding.

VAF = sigma^2[pass@1 | long] / sigma^2[pass@1 | short]
RDS = linear_regression_coeff(GDS vs. duration_bucket_index)
MOP = first step t where entropy(tool_calls[t-w:t]) > theta_H AND delta_entropy > delta_threshold

GDS = weighted_sum(completed_subtasks) in [0, 1]
Benchmark: 396 tasks x 10 models x 2 scaffolds x k=3 = 23,392 episodes

Key Findings

Results

GLM-4.5 Air ranked first at short-horizon tasks with 94.9% pass@1 but dropped to fourth at very-long (66.7%), while Llama 3.3 70B climbed from fifth-sixth at short (74.7%) to third-fourth at very-long (54.5%) - a full rank inversion invisible to any short-task evaluation. The frontier cluster (DeepSeek, Kimi, MiniMax) maintained above 79.8% pass@1 even at very-long tasks, separating cleanly from the mid-tier. The SE domain proved most sensitive to duration: even the best models saw GDS fall below 0.50 on very-long SE tasks. Mistral Nemo's medium-horizon decay was 2.4x worse than the geometric i.i.d. baseline prediction, suggesting correlated failures rather than independent random errors. Across all models, observed reliability decay exceeded what would be predicted if long tasks were simply multiple short tasks chained together.

Why This Matters for AI and Automation

My Take

This paper sits at the intersection of two threads I have been tracking here. Last week (Week 15), the Hermes paper demonstrated that a 3.3x performance gain over chain-of-thought came entirely from harness design, not model capability. The week before (Week 14), Shangding Gu's framework formalized why system-level engineering - context governance, memory hygiene, skill routing - is the new bottleneck. This paper from Khanal et al. provides the measurement infrastructure to make those claims quantitative at deployment scale. The memory scaffold finding is particularly pointed: the standard practitioner instinct to add memory when agents fail on long tasks is not just unhelpful, it is actively harmful across six out of ten models tested. The most actionable takeaway is architectural - task decomposition at the harness level recovers most of the 24-point reliability gap without touching the model. What the paper leaves open is whether these results transfer to proprietary frontier models (GPT-4o, Claude, Gemini were excluded due to cost at 23,392 episodes) and whether the domains generalize beyond SE, WR, and DP. For practitioners building production agents today, the implication is direct: if you are selecting a model for a long-horizon deployment, run it on your longest task bucket first, not the short one.

Discussion question: Memory scaffolds hurt long-horizon performance across all 10 models tested, yet most production agent frameworks ship with memory enabled by default. If architectural interventions like task decomposition outperform memory augmentation at no model cost, what does this mean for the memory-as-infrastructure narrative being pushed by agent platform vendors - and should reliability metrics like RDC and GDS replace or complement pass@1 in standard evaluation suites?

Read the Paper on arXiv →
← Back to all papers
Share