Week 18 - SkillOpt: Teaching Agents to Improve Their Own Instructions

The Paper

"SkillOpt: Executive Strategy for Self-Evolving Agent Skills" is authored by Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang, Ziwei Zhou, Zisu Huang, Yan Li, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Yuqing Yang, Dongdong Chen, Xue Yang, and Chong Luo, from Microsoft, Shanghai Jiao Tong University, Tongji University, and Fudan University. Published on arXiv in May 2026, the paper's central claim is that agent skills - natural-language policy documents prepended to agent context - should be treated as trainable external state rather than one-shot artifacts. SkillOpt is, by the authors' account, the first systematic text-space optimizer for agent skills: a separate model that converts scored execution trajectories into bounded, validated edits on a skill document, accepting only changes that strictly improve downstream task performance.

The Problem Before This Paper

Most agent skill engineering falls into two patterns: humans write task-specific instructions by hand, or an LLM generates a skill document in a single forward pass from a task description. Both produce static artifacts. When an agent fails on a task variant, the skill does not update. Gradient-based prompt optimizers like TextGrad operate on soft token embeddings and are model-specific; they cannot produce human-readable, transferable text. Trajectory distillation methods like Trace2Skill convert successful runs into new skills but have no mechanism for iterative refinement or quality control across runs. Evolutionary methods like GEPA and EvoSkill explore diverse skill proposals but without the bounded, monotonically-improving update discipline needed for reliable convergence. None of these methods treats skill optimization as a proper training loop with train/validation/test split discipline, explicit learning-rate analogs, and a mechanism that both learns from failure and protects stable patterns already in the skill from being overwritten.

What They Built

SkillOpt frames skill optimization as a structured loop over three data splits. In the forward pass, the current skill is prepended to agent context and rollouts are collected in batches of 40, generating trajectories and scalar scores. In the backward pass, a reflection minibatch of 8 samples separates failures from successes and a separate optimizer model proposes structured edits to the skill document - specifically add, delete, and replace operations. The edit budget L_t acts as a textual learning rate: at each step, only up to L_t edits are permitted, with L_t following a scheduled decay (constant, linear, or cosine; default cosine with L_t=4 and floor=2). A validation gate checks each candidate skill against the selection split and accepts it only if the score strictly exceeds the current best. Failed proposals go into a rejected-edit buffer that feeds back into subsequent reflection steps within the epoch. At the end of each epoch, a slow/meta update consolidates stable patterns across the epoch's accepted edits into a protected region of the skill document, preventing epoch-to-epoch regression. This mechanism turns out to be the single most important component: removing it drops SpreadsheetBench accuracy by 22.5 points.

Objective: s* = argmax E[r(s, M, h)]
where s = skill document, M = frozen model, h = execution harness
r(s) in [0,1] = trajectory scalar score on task split

Validation gate: accept s_candidate only if score(s_candidate, S_val) > score(s_current, S_val)
Edit budget: |edits per step| <= L_t, where L_t decays via cosine schedule (default: L_t=4, floor=2)

Key Findings

+23.5 points average on GPT-5.5 (direct chat): SkillOpt-optimized skills raise average accuracy from 58.8% to 82.3% across six benchmarks, including +38.9 on SpreadsheetBench and +39.0 on OfficeQA.
Competitive across all 52 evaluated cells: Across every combination of 7 models, 6 benchmarks, and 3 execution harnesses tested, SkillOpt achieves best or tied-best performance - including a +5.4 point improvement over the per-cell oracle baseline.
Skills transfer across models and harnesses: A skill optimized on GPT-5.4 transfers to GPT-5.4-mini with +9.4 gain; a Codex-harness skill transfers to Claude Code harness with +59.7 points and back with +43.6 - without retraining.
Slow/meta update is load-bearing: Removing the epoch-wise consolidation mechanism causes a 22.5-point collapse on SpreadsheetBench (77.5 to 55.0), the largest single ablation degradation in the paper.
Final skills are compact: Despite hundreds of optimizer proposals, only 1-4 edits are accepted into the best skill per benchmark. Final skills range 379-1,995 tokens, deployable at zero inference-time overhead.

Results

On GPT-5.5 in direct chat mode, SkillOpt reaches 87.3% on SearchQA (+9.6), 80.7% on SpreadsheetBench (+38.9), 72.1% on OfficeQA (+39.0), 91.2% on DocVQA (+12.4), 66.9% on LiveMathematicianBench (+29.3), and 95.5% on ALFWorld (+11.9). The gains on agentic harnesses are similarly strong: +24.8 points on the Codex loop and +19.1 on Claude Code. In ablation, learning rate L_t=4 is optimal, with SpreadsheetBench peaking at 78.2 and SearchQA at 86.5 at that setting. Batch size from 1 to 32 shows robustness: SearchQA varies only 85.9-87.1 and SpreadsheetBench 75.4-77.9. Training cost efficiency varies significantly by task: SpreadsheetBench costs 0.6M tokens per point gained while SearchQA costs 37.9M per point, reflecting different signal density across domains.

Why This Matters for AI and Automation

Reusable skill investment: Skills optimized on one model or harness do not need to be rebuilt from scratch for a new deployment target - positive cross-harness and cross-model transfer means the training cost is amortized across multiple environments.
Zero deployment overhead: The optimizer runs at training time only. At inference, the skill is a prepended text document with no additional latency or infrastructure requirements.
Observable skill artifacts: Unlike gradient-based methods, SkillOpt produces readable, auditable skill documents. The learned rules are interpretable: SearchQA learns to prefer the shortest canonical entity supported by co-occurring evidence; SpreadsheetBench learns to write evaluated static values rather than relying on formula recalculation.
Works on frozen models: The optimization loop requires no access to model weights, gradients, or fine-tuning infrastructure. It operates entirely in the text space of the model API, making it compatible with any hosted model.

My Take

The key insight in SkillOpt is the separation of concerns: the skill document is the thing being trained, not the model. This is a conceptually clean formulation that sidesteps the access and cost constraints of fine-tuning while producing an artifact that is human-readable and transferable. The slow/meta update result is the most interesting finding in the paper - a 22.5-point swing from a single ablation is unusual and suggests that protecting stable, validated knowledge from being overwritten by epoch-level noise is a more important design choice than the specifics of the edit proposal mechanism. The transferability results are the other standout: +59.7 on Codex-to-Claude-Code skill transfer is not a small margin, and it implies that SkillOpt is learning something more general than harness-specific formatting tricks. What the paper does not fully address is the cost asymmetry: the 37.9M tokens per point on SearchQA versus 0.6M on SpreadsheetBench indicates that some domains are naturally resistant to text-space skill compression. For practitioners, that means the return on optimization investment varies substantially by task type and is worth estimating before committing to a full optimization run. The single-skill-per-domain limitation is also real: tasks that require multiple disjoint procedures are a natural next frontier, and the paper acknowledges it.

Discussion Question

SkillOpt's validation gate accepts only strictly improving edits, which guarantees monotonic improvement on the selection split but also means the optimizer can only move in one direction at each step. If the optimal skill requires dismantling a well-validated pattern in order to insert a better one - a non-monotone path through skill space - would the current architecture find it, or would the validation gate systematically block the necessary intermediate regression?

Read the Paper on arXiv →