Week 18 · June 2026

SkillOpt: Teaching Agents to Improve Their Own Instructions

June 13, 2026 · by Satish K C 8 min read
Agents Optimization LLMs Efficiency
Built by the Author Kravhal - autonomous agents that run your business workflows end-to-end. Pay per outcome.
Get Early Access

The Paper

"SkillOpt: Executive Strategy for Self-Evolving Agent Skills" is authored by Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang, Ziwei Zhou, Zisu Huang, Yan Li, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Yuqing Yang, Dongdong Chen, Xue Yang, and Chong Luo, from Microsoft, Shanghai Jiao Tong University, Tongji University, and Fudan University. Published on arXiv in May 2026, the paper's central claim is that agent skills - natural-language policy documents prepended to agent context - should be treated as trainable external state rather than one-shot artifacts. SkillOpt is, by the authors' account, the first systematic text-space optimizer for agent skills: a separate model that converts scored execution trajectories into bounded, validated edits on a skill document, accepting only changes that strictly improve downstream task performance.

The Problem Before This Paper

Most agent skill engineering falls into two patterns: humans write task-specific instructions by hand, or an LLM generates a skill document in a single forward pass from a task description. Both produce static artifacts. When an agent fails on a task variant, the skill does not update. Gradient-based prompt optimizers like TextGrad operate on soft token embeddings and are model-specific; they cannot produce human-readable, transferable text. Trajectory distillation methods like Trace2Skill convert successful runs into new skills but have no mechanism for iterative refinement or quality control across runs. Evolutionary methods like GEPA and EvoSkill explore diverse skill proposals but without the bounded, monotonically-improving update discipline needed for reliable convergence. None of these methods treats skill optimization as a proper training loop with train/validation/test split discipline, explicit learning-rate analogs, and a mechanism that both learns from failure and protects stable patterns already in the skill from being overwritten.

What They Built

SkillOpt frames skill optimization as a structured loop over three data splits. In the forward pass, the current skill is prepended to agent context and rollouts are collected in batches of 40, generating trajectories and scalar scores. In the backward pass, a reflection minibatch of 8 samples separates failures from successes and a separate optimizer model proposes structured edits to the skill document - specifically add, delete, and replace operations. The edit budget Lt acts as a textual learning rate: at each step, only up to Lt edits are permitted, with Lt following a scheduled decay (constant, linear, or cosine; default cosine with Lt=4 and floor=2). A validation gate checks each candidate skill against the selection split and accepts it only if the score strictly exceeds the current best. Failed proposals go into a rejected-edit buffer that feeds back into subsequent reflection steps within the epoch. At the end of each epoch, a slow/meta update consolidates stable patterns across the epoch's accepted edits into a protected region of the skill document, preventing epoch-to-epoch regression. This mechanism turns out to be the single most important component: removing it drops SpreadsheetBench accuracy by 22.5 points.

Objective: s* = argmax E[r(s, M, h)]
where s = skill document, M = frozen model, h = execution harness
r(s) in [0,1] = trajectory scalar score on task split

Validation gate: accept s_candidate only if score(s_candidate, S_val) > score(s_current, S_val)
Edit budget: |edits per step| <= L_t, where L_t decays via cosine schedule (default: L_t=4, floor=2)

Key Findings

Results

On GPT-5.5 in direct chat mode, SkillOpt reaches 87.3% on SearchQA (+9.6), 80.7% on SpreadsheetBench (+38.9), 72.1% on OfficeQA (+39.0), 91.2% on DocVQA (+12.4), 66.9% on LiveMathematicianBench (+29.3), and 95.5% on ALFWorld (+11.9). The gains on agentic harnesses are similarly strong: +24.8 points on the Codex loop and +19.1 on Claude Code. In ablation, learning rate Lt=4 is optimal, with SpreadsheetBench peaking at 78.2 and SearchQA at 86.5 at that setting. Batch size from 1 to 32 shows robustness: SearchQA varies only 85.9-87.1 and SpreadsheetBench 75.4-77.9. Training cost efficiency varies significantly by task: SpreadsheetBench costs 0.6M tokens per point gained while SearchQA costs 37.9M per point, reflecting different signal density across domains.

Why This Matters for AI and Automation

My Take

The key insight in SkillOpt is the separation of concerns: the skill document is the thing being trained, not the model. This is a conceptually clean formulation that sidesteps the access and cost constraints of fine-tuning while producing an artifact that is human-readable and transferable. The slow/meta update result is the most interesting finding in the paper - a 22.5-point swing from a single ablation is unusual and suggests that protecting stable, validated knowledge from being overwritten by epoch-level noise is a more important design choice than the specifics of the edit proposal mechanism. The transferability results are the other standout: +59.7 on Codex-to-Claude-Code skill transfer is not a small margin, and it implies that SkillOpt is learning something more general than harness-specific formatting tricks. What the paper does not fully address is the cost asymmetry: the 37.9M tokens per point on SearchQA versus 0.6M on SpreadsheetBench indicates that some domains are naturally resistant to text-space skill compression. For practitioners, that means the return on optimization investment varies substantially by task type and is worth estimating before committing to a full optimization run. The single-skill-per-domain limitation is also real: tasks that require multiple disjoint procedures are a natural next frontier, and the paper acknowledges it.

Discussion Question

SkillOpt's validation gate accepts only strictly improving edits, which guarantees monotonic improvement on the selection split but also means the optimizer can only move in one direction at each step. If the optimal skill requires dismantling a well-validated pattern in order to insert a better one - a non-monotone path through skill space - would the current architecture find it, or would the validation gate systematically block the necessary intermediate regression?

Read the Paper on arXiv →
← Back to all articles
Share