Week 15 · Hermes: Chain-of-LLM Agents for Autonomous Network Modeling

The Paper

"Hermes: A Large Language Model Framework on the Journey to Autonomous Networks" was published in November 2024 by Fadhel Ayed, Ali Maatouk, Nicola Piovesan, Antonio De Domenico (Huawei Paris Research Center), Merouane Debbah (Khalifa University, UAE), and Zhi-Quan Luo (Chinese University of Hong Kong, Shenzhen), with Maatouk also affiliated with Yale University. The central claim is that a chain of specialized LLM agents using structured "blueprints" can automatically construct Network Digital Twin instances with up to 82.5% success rate - a 3.3x improvement over chain-of-thought prompting on the same model - proving that multi-agent harness design, not model capability alone, determines success in complex domain-specific automation.

The Problem Before This Paper

Cellular networks target Level 5 full autonomy on the TM Forum scale, but current operations sit around Level 3 with human experts still required for modeling network behavior and defining policies. Network Digital Twins showed promise but each use case demanded its own bespoke architecture - different KPIs, different data availability, different operator regulations - making NDTs impossible to scale. LLMs were proposed as the "telecommunications brain" to bridge this gap, but existing applications topped out at RAG chatbots for 3GPP standards lookup and simple config file translation. When asked to actually model network behavior - computing how a power change propagates through RSRP, interference, and SINR - even GPT-4o with chain-of-thought reasoning failed 75% of the time due to unit conversion errors, hallucinated formulas, and an inability to maintain coherent multi-step plans.

What They Built

Hermes separates network modeling into two specialized agent chains: a Designer that produces a YAML "blueprint" (sequence of named steps with inputs, outputs, formulas, and explanations) and a Coder that implements the blueprint as executable Python. The Designer chain runs through five stages: N coarse-grained generators produce high-level reflections, evaluators validate and challenge those reflections using the Foresee-and-Reflect framework, M fine-grained generators synthesize validated outputs into detailed strategies with formulas (mirroring genetic algorithm crossover), a blueprint editor compiles the final YAML, and a blueprint refiner catches unit mismatches and missing terms. The Coder chain follows with code generation, code refinement (using a known-issues checklist), Python interpreter execution, and iterative debugging. A feedback phase then performs sanity checks on functional blocks against ground truth data, feeding corrections back to the Designer for blueprint revision.

noise_dbm = -174 + 10 * log10(bandwidth_hz)
SINR = RSRP_serving - 10*log10(sum(10^(RSRP_interferer/10))) - noise_dbm

Blueprint := YAML{ steps: [name, inputs, outputs, logic, explanation] }
Designer(policy, data) -> Blueprint -> Coder(blueprint) -> Python + evaluation

Key Findings

Full Hermes pipeline achieves 75-85% success rate across four tasks of increasing complexity (4-7 modeling blocks), compared to 5-35% for chain-of-thought alone.
The performance gap between Hermes and CoT widens as task complexity grows - from 50 percentage points on simple tasks to 70 percentage points on complex ones.
GPT-4o with Hermes harness (82.5%) massively outperforms Llama-3.1-405b with Hermes (45%), but providing 5 expert-designed blocks to Llama-70b (75%) nearly matches GPT-4o without expert blocks.
Blueprint-as-intermediate-representation forces the LLM to externalize its reasoning in a structured, verifiable format before any code is written.
The feedback phase catches unit conversion errors (dBm vs linear) that persist even through chain-of-thought reasoning.

Results

Evaluated across 20 independent runs per task on a simulated network of 10 tri-sectored base stations, Hermes with GPT-4o achieved success rates of 85% (power control), 80% (energy saving), 80% (energy saving vs SINR), and 75% (new BS deployment) - where success means estimation error below 10% relative to ground truth. The most striking result is the expert-block scaling experiment: Llama-3.1-70b jumps from 25% to 75% success as expert-designed model blocks in the repository increase from 0 to 5, while Llama-3.1-405b climbs from 45% to 80%. This demonstrates that a well-designed harness with curated tool libraries can close most of the gap between a 70B open-source model and a frontier proprietary model.

Why This Matters for AI and Automation

Proves the harness-over-model thesis in a real engineering domain: same GPT-4o model goes from 25% (CoT) to 82.5% (Hermes) purely through better orchestration.
Blueprint-as-YAML is a generalizable pattern for any domain requiring multi-step quantitative reasoning - not just telecom.
Expert-block repositories function as domain-specific tool libraries that make smaller models competitive, reducing API cost by 5-10x compared to frontier models.
The Designer/Coder separation maps directly onto how effective engineering teams work: architects specify, implementers code, reviewers validate.

My Take

This paper lands at the right moment. Over the past two weeks on this site, I covered Anthropic's practitioner playbooks for production agents (Article 14) and Shangding Gu's formal framework proving that harness engineering outperforms model scaling (Week 14). Hermes is the domain-specific proof of both theses, independently discovered by a telecom research lab. The structural parallels are exact: Anthropic's initializer/executor split maps to Hermes' Designer/Coder split. Gu's three harness pillars (context engineering, tool architecture, orchestration) map to Hermes' blueprint structure, expert-block repository, and multi-agent chain respectively. The YAML blueprint serves the same function as Anthropic's feature_list.json - a structured external artifact that survives context boundaries and prevents drift.

What makes Hermes particularly relevant right now is the broader convergence happening in agent frameworks. The same week this paper gained traction, practitioners building personal automation agents - including OpenClaw and PicoClaw, which I covered in Article 02 - are incorporating similar multi-agent orchestration patterns at the edge. PicoClaw runs on a $17 board with under 10MB RAM, yet its architecture already separates planning from execution and uses structured state files for persistence across sessions. The pattern is fractal: whether you are orchestrating GPT-4o agents to model cellular networks or running a lightweight Go binary on cheap hardware to manage personal workflows, the winning architecture is the same - specialized agents, structured intermediate state, and feedback loops that catch errors before they compound. Hermes just proves it with hard numbers in one of the most demanding engineering domains available.

Discussion question: Hermes achieves 75% of frontier-model performance using Llama-70b plus five expert-designed blocks. As open-source models continue closing the capability gap, does this suggest that curated domain-specific tool libraries will become more competitively valuable than the models themselves - and if so, who should be building and maintaining those libraries: the model providers, the domain experts, or a new category of "agent tool vendors"?

Read the Paper on arXiv →