arXiv 2026 (v1), ~102pp survey · 2026 ·
My reading notes
Directly relevant to Arun's MIRROR anomaly platform and agentic tooling: it gives a vocabulary and design checklist (Plan-Execute-Verify loop, deterministic sensors, sandboxed/permissioned execution, evidence-carrying verification, telemetry-driven self-evolution) for building reliable long-horizon agent systems that train on Slurm and serve on Kubernetes. The verification-and-oracle-adequacy chapter maps onto his trustworthy-ML and synthetic-ground-truth interests.
This is a large survey (over 100 pages, dozens of authors across UIUC, Meta, and Stanford) that reframes the role of code in LLM agent systems. The core thesis: code is no longer just an output to be generated, but the executable, inspectable, and stateful medium through which an agent reasons, acts, observes feedback, and verifies progress. The authors call this view "code as agent harness," where a harness is the software layer (tools, APIs, sandboxes, memory, validators, permission boundaries, execution loops, feedback channels) that turns a stateless model into a functional long-running agent. They argue the bottleneck of autonomy is not just base-model reasoning but the reliability of the system connecting model outputs to long-horizon actions and persistent state.
The survey is organized into three connected layers. The Harness Interface layer covers code for reasoning (program-delegated computation, formal/symbolic verification, iterative code-grounded reasoning), code for acting (grounded skill selection, programmatic policy generation, lifelong code-based agents), and code for environment modeling (structured world representations, execution-trace world models, code-grounded evaluation, verifiable environment construction). The Harness Mechanisms layer covers planning (linear, structural, search-based, orchestration), memory and context engineering (working/semantic/experiential/long-term/multi-agent memory plus context compaction and state offloading), tool use (function-oriented, environment-interaction, verification-driven, workflow-orchestration), a Plan-Execute-Verify (PEV) control loop where the harness acts as a "cybernetic governor" reading deterministic sensors (linters, tests, type checkers, fuzzers, CI), and agentic harness engineering for telemetry-driven self-optimization. The Scaling layer extends to multi-agent orchestration over shared code artifacts: role specialization (planner/coder/reviewer/tester), interaction modes (collaborative/adversarial/red-team), workflow topologies, execution-feedback synchronization, and a position argument for a shared code-centric harness substrate with state convergence.
The final part surveys tangible applications (coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization/recommendation, DevOps, enterprise workflows) and lays out seven open problems: harness-level evaluation beyond final task success and oracle adequacy; semantic verification beyond executable feedback (a "verification stack with explicit scope" where each check declares what it can and cannot verify, and every accepted action carries an evidence bundle); self-evolving harnesses without regression (treating each harness mutation like a safety-critical code change with a change contract and rollback); transactional shared program state with semantic (not just textual) conflict resolution; human-in-the-loop safety and accountability as harness state; multimodal code-harness systems; and a call toward a science of harness engineering. It ships a companion GitHub paper list.