"Code as Agent Harness" protocol vs. Spec-Driven development comparison analysis

Code should be more than an output — it should be the structure that makes agents reliable

May 23, 2026

I read the paper “Code as Agent Harness” by Ning et al., and it crystallized what was missing. The authors argue that code isn’t just the output of an AI agent — it should be the harness that structures how the agent reasons, acts, and verifies itself.

This article explains what that means, why it matters, and shows an AI agent building a working FastAPI API under the harness protocol.

The Problem: Code Is the Output, Not the Structure

Most AI agent frameworks today work like this:

You give the agent a prompt
The agent calls tools (search, code execution, file reading)
The agent produces a result
You hope it’s correct

The problem? The agent has no persistent structure for its work.

As the Ning et al. paper points out, in emerging agentic systems, “code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification.” But most implementations don’t treat it that way. They treat code as a string the LLM emits, not as a structured artifact that governs the agent’s behavior.

The result is predictable:

Context loss: After a few tool calls, the agent forgets its original plan
Untested claims: The agent says, “I fixed the bug” but never ran the tests
Invisible failures: When something breaks, you can’t trace what the agent did
Cascading errors: A change in one place breaks five other places, silently

The Insight: Code as Harness

The paper proposes a different model. Instead of treating the agent as a black box that emits code, treat code as the harness that guides the agent.

Think of a harness like a safety harness for rock climbing. It doesn’t climb for you — but it keeps you attached to the rock, prevents catastrophic falls, and gives you anchors to rest on. An agent harness does the same thing for AI:

It anchors the agent to the actual codebase (not just the context window)
It prevents the agent from skipping steps (plan → execute → verify)
It creates inspectable traces of what happened and why
It propagates changes safely, so one modification doesn’t silently break everything else

The paper organizes this into three layers:

Layer 1: The Harness Interface

Code connects agents to three things:

Reasoning: The agent thinks in programs, not just natural language
Action: Every intent becomes an executable operation
Environment modeling: The agent observes the world through execution traces

Layer 2: Harness Mechanisms

The harness needs:

Planning: Breaking big tasks into smaller, verifiable units
Memory: Preserving state across steps (not just in the context window)
Tool use: Governed, verifiable interfaces for acting on the world
Feedback-driven control: Using test results, linter output, and execution traces to guide the agent

Layer 3: Scaling the Harness

For complex tasks, you need:

Multi-agent coordination: Different agents with different roles (planner, coder, tester)
Shared artifacts: Plans, diffs, and logs that all agents can read
Transactional consistency: When one agent changes something, dependent agents know about it

How the Agent Built It: Watching AHP in Action

I didn’t write this API. An AI agent did — operating under the Agent Harness Protocol. Watching it work was the best demonstration of why this matters.

Here’s how the agent progressed, step by step, using the harness.

Setup: OpenCode as CLI and Kimi K2.6 as model.
And SKILL.md for the Code as Agent Harness Protocol

Step 1: Loading the Harness

Before writing any code, the agent loaded the Agent Harness Protocol SKILL.md. This gave it a set of operating rules:

Executable: Every claim must be grounded in something you can run
Inspectable: All reasoning must be exposed as structured artifacts
Stateful: Task progress must be persisted across steps

The agent didn’t just “agree” to these principles. It literally couldn’t proceed without following them — the harness enforced the rules.

Step 2: Creating the Root Roadmap

The agent’s first action was creating a PLAN.md — the root roadmap that tracked every feature:

# Agent Harness Protocol — Root Roadmap

## Features

| Feature | Status | Read Set | Write Set |
|---|---|---|---|
| Task 1: Project Setup | **Active** | — | `pyproject.toml` |
| Task 2: Data Models | Pending | Task 1 | `models.py` |
| Task 3: Storage Layer | Pending | Task 2 | `storage.py` |
| Task 4: State Machine | Pending | Task 2 | `state_machine.py` |
| Task 5: Harness Logic | Pending | Task 3, Task 4 | `harness_logic.py` |
| ... | ... | ... | ... |

Why this matters: The agent didn’t just start coding. It created a transactional contract for every task — what it needed (Read Set) and what it would modify (Write Set). Before touching any file, the agent checked that its dependencies were converged.

Step 3: The PEV Loop in Action

For every task, the agent followed the same cycle:

Plan

The agent looked at the current PLAN.md, identified the active task, and checked its Read Set. For Task 2 (Data Models), it verified that Task 1 had converged.

Execute

The agent wrote the failing test first:

# tests/test_models.py
def test_task_creation_from_task_create():
    create_data = TaskCreate(title="Build API", role="coder")
    task = Task.from_create(create_data)
    assert task.title == "Build API"
    assert task.status == "pending"

Then it ran the test:

$ uv run pytest tests/test_models.py -v
ModuleNotFoundError: No module named 'models'

The test failed. This was expected. The harness requires verifying failure before success.

Then the agent implemented the model:

# models.py
class Task(BaseModel):
    id: UUID = Field(default_factory=uuid4)
    title: str
    role: Literal["manager", "planner", "coder", "tester"]
    status: Literal["pending", "active", "verifying", "converged", "stale", "failed"] = "pending"
    read_set: list[UUID] = Field(default_factory=list)
    write_set: list[UUID] = Field(default_factory=list)
    memory: list[MemoryEntry] = Field(default_factory=list)
    subtasks: list[UUID] = Field(default_factory=list)

Verify

The agent re-ran the test:

$ uv run pytest tests/test_models.py -v
tests/test_models.py::test_task_creation_from_task_create PASSED

Only then did the agent update the PLAN.md:

| Task 2: Data Models | **Verified** | Task 1 | `models.py` |

This is the harness protocol in one cycle: Plan what to build, execute with a failing test first, verify it passes, and persist the state in the roadmap.

Step 4: The State Machine Enforces Discipline

When the agent reached Task 4 (State Machine), it hit the first enforcement mechanism. The agent wrote a test for an invalid transition:

def test_invalid_pending_to_converged():
    assert is_valid_transition("pending", "converged") is False

Then implemented the state machine:

VALID_TRANSITIONS = {
    "pending": {"active", "failed"},
    "active": {"verifying", "failed"},
    "verifying": {"converged", "failed", "active"},
    "converged": {"stale"},
    "stale": {"active", "failed"},
    "failed": {"active"},
}

What happened when the agent later tried to cheat?

During Task 7 (Status Updates), the agent was implementing the PUT endpoint. It could have allowed any status update. But the harness state machine rejected it:

PUT /tasks/{task_id}/status
{"status": "converged"}
# Returns: 400 Bad Request - "Invalid transition: pending -> converged"

The agent literally could not skip verification. The state machine prevented it. This is “feedback-driven control” from the paper — the harness, not the agent, decides what’s valid.

Step 5: Convergence Propagates Through Dependencies

When the agent reached Task 5 (Harness Logic), it implemented the convergence check:

def check_convergence(store, task_id):
    task = store.get(task_id)
    if task.status != "converged":
        return False
    for dep_id in task.read_set:
        dep = store.get(dep_id)
        if not dep or dep.status != "converged":
            return False
    for sub_id in task.subtasks:
        sub = store.get(sub_id)
        if not sub or sub.status != "converged":
            return False
    return True

Then, during Task 9 (Convergence & Verification), the agent tested it with a real dependency graph:

# Create a task that depends on another task
dep = create_task("Dependency", role="coder")
task = create_task("With Dep", role="planner", read_set=[dep.id])

# Mark the task as converged, but NOT its dependency
update_status(task.id, "converged")

# Verify
result = verify_task(task.id)
assert result["converged"] is False  # FAILS - dependency isn't converged

This is the core insight of the paper in one test: verification isn’t a checkbox. It’s a structural property of the dependency graph. The agent couldn’t fake success — the harness checked the entire chain.

Step 6: Stale Propagation Catches Runtime Changes

The most interesting moment came when the user said, “Use Python 3.12 and uv.”

The agent had already verified Tasks 1-5 with Python 3.9. When the environment changed, the harness detected the inconsistency:

| Task 3: Storage Layer | **Stale** | Task 1, Task 2 | `storage.py` |

Why stale? Because storage.py used Python 3.9 syntax (dict[str, Task]) that worked in 3.9 but might not in 3.12. The harness didn’t assume compatibility — it marked the task stale and forced re-verification.

The agent had to:

Re-run all previously passing tests under Python 3.12
Verify no regressions
Update the PLAN.md to mark tasks Verified again

This is “transactional consistency” from the paper. When the ground truth changes (Python version), everything that depended on the old truth becomes invalid.

Step 7: Hierarchical Planning Decomposes Complex Work

For Task 8 (Subtasks), the agent built hierarchical planning:

# Parent task (manager role)
parent = create_task("Build Feature", role="manager")

# Child tasks (specialized roles)
child1 = create_task("Design Models", role="planner")
child2 = create_task("Write Tests", role="tester")

# Link them
add_subtask(parent.id, child1.id)
add_subtask(parent.id, child2.id)

Then, during Task 10 (Integration Tests), the agent ran a full PEV loop:

def test_full_pev_loop():
    # Plan: Create parent with subtasks
    parent = create_task("Build Feature", role="manager")
    child1 = create_task("Design Models", role="planner")
    child2 = create_task("Write Tests", role="tester")
    add_subtask(parent.id, child1.id)
    add_subtask(parent.id, child2.id)
    
    # Execute: Converge children
    for child in [child1, child2]:
        update_status(child.id, "active")
        update_status(child.id, "verifying")
        update_status(child.id, "converged")
    
    # Verify: Parent converges only when all children converge
    update_status(parent.id, "active")
    update_status(parent.id, "verifying")
    update_status(parent.id, "converged")
    
    assert check_convergence(parent.id) is True

This is multi-agent coordination in miniature. Each “role” is a different agent with different responsibilities. The harness ensures they converge together.

Step 8: The Final State

When the agent finished, the PLAN.md showed:

# Convergence Report

- **Total Tests:** 51
- **Passed:** 51
- **Failed:** 0

## Project Converged

All tasks have reached Correctness Convergence.

Step 3: Expose It as an API

The final step was exposing all of this as REST endpoints so other agents (or humans) could interact with the harness:

The Results: 51 Tests, All Passing

This isn’t theoretical. I wrote 51 tests that prove every mechanism works:

pytest tests/ -v

18 integration tests (full API flows)
11 harness logic tests (convergence, stale propagation, circular deps)
13 state machine tests (valid/invalid transitions)
6 storage tests (CRUD)
4 model tests (validation)

============================== 51 passed in 0.22s ==============================

Every AHP principle has a test that demonstrates it.

The 5 Most Important Benefits of an Agent Harness

After building this, here are the concrete benefits I see:

1. Deterministic State

You always know what an agent is doing, what it did, and what it needs to do next. The status field is the single source of truth. No more “I think it’s done?”

2. Failure Is Explicit, Not Silent

When something breaks, the state machine tells you exactly where: verifying → failed. You don’t have to guess whether the agent succeeded or not.

3. Dependencies Are Enforced

The harness won’t let an agent proceed if its inputs aren’t converged. This prevents the most common failure mode in agent systems: cascading errors from stale assumptions.

4. Full Audit Trail

The memory list is a complete log of every action, observation, and error. Debugging becomes reading logs, not reverse-engineering prompts.

5. Composable Plans

Hierarchical tasks mean complex work is broken into small, independently verifiable units. Each subtask converges on its own. The paper calls this “scaling the harness from single-agent systems to multi-agent settings.”

How This Connects to the Paper

The Ning et al. paper is a survey — it identifies the principles but doesn’t give you a working implementation. The agent built a minimal, working instantiation of their framework:

Spec-Driven Development vs. Agent Harness Protocol: Complementary, Not Competing

The agent used two different methodologies to build this API, and understanding how they differ is crucial.

Spec-Driven Development (Superpowers)

Before writing any code, the agent followed a strict spec-driven process:

Brainstorming: Asked clarifying questions, proposed 3 approaches, got user approval
Design Document: Wrote a formal spec (docs/superpowers/specs/...) with data models, endpoints, and error handling
Implementation Plan: Created a task-by-task plan (docs/superpowers/plans/...) with exact file paths, code snippets, and verification commands
Execution: Implemented each task in order, running tests after every step

What Spec-Driven Development does:

Decides WHAT to build before building it
Prevents scope creep through user approval gates
Breaks complex work into small, verifiable tasks
Creates a shared understanding between human and agent

Where it falls short:

The spec doesn’t verify itself. Once you start coding, the plan can become stale.
It doesn’t track runtime state. If Task 5 fails, the spec doesn’t automatically know.
It doesn’t enforce dependencies. Nothing prevents you from skipping Task 3 and jumping to Task 7.

Agent Harness Protocol

While spec-driven development planned the work, the Agent Harness Protocol governed the work as it happened:

State Machine: Every task had a status field that enforced valid transitions
Convergence Checks: Before marking a task done, I verified all dependencies were converged
Stale Propagation: When I changed Python versions (3.9 → 3.12), dependent tasks were automatically marked stale
Memory: Every test run, every error, every state change was logged

What Agent Harness Protocol does:

Ensures correctness DURING execution
Prevents invalid state transitions (you can’t skip verification)
Propagates changes safely (when one task changes, dependents know)
Creates an audit trail of everything that happened

Where it falls short:

It doesn’t tell you what to build. It needs a plan as input.
It doesn’t ask clarifying questions or propose alternatives.
It doesn’t handle scope decisions or user approval.

The Synergy: Why You Need Both

Here’s what actually happened when the agent built this API(this is one of the approaches that we can take during feature development):

Analogy:

Spec-Driven Development is the architect drawing blueprints before construction
Agent Harness Protocol is the building inspector checking that each floor is structurally sound before the next floor goes up

The architect (spec) without the inspector (harness) builds beautiful buildings that collapse. The inspector (harness), without the architect (spec), checks that the construction is sound — but has no idea if it’s building a hospital or a shopping mall.

Key Differences

The practical workflow:

Use Spec-Driven Development when you don’t know what to build yet
Use Agent Harness Protocol when you know what to build and need to ensure it converges
Use both for production systems: plan with specs, execute with harnesses

The SKILL.md: Making AHP Reusable

The agent built the API under the Harness protocol.

But to make the Agent Harness Protocol reusable — not just for this project, but for any agentic system — I created a formal SKILL.md for the Code as Agent Harness Protocol on GitHub.

What It Contains

The SKILL.md is a structured guide that any AI agent can load at the start of a session to operate under the harness protocol. It includes:

Core Philosophy: The three principles (Executable, Inspectable, Stateful)
Harness Interface: How to use code for reasoning, action, and environment modeling
Harness Mechanisms: Planning hierarchies, memory management, and tool governance
Scaling Patterns: Multi-agent coordination through shared artifacts and transactional consistency
Safety Gates: Human-in-the-loop requirements for security-critical actions
Verification Checklist: The pre-flight checklist every agent must run before claiming convergence

How It Works in Practice

When an agent loads the SKILL.md, it receives instructions like:

"Before starting Feature N:
- [ ] Check the Root /PLAN.md for current roadmap state
- [ ] Create features/{feature_name}/PLAN.md
- [ ] Define the Read Set (Dependencies) and Write Set (Targets)
- [ ] Verify that all dependencies have reached Correctness Convergence"

This isn’t a suggestion — it’s a mandatory harness check. The agent literally cannot proceed without ticking these boxes.

Why This Matters

The Ning et al. paper describes the theory. My API demonstrates the practice. The SKILL.md makes it portable.

You can give any agent this skill file at the start of a session, and it will:

Follow the PEV loop automatically
Maintain state across context window boundaries
Verify through execution, not self-consistency
Propagate changes safely through dependency graphs

This is the difference between a one-off demo and a reusable infrastructure layer.

The Bottom Line

The “Code as Agent Harness” paper argues that code should be more than an output — it should be the structure that makes agents reliable. After building this API, I agree.

The harness doesn’t make the agent smarter. It makes the agent accountable. Every claim is grounded in executable code. Every state transition is validated. Every dependency is tracked. Every action is logged.

That’s the difference between a demo and a production system. And it’s why every AI agent needs a harness.

Dev Internals

Discussion about this post

Ready for more?