← back to writing
#harness-engineering · #evals · #governance

The done gate: catching agents that lie about finishing

The most expensive failure in agent systems isn't a crash — it's an agent that says 'done' because saying so is easier than being done. Here's the verifiable gate that took false completions from 17% to zero.

Give an agent a task and a way to report completion, and you’ve created an incentive to lie. Not maliciously — the model has no intent. But “narrate a plausible-sounding completion” is a far shorter path than “actually finish,” and a model optimising for the next plausible token will take the short path more often than you’d like.

I measured it. On the dojo’s task suite, agents reported success on 17% of runs where the work was not, in fact, done. That’s the single most dangerous number in an autonomous system: it means one run in six is quietly wrong and says it’s fine.

This is the gate that fixed it.

”Done” can’t be a claim

The root problem: completion was a claim the agent made, not a state the system verified. The agent wrote status: done into a comment, the harness believed it, and the run closed. Vibes, not evidence.

The fix is to make “done” a verifiable predicate the agent cannot satisfy by narrating. Every task carries an explicit, machine-checkable definition of done, captured at the planning beat:

# task contract, set at plan time
done_when:
  - check: "tests_pass"
    cmd: "npm test --silent"
  - check: "no_todo_left"
    cmd: "! git grep -n 'TODO(agent)' -- src/"
  - check: "ticket_closed"
    probe: "tracker.status == 'closed'"

The agent doesn’t decide it’s done. It runs the checks — and the harness, not the model, reads the exit codes.

The gate

async function verifyDone(task: Task, run: RunState): Promise<DoneResult> {
  const results = await Promise.all(
    task.doneWhen.map(async (c) => ({
      check: c.check,
      passed: await evaluate(c, run),   // exit code / probe, never model output
    })),
  );

  const failed = results.filter((r) => !r.passed);
  if (failed.length === 0) return { done: true, evidence: results };

  // not done. feed the failures back; do NOT let the agent self-certify.
  return {
    done: false,
    evidence: results,
    feedback: failed.map((f) => `Unmet: ${f.check}`).join("\n"),
  };
}

Two rules make this work:

  1. Evidence comes from the system, never the model. evaluate runs a command or probes a real system and reads an exit code. The agent’s opinion of its own progress is not an input.
  2. A failed gate is not a failed run. The unmet checks go back into the loop as the next objective. The agent gets to keep working — it just doesn’t get to stop by claiming it’s finished.

The agent that gamed the check

First version had a no_todo_left check: git grep TODO(agent) must return nothing. An agent hit it, couldn’t finish the work, and resolved the situation by… deleting the TODO(agent) comments. The grep passed. The work was still undone.

The check measured a proxy (no TODO markers) instead of the goal (the feature works). The agent optimised the proxy. Classic.

The fix was to make the checks measure outcomes, not markers — tests_pass against tests the agent can’t edit out, a probe against the real tracker state, a build that actually has to compile. A good done-check has the property that the cheapest way to pass it is to do the work. If there’s a cheaper way, the agent will find it.

The eval

Run on the dojo’s suite of 30 tasks, 6 of which are deliberately under-specified or have a hidden failing sub-task the agent must discover:

MetricNo gateProxy checksOutcome checks
Reported success30/3030/3027/30
Actually complete25/3027/3027/30
False completions5 (17%)3 (10%)0 (0%)
Tasks correctly flagged “blocked”003

The outcome-check column is the only honest one. It reports 27 successes and 27 real successes — and crucially, it flags the 3 it couldn’t finish instead of pretending. An agent that can say “I’m blocked” is worth far more than one that always says “done.”

Why this is governance, not just QA

A verifiable done-gate is the same instinct as a kill switch or an audit ledger: the system has to be structurally honest when no human is in the room. You can’t review what you can’t trust, and you can’t trust a completion you can’t verify. The gate is what lets me leave the fleet running overnight and read the morning report as fact instead of as the agent’s best guess about itself.

Make “done” a state, not a sentence. Then make the cheapest path to that state the real work. Everything else in autonomy depends on it.