← back to writing
#harness-engineering · #state · #reliability

Killed at 2am, resumed at 2:01: externalising agent run state

A power blip took out an eight-hour fleet run four hours in. It should have cost four hours of work and a fortune in tokens. It cost 90 seconds — because the state lived outside the process. Here's the checkpoint model that made it boring.

A demo holds its entire state in one heroic prompt. It works right up until the process dies — then it takes the whole run down with it, and you start the eight-hour job from zero. That’s fine for a demo. It is not fine for a fleet of agents working overnight while you sleep.

At 2:14am a host running four agents took a power blip and restarted. The run was four hours deep. Here’s why it cost 90 seconds instead of four hours — and the one bug that made the first version worse than no recovery at all.

State is not the conversation

The mistake almost everyone makes: treating the model’s context window as the state. The window is working memory — ephemeral, capped, gone when the process dies. The run state is something else: a durable, external record of what the task is, what’s been done, and what’s left.

In the dojo harness, run state is a small externalised document, checkpointed after every beat:

interface RunState {
  runId: string;
  task: TaskContract;          // the goal + definition of done (invariants)
  plan: Step[];                // the task tree
  cursor: number;              // which step we're on
  completed: StepResult[];     // what's done, with evidence
  sideEffects: Effect[];       // external actions taken, with idempotency keys
  memory: MemoryRef[];         // pointers, not the payload
}

The context window is rebuilt from this on resume. It is never the source of truth.

Checkpoint after every beat

The loop is plan → context → reason → act → observe → learn. State is persisted at the boundary of each beat, so the most you can ever lose is the single beat in flight:

async function runLoop(state: RunState, store: CheckpointStore) {
  while (state.cursor < state.plan.length) {
    const step = state.plan[state.cursor];

    const ctx = await assembleContext(state);     // rebuilds the window from state
    const action = await reason(ctx, step);
    const result = await act(action, state);       // see idempotency, below
    state.completed.push(observe(result));
    state.cursor += 1;

    await store.save(state);                        // durable checkpoint
  }
}

Resume is then trivial — there’s no special “recovery mode,” because normal operation is recovery from the last checkpoint:

const state = (await store.load(runId)) ?? initRun(task);
await runLoop(state, store);   // picks up at state.cursor, whether it's 0 or 140

The bug that re-ran a side effect

The first version checkpointed after act. Looks right. It isn’t.

The 2am restart happened to land between “the agent posted a comment to the tracker” and “the checkpoint saved.” On resume, state.cursor still pointed at that step — so the agent posted the comment again. One duplicate comment is harmless. The same gap on a step that creates a ticket or sends a notification is not.

The fix was to make side effects idempotent and record the intent before acting:

async function act(action: Action, state: RunState): Promise<Result> {
  const key = idempotencyKey(state.runId, state.cursor, action);

  if (state.sideEffects.some((e) => e.key === key && e.confirmed)) {
    return replay(key);          // already happened pre-crash; don't repeat it
  }

  state.sideEffects.push({ key, confirmed: false });
  await checkpoint(state);        // record *intent* before the irreversible act

  const result = await action.execute({ idempotencyKey: key });

  state.sideEffects.find((e) => e.key === key)!.confirmed = true;
  return result;
}

Now a crash mid-act resumes into the “intent recorded but not confirmed” state, sees the idempotency key, and either replays the result or safely re-issues a call the downstream system will de-duplicate. The external world sees each effect exactly once.

The eval

I built a chaos test that kills the process at a random beat on every task in the suite, then resumes, and asserts the final state matches an uninterrupted run:

MetricIn-prompt stateCheckpoint, act-then-saveCheckpoint + idempotent act
Runs surviving a kill0/3030/3030/30
Duplicate side effectsn/a40
Avg work lost on resumewhole runup to 1 beatup to 1 beat
Resume time— (restart)1.4s1.5s

The real incident matched the test: 90 seconds from cold start to the agent picking up at step 141 of 160, no duplicated actions, no human in the loop.

The takeaway

If your agent can’t be killed and restarted mid-task, you don’t have a harness — you have a long prompt holding its breath. Externalised state turns a catastrophe into a non-event, and it’s the precondition for everything else autonomy promises: pause, hand off, recover, run all night. The model owns one beat of the loop. Surviving the loop is on you.