Compaction: cutting agent context 62% with no accuracy loss
The night-shift agents were drowning in their own history. Here's the compaction pass that more than halved token cost on long runs — and the one summary it silently corrupted before I added pinned invariants.
A long-running agent has one resource that quietly runs out before the task does: context. Every reasoning step, tool result, and observation gets appended to the window. By beat 40 of an overnight run, half the budget is history the agent no longer needs — and you’re paying full price to re-read it on every turn.
This is the compaction pass I added to the dojo harness, the number it moved, and the bug it shipped before I caught it.
The symptom
A representative backlog-triage run, before any compaction:
turns: 58
peak input tokens: 184,210
cost / run: $2.41
wall time: 7m 12s
The window was 78% history by the final turns. Most of it was resolved: closed sub-tasks, tool calls that succeeded, observations already folded into a decision. Dead weight the model re-read every single turn.
The naive fix that broke things
First attempt: at a token threshold, summarise everything older than the last N turns into a single recap block, then drop the originals.
function compact(history: Turn[], keepLast = 8): Turn[] {
if (estimateTokens(history) < THRESHOLD) return history;
const recent = history.slice(-keepLast);
const older = history.slice(0, -keepLast);
const recap = summarise(older); // one LLM call -> prose recap
return [{ role: "system", content: `Earlier context:\n${recap}` }, ...recent];
}
Token cost dropped immediately. So did correctness — on one run in twelve. The summariser, asked to compress 40 turns into a paragraph, dropped a constraint the user had set on turn 3: never touch tickets labelled customer-reported. It wasn’t in the last 8 turns, so it lived only in the prose recap — and “be concise” is exactly the instruction that throws away a clause that hasn’t been relevant for 30 turns.
The agent then “tidied” three customer-reported tickets. The demo would never have caught this. The eval did.
Pinned invariants
The fix: separate two kinds of history. Episodic turns (what happened) are safe to summarise. Invariants (constraints, the task contract, the definition of done) are never summarised — they’re pinned verbatim and survive every compaction.
function compact(state: RunState, keepLast = 8): Turn[] {
const { invariants, history } = state;
if (estimateTokens(history) < THRESHOLD) return [...invariants, ...history];
const recent = history.slice(-keepLast);
const older = history.slice(0, -keepLast);
const recap = summarise(older, { preserve: "decisions, open threads, failures" });
return [
...invariants, // pinned, verbatim, every turn
{ role: "system", content: `Earlier context:\n${recap}` },
...recent,
];
}
Invariants are captured at the planning beat and only ever appended to — a constraint, once set, is structurally incapable of being compacted away. Compaction is now allowed to forget what happened but never what’s forbidden.
The eval
Run against the dojo’s golden set — 24 multi-step tasks with known-good outcomes, each with at least one constraint set early and tested late:
| Metric | No compaction | Naive recap | Pinned invariants |
|---|---|---|---|
| Task success | 24/24 | 22/24 | 24/24 |
| Constraint violations | 0 | 2 | 0 |
| Peak input tokens | 184,210 | 71,940 | 69,880 |
| Cost / run | $2.41 | $0.94 | $0.91 |
| Wall time | 7m 12s | 4m 02s | 3m 58s |
A 62% token reduction and a 38% time reduction, with constraint violations back to zero. The pinned invariants cost almost nothing — they’re small and they replace history the agent was paying to re-read anyway.
What I’d do differently
I shipped the naive version to the night fleet for two runs before the eval gate caught the regression. The lesson wasn’t “summarisation is dangerous” — it’s that compaction is a lossy operation, and you have to decide what it’s allowed to lose before you turn it on. Now every harness change that touches the context beat runs the constraint-violation eval before it can merge. The gate is cheap. The customer-reported ticket it would have touched in production was not.
Compaction is one of the five beats the model doesn’t own. Get it wrong and the agent slowly forgets why it’s doing the task. Get it right and it runs all night on a fraction of the budget.