Your AI agents are untrained. The bottleneck was never capability.

We keep waiting for smarter models. But the agents we already have fail for the same reasons junior engineers do — no plan, no proof, no memory. Capability isn't the constraint. Discipline is.

I’ve spent the last couple of years putting AI coding agents in front of real work — production systems, enterprise constraints, the kind of codebases where a careless change has consequences. And I kept noticing something that didn’t fit the narrative.

The narrative says agents fail because models aren’t smart enough yet. Wait for the next release, the reasoning will improve, the failures will go away. But that’s not what I was seeing. The model was usually smart enough. What it lacked wasn’t intelligence. It was method.

An untrained agent, handed a non-trivial task, behaves in a way every engineering lead will recognize instantly. It rushes in without a plan. It writes code before it understands the problem. It declares the job done without ever proving it works. It makes the same mistake it made an hour ago, because nothing it learned then survived into now. And it floods its own context window with noise until it can no longer think straight.

None of those are intelligence problems. A brilliant new graduate does every one of them in their first month. We don’t fix that by hiring a smarter graduate. We fix it with training — with the accumulated discipline of how good engineers actually work.

The senior engineer in the room

Think about what a senior engineer actually does that a junior doesn’t. It’s rarely that they know a cleverer algorithm. It’s that they’ve internalized a set of instincts: plan the approach before touching the keyboard, isolate your work so you can’t break what’s already running, prove every change with a test, review your own work against the plan before asking anyone else to, and — crucially — remember the lesson when something goes wrong so it doesn’t go wrong the same way twice.

Those instincts are the difference between someone you can hand an ambiguous problem to and someone you have to supervise line by line. They’re also, it turns out, completely transferable to an agent — if you’re willing to encode them explicitly instead of hoping the model picks them up on its own.

That’s the whole premise behind a project I’ve been building in the open, called the Copilot Agents Dojo. The name is only half a joke. A dojo is where you go to turn raw ability into trained skill through repetition and structure. That’s exactly what an agent needs. Not more capability. More practice, made mandatory.

What “trained” actually means

When I say an agent is trained, I mean something concrete and testable, not a vibe. A trained agent does five things an untrained one won’t:

It plans before it codes. Multi-step work gets broken into bite-sized tasks, written down, and approved before a single line is touched.
It works in isolation. Every session happens on its own branch in its own workspace, so a failed experiment never contaminates the main line.
It proves its work. “Done” is not a claim. It’s a test that passes, a diff you can read, a log you can check.
It reviews itself. Before the work leaves the agent’s hands, it gets checked against the plan that justified it.
It learns from losing. Every correction becomes a logged lesson, and recurring lessons become permanent rules.

Notice that none of these are about the model being clever. They’re about behavior being governed. You could hand the exact same underlying model two repos — one with this structure, one without — and get two completely different collaborators. One you’d trust with a real task. One you’d babysit.

The model is the talent. The discipline is what makes the talent reliable.

Why this matters more as agents get better, not less

There’s a comfortable assumption that all of this is scaffolding we’ll throw away once models improve. I think the opposite is true. The more capable agents become, the more autonomy we hand them — and the more autonomy you grant, the more the lack of discipline costs you. A junior who makes an unproven change to one file is a small problem. An autonomous agent making unproven changes across a whole system, fast, is a very large one.

Capability and discipline aren’t competitors. Capability is the engine. Discipline is the steering and the brakes. Nobody celebrates a faster engine in a car with no brakes.

This is the same thing I spend my days on at enterprise scale — except there the stakes wear different clothes. “Did the agent write a test” becomes “can this AI system pass an audit.” “Did it plan before coding” becomes “can we govern this before it touches a regulated workload.” The vocabulary changes. The underlying truth doesn’t: the hard part of AI was never making it capable. It’s making it accountable.

So before you wait for the next model, ask a cheaper question first: have you actually trained the agent you already have? Most people haven’t. They’ve handed a capable system an open-ended task and called the resulting chaos a limitation of the technology.

It isn’t. It’s an untrained black belt with no dojo. Give it structure, and watch what the same model can do.

Copilot Agents Dojo

The open-source framework this essay is built on. 22 skills, a mandatory workflow, MIT-licensed.

View on GitHub →