← back to where I focus
02 Where I focus

Evaluation & observability

You can't improve what you can't measure, and you can't trust what you can't see. Evals and traces are the difference between hoping an agent works and knowing it does.

A demo proves it can work once

The hardest question in agentic AI isn’t “can I build it” — it’s “how do I know it works, and how would I know if it stopped.” A demo proves an agent can succeed once, on a path you chose. A system proves it succeeds on paths you didn’t, and tells you the moment it regresses. The bridge between the two is measurement.

Evaluation: the numbers that tell the truth

Observability: seeing what the agent actually did

Why the two are one discipline

Evaluation tells you whether it works. Observability tells you why, and whether it still does. Apart, each is half a system: evals without traces can’t explain a failure; traces without evals can’t tell you a failure happened. Together they’re the feedback loop that lets an agent improve safely instead of drifting quietly.

I lean on established measures where they fit — DORA for delivery, SPACE for developer experience — but the agent-specific work is building the eval suites and trace infrastructure that turn “it seemed to work” into a number you can defend.

What it delivers

Confidence you can act on. When the evals are green and the traces are clean, you ship. When they’re not, you know precisely where to look. That’s the discipline that separates a demo you hope holds up from a system you know does.