Automation & Agents
07Chapter · Automation & AgentsNew
3 min read

Knowing It's Working: Observability, Evals & Cost

The whole premise of automation is that it runs without you. That same premise demands you can see what it did, prove it's still right, and know what it costs — without watching.

Last updated ·
What changed
  • New chapter — added in the June 2026 restructure.

Everything in this course has been about work that runs without you. The catch is the obvious one no one wants to face: if it runs without you, how do you know it's working?

An automation that fails loudly is the easy case — it pages you, you fix it. The one that quietly succeeds while producing garbage is the nightmare, because nothing tells you. A summariser that started dropping the most important line, an agent that's been “completing” tasks wrongly for a week: by the time a human notices downstream, the damage has compounded. Trusting something unattended means building the three things that let you see it without watching it.

You watch It catches It misses
Logsthat something ran & erroredwhy, and across many steps
Traceseach step's cost, latency & tool callwhether the output was any good
Evalsquality & correctness on a test setlive, one-off production surprises
Alertsa threshold crossed, right nowanything you didn't think to threshold

The first move up from logs is traces: step-level visibility across a multi-step run, so you can see which tool call was slow, which one cost the most, and where an agent started looping. Logs tell you it ran; a trace tells you the story of how. The second is evals — and the key insight is to test full trajectories, not just final answers. For an agent, the right tool choice and the path it took matter as much as the output; a small reference set you run on every change is what stops a quiet regression from shipping. The third is cost & latency ceilings: a per-run budget and a P99 latency you alert on, plus loop detection so a thrashing agent trips a wire instead of a bill.

A trace on every run — step cost, latency, and tool calls, not just a final log line.

A small eval set on the trajectory — run it on every change, alert on a drop.

A per-run cost & latency ceiling, with loop detection that halts a thrashing run.

An alert keyed to outcome quality, not only to whether the process exited cleanly.

Where it goes wrong

Watching only the exit code. A run that “succeeded” — no error, clean exit — while quietly producing wrong output is the signature failure of unattended systems, and a green checkmark hides it perfectly. If you only alert on crashes, you're blind to the most expensive way automation fails: confidently, and on schedule.

Try this

Take one automation you already trust and ask: if its output started being subtly wrong — not crashing, just wrong — how long until I'd know? If the honest answer is “a while,” you've found the gap. Add one eval on the thing that matters and one alert on outcome, not exit code. That's the difference between automation you hope is working and automation you can prove is.

Grounded in current agent-observability and evaluation practice — step-level traces, trajectory evals, and cost/latency monitoring (LangSmith, OpenTelemetry, and the 2026 guardrails playbooks).

New chapters land here as I learn them. Want the next one?