Knowing It's Working: Observability, Evals & Cost

The whole premise of automation is that it runs without you. That same premise demands you can see what it did, prove it's still right, and know what it costs — without watching.

Last updated · June 30, 2026

What changed›

June 30, 2026New chapter — added in the June 2026 restructure.

Everything in this course has been about work that runs without you. The catch is the obvious one no one wants to face: if it runs without you, how do you know it's working?

An automation that fails loudly is the easy case — it pages you, you fix it. The one that quietly succeeds while producing garbage is the nightmare, because nothing tells you. A summariser that started dropping the most important line, an agent that's been “completing” tasks wrongly for a week: by the time a human notices downstream, the damage has compounded. Trusting something unattended means building the three things that let you see it without watching it.

You watch	It catches	It misses
Logs	that something ran & errored	why, and across many steps
Traces	each step's cost, latency & tool call	whether the output was any good
Evals	quality & correctness on a test set	live, one-off production surprises
Alerts	a threshold crossed, right now	anything you didn't think to threshold

The first move up from logs is traces: step-level visibility across a multi-step run, so you can see which tool call was slow, which one cost the most, and where an agent started looping. Logs tell you it ran; a trace tells you the story of how. The second is evals — and the key insight is to test full trajectories, not just final answers. For an agent, the right tool choice and the path it took matter as much as the output; a small reference set you run on every change is what stops a quiet regression from shipping. The third is cost & latency ceilings: a per-run budget and a P99 latency you alert on, plus loop detection so a thrashing agent trips a wire instead of a bill.

✓

A trace on every run — step cost, latency, and tool calls, not just a final log line.

✓

A small eval set on the trajectory — run it on every change, alert on a drop.

✓

A per-run cost & latency ceiling, with loop detection that halts a thrashing run.

✓

An alert keyed to outcome quality, not only to whether the process exited cleanly.

Where it goes wrong

Watching only the exit code. A run that “succeeded” — no error, clean exit — while quietly producing wrong output is the signature failure of unattended systems, and a green checkmark hides it perfectly. If you only alert on crashes, you're blind to the most expensive way automation fails: confidently, and on schedule.

Try this

Take one automation you already trust and ask: if its output started being subtly wrong — not crashing, just wrong — how long until I'd know? If the honest answer is “a while,” you've found the gap. Add one eval on the thing that matters and one alert on outcome, not exit code. That's the difference between automation you hope is working and automation you can prove is.

Grounded in current agent-observability and evaluation practice — step-level traces, trajectory evals, and cost/latency monitoring (LangSmith, OpenTelemetry, and the 2026 guardrails playbooks).

PreviousKeeping Agents on Rails