Keeping Agents on Rails

An agent that can act on its own can also fail on its own. Guardrails, human checkpoints, and stopping conditions are what make autonomy safe to ship — and Anthropic is direct about why.

Last updated · June 30, 2026

An agent that can act on its own can also fail on its own — at machine speed, while you sleep. Rails are what make that power safe enough to ship.

The first lesson warned that agents are unpredictable by nature — they decide their own steps. The danger isn't that an agent is malicious; it's that it acts fast, repeatedly, and without a human in the moment. A mistake a person makes once, an agent can make four hundred times before anyone notices. The modern bar isn't “stop the agent from ever erring” — it's to make every mistake observable, containable, and recoverable. Three rails do that, and serious automation uses all three at once.

✓

Guardrails — limit what it can do. Don't rely on the agent choosing well; remove the dangerous choice. Least privilege from the auth lesson, pointed at an agent — only the tools the job needs, each scoped tightly, every action capped (a refund agent that cannot issue more than £50). Keep instructions and untrusted data separate, so a poisoned input can't rewrite the goal.

✓

Human checkpoints — a person in the loop for the big moves. For anything consequential or hard to undo, the agent proposes and a human approves — exactly how Claude Code pauses before a risky edit. The agent does the work; a person owns the irreversible yes.

✓

Stopping conditions — a hard end to the loop. An agent's loop can run forever, burning money and making things worse. Cap it: a max number of iterations, a cost ceiling, a time limit — plus a kill switch you can hit. When it reaches the cap unresolved, it halts and escalates instead of spinning.

A worked example: the refund agent

Say you let an agent handle refunds. All three rails at once: a guardrail caps any single refund at £50 — above that, it physically can't. A checkpoint routes anything from £50 to £500 to a human for a yes. And a stopping condition says that if it loops more than ten times on one case without resolving it, it halts and escalates. Notice none of these trust the agent to behave. Each assumes it might not, and makes the worst case small — that's the whole move.

Where it goes wrong

Trusting a good demo. The agent behaves beautifully in testing, so you give it real reach with no rails — and the first weird input it meets in production, it acts on confidently and at scale. Rails aren't an insult to your build; they're the standard engineering answer to anything that acts autonomously. The more agency you grant, the more they matter.

Try this

Take an agent you'd actually deploy and answer the blunt question: what's the worst it could do in an hour if it went wrong and nobody was watching? Then add one rail of each kind to shrink that answer — a guardrail to cap the damage, a checkpoint on the irreversible step, a stop condition on the loop. When the worst case is small and survivable, you're ready to ship. Until then, you're not.

Grounded in Anthropic, Building Effective Agents, on guardrails and human oversight, and 2025–2026 agent-safety practice (least privilege, containment, kill switches).

PreviousDurable Workflows: Surviving Failure NextKnowing It's Working: Observability, Evals & Cost