Durable Workflows: Surviving Failure

Real automation runs in a world that fails — networks drop, APIs time out, machines restart. Durable workflow engines are built so a job can fall over and pick up exactly where it left off.

Last updated · June 30, 2026

Real automation runs in a world that fails. Networks drop, APIs time out, machines restart. The question isn't whether your job falls over — it's whether it can stand back up where it fell.

Picture a four-step job: charge the card, reserve the stock, book the courier, send the email. Now picture the courier API timing out at step three. A naive script has no memory of how far it got — so when you re-run it, it starts from step one and charges the card again. The failure wasn't the timeout; it was that recovery re-did work that should never repeat. In automation, a crash halfway through isn't an inconvenience. It's a double charge waiting to happen.

A durable workflow engine removes exactly this danger. After each step completes, it persists the result — durably, outside the running process. If the process dies, it reads back what's already done and resumes at the next unfinished step. Underneath, every step lives in a small state machine, and that machine is what makes “run exactly once” possible:

Each step is a tiny state machine. Success is recorded and skipped on replay; failure backs off and retries; only when retries run out does it land in the dead-letter queue for a human. The card is never charged twice.

Idempotency: the rule that makes retries safe

There's one catch the diagram hides. A retry only stays safe if repeating a step has no extra effect — and “charge the card” very much does. The fix is an idempotency key: each side-effecting step carries a unique token tied to the workflow, and the downstream system promises that two calls with the same token charge once. With it, the engine can retry freely; without it, every retry is a fresh charge. Idempotency is what lets “run it again” be a safe instruction instead of a dangerous one.

You could try to build all of this by hand — a status column, a pile of “did I already do this?” checks, retry logic everywhere — and it's where a startling share of production bugs live. This is why durable engines (Temporal, Inngest, Trigger.dev) exist as their own category: they make “resume exactly where you left off” the default, not something you bolt on and pray about. The craft course made the same point — prefer the proven, boring part for the solved problem.

Where it goes wrong

Assuming the happy path is the whole story. A script that works in the demo, where nothing fails, feels finished — then meets the real world, where the third call times out at 2am and the retry quietly bills a customer twice. Build for the interruption you can't see coming: in automation it isn't an edge case, it's Tuesday.

Try this

Take a multi-step automation you rely on and ask one question of each step: if the process died right after this, what happens when it runs again? Anywhere the answer is “it redoes something it shouldn't” — a charge, an email, an order — is a step that needs durability and an idempotency key. That list is the case for a durable engine, written in your own system's risks.

Grounded in the durable-execution model behind engines such as Temporal, Inngest, and Trigger.dev.

PreviousAgents Are Only as Useful as Their Tools NextKeeping Agents on Rails