Skip to main content

AI Incident Postmortem Generator

Structured postmortem template for AI outages and cost incidents — blameless format.

100% client-side⎘ exportable output⌁ zero network calls
Timeline
Action items (action · owner · due)

Live postmortem preview

# Postmortem: Runaway CI agent loop — $2,300 cost spike

**Date:** 2026-06-09 · **Duration:** 3h 25m · **Severity:** SEV2

> This is a blameless postmortem. It names systems, conditions and missing controls — not people at fault. The goal is that this class of incident cannot recur, not that someone is held responsible.

## Summary & impact

$2,300 unplanned API spend; CI queue degraded for ~2 hours; no customer impact.

## Timeline

| Time | Event |
|---|---|
| 14:05 | CI agent begins retrying a failing integration test, regenerating full context each attempt. |
| 16:40 | Developer notices unusually slow CI queue; agent has consumed ~$1,900 in API calls. |
| 16:55 | Agent job killed; API key rotated as a precaution. |
| 17:30 | Vendor dashboard confirms $2,300 total spend for the session; incident declared. |

## Root cause

No per-developer spend alerting or daily cap existed, so a retry loop that should have been a $30 blip ran unobserved for over three hours. The retry loop itself was the trigger; the missing control is the root cause.

## Contributing factors

- The CI agent harness had no max-retry limit.
- The failing test was flaky, so each retry looked plausible to the agent.
- Spend visibility was a monthly invoice, not a real-time signal.

## Action items

| Action | Owner | Due |
|---|---|---|
| Enable per-developer daily spend caps with alerts at 80% | J. Park | 2026-06-30 |
| Add max-retry limit (3) to the CI agent harness | A. Chen | 2026-06-20 |
| Document the kill procedure for runaway agent jobs | M. Osei | 2026-06-25 |

## Lessons learned

- Detection gaps, not triggers, set the size of cost incidents — real-time spend visibility bounds the blast radius.
- Action items above are reviewed at the next monthly spend report; open items past due are escalated to the budget owner.

---
*Generated with the FORG AI Incident Postmortem Generator (forg.pro/tools/incident-postmortem).*
markdown export, no lock-in
100%
generated locally
0
signup walls
0
network requests per keystroke

How it works

This generator produces a complete, blameless postmortem document for AI incidents — cost spikes, provider outages, runaway agents — from a structured form. Fill in the incident basics, build the timeline row by row, state the root cause and contributing factors, and assign owned action items; the markdown document assembles live on the right, ready to copy or download into your incident repository. The form arrives prefilled with a realistic agent-loop cost spike so you can see the target shape before replacing it with your own incident. Nothing you enter leaves your browser.

The structure is the standard one that makes postmortems comparable across incidents: summary and impact up front, a timestamped timeline, root cause separated from contributing factors, and action items with owners and due dates. That separation matters more for AI incidents than most, because the trigger (an agent retrying a failing test) is almost never the root cause (no spend alerting existed to catch it). The document template keeps asking you for the layer beneath the trigger, which is where the fixes that prevent recurrence actually live.

Blamelessness is enforced by framing, not by euphemism. The template names systems and missing controls — "no per-developer cap bounded the session" — rather than people at fault, because the engineer closest to the incident writes a precise timeline only when honesty is safe. Impact is captured in concrete terms: dollars for cost incidents, duration and affected scope for outages. Severity is declared explicitly so the ceremony matches the stakes, from async write-up to synchronous review.

One pattern you will likely recognize in your own first AI postmortem: the detection gap is nearly always the largest contributing factor. Cost incidents run for hours because nothing was watching per-developer spend in real time; the fix is alerting and hard caps, which is the control layer FORG provides out of the box. Write the postmortem with this tool, then make its most common action item — "we need spend alerts before this happens again" — true by the due date.

Frequently asked questions

What makes a postmortem blameless, and why does it matter?

Blameless means the analysis names systems, conditions and decisions-in-context rather than people at fault. The practical reason is information quality: the engineer who knows exactly what happened will write a precise timeline if honesty is safe, and a defensive one if it is not. A postmortem culture that punishes the person closest to the incident gets vague timelines, late reports and repeat incidents. The document should read like an engineering analysis, never like a verdict.

Do AI cost spikes really deserve a full postmortem like outages do?

Yes, and the discipline pays off fast. A runaway agent loop that burns four figures in an afternoon has all the anatomy of an outage: a trigger, a detection gap, a response, and systemic causes — usually missing budget caps and alerting. Teams that postmortem their first cost spike typically fix the detection gap immediately; teams that treat it as an embarrassing one-off get the same spike again with a larger bill, because nothing structural changed.

How do I find the actual root cause instead of stopping at the trigger?

Keep asking why past the first satisfying answer. 'An agent looped on a failing test' is a trigger, not a root cause — why could it loop for three hours unobserved? Because no per-developer spend alerting existed. Why not? Because budget controls were planned for next quarter. The root cause is almost always a missing control or a process gap, not the keystroke that tripped it. The contributing-factors section exists because real incidents have several causes that aligned, and fixing only one leaves the others loaded.

What separates action items that get done from ones that rot in the doc?

Three properties: a named owner (a person, not a team), a due date, and a verifiable definition of done. 'Improve monitoring' rots; 'enable per-developer daily spend caps with alerts at 80%, owner J. Park, due June 30' gets done or visibly slips. Cap the list at five items — postmortems that produce fifteen actions produce zero — and review open items at the next incident or the monthly spend report, whichever comes first.

Who should write the postmortem and when?

The person closest to the incident drafts it within 48 hours while the timeline is still reconstructible from memory and logs, then the team reviews it together in a short meeting focused on the action items. Severity drives ceremony: a SEV3 cost blip can be a fifteen-minute async write-up, while a SEV1 outage or five-figure spike deserves a synchronous review with the budget owner present. Publish internally where the next team can find it — postmortems compound only if they are read.

FORG tracks this automatically across every agent session — live cost attribution, budgets, and alerts.

Start tracking with FORG