---
name: incident-runbook
description: >
  Steps for responding to a production incident. Use when an alert fires, a
  service is degraded/down, or the user asks for incident triage or a postmortem.
---

# Incident Runbook

## Triage (first 5 minutes)
1. Confirm scope: what's broken, who's affected, since when.
2. Declare severity; open the incident channel; assign an IC.
3. Stop the bleeding before root-causing (roll back, feature-flag off, scale).

## Stabilize
- Prefer the fastest safe mitigation; note every action with a timestamp.
- Communicate status on a fixed cadence to stakeholders.

## Resolve & learn
- Confirm recovery with the same signal that fired the alert.
- Write a blameless postmortem: timeline, root cause, contributing factors,
  action items with owners.

Keep service-specific dashboards / escalation contacts in references/ (tier-3).
