Chapter 14: 24/7 Production SRE
Learning Objectives
By the end of this chapter, you will be able to:
- Assign incident roles: Commander, Responder, Comms Lead, Scribe
- Apply the 4-tier severity matrix for triage decisions
- Conduct a blameless postmortem focused on system gaps, not individual blame
- Convert incident findings into concrete hardening tasks
Start with the video for the concept overview, then work through each lesson section.
Recall Check
Before continuing, quickly recall:
- What guardrails must be in place before any chaos drill? (Chapter 12)
- Why can the AI guardian enrich and route alerts but never execute mutations? (Chapter 13)
- Which three signals must agree before taking action during an incident? (Chapter 10)
If you can’t answer these, revisit the corresponding chapter before proceeding.
Tooling is useless without operational discipline. When a major incident happens at 3:00 AM, the difference between a 15-minute fix and a 4-hour outage is how your team coordinates. In this final core chapter, we implement the SRE operating model to scale beyond individual heroism.
1. The Problem: The “Coordination Chaos”
A critical alert (Sev1) fires outside of business hours. Responders join, but roles are unclear and communication is fragmented. Technical actions race each other, creating further drift and uncertainty. Time is lost on coordination instead of restoration.
2. The Concept: Incident Roles & Severity
We treat an incident like a structured mission with clear ownership:
- Incident Commander (IC): Strategy and resource management.
- Primary Responder: Technical execution and verification.
- Communications Lead: Stakeholder updates and status reporting.
- Scribe: Timeline and evidence logging.
3. The Code: Runbooks & Severity Matrix
Our sre/ repo includes the organizational templates required for high-stakes response. We use a 4-tier severity matrix to set communication expectations and response targets.
| Severity | Impact | Update Cadence |
|---|---|---|
| Sev0 | Complete Outage | 15 minutes |
| Sev1 | Major Degradation | 30 minutes |
| Sev2 | Partial Failure | 2 hours |
| Sev3 | Minor Defect | Working hours |
4. The Guardrail: Blameless Postmortems
The most important SRE guardrail is the Postmortem. We focus on system conditions and guardrail gaps rather than individual mistakes. Every major incident must result in specific, evidence-backed hardening actions.
5. Verification: Did I Get It?
Run an incident simulation using your Chaos Monkey and follow the Safe Workflow:
# 1. Detect signal (Chapter 10)
# 2. Declare Severity and Assign Roles
# 3. Build Timeline and Mitigate
# 4. Confirm Recovery and Resolve
Expected Output: A clear, auditable timeline of the incident and a draft postmortem with at least one technical action item.