Chapter 14: 24/7 Production SRE

Learning Objectives

By the end of this chapter, you will be able to:

Assign incident roles: Commander, Responder, Comms Lead, Scribe
Apply the 4-tier severity matrix for triage decisions
Conduct a blameless postmortem focused on system gaps, not individual blame
Convert incident findings into concrete hardening tasks

Start with the video for the concept overview, then work through each lesson section.

Recall Check

Before continuing, quickly recall:

What guardrails must be in place before any chaos drill? (Chapter 12)
Why can the AI guardian enrich and route alerts but never execute mutations? (Chapter 13)
Which three signals must agree before taking action during an incident? (Chapter 10)

If you can’t answer these, revisit the corresponding chapter before proceeding.

Tooling is useless without operational discipline. When a major incident happens at 3:00 AM, the difference between a 15-minute fix and a 4-hour outage is how your team coordinates. In this final core chapter, we implement the SRE operating model to scale beyond individual heroism.

1. The Problem: The “Coordination Chaos”

A critical alert (Sev1) fires outside of business hours. Responders join, but roles are unclear and communication is fragmented. Technical actions race each other, creating further drift and uncertainty. Time is lost on coordination instead of restoration.

2. The Concept: Incident Roles & Severity

We treat an incident like a structured mission with clear ownership:

Incident Commander (IC): Strategy and resource management.
Primary Responder: Technical execution and verification.
Communications Lead: Stakeholder updates and status reporting.
Scribe: Timeline and evidence logging.

3. The Code: Runbooks & Severity Matrix

Our sre/ repo includes the organizational templates required for high-stakes response. We use a 4-tier severity matrix to set communication expectations and response targets.

Severity	Impact	Update Cadence
Sev0	Complete Outage	15 minutes
Sev1	Major Degradation	30 minutes
Sev2	Partial Failure	2 hours
Sev3	Minor Defect	Working hours

4. The Guardrail: Blameless Postmortems

The most important SRE guardrail is the Postmortem. We focus on system conditions and guardrail gaps rather than individual mistakes. Every major incident must result in specific, evidence-backed hardening actions.

5. Verification: Did I Get It?

Run an incident simulation using your Chaos Monkey and follow the Safe Workflow:

# 1. Detect signal (Chapter 10)
# 2. Declare Severity and Assign Roles
# 3. Build Timeline and Mitigate
# 4. Confirm Recovery and Resolve

Expected Output: A clear, auditable timeline of the incident and a draft postmortem with at least one technical action item.

Estimated Time

Prerequisites

What You Will Produce

Chapter 14: 24/7 Production SRE

Learning Objectives

Recall Check

1. The Problem: The “Coordination Chaos”

2. The Concept: Incident Roles & Severity

3. The Code: Runbooks & Severity Matrix

4. The Guardrail: Blameless Postmortems

5. Verification: Did I Get It?

Detailed Lessons

Hands-On Materials

Hands-On Materials

The Incident: Coordination Chaos

Investigation & Containment

Workflow & Operating Model

Lab & Completion