Chapter 13: 24/7 Production SRE

Why This Chapter Exists

Tooling is not enough without operational discipline. This chapter defines how teams run incidents, reduce recurrence, and harden systems continuously.

Scope

on-call operating model
incident lifecycle and severity policy
recurring-problem management
blameless postmortem workflow
AI boundary policy in production

Core Principles

Evidence first:

metrics + traces + logs before high-risk actions

Blameless response:

focus on system conditions and guardrail gaps, not individuals

Controlled escalation:

severity-based comms and ownership

AI boundary:

AI can classify and recommend
humans own decisions and execution

Operating Model

Incident Commander (IC)
Primary Responder
Communications Owner
Scribe

Lab Files

lab.md
runbook-oncall.md
postmortem-template.md
quiz.md

Done When

learner can run a full incident timeline with roles and severity
learner can produce a complete blameless postmortem
learner can define hardening actions with owner and due date

Blameless Postmortem Template

Incident Metadata

Incident ID:
Date/Time (UTC):
Severity:
Services affected:
Incident Commander:

Summary

What happened:
Customer impact:
Duration:

Timeline (UTC)

Detection:
Triage:
Mitigation:
Recovery:
Closure:

Root Cause Analysis

Primary cause:
Contributing factors:
Why safeguards did not prevent it:

What Worked Well

What Didn’t Work

Action Items

Action:

Owner:
Due date:
Validation method:

Action:

Owner:
Due date:
Validation method:

Prevention & Hardening

Guardrails to add or tighten:
Alerting/SLO improvements:
Runbook updates required:

AI Assistance Review

Where AI helped:
Where human judgment was critical:
Any AI recommendation rejected and why:

Lab: Full Incident Lifecycle (24/7 SRE)

Goal

Run one full lifecycle simulation:

detect
triage
mitigate
recover
postmortem

Scenario Input

Use one recent controlled scenario (recommended from Chapter 11/12):

backend crash/panic pattern, or
elevated 5xx with recurring incidents

Step 1: Incident Declaration

Define:

severity (SEV-1/SEV-2/SEV-3)
blast radius
IC and responder roles
comms channel and update cadence

Step 2: Evidence Collection

Capture:

symptom metrics
representative trace(s)
correlated log evidence
guardian incident id (if available)

Step 3: Mitigation Decision

Choose one:

Quiz: Chapter 13 (24/7 Production SRE)

Questions

What is the first priority in the first minutes of an incident?
Which statement is correct?

A) Decide mitigation first, collect evidence later.
B) Collect evidence first, then choose mitigation.
C) Wait for AI confidence to reach 100%.

Name the minimum evidence set before high-risk action.

Why are blameless postmortems important?

Who owns final production decisions when AI is used?

What makes an action item “good” in postmortem output?
Runbook: On-Call Incident Operations
Runbook: On-Call Incident Operations
Severity Matrix
- SEV-1: active customer outage or high data-risk
- SEV-2: major degradation with customer impact
- SEV-3: limited/contained issue, no major customer impact
Standard Timeline
1. 0-5 min:
- acknowledge alert
- appoint IC
- declare severity and channel
1. 5-15 min:
- confirm symptom via metrics
- trace/log correlation
- first mitigation proposal
1. 15-30 min:
- execute lowest-risk mitigation
- status updates on cadence
1. 30+ min:
- verify recovery
- downgrade/close incident when stable
- create postmortem task
Communications Template
- Current status:
- Impact:
- Scope:
- Action in progress:
- Next update in:
Decision Rules
- No rollback/hotfix without evidence package.
- Prefer reversible mitigation first.
- If uncertainty remains high, reduce blast radius before deeper fixes.
Incident Closure Checklist
- service indicators back to baseline
- no active critical symptom for agreed window
- postmortem owner assigned
- hardening action items created

Chapter 13: 24/7 Production SRE

Chapter 13: 24/7 Production SRE

Why This Chapter Exists

Scope

Core Principles

Operating Model

Lab Files

Done When

Blameless Postmortem Template

Blameless Postmortem Template

Incident Metadata

Summary

Timeline (UTC)

Root Cause Analysis

What Worked Well

What Didn’t Work

Action Items

Prevention & Hardening

AI Assistance Review

Lab: Full Incident Lifecycle (24/7 SRE)

Lab: Full Incident Lifecycle (24/7 SRE)

Goal

Scenario Input

Step 1: Incident Declaration

Step 2: Evidence Collection

Step 3: Mitigation Decision

Quiz: Chapter 13 (24/7 Production SRE)

Quiz: Chapter 13 (24/7 Production SRE)

Questions

Runbook: On-Call Incident Operations

Runbook: On-Call Incident Operations

Severity Matrix

Standard Timeline

Communications Template

Decision Rules

Incident Closure Checklist