Chapter 13: 24/7 Production SRE

Chapter 13: 24/7 Production SRE

Why This Chapter Exists

Tooling is not enough without operational discipline. This chapter defines how teams run incidents, reduce recurrence, and harden systems continuously.

Scope

  • on-call operating model
  • incident lifecycle and severity policy
  • recurring-problem management
  • blameless postmortem workflow
  • AI boundary policy in production

Core Principles

  1. Evidence first:
  • metrics + traces + logs before high-risk actions
  1. Blameless response:
  • focus on system conditions and guardrail gaps, not individuals
  1. Controlled escalation:
  • severity-based comms and ownership
  1. AI boundary:
  • AI can classify and recommend
  • humans own decisions and execution

Operating Model

  • Incident Commander (IC)
  • Primary Responder
  • Communications Owner
  • Scribe

Lab Files

  • lab.md
  • runbook-oncall.md
  • postmortem-template.md
  • quiz.md

Done When

  • learner can run a full incident timeline with roles and severity
  • learner can produce a complete blameless postmortem
  • learner can define hardening actions with owner and due date

Blameless Postmortem Template

Blameless Postmortem Template

Incident Metadata

  • Incident ID:
  • Date/Time (UTC):
  • Severity:
  • Services affected:
  • Incident Commander:

Summary

  • What happened:
  • Customer impact:
  • Duration:

Timeline (UTC)

  1. Detection:
  2. Triage:
  3. Mitigation:
  4. Recovery:
  5. Closure:

Root Cause Analysis

  • Primary cause:
  • Contributing factors:
  • Why safeguards did not prevent it:

What Worked Well

What Didn’t Work

Action Items

  1. Action:
  • Owner:
  • Due date:
  • Validation method:
  1. Action:
  • Owner:
  • Due date:
  • Validation method:

Prevention & Hardening

  • Guardrails to add or tighten:
  • Alerting/SLO improvements:
  • Runbook updates required:

AI Assistance Review

  • Where AI helped:
  • Where human judgment was critical:
  • Any AI recommendation rejected and why:

Lab: Full Incident Lifecycle (24/7 SRE)

Lab: Full Incident Lifecycle (24/7 SRE)

Goal

Run one full lifecycle simulation:

  • detect
  • triage
  • mitigate
  • recover
  • postmortem

Scenario Input

Use one recent controlled scenario (recommended from Chapter 11/12):

  • backend crash/panic pattern, or
  • elevated 5xx with recurring incidents

Step 1: Incident Declaration

Define:

  • severity (SEV-1/SEV-2/SEV-3)
  • blast radius
  • IC and responder roles
  • comms channel and update cadence

Step 2: Evidence Collection

Capture:

  • symptom metrics
  • representative trace(s)
  • correlated log evidence
  • guardian incident id (if available)

Step 3: Mitigation Decision

Choose one:

Quiz: Chapter 13 (24/7 Production SRE)

Quiz: Chapter 13 (24/7 Production SRE)

Questions

  1. What is the first priority in the first minutes of an incident?

  2. Which statement is correct?

  • A) Decide mitigation first, collect evidence later.
  • B) Collect evidence first, then choose mitigation.
  • C) Wait for AI confidence to reach 100%.
  1. Name the minimum evidence set before high-risk action.

  2. Why are blameless postmortems important?

  3. Who owns final production decisions when AI is used?

  4. What makes an action item “good” in postmortem output?

    Runbook: On-Call Incident Operations

    Runbook: On-Call Incident Operations

    Severity Matrix

    • SEV-1: active customer outage or high data-risk
    • SEV-2: major degradation with customer impact
    • SEV-3: limited/contained issue, no major customer impact

    Standard Timeline

    1. 0-5 min:
    • acknowledge alert
    • appoint IC
    • declare severity and channel
    1. 5-15 min:
    • confirm symptom via metrics
    • trace/log correlation
    • first mitigation proposal
    1. 15-30 min:
    • execute lowest-risk mitigation
    • status updates on cadence
    1. 30+ min:
    • verify recovery
    • downgrade/close incident when stable
    • create postmortem task

    Communications Template

    • Current status:
    • Impact:
    • Scope:
    • Action in progress:
    • Next update in:

    Decision Rules

    • No rollback/hotfix without evidence package.
    • Prefer reversible mitigation first.
    • If uncertainty remains high, reduce blast radius before deeper fixes.

    Incident Closure Checklist

    • service indicators back to baseline
    • no active critical symptom for agreed window
    • postmortem owner assigned
    • hardening action items created