Chapter 11: Controlled Chaos

Why This Chapter Exists

Production resilience is not proven in calm conditions. This chapter validates behavior under controlled failures with explicit blast-radius limits.

Scope

Failure classes in this chapter:

crash loop (/panic)
elevated 5xx (/status/500)
random pod termination (Chaos Monkey)

Current implementation focus:

deterministic drills first
Chaos Monkey in develop with kill switch and strict target allowlist

Chaos Monkey (MVP)

Flux path:

flux/infrastructure/chaos/develop/

Safety controls:

namespace scope: develop only (RBAC Role in develop)
target scope: app=frontend or app=backend
schedule: every 15 minutes
window: UTC 10-16
kill switch: spec.suspend: true on CronJob (default)

Guardrails

Never run uncontrolled chaos in staging/production.
One failure injection per run.
Evidence-first triage: metrics -> traces -> logs.
Every drill must end with recovery verification and a hardening action.

Lab Files

lab.md
runbook-game-day.md
scorecard.md
quiz.md

Handoff to Chapter 12 (AI Guardian)

Chaos Monkey emits structured log events in CronJob output. In Chapter 12, Guardian watchers consume these events and classify:

expected controlled disruption
unexpected collateral impact
escalation-required incident

Done When

learner runs at least two controlled failure drills with evidence
learner enables/disables Chaos Monkey safely
learner captures one game-day scorecard with action items

Game Day Scorecard (Template)

Date:
Environment:
Scenario:
Driver:
Incident Commander:
Observer:

Detection

First symptom timestamp:
Detection signal:
MTTD (minutes):

Triage

Representative trace id:
Correlated log evidence:
Hypothesis quality (low/medium/high):

Recovery

Mitigation applied:
Recovery timestamp:
MTTR (minutes):

Signal Quality

Metrics usefulness (1-5):
Traces usefulness (1-5):
Logs usefulness (1-5):
Alert noise assessment:

Outcome

Blast radius within expectation: yes/no
Guardrails respected: yes/no
Follow-up hardening action:
Owner:
Due date:

Lab: Controlled Chaos with Safety Guardrails

Goal

Run one deterministic failure drill and one Chaos Monkey drill in develop:

confirm detection
run incident workflow
verify recovery

Prerequisites

kubectl -n flux-system get kustomization chaos-monkey-develop
kubectl -n develop get deploy frontend backend
kubectl -n observability get prometheusrule backend-alerts backend-slo-rules

Step 1: Deterministic Drill (Backend 5xx)

Generate controlled 5xx from frontend Chaos page or directly:

kubectl -n develop port-forward svc/backend 8080:8080
for i in $(seq 1 40); do curl -s -o /dev/null -w "%{http_code}\n" http://localhost:8080/status/500; done

Expected:

Quiz: Chapter 11 (Controlled Chaos)

Questions

Why should chaos drills start in develop and not production?
Which CronJob field is the primary kill switch for Chaos Monkey?
In this repo, what target app labels are allowed for monkey pod deletion?
Which incident flow is required before mitigation decisions?
Which statement is correct?

A) Chaos drills are successful without evidence if service recovers.
B) Controlled chaos requires blast-radius limits and rollback path.
C) Chaos Monkey should run in all namespaces for realistic behavior.

What is the minimum evidence set per drill?
Runbook: Controlled Chaos Game Day
Runbook: Controlled Chaos Game Day
Purpose
Run controlled failure injection with strict safety boundaries and evidence-based response.
Roles
- Incident Commander: owns decision flow
- Driver: executes injection commands
- Observer: records timeline and evidence
Preflight (Required)
1. Confirm environment is develop.
2. Confirm rollback path is known.
3. Confirm monitoring/tracing access is available.
4. Confirm Chaos Monkey is suspend: true before and after run.
Timeline Template
1. T0: baseline metrics snapshot
2. T+2m: inject failure
3. T+5m: detect symptom
4. T+10m: isolate via traces/logs
5. T+15m: mitigate/recover
6. T+25m: verify stability
7. T+35m: write scorecard + actions
Injection Options
1. Deterministic:
- GET /status/500
- GET /panic
1. Monkey:
- one pod deletion via chaos-monkey job
Decision Classes
- Class A: low impact, recover automatically
- Class B: moderate impact, manual mitigation needed
- Class C: customer-impact pattern, trigger rollback/incident protocol
Exit Criteria
- service recovered to baseline
- no active critical alerts for drill scenario
- evidence package complete (metrics + traces + logs)
- at least one hardening issue created
Post-Run Deliverables
- filled scorecard.md
- one short blameless summary
- one backlog item with owner and due date

Chapter 11: Controlled Chaos

Chapter 11: Controlled Chaos

Why This Chapter Exists

Scope

Chaos Monkey (MVP)

Guardrails

Lab Files

Handoff to Chapter 12 (AI Guardian)

Done When

Game Day Scorecard (Template)

Game Day Scorecard (Template)

Detection

Triage

Recovery

Signal Quality

Outcome

Lab: Controlled Chaos with Safety Guardrails

Lab: Controlled Chaos with Safety Guardrails

Goal

Prerequisites

Step 1: Deterministic Drill (Backend 5xx)

Quiz: Chapter 11 (Controlled Chaos)

Quiz: Chapter 11 (Controlled Chaos)

Questions

Runbook: Controlled Chaos Game Day

Runbook: Controlled Chaos Game Day

Purpose

Roles

Preflight (Required)

Timeline Template

Injection Options

Decision Classes

Exit Criteria

Post-Run Deliverables