Chapter 12: Controlled Chaos
Learning Objectives
By the end of this chapter, you will be able to:
- Design a bounded chaos drill with explicit kill switch and time window
- Execute deterministic failure injection before advancing to random chaos
- Capture evidence during controlled disruption for post-drill analysis
- Distinguish a chaos drill from an uncontrolled experiment
Start with the video for the concept overview, then work through each lesson section.
Recall Check
Before continuing, quickly recall:
- What is the difference between HPA bounds and PDB constraints during a node drain? (Chapter 09)
- Why does logs-only debugging fail without trace_id correlation? (Chapter 10)
- Why is a backup that has never been restored not actually a backup? (Chapter 11)
If you can’t answer these, revisit the corresponding chapter before proceeding.
Production resilience is not proven in calm conditions. The only way to know if your failover and alerts work is to break things on purpose. In this chapter, we implement a Chaos Monkey to rehearse failure under controlled conditions.
1. The Problem: The “Improvisation Under Stress”
A real failure mode appears for the first time during a high-stakes on-call shift. Responders improvise under stress, and mitigation becomes a manual, slow process. Recovery takes longer because the system’s behavior under failure was never practiced.
2. The Concept: Failure as an Experiment
We treat chaos like a scientific experiment, not a random act of destruction.
- Blast Radius: Limit failure to the
developnamespace. - Kill Switch: Every chaos tool must have an immediate “Stop” button.
- Measurable Impact: Run a drill only when you know exactly which metrics you expect to change.
3. The Code: The Chaos Monkey CronJob
Our sre/ repo includes a “Chaos Monkey” — a small script that randomly deletes Pods to test if our HPA and PDB settings actually work.
Develop chaos pack
Show the chaos configuration
flux/infrastructure/chaos/develop/cronjob.yamlflux/infrastructure/chaos/develop/kustomization.yamlflux/infrastructure/chaos/develop/role.yamlflux/infrastructure/chaos/develop/rolebinding.yamlflux/infrastructure/chaos/develop/serviceaccount.yaml
4. The Guardrail: The “Suspend” Kill Switch
The most critical safety feature of any chaos system is the ability to stop it instantly. Our Chaos Monkey is Suspended by default. It only runs when an engineer explicitly enables it for a “Game Day” exercise.
Chaos Monkey Config
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: chaos-monkey
namespace: flux-system
labels:
app.kubernetes.io/name: chaos-monkey
app.kubernetes.io/component: chaos
spec:
schedule: "*/15 * * * *"
suspend: true
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 2
failedJobsHistoryLimit: 3
jobTemplate:
spec:
backoffLimit: 0
template:
metadata:
labels:
app.kubernetes.io/name: chaos-monkey
app.kubernetes.io/component: chaos
spec:
serviceAccountName: chaos-monkey
restartPolicy: Never
securityContext:
runAsNonRoot: true
runAsUser: 65532
runAsGroup: 65532
fsGroup: 65532
seccompProfile:
type: RuntimeDefault
containers:
- name: monkey
image: bitnami/kubectl:1.31
imagePullPolicy: IfNotPresent
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 65532
runAsGroup: 65532
capabilities:
drop: ["ALL"]
env:
- name: TARGET_NAMESPACE
value: develop
- name: TARGET_APPS
value: "frontend backend"
- name: ALLOWED_HOURS_UTC
value: "10-16"
command:
- /bin/sh
- -ec
- |
hour="$(date -u +%H)"
start_hour="${ALLOWED_HOURS_UTC%-*}"
end_hour="${ALLOWED_HOURS_UTC#*-}"
if [ "$hour" -lt "$start_hour" ] || [ "$hour" -ge "$end_hour" ]; then
echo "outside allowed UTC window ${ALLOWED_HOURS_UTC}; skipping"
exit 0
fi
set -- ${TARGET_APPS}
count="$#"
if [ "$count" -eq 0 ]; then
echo "no target apps configured; skipping"
exit 0
fi
index="$(awk -v n="$count" 'BEGIN{srand(); print int(rand()*n)+1}')"
eval "app=\${$index}"
pods="$(kubectl -n "${TARGET_NAMESPACE}" get pods -l "app=${app}" --field-selector=status.phase=Running -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}')"
target="$(printf '%s\n' "$pods" | awk 'NF {a[++n]=$0} END {if (n>0) {srand(); print a[int(rand()*n)+1]}}')"
if [ -z "${target}" ]; then
echo "no running pods for app=${app}; skipping"
exit 0
fi
ts="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "{\"event\":\"chaos_monkey_delete_pod\",\"timestamp\":\"${ts}\",\"namespace\":\"${TARGET_NAMESPACE}\",\"app\":\"${app}\",\"pod\":\"${target}\"}"
kubectl -n "${TARGET_NAMESPACE}" delete pod "${target}" --wait=false
resources:
requests:
cpu: 10m
memory: 32Mi
ephemeral-storage: 64Mi
limits:
cpu: 100m
memory: 128Mi
ephemeral-storage: 128Mi
volumeMounts:
- name: tmp
mountPath: /tmp
volumes:
- name: tmp
emptyDir: {}
5. Verification: Did I Get It?
Run your first “Game Day” by manually triggering the Chaos Monkey:
# Manually trigger the Chaos Monkey Job
kubectl create job --from=cronjob/chaos-monkey -n flux-system game-day-01
# Watch the pods being terminated and replaced
kubectl get pods -n develop -w
Expected Output: You should see a pod being deleted and a new one being created immediately, with minimal impact on the overall service availability.