Chapter 12: Controlled Chaos

Learning Objectives

By the end of this chapter, you will be able to:

Design a bounded chaos drill with explicit kill switch and time window
Execute deterministic failure injection before advancing to random chaos
Capture evidence during controlled disruption for post-drill analysis
Distinguish a chaos drill from an uncontrolled experiment

Start with the video for the concept overview, then work through each lesson section.

Recall Check

Before continuing, quickly recall:

What is the difference between HPA bounds and PDB constraints during a node drain? (Chapter 09)
Why does logs-only debugging fail without trace_id correlation? (Chapter 10)
Why is a backup that has never been restored not actually a backup? (Chapter 11)

If you can’t answer these, revisit the corresponding chapter before proceeding.

Production resilience is not proven in calm conditions. The only way to know if your failover and alerts work is to break things on purpose. In this chapter, we implement a Chaos Monkey to rehearse failure under controlled conditions.

1. The Problem: The “Improvisation Under Stress”

A real failure mode appears for the first time during a high-stakes on-call shift. Responders improvise under stress, and mitigation becomes a manual, slow process. Recovery takes longer because the system’s behavior under failure was never practiced.

2. The Concept: Failure as an Experiment

We treat chaos like a scientific experiment, not a random act of destruction.

Blast Radius: Limit failure to the develop namespace.
Kill Switch: Every chaos tool must have an immediate “Stop” button.
Measurable Impact: Run a drill only when you know exactly which metrics you expect to change.

3. The Code: The Chaos Monkey CronJob

Our sre/ repo includes a “Chaos Monkey” — a small script that randomly deletes Pods to test if our HPA and PDB settings actually work.

Develop chaos pack

Show the chaos configuration

flux/infrastructure/chaos/develop/cronjob.yaml
flux/infrastructure/chaos/develop/kustomization.yaml
flux/infrastructure/chaos/develop/role.yaml
flux/infrastructure/chaos/develop/rolebinding.yaml
flux/infrastructure/chaos/develop/serviceaccount.yaml

4. The Guardrail: The “Suspend” Kill Switch

The most critical safety feature of any chaos system is the ability to stop it instantly. Our Chaos Monkey is Suspended by default. It only runs when an engineer explicitly enables it for a “Game Day” exercise.

Chaos Monkey Config

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: chaos-monkey
  namespace: flux-system
  labels:
    app.kubernetes.io/name: chaos-monkey
    app.kubernetes.io/component: chaos
spec:
  schedule: "*/15 * * * *"
  suspend: true
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 2
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 0
      template:
        metadata:
          labels:
            app.kubernetes.io/name: chaos-monkey
            app.kubernetes.io/component: chaos
        spec:
          serviceAccountName: chaos-monkey
          restartPolicy: Never
          securityContext:
            runAsNonRoot: true
            runAsUser: 65532
            runAsGroup: 65532
            fsGroup: 65532
            seccompProfile:
              type: RuntimeDefault
          containers:
            - name: monkey
              image: bitnami/kubectl:1.31
              imagePullPolicy: IfNotPresent
              securityContext:
                allowPrivilegeEscalation: false
                readOnlyRootFilesystem: true
                runAsNonRoot: true
                runAsUser: 65532
                runAsGroup: 65532
                capabilities:
                  drop: ["ALL"]
              env:
                - name: TARGET_NAMESPACE
                  value: develop
                - name: TARGET_APPS
                  value: "frontend backend"
                - name: ALLOWED_HOURS_UTC
                  value: "10-16"
              command:
                - /bin/sh
                - -ec
                - |
                  hour="$(date -u +%H)"
                  start_hour="${ALLOWED_HOURS_UTC%-*}"
                  end_hour="${ALLOWED_HOURS_UTC#*-}"
                  if [ "$hour" -lt "$start_hour" ] || [ "$hour" -ge "$end_hour" ]; then
                    echo "outside allowed UTC window ${ALLOWED_HOURS_UTC}; skipping"
                    exit 0
                  fi

                  set -- ${TARGET_APPS}
                  count="$#"
                  if [ "$count" -eq 0 ]; then
                    echo "no target apps configured; skipping"
                    exit 0
                  fi
                  index="$(awk -v n="$count" 'BEGIN{srand(); print int(rand()*n)+1}')"
                  eval "app=\${$index}"

                  pods="$(kubectl -n "${TARGET_NAMESPACE}" get pods -l "app=${app}" --field-selector=status.phase=Running -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}')"
                  target="$(printf '%s\n' "$pods" | awk 'NF {a[++n]=$0} END {if (n>0) {srand(); print a[int(rand()*n)+1]}}')"
                  if [ -z "${target}" ]; then
                    echo "no running pods for app=${app}; skipping"
                    exit 0
                  fi

                  ts="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
                  echo "{\"event\":\"chaos_monkey_delete_pod\",\"timestamp\":\"${ts}\",\"namespace\":\"${TARGET_NAMESPACE}\",\"app\":\"${app}\",\"pod\":\"${target}\"}"
                  kubectl -n "${TARGET_NAMESPACE}" delete pod "${target}" --wait=false
              resources:
                requests:
                  cpu: 10m
                  memory: 32Mi
                  ephemeral-storage: 64Mi
                limits:
                  cpu: 100m
                  memory: 128Mi
                  ephemeral-storage: 128Mi
              volumeMounts:
                - name: tmp
                  mountPath: /tmp
          volumes:
            - name: tmp
              emptyDir: {}

5. Verification: Did I Get It?

Run your first “Game Day” by manually triggering the Chaos Monkey:

# Manually trigger the Chaos Monkey Job
kubectl create job --from=cronjob/chaos-monkey -n flux-system game-day-01
# Watch the pods being terminated and replaced
kubectl get pods -n develop -w

Expected Output: You should see a pod being deleted and a new one being created immediately, with minimal impact on the overall service availability.

Estimated Time

Prerequisites

Source Code References

What You Will Produce

Chapter 12: Controlled Chaos

Learning Objectives

Recall Check

1. The Problem: The “Improvisation Under Stress”

2. The Concept: Failure as an Experiment

3. The Code: The Chaos Monkey CronJob

4. The Guardrail: The “Suspend” Kill Switch

5. Verification: Did I Get It?

Detailed Lessons

Hands-On Materials

Hands-On Materials

The Incident: Chaos Improv

Investigation & Containment

Workflow & Kill Switch

Lab & Completion