Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • cronjob.yaml Members
  • develop/ Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 12: Controlled Chaos

Learning Objectives

By the end of this chapter, you will be able to:

  • Design a bounded chaos drill with explicit kill switch and time window
  • Execute deterministic failure injection before advancing to random chaos
  • Capture evidence during controlled disruption for post-drill analysis
  • Distinguish a chaos drill from an uncontrolled experiment

Start with the video for the concept overview, then work through each lesson section.

Recall Check

Before continuing, quickly recall:

  • What is the difference between HPA bounds and PDB constraints during a node drain? (Chapter 09)
  • Why does logs-only debugging fail without trace_id correlation? (Chapter 10)
  • Why is a backup that has never been restored not actually a backup? (Chapter 11)

If you can’t answer these, revisit the corresponding chapter before proceeding.

Production resilience is not proven in calm conditions. The only way to know if your failover and alerts work is to break things on purpose. In this chapter, we implement a Chaos Monkey to rehearse failure under controlled conditions.


1. The Problem: The “Improvisation Under Stress”

A real failure mode appears for the first time during a high-stakes on-call shift. Responders improvise under stress, and mitigation becomes a manual, slow process. Recovery takes longer because the system’s behavior under failure was never practiced.

2. The Concept: Failure as an Experiment

We treat chaos like a scientific experiment, not a random act of destruction.

  1. Blast Radius: Limit failure to the develop namespace.
  2. Kill Switch: Every chaos tool must have an immediate “Stop” button.
  3. Measurable Impact: Run a drill only when you know exactly which metrics you expect to change.

3. The Code: The Chaos Monkey CronJob

Our sre/ repo includes a “Chaos Monkey” — a small script that randomly deletes Pods to test if our HPA and PDB settings actually work.

Develop chaos pack

Show the chaos configuration
  • flux/infrastructure/chaos/develop/cronjob.yaml
  • flux/infrastructure/chaos/develop/kustomization.yaml
  • flux/infrastructure/chaos/develop/role.yaml
  • flux/infrastructure/chaos/develop/rolebinding.yaml
  • flux/infrastructure/chaos/develop/serviceaccount.yaml

4. The Guardrail: The “Suspend” Kill Switch

The most critical safety feature of any chaos system is the ability to stop it instantly. Our Chaos Monkey is Suspended by default. It only runs when an engineer explicitly enables it for a “Game Day” exercise.

Chaos Monkey Config

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: chaos-monkey
  namespace: flux-system
  labels:
    app.kubernetes.io/name: chaos-monkey
    app.kubernetes.io/component: chaos
spec:
  schedule: "*/15 * * * *"
  suspend: true
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 2
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 0
      template:
        metadata:
          labels:
            app.kubernetes.io/name: chaos-monkey
            app.kubernetes.io/component: chaos
        spec:
          serviceAccountName: chaos-monkey
          restartPolicy: Never
          securityContext:
            runAsNonRoot: true
            runAsUser: 65532
            runAsGroup: 65532
            fsGroup: 65532
            seccompProfile:
              type: RuntimeDefault
          containers:
            - name: monkey
              image: bitnami/kubectl:1.31
              imagePullPolicy: IfNotPresent
              securityContext:
                allowPrivilegeEscalation: false
                readOnlyRootFilesystem: true
                runAsNonRoot: true
                runAsUser: 65532
                runAsGroup: 65532
                capabilities:
                  drop: ["ALL"]
              env:
                - name: TARGET_NAMESPACE
                  value: develop
                - name: TARGET_APPS
                  value: "frontend backend"
                - name: ALLOWED_HOURS_UTC
                  value: "10-16"
              command:
                - /bin/sh
                - -ec
                - |
                  hour="$(date -u +%H)"
                  start_hour="${ALLOWED_HOURS_UTC%-*}"
                  end_hour="${ALLOWED_HOURS_UTC#*-}"
                  if [ "$hour" -lt "$start_hour" ] || [ "$hour" -ge "$end_hour" ]; then
                    echo "outside allowed UTC window ${ALLOWED_HOURS_UTC}; skipping"
                    exit 0
                  fi

                  set -- ${TARGET_APPS}
                  count="$#"
                  if [ "$count" -eq 0 ]; then
                    echo "no target apps configured; skipping"
                    exit 0
                  fi
                  index="$(awk -v n="$count" 'BEGIN{srand(); print int(rand()*n)+1}')"
                  eval "app=\${$index}"

                  pods="$(kubectl -n "${TARGET_NAMESPACE}" get pods -l "app=${app}" --field-selector=status.phase=Running -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}')"
                  target="$(printf '%s\n' "$pods" | awk 'NF {a[++n]=$0} END {if (n>0) {srand(); print a[int(rand()*n)+1]}}')"
                  if [ -z "${target}" ]; then
                    echo "no running pods for app=${app}; skipping"
                    exit 0
                  fi

                  ts="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
                  echo "{\"event\":\"chaos_monkey_delete_pod\",\"timestamp\":\"${ts}\",\"namespace\":\"${TARGET_NAMESPACE}\",\"app\":\"${app}\",\"pod\":\"${target}\"}"
                  kubectl -n "${TARGET_NAMESPACE}" delete pod "${target}" --wait=false
              resources:
                requests:
                  cpu: 10m
                  memory: 32Mi
                  ephemeral-storage: 64Mi
                limits:
                  cpu: 100m
                  memory: 128Mi
                  ephemeral-storage: 128Mi
              volumeMounts:
                - name: tmp
                  mountPath: /tmp
          volumes:
            - name: tmp
              emptyDir: {}

5. Verification: Did I Get It?

Run your first “Game Day” by manually triggering the Chaos Monkey:

# Manually trigger the Chaos Monkey Job
kubectl create job --from=cronjob/chaos-monkey -n flux-system game-day-01
# Watch the pods being terminated and replaced
kubectl get pods -n develop -w

Expected Output: You should see a pod being deleted and a new one being created immediately, with minimal impact on the overall service availability.


Detailed Lessons

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Game Day Scorecard (Template) Members
  • Lab: Controlled Chaos with Safety Guardrails Members
  • Quiz: Chapter 12 (Controlled Chaos) Members
  • Runbook: Controlled Chaos Game Day Members

The Incident: Chaos Improv

Result: Uncertainty expands the blast radius and recovery time because failure response was not a practiced discipline. Observed Symptoms What the team sees first: A failure mode appears with no documented or practiced …

Investigation & Containment

Safe investigation sequence: Define the Drill: Choose one failure type (e.g., pod termination) and one target service. Confirm Controls: Verify the kill switch, namespace scope, and time window before starting. Capture …

Workflow & Kill Switch

Kill Switch: spec.suspend: true on the CronJob (the default state). Time Window: Chaos is only allowed during UTC 10-16 on business days. RBAC Limit: The chaos job only has delete permissions on Pods in the develop …

Lab & Completion

What was the target? (e.g., Backend Pods) Was the failure detected? (Alert/Metric signal) Did the system automatically recover? (HPA/Deployment controller) What is our hardening action? (e.g., Add a PDB, tune HPA, …