Chapter 01: Blast Radius & the Shape of Safety

This chapter has no scripts to run and no cluster to touch — the cluster does not exist yet; you build it in Chapter 02.

What this chapter gives you is the vocabulary you will use for the next sixteen chapters, and a map of how they fit together. Every time a later chapter introduces a new tool — Terraform plan review, SOPS-encrypted secrets, Kyverno admission policies, PITR restores — you should be able to recognize the family of guardrail it belongs to, because you will have met the family here first.

Read this once before you start Chapter 02. Refer back to the roadmap at the end whenever a later chapter feels unmoored from the larger arc.

Learning Objectives

By the end of this chapter, you will be able to:

Explain why “small but mixed” changes carry disproportionate risk during incidents
Name the four invariants that apply to every change in this course
Recognize which of the six guardrail families a new tool belongs to
Describe, from memory, how the course is divided into four blocks and what each block protects against

1. The Shape of a Bad Incident

A senior engineer opens a pull request at 16:47 on a Friday. It contains two small changes:

a backend image bump (v1.2.3 → v1.2.4, a hotfix for a log parsing bug)
an ingress manifest edit (a new annotation intended for the staging environment, but copy-pasted into the develop overlay)

Each diff is small. CI passes. A second engineer approves it. Merge.

Within three minutes, the frontend starts returning 502 Bad Gateway. The backend rollout is still in progress — half the pods are v1.2.3, half are v1.2.4. Somebody asks: is it the image, or is it the ingress?

Nobody knows. The investigation takes thirty-eight minutes.

Had the two changes been in separate PRs, the first rollback attempt — git revert on the ingress commit — would have either fixed the problem immediately or cleanly ruled the ingress change out. One revert, thirty seconds to know. Instead, two layers are simultaneously suspect and neither can be rolled back in isolation.

This is the shape that matters: not the specific bug, but correlated blast radius. Two unrelated changes share one rollback path. When something breaks, you cannot tell which change broke it without investigating both — and you cannot roll either back without touching the other.

Now picture this at AI speed

One engineer fired off one mixed PR, and the incident took thirty-eight minutes. Now imagine the same pattern when an AI agent dispatches fifty parallel tasks across your clusters — each one a small, plausible, CI-passing change. If even two of those tasks share a failure domain, you do not have one incident. You have a matrix of correlated failures, and no human operator can hold fifty rollback paths in their head at the same time.

The cost of sloppy change boundaries scales with the speed and fan-out of the thing making the changes. AI is about to make both of those much larger. That is why this course starts with blast radius, not with tools: the discipline taught here exists because the future of production operations will have fewer humans per commit, not more.

When bundling is not the problem

The rule here is not “one change per PR.” That rule breaks on contact with real engineering: a schema migration must ship with the code that reads the new column; a feature flag must ship with the code it gates; an API change must ship with its caller updates. These belong together because they share a failure domain — if one is wrong, the other cannot work either.

The rule is:

One failure domain per PR.

An image bump and an ingress edit are different failure domains: one affects application behavior, the other affects network routing. They can fail independently and so they must be reviewed, applied, and rolled back independently.

The question to ask on every PR is not “how many files changed?” It is:

“If this breaks, is there a single clean rollback?”

If the honest answer is “well, it depends which part broke,” you are bundling failure domains. Split.

2. The Four Invariants

Everything in this course is built on four rules that do not change between chapters, environments, or tools. Each rule exists because a specific class of incident keeps recurring when the rule is absent. You will see these invariants applied to different technologies in every block.

I. Verify context before every write

Never run a command that mutates shared state without first proving, explicitly, what state it will mutate. Which cluster. Which namespace. Which environment’s Terraform state. Which branch. The verification is mechanical, not mental — a check your workflow performs, not a habit you rely on your memory for.

Why it matters: The majority of “wrong environment” incidents happen when an engineer is tired, distracted, or multitasking — not when they are careless. Context verification turns a memory test into a mechanical assertion.

II. Plan before apply

Never apply a change without first producing, reviewing, and approving an artifact that describes exactly what that change will do. The plan is the thing that is reviewed; the apply is a mechanical replay of an already-approved plan.

Why it matters: Without a reviewed plan, every apply is a gamble on what terraform apply or kubectl apply will decide to do this time. With a reviewed plan, apply becomes a deterministic re-execution of something a human already signed off on.

III. One failure domain per change

Keep the blast radius of any single merge narrow enough that the rollback path is obvious before you need it.

Why it matters: You do not choose whether to roll back during an incident. You only choose whether the rollback path is ready. If you bundled failure domains in the merge, the rollback path is not ready, and the incident has already grown past the original bug.

IV. Evidence before merge

A PR is not “ready to merge” because the author says it is. It is ready when the artifacts that prove it is ready are attached: the plan output, the CI run, the policy check, the reviewer approval, the test evidence. Claims do not merge — evidence merges.

Why it matters: AI-assisted development compresses the time between “I typed a change” and “a reviewer is looking at it.” The only thing that scales with that compression is the quality of the evidence attached to each change.

These four rules will reappear, in different clothing, in every subsequent chapter. Chapter 02 applies them to Terraform. Chapter 03 to secrets. Chapter 04 to deployments. By Chapter 16 you will see them applied at cluster admission.

3. The Six Guardrail Families

Guardrails in this course are not a pile of tools. They are six distinct families, each defending against a specific class of mistake, each applied at a specific point in the lifecycle of a change.

Learn the families now. When Chapter 15 introduces Cosign image signing, you should be able to say “that’s a delivery-time guardrail” without being told. When Chapter 16 introduces Kyverno, “admission-time.” The tools will be new; the families will not.

1. Write-time guardrails — run on your workstation before the change leaves your keyboard. Pre-commit hooks, local linters, context assertions. They catch the cheapest class of mistake at the cheapest moment. Chapter 05 is where you will build a full write-time layer end-to-end; Chapters 02 and 03 already depend on parts of it.

2. Plan-time guardrails — require a reviewed plan artifact before any apply. Terraform plan files with age limits, Flux diff output, migration previews. Chapter 02 introduces the pattern with Terraform plan-review-apply. Chapter 04 applies the same idea to Flux diffs, and Chapter 17 to data migrations.

3. Delivery-time guardrails — gate the path from merged PR to running cluster. CI pipelines, image signing, attestation checks, immutable promotion. Chapter 05 is where the full delivery gate gets assembled. You see earlier slices of it in Chapter 03 (encrypted secrets on their way to the cluster), Chapter 04 (image promotion via GitOps), and Chapter 15 (signed and attested artifacts).

4. Admission-time guardrails — enforced by the cluster itself, at the moment a manifest is applied. Kyverno policies, webhook validators, fail-closed defaults. Chapter 16 is where this family takes its final shape, as cluster-wide policy packs in audit → enforce rollout. Chapter 15 is where it first appears, in the form of image verification policies.

5. Runtime defaults — the workload’s own configured safety properties. Network policies, security contexts, resource limits, replica counts, disruption budgets. This family is split across four chapters, each adding one layer: Chapter 06 (network isolation), Chapter 07 (process hardening), Chapter 08 (resource budgets), Chapter 09 (availability).

6. Operational guardrails — how humans and systems respond after something goes wrong. Observability, backup drills, chaos experiments, AI-assisted triage, incident response discipline. This family runs across the whole operational block of the course, in order: Chapter 10 (observe), 11 (back up and verify restores), 12 (break things safely), 13 (AI triage without mutation), 14 (respond as a team).

Notice that the families form a lifecycle: write → plan → deliver → admit → run → operate. Every change flows through all six stages. A mature platform has a guardrail at each stage — not a single guardrail everywhere, and not all guardrails at one stage.

4. The Course in Four Blocks

The seventeen chapters (plus one advanced module) form four blocks. Each block is a distinct kind of safety. You do not need to memorize chapter numbers — you need to know which block a problem belongs to, and which block teaches the solution.

Block A — Delivery Pipeline (Chapters 02–05)

What it protects: the path from a code change on your workstation to a running workload in the cluster. Guardrail families used: write-time, plan-time, delivery-time. You will practice: context verification (02), plan-review-apply (02), encrypted secrets at rest (03), immutable image promotion (04), layered CI gates (05). Output at Checkpoint A: a local kind cluster provisioned by Terraform, secrets encrypted with SOPS, GitOps-driven deployments via Flux, and CI that refuses to merge PRs without reviewed plans.

Block B — Runtime Safety Net (Chapters 06–09)

What it protects: the workloads that are already running in the cluster, from each other and from themselves. Guardrail families used: runtime defaults. You will practice: default-deny network policies (06), non-root and read-only security contexts (07), resource requests/limits and QoS (08), HPA and PDB for availability (09). Output at Checkpoint B: every workload has network isolation, runs as non-root with a read-only filesystem, declares resource limits, and can survive a node drain.

Block C — Operational Discipline (Chapters 10–14)

What it protects: your ability to know what is happening, to recover when it goes wrong, and to respond without improvising. Guardrail families used: operational. You will practice: metrics-traces-logs correlation (10), backup drills validated by real restores (11), chaos experiments with kill switches (12), AI triage that proposes but does not mutate (13), structured incident response with named roles (14). Output: an incident response model your team can run at 03:00 without improvising.

Block D — Advanced Hardening (Chapters 15–17 + Progressive Delivery module)

What it protects: the edges the core track leaves exposed — supply chain provenance, cluster-side enforcement, data migration rollback, traffic-level blast radius. Guardrail families used: delivery-time, admission-time, operational. You will practice: signed images with attestations (15), Kyverno policies in audit → enforce rollout (16), expand-contract migrations with PITR recovery (17), metric-driven canary releases (module).

5. What You Need Before Chapter 02

Chapter 02 is where you stop reading and start running commands. Before you begin, install and verify the following on your workstation:

Docker (or a compatible container runtime, for the local kind cluster)
Terraform — version matching the course baseline (see CURRICULUM.md)
kubectl — any recent version
Git with pre-commit installed
Make (standard on Linux/macOS)

If any of these are missing or unfamiliar, read appendix-local-dev before starting Chapter 02. The appendix walks through the local development environment that Chapter 02 assumes.

You do not need a cloud account, a domain, or a managed Kubernetes cluster. The course uses a local kind cluster as its hands-on environment until Block D.

6. How Every Lesson Works

Every lesson in this course follows the four questions introduced in Intro: AI as a Very Well-Read Junior Engineer:

What would AI propose (the brave junior)?
What should we not allow?
What guardrail stops it?
What is the safe workflow?

Hold these four questions in mind as you read. They are the interpretive frame for every incident, every script, and every policy you will see across the rest of the course.

Knowledge Check

Before starting Chapter 02, complete the Quiz to verify your grasp of the four invariants, the six guardrail families, and the four-block structure.

Estimated Time

Prerequisites

What You Will Produce

Interactive Explainer