Lab: Safe Terraform Workflow for Production-Like Kubernetes

Goal

Execute a guardrails-first Terraform workflow:

plan with explicit output artifact
review and validate intent
apply only from reviewed planfile
verify resulting state

Guardrail companion:

review-checklist.md (must be completed before apply)
drift-playbook.md (required when drift is detected)

Prerequisites

Terraform installed
pre-commit installed and hooks configured (make install-hooks)
Access to the target Terraform directory
Required environment variables/secrets for the selected environment
scripts/guard-terraform-plan.sh available and executable

Target Options

Choose one:

Local: infra/terraform/kind_cluster
Hetzner: infra/terraform/hcloud_cluster

Examples below use Hetzner path.

Step 1: Context and Scope Check

Confirm you are in the correct repo and directory:

pwd
ls -la

Expected:

path ends with sre/
Terraform target directory exists

Run local IaC guardrails before creating a plan:

pre-commit run terraform-fmt --all-files
pre-commit run terraform-validate --all-files
pre-commit run terraform-security --all-files
pre-commit run flux-kustomize-validate --all-files

Step 2: Generate a Planfile (Guarded)

scripts/guard-terraform-plan.sh plan \
  --dir infra/terraform/hcloud_cluster \
  --out tfplan

Expected output includes:

plan created: <workdir>/tfplan
metadata created: <workdir>/tfplan.meta

Step 3: Review Plan Before Apply

terraform -chdir=infra/terraform/hcloud_cluster show tfplan

Now complete review-checklist.md and attach it to PR/review notes.

Hard stop conditions (do not apply):

Any unexpected destroy action.
Changes to unrelated modules/resources.
Environment mismatch (wrong account/cluster/namespace assumptions).
Planfile older than policy window for this change.

Step 4: Apply Only the Reviewed Planfile

scripts/guard-terraform-plan.sh apply \
  --dir infra/terraform/hcloud_cluster \
  --out tfplan \
  --max-age-minutes 60

Expected:

Apply runs only if tfplan and tfplan.meta are present and fresh.
If stale/missing metadata, script blocks apply with explicit error.
Apply must happen only after signed-off checklist completion.

Step 5: Verify Post-Apply State

terraform -chdir=infra/terraform/hcloud_cluster output

For cluster targets, also verify:

kubectl get nodes
kubectl get ns

Step 6: Drift Detection Drill

Run a fresh plan after apply:

terraform -chdir=infra/terraform/hcloud_cluster plan -input=false -detailed-exitcode
echo $?

Expected:

0: no changes, continue
2: drift and/or pending changes, classify using drift-playbook.md
1: tooling/state error, stop and fix before any apply

Stop criteria by drift class (from drift-playbook.md):

Class A: document evidence and proceed only after reviewer confirms benign impact.
Class B: pause apply, decide reconcile-vs-codify path, then re-plan.
Class C: block apply and escalate to incident-level review.

Step 7: Safe Destroy Practice (Dry Run Discussion)

Do not run destroy blindly. First define:

exact target environment
expected deleted resource classes
recreate path and recovery time expectation

Optional (only in isolated test env):

terraform -chdir=infra/terraform/hcloud_cluster plan -destroy -input=false

Destroy preflight checklist (required):

Correct target environment confirmed.
Data/state impact explicitly documented.
Recreate path documented and tested at least once in non-prod.
Stakeholder approval recorded.
Scope is explicit (-target or clearly bounded module/resource set) and reviewed.

Failure Scenarios

Apply without plan metadata

command should fail
learner explains why guardrail blocked execution

Stale planfile

command should fail when --max-age-minutes is exceeded
learner regenerates plan and re-runs review

Done When

Learner can run guarded plan -> apply end-to-end.
Learner can explain why lock/state/plan artifacts reduce blast radius.
Learner can identify and communicate drift before applying new changes.
Learner can use and defend a concrete plan review checklist before any apply.