Guardrails-First Course Materials

This course teaches production-grade Kubernetes and SRE practice through incidents, guardrails, and repeatable workflows.

The goal is not to memorize tools. The goal is to learn how to keep systems safe when pressure, ambiguity, and AI-assisted speed all show up at the same time.

Who This Is For

  • platform engineers moving from “it works” to “it survives mistakes”
  • DevOps engineers who want stronger operating discipline, not more tooling hype
  • SREs who want concrete labs, guardrails, and incident-shaped lessons

How the Course Works

Each chapter is built around one production failure pattern:

  • what broke
  • why the shortcut looked reasonable
  • how the investigation becomes confusing
  • which guardrail restores a safe operating path

Every core lesson includes:

  • a written incident walkthrough
  • a hands-on lab
  • a quiz to confirm the operating rule
  • runbooks or scorecards where the topic needs them

This course does not only teach how to operate Kubernetes around applications. It also shows what a production-ready Kubernetes application should look like so rollout safety, observability, GitOps reconciliation, and incident response work correctly in the first place.

The course uses the SafeOps reference applications as concrete examples:

  • safeops-course/backend, a small production-shaped Go API with health probes, metrics, tracing hooks, chaos endpoints, and OpenAPI/Swagger support
  • safeops-course/frontend, a Vue-based frontend with container hardening, runtime config injection, and Kubernetes deployment packaging

Many of the application patterns used throughout those reference apps are inspired by Podinfo by Stefan Prodan, including:

  • readiness and liveness probes
  • graceful shutdown on interrupt signals
  • config and secret reload patterns
  • Prometheus and OpenTelemetry instrumentation
  • structured logging
  • 12-factor configuration
  • fault injection for safe drills
  • packaging and install paths with Timoni, Helm, and Kustomize
  • end-to-end validation with Kind and Helm
  • multi-arch images, signing, SBOMs, provenance, and CVE scanning

Video assets are optional. The written lesson remains the primary source of truth, and the video should make the same lesson easier to absorb, not replace the material.

  1. Start with Intro: AI as a Very Well-Read Junior Engineer.
  2. Go through Chapters 01-14 in order.
  3. Run the lab before moving to the next chapter.
  4. Use the quiz to confirm the main guardrail rule before continuing.
  5. Move to the advanced modules only after the core path feels operationally natural.

Tracks

Core track:

  • Chapters 01-14 covering platform foundations, GitOps, CI/CD, security, observability, reliability, and on-call discipline

Advanced track:

  • Chapter 15: Supply Chain Security
  • Chapter 16: Admission Policy Guardrails
  • Chapter 17: Rollback and Data Migrations
  • Module: Progressive Delivery (Canary with Traefik + Flagger)

Reference appendices:

  • Appendix: Local Development Environment
  • Appendix: DNS and TLS Automation

References

Chapter 01: Blast Radius & the Shape of Safety

What this chapter gives you is the vocabulary you will use for the next sixteen chapters, and a map of how they fit together. Every time a later chapter introduces a new tool — Terraform plan review, SOPS-encrypted …

Chapter 02: Infrastructure as Code (IaC) with Kind

Describe Terraform state locking, drift detection, and the dangers of stale plans Execute the plan-review-apply workflow with guard-terraform-plan.sh Deploy a local Kind cluster for safe infrastructure rehearsal Detect …

Chapter 03: Secrets Management (SOPS + Age)

Encrypt secrets with SOPS and Age before committing to Git Trace the Flux decryption flow from encrypted YAML to cluster Secret Execute secret rotation after a leak incident Explain why git revert is not remediation for …

Chapter 04: GitOps & Version Promotion

Describe the Flux reconciliation model and environment overlays Promote immutable images across develop, staging, and production without rebuild Execute rollback by Git evidence rather than ad-hoc rebuilds Configure …

Chapter 05: CI/CD & Developer Guardrails

Explain the Layered Defense model from workstation to cluster Install and trigger pre-commit hooks to catch issues before push Trace a change from workstation commit to cluster apply through all validation layers Verify …

Checkpoint A: Your Delivery Pipeline

This page is a consolidation — not new material. Its purpose is to show you the assembled delivery pipeline and the guardrails you can now rely on. What You Have Built You can now deliver code from a developer …

Chapter 06: Network Policies (Production Isolation)

Implement a default-deny NetworkPolicy for a namespace Diagnose blocked traffic using kubectl and network policy selectors Evaluate whether an allow rule follows least-privilege principles Construct a rollback plan for a …

Chapter 07: Security Context & Pod Hardening

Configure non-root, read-only filesystem, dropped capabilities, and seccomp baseline Debug permission failures without escalating to privileged mode Compare a golden security baseline against an insecure manifest diff …

Chapter 08: Resource Management & QoS

Set requests, limits, and quotas per namespace for CPU and memory Predict QoS class assignment under resource pressure Distinguish OOMKilled behavior from node-pressure eviction Justify scaling decisions with resource …

Chapter 09: Availability Engineering (HPA + PDB)

Configure HPA bounds and PDB constraints for safe scaling Explain why minReplicas: 1 is a reliability regression for critical services Coordinate rolling updates with node drain events using PDB Design planned disruption …

Checkpoint B: Your Runtime Safety Net

This page is a consolidation — not new material. Its purpose is to show how the four hardening chapters (06-09) work together to contain failure. What You Have Built Your workloads now run inside a layered safety net. …

Chapter 10: Observability (Metrics, Logs, Traces)

Follow the metrics-to-traces-to-logs investigation path during an incident Correlate signals across service boundaries using trace_id Configure ServiceMonitor for automatic Prometheus discovery Distinguish symptom …

Chapter 11: Backup & Restore Basics

Verify backup success independently from restore success Execute CloudNativePG backup and point-in-time restore Apply the restore verification checklist as the actual safety bar Explain why untested backups are not …

Chapter 12: Controlled Chaos

Design a bounded chaos drill with explicit kill switch and time window Execute deterministic failure injection before advancing to random chaos Capture evidence during controlled disruption for post-drill analysis …

Chapter 13: AI-Assisted SRE Guardian

Describe k8s-ai-monitor’s normalization, deduplication, and routing model Configure LLM boundaries including sanitization, budget caps, and no-mutation rules Use API, CLI, and MCP surfaces for safe investigation …

Chapter 14: 24/7 Production SRE

Assign incident roles: Commander, Responder, Comms Lead, Scribe Apply the 4-tier severity matrix for triage decisions Conduct a blameless postmortem focused on system gaps, not individual blame Convert incident findings …

Chapter 15: Supply Chain Security

Generate SBOM and sign container images with attestation evidence Verify image attestation at admission time before deployment Plan an audit-first rollout toward enforceable supply-chain trust Trace the supply chain …

Chapter 16: Admission Policy Guardrails

Deploy Kyverno policy packs in audit mode for safe rollout Graduate policies from audit to enforce using evidence from audit logs Configure break-glass exceptions with expiry and evidence requirements Evaluate policy …

Chapter 17: Rollback & Data Migrations

Apply the expand/contract migration model for safe schema changes Use feature flags to maintain rollback capability during releases Configure destructive migration approval gates with evidence requirements Design a …

Module: Progressive Delivery

Configure Flagger canary analysis with weighted rollout progression Define Prometheus-driven abort criteria for canary deployments Execute controlled traffic shifting via Traefik ingress-level control Analyze canary …

Appendix: DNS and TLS Automation

Why This Appendix Exists The main course keeps early chapters focused on platform safety and GitOps. This appendix explains the edge automation layer used by the SafeOps platform: external-dns manages DNS records from …

Appendix: Local Development Environment

Why This Appendix Exists The main course teaches the production path first. This appendix shows the fastest safe feedback loop for local experimentation: a Terraform-managed kind cluster generated kubeconfig and context …

Intro: AI as a Very Well-Read Junior Engineer

It is about using AI in DevOps / SysOps / SRE without increasing risk or blast radius. The Mental Model AI is the most well-read junior engineer you will ever work with: Knows tooling, flags, YAML, Terraform, Helm. Works …

Production-Grade Kubernetes with Guardrails & AI-Assisted SRE

Core Track (14 Chapters) AI Changes Two Things at Once Beginner · ~2h correlated blast radius from bundling unrelated changes AI as a brave junior: fast, useful, but unsafe without guardrails context checks, …