Guardrails-First Course Materials

Guardrails-First Course Materials

Current Status

This directory is in active draft-delivery state (core track + advanced track packs are already present).

Available now:

  • 00-intro-ai-as-junior.md - course framing and mental model.
  • CURRICULUM.md - approved 12-chapter core structure + advanced track.
  • _lesson-template.md - standard lesson structure for guardrails-first labs.
  • chapter-01-introduction/README.md - first complete guardrails lesson with demo commands.
  • chapter-02-iac/{README,lab,quiz}.md - first IaC chapter draft with guarded Terraform workflow.
  • chapter-03-secrets-management/{README,lab,quiz}.md - SOPS lesson pack for encrypt -> commit -> Flux decrypt/apply.
  • chapter-04-gitops/{README,lab,quiz}.md - GitOps promotion pack (develop -> staging -> production) with rollback drill.
  • chapter-05-network-policies/{README,lab,quiz}.md - isolation pack with default deny, DNS allow, ingress allow, and blocked-traffic debug.
  • chapter-06-security-context/{README,lab,quiz}.md - pod hardening pack (non-root, read-only root FS, dropped caps, seccomp).
  • chapter-07-resource-management/{README,lab,quiz}.md - requests/limits, quota/limitrange, QoS and OOM analysis pack.
  • chapter-08-availability-engineering/{README,lab,quiz}.md - HPA/PDB availability pack with drain preflight checks.
  • chapter-09-observability/{README,lab,runbook-incident-debug,quiz}.md - metrics/logs/traces workflow with incident debug path.
  • chapter-10-backup-restore/{README,lab,runbook,quiz}.md - CNPG backup/restore basics with simulation workflow.
  • chapter-11-controlled-chaos/{README,lab,runbook-game-day,scorecard,quiz}.md - deterministic failure drills + guarded Chaos Monkey in develop.
  • chapter-12-ai-assisted-sre-guardian/{README,lab,runbook-guardian,quiz}.md - draft advanced-track guardian chapter mapped to k8s-ai-monitor.
  • chapter-13-24-7-production-sre/{README,lab,runbook-oncall,postmortem-template,quiz}.md - on-call lifecycle and blameless operations module.
  • chapter-14-supply-chain-security/{README,lab,runbook-supply-chain,quiz}.md - advanced supply-chain guardrails pack (SBOM, signing, verification).
  • chapter-15-admission-policy-guardrails/{README,lab,runbook-admission-policy,quiz}.md - advanced policy-as-code enforcement pack (deny risky manifests).
  • chapter-16-rollback-data-migrations/{README,lab,runbook-rollback-migrations,quiz}.md - advanced rollback-safe schema migration operations pack.
  • module-linkerd-progressive-delivery/{README,lab,runbook-linkerd-progressive-delivery,quiz}.md - advanced mesh and progressive delivery module (canary/A-B).
  • Flux scaffolds for advanced modules:
    • flux/infrastructure/policy/kyverno/ + policy packs in flux/infrastructure/policy/packs/
    • flux/infrastructure/progressive-delivery/{linkerd,flagger,develop}/
    • bootstrap wiring in flux/bootstrap/flux-system/infrastructure.yaml (controllers enabled, sample canary pack opt-in)
  • Local Git guardrails:
    • .pre-commit-config.yaml includes flux-kustomize-validate
    • scripts/flux-kustomize-validate.sh (yq + kustomize + kubeconform + Flux CRD schemas)

Still in progress:

  • instructor-grade solution keys / answer guides per lab
  • chapter numbering and legacy placeholder cleanup across chapter-*
  • chapter-16 hands-on wiring to real backend DB migration flow (after backend migration module is implemented)

Course Goal

Teach practical DevOps/SRE workflows where AI increases speed without increasing production risk.

Core model:

  • AI proposes.
  • Humans decide.
  • Guardrails enforce safe execution paths.

See ../ai-code-of-conduct.md for repository-wide rules.

Planned Structure

The canonical structure is now the 12-chapter core program in CURRICULUM.md:

  1. Production Mindset & Guardrails
  2. Infrastructure as Code (IaC)
  3. Secrets Management (SOPS)
  4. GitOps & Version Promotion
  5. Network Policies (Production Isolation)
  6. Security Context & Pod Hardening
  7. Resource Management & QoS
  8. Availability Engineering (HPA + PDB)
  9. Observability
  10. Backup & Restore Basics
  11. Controlled Chaos
  12. 24/7 Production SRE

Advanced track (Part 2):

  1. Supply Chain Security
  2. Admission Policy Guardrails
  3. AI-Assisted SRE Guardian
  4. Linkerd + Progressive Delivery (Canary / A-B)
  5. Rollback and Data Migrations

Authoring Workflow

  1. Start each new chapter from _lesson-template.md.
  2. Keep each lesson tied to one failure mode and one guardrail story.
  3. Include:
    • unsafe path (what breaks and why)
    • safe path (checks, approvals, rollback)
    • reproducible demo commands
  4. Prefer deterministic labs that can run on local kind and map to Hetzner workflows.
  1. Content hygiene: align chapter numbering and clearly mark/remove legacy placeholder directories.
  2. Instructor assets: add lab solution keys and scoring rubrics for chapters 09, 11, 13, 14, 15, and 16.
  3. Advanced track enablement: add a documented non-production rollout path for policy packs (Audit -> Enforce) and keep production opt-in.
  4. Progressive delivery labs: enable develop canary sample only during lab windows and add explicit verify/rollback evidence checklist.
  5. Backend integration: wire chapter-16-rollback-data-migrations to real backend DB migration workflow once migration tooling is added.

Pending Decisions

  1. Final course duration estimate (hours).
  2. Target learner level (mid/senior split).
  3. Concrete lab depth per chapter.
  4. Opening and closing story arc for delivery impact.

Notes

If a chapter folder is present but empty, treat it as planned scope, not completed material. Current chapter-* directory numbering reflects existing draft files and may lag behind canonical curriculum ordering.

Chapter 01: AI Changes Two Things at Once

Chapter 01: AI Changes Two Things at Once

Incident Hook

A fast “AI-assisted” hotfix bundles two unrelated changes in one push:

  • a backend image tag bump for develop
  • an ingress manifest change intended for staging

The change looks harmless in review because each diff is small. In practice, the combined blast radius is larger: routing breaks while backend behavior changes at the same time, making rollback and triage slower.

What AI Would Propose (Brave Junior)

  • “Update image and ingress together to save one pipeline run.”
  • “Apply quickly to unblock the demo.”
  • “Skip context checks; it is just develop.”

Why it sounds reasonable:

Chapter 02: Infrastructure as Code (IaC)

Chapter 02: Infrastructure as Code (IaC)

Why This Chapter Exists

In production, infrastructure mistakes are expensive and fast-moving. IaC is not only about automation speed. It is about:

  • repeatability
  • reviewability
  • rollback paths
  • controlled blast radius

This chapter introduces a guardrails-first Terraform workflow for Kubernetes platforms.

Learning Objectives

By the end of this chapter, learners can:

  • explain module boundaries and Terraform folder structure in this repo
  • run a safe plan -> review -> apply workflow
  • explain why remote state and locking are non-negotiable in team environments
  • detect drift and decide whether to reconcile or rollback
  • execute safe destroy practices with explicit scope checks

Repo Mapping

Relevant paths:

Chapter 03: Secrets Management (SOPS)

Chapter 03: Secrets Management (SOPS)

Why This Chapter Exists

Plaintext secrets in Git are a production incident waiting to happen. This chapter establishes one safe path:

  • secrets are encrypted before commit
  • Flux decrypts in-cluster with sops-age
  • key material is never committed

The Incident Hook

A teammate commits a plaintext API key to fix a failing deploy quickly. The key is exposed in Git history, CI logs, and local clones. The rollback is not enough because the secret is already leaked. Response now includes rotation, audit, and cross-team coordination under pressure.

Chapter 04: GitOps & Version Promotion

Chapter 04: GitOps & Version Promotion

Why This Chapter Exists

Production safety depends on controlled promotion, not ad-hoc rebuilds. This chapter defines one deployment model:

  • develop deploys develop-* images
  • staging deploys staging-* images
  • production deploys production-* images from explicit promotion

The Incident Hook

A team rebuilds “the same” code for production during incident pressure. The binary differs from staging due to dependency drift and build-time variance. Rollback is confusing because the promoted artifact is not the one that was tested. Time is lost proving artifact lineage instead of restoring service.

Chapter 05: Network Policies (Production Isolation)

Chapter 05: Network Policies (Production Isolation)

Why This Chapter Exists

Without network isolation, one compromised pod can move laterally across environments. This chapter introduces a safe baseline:

  • default deny
  • explicit allow rules
  • DNS and ingress paths opened intentionally

The Incident Hook

A debug pod in develop reaches internal services it should never touch. No exploit sophistication is needed, only open east-west traffic. When incident starts, responders cannot quickly prove or limit blast radius. Network policies turn this into an auditable allowlist model.

Chapter 06: Security Context & Pod Hardening

Chapter 06: Security Context & Pod Hardening

Why This Chapter Exists

Container defaults are not production-safe. This chapter enforces baseline pod hardening:

  • non-root execution
  • read-only root filesystem where possible
  • dropped Linux capabilities
  • runtime-default seccomp

The Incident Hook

A container compromise lands shell access inside a pod. If the pod runs with broad privileges, escalation is fast. If security context is hardened, attacker movement is constrained. This chapter teaches those constraints as default behavior.

Chapter 07: Resource Management & QoS

Chapter 07: Resource Management & QoS

Why This Chapter Exists

Unbounded workloads create noisy-neighbor incidents and unpredictable recovery. This chapter enforces resource discipline:

  • requests/limits per container
  • namespace quotas
  • predictable QoS behavior under pressure

Guardrails

  • Every workload must define CPU/memory requests and limits.
  • Namespaces must enforce LimitRange and ResourceQuota.
  • OOM and throttling analysis must happen before scaling decisions.

Repo Mapping

  • App resources:
    • flux/apps/backend/base/deployment.yaml
    • flux/apps/frontend/base/deployment.yaml
  • Namespace quotas/limits:
    • flux/infrastructure/resource-management/develop/
    • flux/infrastructure/resource-management/staging/
    • flux/infrastructure/resource-management/production/
  • Flux wiring:
    • flux/bootstrap/flux-system/infrastructure.yaml
    • flux/bootstrap/flux-system/apps.yaml

Current Implementation (This Repo)

  • Backend and frontend define CPU/memory/ephemeral-storage requests+limits.
  • develop, staging, production have LimitRange and ResourceQuota via Flux.
  • Apps depend on resource-management Kustomizations before reconcile.

Lab Files

  • lab.md
  • quiz.md

Done When

  • learner can explain Burstable vs Guaranteed vs BestEffort with real manifests
  • learner can verify quota/limitrange enforcement in cluster
  • learner can diagnose OOM/resource pressure from pod events and metrics

Chapter 08: Availability Engineering (HPA + PDB)

Chapter 08: Availability Engineering (HPA + PDB)

Why This Chapter Exists

Replicas alone do not guarantee availability during disruption. This chapter combines:

  • HPA for load-based scaling
  • PDB for controlled voluntary disruptions
  • rollout/drain awareness

Guardrails

  • staging/production start from 2 replicas for critical services.
  • each service has HPA bounds (minReplicas, maxReplicas) and resource targets.
  • each service has PDB to prevent unsafe disruption.
  • node drain or rollout is never executed without checking PDB/HPA state.

Repo Mapping

  • Backend overlays:
    • flux/apps/backend/develop/
    • flux/apps/backend/staging/
    • flux/apps/backend/production/
  • Frontend overlays:
    • flux/apps/frontend/overlays/develop/
    • flux/apps/frontend/overlays/staging/
    • flux/apps/frontend/overlays/production/

Current Implementation (This Repo)

  • HPA (autoscaling/v2) added for backend and frontend in all three environments.
  • PDB (policy/v1) added for backend and frontend in all three environments.
  • staging/production baseline replicas are 2 for backend and frontend.

Lab Files

  • lab.md
  • quiz.md

Done When

  • learner can verify HPA target/bounds and current scaling state
  • learner can verify PDB allowed disruptions before node drain
  • learner can explain interaction: HPA, PDB, rollout, and drain

Chapter 09: Observability (Metrics, Logs, Traces)

Chapter 09: Observability (Metrics, Logs, Traces)

Why This Chapter Exists

Without correlated signals, incidents become guesswork. This chapter defines the minimum production baseline:

  • metrics for symptom detection
  • traces for path analysis
  • logs for evidence

Scope Decision (MVP)

  • No in-cluster OpenTelemetry Collector in this phase.
  • Frontend and backend export telemetry directly to Uptrace.
  • Target investigation path: frontend -> backend now, -> database when DB layer is introduced.

References:

  • docs/observability/uptrace-cloud.md
  • docs/observability/uptrace-e2e-plan.md

The Incident Hook

Users report intermittent 5xx errors and slow responses. Dashboards show elevated latency, but root cause is unclear. Without trace correlation, the team jumps between pods/logs blindly. With baseline observability, on-call narrows cause in minutes.

Chapter 10: Backup & Restore Basics

Chapter 10: Backup & Restore Basics

Why This Chapter Exists

Backups are useful only if restore is tested and repeatable. This chapter uses CloudNativePG as real stateful target with PVC-backed PostgreSQL.

Data Plane Choice

CloudNativePG setup in this repo:

  • operator: flux/infrastructure/data/cnpg-operator/
  • clusters: flux/infrastructure/data/cnpg-clusters/{develop,staging,production}/
  • each environment has dedicated Cluster + ScheduledBackup

Backup Credential Model

Before SOPS integration, bootstrap credentials are created by Terraform:

  • secret name: cnpg-backup-s3
  • namespaces: develop, staging, production
  • keys: ACCESS_KEY_ID, ACCESS_SECRET_KEY, BUCKET (+ optional ENDPOINT, REGION)

Terraform source:

Chapter 11: Controlled Chaos

Chapter 11: Controlled Chaos

Why This Chapter Exists

Production resilience is not proven in calm conditions. This chapter validates behavior under controlled failures with explicit blast-radius limits.

Scope

Failure classes in this chapter:

  • crash loop (/panic)
  • elevated 5xx (/status/500)
  • random pod termination (Chaos Monkey)

Current implementation focus:

  • deterministic drills first
  • Chaos Monkey in develop with kill switch and strict target allowlist

Chaos Monkey (MVP)

Flux path:

  • flux/infrastructure/chaos/develop/

Safety controls:

  • namespace scope: develop only (RBAC Role in develop)
  • target scope: app=frontend or app=backend
  • schedule: every 15 minutes
  • window: UTC 10-16
  • kill switch: spec.suspend: true on CronJob (default)

Guardrails

  • Never run uncontrolled chaos in staging/production.
  • One failure injection per run.
  • Evidence-first triage: metrics -> traces -> logs.
  • Every drill must end with recovery verification and a hardening action.

Lab Files

  • lab.md
  • runbook-game-day.md
  • scorecard.md
  • quiz.md

Handoff to Chapter 12 (AI Guardian)

Chaos Monkey emits structured log events in CronJob output. In Chapter 12, Guardian watchers consume these events and classify:

Chapter 12: AI-Assisted SRE Guardian (Draft)

Chapter 12: AI-Assisted SRE Guardian (Draft)

Why This Chapter Exists

Chaos testing and alerts generate noise unless incidents are normalized and prioritized. This chapter introduces an AI-assisted guardian that analyzes incidents, proposes actions, and escalates safely without auto-fixing production.

Scope (Current Draft)

Implementation target is ../k8s-ai-monitor/:

  • Kopf operator handlers for events and Flux objects
  • scanner loops for pod/pvc/certificate/endpoint
  • LLM analysis with strict JSON schema
  • incident lifecycle backend (SQLite preferred)
  • confidence-based human escalation

Guardian Responsibilities

  1. Detect:
  • Kubernetes Warning events
  • Flux stalled conditions
  • periodic scanner findings
  1. Analyze:
  • collect structured context
  • sanitize sensitive data
  • enforce context budget
  • call LLM for structured root-cause hypotheses
  1. Decide:
  • create/update incident record
  • deduplicate repeated noise
  • escalate recurring/persistent incidents
  1. Notify:
  • send structured alert
  • expose incident APIs for ack/resolve

Guardrails

  • AI proposes; human approves remediation.
  • No autonomous write-back to production workloads.
  • Confidence < threshold implies explicit human review.
  • Secret/token redaction is mandatory before LLM call.
  • Rate and cost limits are mandatory.

Repository Mapping

  • Guardian config: ../k8s-ai-monitor/src/config.py
  • Event handlers: ../k8s-ai-monitor/src/handlers/events.py, ../k8s-ai-monitor/src/handlers/flux.py
  • Scanner startup loops + HTTP API: ../k8s-ai-monitor/src/handlers/startup.py
  • Processing pipeline: ../k8s-ai-monitor/src/engine/pipeline.py
  • LLM schema + cost tracking: ../k8s-ai-monitor/src/engine/llm.py
  • Sanitizer: ../k8s-ai-monitor/src/engine/sanitizer.py
  • Incident store: ../k8s-ai-monitor/src/engine/store/sqlite.py

Lab Files

  • lab.md
  • runbook-guardian.md
  • quiz.md

Done When (MVP)

  • guardian catches one Chapter 11 chaos scenario
  • incident is persisted with structured analysis and confidence
  • on-call can ack/resolve incident via API
  • one escalation scenario is demonstrated (recurring or persistent)

Chapter 13: 24/7 Production SRE

Chapter 13: 24/7 Production SRE

Why This Chapter Exists

Tooling is not enough without operational discipline. This chapter defines how teams run incidents, reduce recurrence, and harden systems continuously.

Scope

  • on-call operating model
  • incident lifecycle and severity policy
  • recurring-problem management
  • blameless postmortem workflow
  • AI boundary policy in production

Core Principles

  1. Evidence first:
  • metrics + traces + logs before high-risk actions
  1. Blameless response:
  • focus on system conditions and guardrail gaps, not individuals
  1. Controlled escalation:
  • severity-based comms and ownership
  1. AI boundary:
  • AI can classify and recommend
  • humans own decisions and execution

Operating Model

  • Incident Commander (IC)
  • Primary Responder
  • Communications Owner
  • Scribe

Lab Files

  • lab.md
  • runbook-oncall.md
  • postmortem-template.md
  • quiz.md

Done When

  • learner can run a full incident timeline with roles and severity
  • learner can produce a complete blameless postmortem
  • learner can define hardening actions with owner and due date

Chapter 14: Supply Chain Security (Advanced)

Chapter 14: Supply Chain Security (Advanced)

Why This Chapter Exists

A successful CI build does not guarantee runtime trust. This chapter enforces a production rule: only verifiable artifacts may run.

The supply-chain baseline in this course is:

  • immutable artifact identity (digest or immutable tag)
  • SBOM generation
  • image signing and attestation
  • cluster-side verification before admission

Learning Objectives

By the end of this chapter, learners can:

  • explain why “build once, promote many” is required for provenance
  • generate and verify SBOM/signature evidence for one artifact
  • run policy rollout in Audit -> Enforce phases in non-production
  • document deny evidence and remediation path

The Incident Hook

An urgent fix is rebuilt from a developer workstation and pushed with a familiar tag. The deploy appears normal, but during incident triage the team cannot prove which workflow produced the binary. Dependency baseline, SBOM lineage, and signer identity are unclear. Rollback confidence drops because artifact trust is uncertain.

Chapter 15: Admission Policy Guardrails (Advanced)

Chapter 15: Admission Policy Guardrails (Advanced)

Why This Chapter Exists

Local checks (pre-commit, CI, review) reduce risk but can be bypassed. Admission control is the last enforcement point before runtime.

This chapter focuses on policy-as-code guardrails that block risky workloads even when upstream checks fail.

Learning Objectives

By the end of this chapter, learners can:

  • explain why cluster-side policy is mandatory in production systems
  • roll out Kyverno rules with Audit -> Enforce safely
  • troubleshoot deny events and remediate manifests correctly
  • run controlled break-glass exceptions with expiry and audit trail

The Incident Hook

A workload is deployed during incident pressure with missing limits, mutable tags, and weak security context. Workstation hooks were skipped and review focused on speed. The pod starts in a risky configuration and causes noisy-neighbor impact. Recovery is slowed because the team lacks clear deny/exception discipline.

Chapter 16: Rollback and Data Migrations (Advanced)

Chapter 16: Rollback and Data Migrations (Advanced)

Why This Chapter Exists

Application rollback is easy only when database state is compatible. Most production rollback failures happen at the boundary between application version and schema version.

This chapter defines a safe migration discipline:

  • backward-compatible schema first
  • application rollout second
  • destructive schema changes last
  • explicit rollback windows and feature flag gates

Learning Objectives

By the end of this chapter, learners can:

  • explain expand/contract migration strategy
  • design rollback-safe deploy sequence for app + schema
  • execute a migration incident drill with evidence capture
  • define break-glass rules for failed migrations

Current Project State

  • backend service currently does not depend on production DB reads/writes for core flow
  • chapter uses migration workflow simulation on CNPG/PostgreSQL targets
  • when backend login + DB flow is added, this chapter becomes mandatory release gate

The Incident Hook

A release includes application code and schema migration in one step. Migration drops/renames a column used by previous app version. New deployment fails health checks; rollback of application image succeeds, but old app cannot read data anymore. Incident duration expands because “app rollback” alone cannot recover service.

Advanced Module: Linkerd + Progressive Delivery (Canary / A-B)

Advanced Module: Linkerd + Progressive Delivery (Canary / A-B)

Why This Module Exists

Safe delivery is not only “deploy or rollback”. This module adds service-mesh-driven progressive rollout guardrails:

  • Linkerd mTLS by default
  • canary rollout with measurable abort criteria
  • A/B routing with explicit experiment boundaries

The Incident Hook

A full rollout passes smoke checks but fails under real production traffic mix. Error rate and latency spike after deploy, and rollback starts late because detection is manual. The team needs controlled traffic progression with automatic safety checks.

Intro: AI as a Very Well-Read Junior Engineer

Intro: AI as a Very Well-Read Junior Engineer

The course is not “how to use AI” and not “how to write prompts”.

It is about using AI in DevOps / SysOps / SRE without increasing risk or blast radius.

The Mental Model

AI is the most well-read junior engineer you will ever work with:

  • Knows tooling, flags, YAML, Terraform, Helm.
  • Works fast and in parallel.
  • Sounds confident.

And that is exactly why it is dangerous:

Production-Grade Kubernetes with Guardrails & AI-Assisted SRE

Production-Grade Kubernetes with Guardrails & AI-Assisted SRE

Core Course Structure (12 Chapters)

  1. Production Mindset & Guardrails
  • Kubernetes != dev playground
  • alert fatigue, blast radius, environment separation
  • AI as read-only assistant
  1. Infrastructure as Code (IaC)
  • Terraform modules and structure
  • remote state, IAM/RBAC, version pinning, drift detection, safe destroy
  • Lab: build production-ready cluster
  1. Secrets Management (SOPS)
  • encrypted secrets with SOPS + age
  • Flux + SOPS integration
  • key rotation strategy
  • Lab: encrypted secret -> deploy -> decrypt via Flux
  1. GitOps & Version Promotion
  • Flux architecture, HelmRelease/Kustomize overlays
  • dev -> stage -> prod promotion (no rebuild), rollback, immutable tags
  • Lab: real promotion workflow
  1. Network Policies (Production Isolation)
  • default deny, namespace isolation, ingress/egress controls, DNS allow patterns
  • blocked traffic debugging
  • Lab: break traffic and analyze
  1. Security Context & Pod Hardening
  • runAsNonRoot, readOnlyRootFilesystem, fsGroup, dropped capabilities
  • privileged pod risks, Pod Security Standards
  • Lab: permission failure recovery without root
  1. Resource Management & QoS
  • requests vs limits, QoS classes, OOMKilled behavior, node pressure
  • overcommit and HPA interaction
  • Lab: OOM simulation and root cause analysis
  1. Availability Engineering (HPA + PDB)
  • HPA mechanics, ScalingLimited, PDB, rolling updates, node drain
  • HPA/PDB/rollout interaction
  • Lab: drain node and analyze disruption
  1. Observability
  • metrics, logging, tracing, SLO/SLI, signal vs noise
  • Lab: baseline monitoring stack
  1. Backup & Restore Basics
  • PVC snapshots, DB dumps, object storage backups
  • restore simulation and verification
  • Lab: backup -> restore -> validation
  1. Controlled Chaos
  • controlled failure engineering (OOM, rollout break, network isolation, PVC full, cert expiry, backup job failure, node drain)
  • Lab: controlled breakage and behavior analysis
  1. 24/7 Production SRE
  • on-call mindset, incident lifecycle, recurring-problem analysis
  • blameless postmortems, continuous hardening
  • why AI should not auto-fix production

Advanced Track (Part 2)

  1. Supply Chain Security
  • SBOM generation and artifact storage
  • image signing with Cosign (OIDC/keyless and key-based models)
  • admission-time signature/attestation verification before deploy
  • Lab: unsigned image denied, signed+attested image allowed
  1. Admission Policy Guardrails
  • policy-as-code with Kyverno (Gatekeeper as advanced track)
  • enforce pod security baseline, immutable tags, trusted registries
  • deny risky manifests even when local hooks are bypassed
  • Lab: risky manifest denied, compliant manifest admitted
  1. AI-Assisted SRE Guardian
  • operator/watchers/scanners, context collectors, structured LLM JSON output
  • escalation logic, incident store, cost control, redaction, confidence calibration
  • Lab: guardian analyzes chaos scenarios
  1. Linkerd + Progressive Delivery (Canary / A-B)
  • service mesh fundamentals with mTLS-by-default
  • service-level golden metrics for rollout decisions
  • progressive delivery patterns (canary weight, header/cookie A-B routing)
  • rollback and abort criteria driven by SLO/error budget guardrails
  • Lab: canary rollout with automated abort + A-B experiment in develop
  • Module files: docs/course/module-linkerd-progressive-delivery/
  1. Rollback and Data Migrations
  • expand/contract migration strategy and compatibility windows
  • feature-flag-assisted rollback for schema-dependent releases
  • destructive migration approval gates and recovery planning
  • Lab: non-production migration drill with rollback evidence capture
  • Chapter files: docs/course/chapter-16-rollback-data-migrations/

Learning Outcome

By the end of the course, learners can: