Production-Grade Kubernetes with Guardrails & AI-Assisted SRE

Production-Grade Kubernetes with Guardrails & AI-Assisted SRE

Core Course Structure (12 Chapters)

Production Mindset & Guardrails

Kubernetes != dev playground
alert fatigue, blast radius, environment separation
AI as read-only assistant

Infrastructure as Code (IaC)

Terraform modules and structure
remote state, IAM/RBAC, version pinning, drift detection, safe destroy
Lab: build production-ready cluster

Secrets Management (SOPS)

encrypted secrets with SOPS + age
Flux + SOPS integration
key rotation strategy
Lab: encrypted secret -> deploy -> decrypt via Flux

GitOps & Version Promotion

Flux architecture, HelmRelease/Kustomize overlays
dev -> stage -> prod promotion (no rebuild), rollback, immutable tags
Lab: real promotion workflow

Network Policies (Production Isolation)

default deny, namespace isolation, ingress/egress controls, DNS allow patterns
blocked traffic debugging
Lab: break traffic and analyze

Security Context & Pod Hardening

runAsNonRoot, readOnlyRootFilesystem, fsGroup, dropped capabilities
privileged pod risks, Pod Security Standards
Lab: permission failure recovery without root

Resource Management & QoS

requests vs limits, QoS classes, OOMKilled behavior, node pressure
overcommit and HPA interaction
Lab: OOM simulation and root cause analysis

Availability Engineering (HPA + PDB)

HPA mechanics, ScalingLimited, PDB, rolling updates, node drain
HPA/PDB/rollout interaction
Lab: drain node and analyze disruption

Observability

metrics, logging, tracing, SLO/SLI, signal vs noise
Lab: baseline monitoring stack

Backup & Restore Basics

PVC snapshots, DB dumps, object storage backups
restore simulation and verification
Lab: backup -> restore -> validation

Controlled Chaos

controlled failure engineering (OOM, rollout break, network isolation, PVC full, cert expiry, backup job failure, node drain)
Lab: controlled breakage and behavior analysis

24/7 Production SRE

on-call mindset, incident lifecycle, recurring-problem analysis
blameless postmortems, continuous hardening
why AI should not auto-fix production

Advanced Track (Part 2)

Supply Chain Security

SBOM generation and artifact storage
image signing with Cosign (OIDC/keyless and key-based models)
admission-time signature/attestation verification before deploy
Lab: unsigned image denied, signed+attested image allowed

Admission Policy Guardrails

policy-as-code with Kyverno (Gatekeeper as advanced track)
enforce pod security baseline, immutable tags, trusted registries
deny risky manifests even when local hooks are bypassed
Lab: risky manifest denied, compliant manifest admitted

AI-Assisted SRE Guardian

operator/watchers/scanners, context collectors, structured LLM JSON output
escalation logic, incident store, cost control, redaction, confidence calibration
Lab: guardian analyzes chaos scenarios

Linkerd + Progressive Delivery (Canary / A-B)

service mesh fundamentals with mTLS-by-default
service-level golden metrics for rollout decisions
progressive delivery patterns (canary weight, header/cookie A-B routing)
rollback and abort criteria driven by SLO/error budget guardrails
Lab: canary rollout with automated abort + A-B experiment in develop
Module files: docs/course/module-linkerd-progressive-delivery/

Rollback and Data Migrations

expand/contract migration strategy and compatibility windows
feature-flag-assisted rollback for schema-dependent releases
destructive migration approval gates and recovery planning
Lab: non-production migration drill with rollback evidence capture
Chapter files: docs/course/chapter-16-rollback-data-migrations/

Learning Outcome

By the end of the course, learners can:

build and operate a production-grade Kubernetes platform
promote versions safely across environments
enforce security and isolation guardrails
manage resource behavior under pressure
implement backup/restore practices
run controlled chaos experiments safely
maintain 24/7 production stability patterns

Advanced track learners additionally can:

verify artifact integrity before runtime admission
enforce cluster-side policy guardrails independent of local tooling
use AI as a guardrail layer (not an autonomous executor)
run rollback-safe migration workflows for schema-dependent releases

Pending Product Decisions

Final duration estimate in hours.
Target level: mid vs senior.
Exact lab scope per chapter.
Opening and closing narrative design for stronger course impact.