Chapter 08: Availability Engineering (HPA + PDB)

Chapter 08: Availability Engineering (HPA + PDB)

Why This Chapter Exists

Replicas alone do not guarantee availability during disruption. This chapter combines:

  • HPA for load-based scaling
  • PDB for controlled voluntary disruptions
  • rollout/drain awareness

Guardrails

  • staging/production start from 2 replicas for critical services.
  • each service has HPA bounds (minReplicas, maxReplicas) and resource targets.
  • each service has PDB to prevent unsafe disruption.
  • node drain or rollout is never executed without checking PDB/HPA state.

Repo Mapping

  • Backend overlays:
    • flux/apps/backend/develop/
    • flux/apps/backend/staging/
    • flux/apps/backend/production/
  • Frontend overlays:
    • flux/apps/frontend/overlays/develop/
    • flux/apps/frontend/overlays/staging/
    • flux/apps/frontend/overlays/production/

Current Implementation (This Repo)

  • HPA (autoscaling/v2) added for backend and frontend in all three environments.
  • PDB (policy/v1) added for backend and frontend in all three environments.
  • staging/production baseline replicas are 2 for backend and frontend.

Lab Files

  • lab.md
  • quiz.md

Done When

  • learner can verify HPA target/bounds and current scaling state
  • learner can verify PDB allowed disruptions before node drain
  • learner can explain interaction: HPA, PDB, rollout, and drain

Lab: HPA + PDB + Node Drain Readiness

Lab: HPA + PDB + Node Drain Readiness

Goal

Validate availability controls in staging:

  • HPA exists and can scale within safe bounds
  • PDB constrains voluntary disruptions
  • drain simulation is evaluated through PDB/HPA signals first

Prerequisites

  • Metrics API available (kubectl top works)
  • backend/frontend deployed in staging
kubectl -n staging get deploy backend frontend
kubectl -n staging get hpa,pdb

Step 1: Verify Baseline

kubectl -n staging get deploy backend frontend -o wide
kubectl -n staging get hpa backend frontend
kubectl -n staging get pdb backend frontend

Expected:

Quiz: Chapter 08 (Availability Engineering)

Quiz: Chapter 08 (Availability Engineering)

Questions

  1. Why is HPA alone not sufficient to guarantee safe disruption handling?

  2. What does a PodDisruptionBudget control?

  3. Which signal must be checked before node drain?

  4. If Allowed disruptions = 0 for a critical service, what is the correct action?

  5. Which statement is correct?

  • A) PDB affects all pod failures including OOM and crashes.
  • B) PDB controls voluntary disruptions such as evictions/drains.
  • C) HPA ignores resource metrics.
  1. What does ScalingLimited=True typically indicate in HPA status?