Chapter 09: Availability Engineering (HPA + PDB)
Learning Objectives
By the end of this chapter, you will be able to:
- Configure HPA bounds and PDB constraints for safe scaling
- Explain why
minReplicas: 1is a reliability regression for critical services - Coordinate rolling updates with node drain events using PDB
- Design planned disruption as an engineering scenario, not an improvisation
Start with the video for the concept overview, then work through each lesson section.
Recall Check
Before continuing, quickly recall:
- Why is default-deny the correct baseline for NetworkPolicy? (Chapter 06)
- What is the correct fix when a pod fails with
runAsNonRootviolation? (Chapter 07) - How does QoS class affect which pod gets evicted under memory pressure? (Chapter 08)
If you can’t answer these, revisit the corresponding chapter before proceeding.
Running multiple replicas of an app doesn’t automatically guarantee availability. If all replicas reside on a single node that goes down for maintenance, your app goes down too. In this chapter we design availability through HPA and PDB.
1. The Problem: The “Stalled Drain” Incident
A routine node maintenance (drain) starts. If a critical service has only one replica (minReplicas: 1), the drain will either cause an outage or be blocked by a protective policy. Planned maintenance shouldn’t turn into a production incident due to misaligned scaling and disruption settings.
2. The Concept: Scaling and Disruption Budgets
We use two mechanisms to ensure our services stay alive during load spikes and maintenance:
- Horizontal Pod Autoscaler (HPA): Automatically adds more Pods when traffic increases.
- Pod Disruption Budget (PDB): Defines the “minimum healthy” floor that must be maintained during maintenance.
3. The Code: Autoscaling & Safety
Our sre/ repo defines HPA and PDB for all applications. The hpa.yaml and pdb.yaml files in our overlays are our contracts for high availability.
Backend availability layout
flux/apps/backend/develop/hpa.yamlflux/apps/backend/develop/image-automation.yamlflux/apps/backend/develop/image-policy.yamlflux/apps/backend/develop/kustomization.yamlflux/apps/backend/develop/patches/feature-flags.yamlflux/apps/backend/develop/pdb.yaml
4. The Guardrail: Minimum Replicas for Production
For critical services in staging and production, setting minReplicas: 1 is a reliability regression. We enforce a minimum of 2 replicas to provide failure tolerance during rollouts, node drains, and unexpected pod restarts.
Frontend availability layout
Show the frontend availability layout
flux/apps/frontend/overlays/develop/hpa.yamlflux/apps/frontend/overlays/develop/image-automation.yamlflux/apps/frontend/overlays/develop/image-policy.yamlflux/apps/frontend/overlays/develop/kustomization.yamlflux/apps/frontend/overlays/develop/namespace.yamlflux/apps/frontend/overlays/develop/patches/deployment.yamlflux/apps/frontend/overlays/develop/patches/ingress.yamlflux/apps/frontend/overlays/develop/pdb.yaml
5. Verification: Did I Get It?
Check your HPA and PDB status to confirm they are protecting your service:
# Check scaling status
kubectl get hpa -n develop
# Check allowed disruptions
kubectl get pdb -n develop
Expected Output: ALLOWED DISRUPTIONS should be greater than 0, and HPA should be Ready with its target metrics.