Chapter 09: Availability Engineering (HPA + PDB)

Learning Objectives

By the end of this chapter, you will be able to:

Configure HPA bounds and PDB constraints for safe scaling
Explain why minReplicas: 1 is a reliability regression for critical services
Coordinate rolling updates with node drain events using PDB
Design planned disruption as an engineering scenario, not an improvisation

Start with the video for the concept overview, then work through each lesson section.

Recall Check

Before continuing, quickly recall:

Why is default-deny the correct baseline for NetworkPolicy? (Chapter 06)
What is the correct fix when a pod fails with runAsNonRoot violation? (Chapter 07)
How does QoS class affect which pod gets evicted under memory pressure? (Chapter 08)

If you can’t answer these, revisit the corresponding chapter before proceeding.

Running multiple replicas of an app doesn’t automatically guarantee availability. If all replicas reside on a single node that goes down for maintenance, your app goes down too. In this chapter we design availability through HPA and PDB.

1. The Problem: The “Stalled Drain” Incident

A routine node maintenance (drain) starts. If a critical service has only one replica (minReplicas: 1), the drain will either cause an outage or be blocked by a protective policy. Planned maintenance shouldn’t turn into a production incident due to misaligned scaling and disruption settings.

2. The Concept: Scaling and Disruption Budgets

We use two mechanisms to ensure our services stay alive during load spikes and maintenance:

Horizontal Pod Autoscaler (HPA): Automatically adds more Pods when traffic increases.
Pod Disruption Budget (PDB): Defines the “minimum healthy” floor that must be maintained during maintenance.

3. The Code: Autoscaling & Safety

Our sre/ repo defines HPA and PDB for all applications. The hpa.yaml and pdb.yaml files in our overlays are our contracts for high availability.

Backend availability layout

flux/apps/backend/develop/hpa.yaml
flux/apps/backend/develop/image-automation.yaml
flux/apps/backend/develop/image-policy.yaml
flux/apps/backend/develop/kustomization.yaml
flux/apps/backend/develop/patches/feature-flags.yaml
flux/apps/backend/develop/pdb.yaml

4. The Guardrail: Minimum Replicas for Production

For critical services in staging and production, setting minReplicas: 1 is a reliability regression. We enforce a minimum of 2 replicas to provide failure tolerance during rollouts, node drains, and unexpected pod restarts.

Frontend availability layout

Show the frontend availability layout

flux/apps/frontend/overlays/develop/hpa.yaml
flux/apps/frontend/overlays/develop/image-automation.yaml
flux/apps/frontend/overlays/develop/image-policy.yaml
flux/apps/frontend/overlays/develop/kustomization.yaml
flux/apps/frontend/overlays/develop/namespace.yaml
flux/apps/frontend/overlays/develop/patches/deployment.yaml
flux/apps/frontend/overlays/develop/patches/ingress.yaml
flux/apps/frontend/overlays/develop/pdb.yaml

5. Verification: Did I Get It?

Check your HPA and PDB status to confirm they are protecting your service:

# Check scaling status
kubectl get hpa -n develop
# Check allowed disruptions
kubectl get pdb -n develop

Expected Output: ALLOWED DISRUPTIONS should be greater than 0, and HPA should be Ready with its target metrics.

Estimated Time

Prerequisites

Source Code References

What You Will Produce

Chapter 09: Availability Engineering (HPA + PDB)

Learning Objectives

Recall Check

1. The Problem: The “Stalled Drain” Incident

2. The Concept: Scaling and Disruption Budgets

3. The Code: Autoscaling & Safety

4. The Guardrail: Minimum Replicas for Production

5. Verification: Did I Get It?

Detailed Lessons

Hands-On Materials

Hands-On Materials

The Incident: The Stalled Drain

Investigation & Containment

Workflow & Baseline

Lab & Completion