Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • develop/ Members
  • develop/ Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 09: Availability Engineering (HPA + PDB)

Learning Objectives

By the end of this chapter, you will be able to:

  • Configure HPA bounds and PDB constraints for safe scaling
  • Explain why minReplicas: 1 is a reliability regression for critical services
  • Coordinate rolling updates with node drain events using PDB
  • Design planned disruption as an engineering scenario, not an improvisation

Start with the video for the concept overview, then work through each lesson section.

Recall Check

Before continuing, quickly recall:

  • Why is default-deny the correct baseline for NetworkPolicy? (Chapter 06)
  • What is the correct fix when a pod fails with runAsNonRoot violation? (Chapter 07)
  • How does QoS class affect which pod gets evicted under memory pressure? (Chapter 08)

If you can’t answer these, revisit the corresponding chapter before proceeding.

Running multiple replicas of an app doesn’t automatically guarantee availability. If all replicas reside on a single node that goes down for maintenance, your app goes down too. In this chapter we design availability through HPA and PDB.


1. The Problem: The “Stalled Drain” Incident

A routine node maintenance (drain) starts. If a critical service has only one replica (minReplicas: 1), the drain will either cause an outage or be blocked by a protective policy. Planned maintenance shouldn’t turn into a production incident due to misaligned scaling and disruption settings.

2. The Concept: Scaling and Disruption Budgets

We use two mechanisms to ensure our services stay alive during load spikes and maintenance:

  1. Horizontal Pod Autoscaler (HPA): Automatically adds more Pods when traffic increases.
  2. Pod Disruption Budget (PDB): Defines the “minimum healthy” floor that must be maintained during maintenance.

3. The Code: Autoscaling & Safety

Our sre/ repo defines HPA and PDB for all applications. The hpa.yaml and pdb.yaml files in our overlays are our contracts for high availability.

Backend availability layout

  • flux/apps/backend/develop/hpa.yaml
  • flux/apps/backend/develop/image-automation.yaml
  • flux/apps/backend/develop/image-policy.yaml
  • flux/apps/backend/develop/kustomization.yaml
  • flux/apps/backend/develop/patches/feature-flags.yaml
  • flux/apps/backend/develop/pdb.yaml

4. The Guardrail: Minimum Replicas for Production

For critical services in staging and production, setting minReplicas: 1 is a reliability regression. We enforce a minimum of 2 replicas to provide failure tolerance during rollouts, node drains, and unexpected pod restarts.

Frontend availability layout

Show the frontend availability layout
  • flux/apps/frontend/overlays/develop/hpa.yaml
  • flux/apps/frontend/overlays/develop/image-automation.yaml
  • flux/apps/frontend/overlays/develop/image-policy.yaml
  • flux/apps/frontend/overlays/develop/kustomization.yaml
  • flux/apps/frontend/overlays/develop/namespace.yaml
  • flux/apps/frontend/overlays/develop/patches/deployment.yaml
  • flux/apps/frontend/overlays/develop/patches/ingress.yaml
  • flux/apps/frontend/overlays/develop/pdb.yaml

5. Verification: Did I Get It?

Check your HPA and PDB status to confirm they are protecting your service:

# Check scaling status
kubectl get hpa -n develop
# Check allowed disruptions
kubectl get pdb -n develop

Expected Output: ALLOWED DISRUPTIONS should be greater than 0, and HPA should be Ready with its target metrics.


Detailed Lessons

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Lab: HPA + PDB + Node Drain Readiness Members
  • Quiz: Chapter 09 (Availability Engineering) Members

The Incident: The Stalled Drain

Result: Settings that looked reasonable in isolation fail together during a disruption, turning planned maintenance into an incident. Observed Symptoms What the team sees first: The node drain does not proceed cleanly. …

Investigation & Containment

Safe investigation sequence: Inspect current state: Check the current replica count and current HPA scaling status. Confirm PDB allowance: Check kubectl get pdb to see how many “allowed disruptions” are …

Workflow & Baseline

Show the backend availability layout flux/apps/backend/develop/hpa.yaml flux/apps/backend/develop/image-automation.yaml flux/apps/backend/develop/image-policy.yaml flux/apps/backend/develop/kustomization.yaml …

Lab & Completion

Healthy HPA + PDB: The drain proceeds smoothly with bounded disruption; HPA adds more replicas if necessary. Restrictive PDB: The drain blocks (expected protective behavior) because the service is at its minimum replica …