Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • deployment.yaml Members
  • develop/ Members
  • resourcequota.yaml Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 08: Resource Management & QoS

Learning Objectives

By the end of this chapter, you will be able to:

  • Set requests, limits, and quotas per namespace for CPU and memory
  • Predict QoS class assignment under resource pressure
  • Distinguish OOMKilled behavior from node-pressure eviction
  • Justify scaling decisions with resource utilization evidence

Start with the video for the concept overview, then work through each lesson section.

In Kubernetes, applications share the same physical hardware. Without resource discipline, one “noisy neighbor” can crash every other service on the node. In this chapter, we implement strict resource guardrails to ensure predictable and stable cluster behavior.


1. The Problem: The “Noisy Neighbor” Incident

A service starts consuming memory aggressively during a traffic peak. Because it has no limits, it starves neighboring Pods of memory, forcing Kubernetes to “evict” healthy Pods to save the host node. This causes a cascading failure across the entire node, affecting unrelated production workloads.

2. The Concept: Requests, Limits, and QoS

We use three tiers of Quality of Service (QoS) to tell Kubernetes which Pods are the most important:

  1. Requests: What the Pod is guaranteed to get for scheduling.
  2. Limits: The absolute maximum the Pod is allowed to use for safety.
  3. QoS Classes: Guaranteed (Requests == Limits), Burstable (Requests < Limits), and BestEffort (no requests/limits).

3. The Code: Resource Blocks

Our sre/ repo enforces explicit resource definitions for every container. The resources block in our deployment manifest is our contract with the Kubernetes scheduler.

Backend resource block

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: backend
  labels:
    app: backend
    app.kubernetes.io/name: backend
    app.kubernetes.io/component: api
spec:
  replicas: 1
  revisionHistoryLimit: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app: backend
  template:
    metadata:
      labels:
        app: backend
        app.kubernetes.io/name: backend
        app.kubernetes.io/component: api
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      imagePullSecrets:
      - name: ghcr-credentials-docker
      securityContext:
        runAsNonRoot: true
        runAsUser: 10001
        runAsGroup: 10001
        fsGroup: 10001
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: backend
        image: backend:latest
        imagePullPolicy: IfNotPresent
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          runAsNonRoot: true
          runAsUser: 10001
          runAsGroup: 10001
          capabilities:
            drop:
              - ALL
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        env:
        - name: PORT
          value: "8080"
        - name: NAMESPACE
          value: "${NAMESPACE}"
        - name: ENVIRONMENT
          value: "${ENVIRONMENT}"
        - name: LOG_LEVEL
          value: "${LOG_LEVEL}"
        - name: SERVICE_NAME
          value: "backend"
        - name: SERVICE_VERSION
          value: "v1.0.0"
        - name: DEPLOYMENT_ENVIRONMENT
          value: "${ENVIRONMENT}"
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: "k8s.cluster.name=${cluster_name}"
        - name: UPTRACE_DSN
          valueFrom:
            secretKeyRef:
              name: backend-secrets
              key: uptrace-dsn
        - name: OTEL_EXPORTER_OTLP_HEADERS
          valueFrom:
            secretKeyRef:
              name: backend-secrets
              key: uptrace-headers
        - name: JWT_SECRET
          valueFrom:
            secretKeyRef:
              name: backend-secrets
              key: jwt-secret
        - name: POSTGRES_USER
          valueFrom:
            secretKeyRef:
              name: app-postgres-app
              key: username
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: app-postgres-app
              key: password
        - name: POSTGRES_HOST
          value: app-postgres-rw
        - name: POSTGRES_DB
          value: app
        livenessProbe:
          httpGet:
            path: /livez
            port: http
          initialDelaySeconds: 15
          periodSeconds: 20
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /readyz
            port: http
          initialDelaySeconds: 5
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3
        startupProbe:
          httpGet:
            path: /healthz
            port: http
          initialDelaySeconds: 0
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 30
        resources:
          requests:
            cpu: 10m
            memory: 32Mi
            ephemeral-storage: 64Mi
          limits:
            cpu: 100m
            memory: 128Mi
            ephemeral-storage: 128Mi
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /home/app/.cache
      volumes:
      - name: tmp
        emptyDir: {}
      - name: cache
        emptyDir:
          sizeLimit: 10Mi

4. The Guardrail: Namespace Quotas

To prevent a single environment from consuming the entire cluster’s resources, we use ResourceQuotas. This provides an additional layer of protection by rejecting any deployment that exceeds the namespace’s total memory or CPU budget.

Develop namespace quota

Show the develop resource baseline
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: develop
spec:
  hard:
    requests.cpu: "1"
    requests.memory: 1Gi
    limits.cpu: "2"
    limits.memory: 2Gi
    pods: "15"

5. Verification: Did I Get It?

Verify your workload’s QoS class and its actual resource consumption:

# Check the assigned QoS class
kubectl get pod -n develop <pod-name> -o jsonpath='{.status.qosClass}'
# Check actual CPU/memory usage
kubectl top pod -n develop <pod-name>

Expected Output: You should see Burstable or Guaranteed and resource usage that remains within the defined limits.


Detailed Lessons

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Lab: Requests, Limits, QoS, and OOM Analysis Members
  • Quiz: Chapter 08 (Resource Management & QoS) Members

The Incident: The Noisy Neighbor

Result: Without resource guardrails, on-call loses control of prioritization and recovery as the cluster&rsquo;s stability degrades. Observed Symptoms What the team sees first: Repeated OOMKilled events on one workload. …

Investigation & Containment

Safe investigation sequence: Inspect Pod Events: Look for OOMKilled, Throttling, and Evicted signals. Confirm QoS Class: Check the QoS class of the affected workloads. Compare Behavior: Compare the requests and limits …

Workflow & Baseline

--- apiVersion: apps/v1 kind: Deployment metadata: name: backend labels: app: backend app.kubernetes.io/name: backend app.kubernetes.io/component: api spec: replicas: 1 revisionHistoryLimit: 3 strategy: type: …

Lab & Completion

Done When You have completed this chapter when: You can explain Burstable vs. Guaranteed vs. BestEffort QoS classes. You have successfully identified an OOMKilled event in a pod&rsquo;s history. You can verify namespace …