Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • clusterrole.yaml Members
  • deployment.yaml Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter video unlocks with Core membership

Members see the full interactive explainer with checkpoint questions and downloadable labs. The first two chapters are free previews — try those to get a feel for the format before you subscribe.

Chapter 13: AI-Assisted SRE Guardian

Learning Objectives

By the end of this chapter, you will be able to:

  • Describe k8s-ai-monitor’s normalization, deduplication, and routing model
  • Configure LLM boundaries including sanitization, budget caps, and no-mutation rules
  • Use API, CLI, and MCP surfaces for safe investigation support
  • Explain why AI cannot own production decisions

Start with the video for the concept overview, then work through each lesson section.

Alert fatigue is real. During an incident, you are often flooded with dozens of technical alerts but lack a clear root cause. In this chapter, we deploy an AI Guardian that normalizes signals, enriches them with context, and assists in triage without taking risky autonomous actions.


1. The Problem: The “Alert Storm”

Multiple warning signals fire after a failure. Responders receive fragmented alerts with no clear priority, leading to manual triage that burns time on duplicate noise while the real impact grows. The problem is not a lack of detection; it is a lack of normalization.

2. The Concept: AI as a Triage Assistant

We use the k8s-ai-monitor to act as a filter between the cluster and the human.

  1. Normalization: Collapses multiple related alerts into a single structured incident.
  2. Context Enrichment: Automatically pulls relevant logs, events, and metrics for analysis.
  3. Sanitization: Redacts secrets and tokens before any LLM call.
  4. No Mutation: The AI can propose a fix, but it can never apply it.

3. The Code: The Guardian Deployment

Our sre/ repo deploys the Guardian as a singleton operator in the observability namespace. It watches the cluster’s health heartbeats and reacts to failures in real-time.

Guardian deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: k8s-ai-monitor
  labels:
    app.kubernetes.io/name: k8s-ai-monitor
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: k8s-ai-monitor
  template:
    metadata:
      labels:
        app.kubernetes.io/name: k8s-ai-monitor
    spec:
      serviceAccountName: k8s-ai-monitor
      securityContext:
        fsGroup: 1000
      imagePullSecrets:
        - name: ghcr-credentials-docker
      terminationGracePeriodSeconds: 30
      containers:
        - name: k8s-ai-monitor
          image: ${image_registry}/k8s-ai-monitor:main # {"$imagepolicy": "observability:k8s-ai-monitor:tag"}
          imagePullPolicy: IfNotPresent
          ports:
            - name: http
              containerPort: 8080
          env:
            - name: CLUSTER_NAME
              value: safeops
            - name: WATCH_NAMESPACES
              value: production
            - name: NON_PROD_NAMESPACES
              value: develop,staging
            - name: EXCLUDE_NAMESPACES
              value: kube-system,kube-public,kube-node-lease,flux-system
            - name: LOG_LEVEL
              value: INFO
            - name: LLM_PROVIDER
              value: openai
            - name: PROMETHEUS_URL
              value: http://kube-prometheus-stack-prometheus.observability.svc.cluster.local:9090
            - name: SQLITE_PATH
              value: /data/k8s-ai-monitor.db
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: openai-api-key
                  optional: true
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: anthropic-api-key
                  optional: true
            - name: SLACK_WEBHOOK_URL
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: slack-webhook-url
                  optional: true
            - name: SLACK_WEBHOOK_URL_NONPROD
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: slack-webhook-url-nonprod
                  optional: true
            - name: INTERNAL_TOKEN
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: internal-token
                  optional: true
            - name: ELASTICSEARCH_URL
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: elasticsearch-url
                  optional: true
            - name: ELASTICSEARCH_USER
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: elasticsearch-user
                  optional: true
            - name: ELASTICSEARCH_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: elasticsearch-password
                  optional: true
            - name: SCANNER_CRITICAL_ENDPOINT_ENABLED
              value: "true"
            - name: ENDPOINT_INGRESS_SERVICE
              value: traefik.traefik.svc.cluster.local
            - name: SCANNER_BACKUP_ENABLED
              value: "true"
          volumeMounts:
            - name: data
              mountPath: /data
          readinessProbe:
            httpGet:
              path: /healthz
              port: http
            initialDelaySeconds: 10
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /healthz
              port: http
            initialDelaySeconds: 20
            periodSeconds: 20
          resources:
            requests:
              cpu: 10m
              memory: 64Mi
            limits:
              cpu: 100m
              memory: 256Mi
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: k8s-ai-monitor-data

4. The Guardrail: Read-Only RBAC

The most important safety rule for AI in SRE is: AI proposes, humans approve. The Guardian is restricted by a read-oriented ClusterRole; it can observe everything it needs to understand the incident, but it has no authority to mutate workloads or manifests.

Guardian runtime context

Show the guardian ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: k8s-ai-monitor
  labels:
    app.kubernetes.io/name: k8s-ai-monitor
rules:
  - apiGroups: [""]
    resources:
      - pods
      - pods/log
      - events
      - namespaces
      - nodes
      - services
      - endpoints
      - persistentvolumeclaims
      - configmaps
      - secrets
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources:
      - events
    verbs: ["create", "patch", "update"]
  - apiGroups: ["apps"]
    resources:
      - deployments
      - replicasets
      - statefulsets
      - daemonsets
    verbs: ["get", "list", "watch"]
  - apiGroups: ["batch"]
    resources:
      - jobs
      - cronjobs
    verbs: ["get", "list", "watch"]
  - apiGroups: ["autoscaling"]
    resources:
      - horizontalpodautoscalers
    verbs: ["get", "list", "watch"]
  - apiGroups: ["networking.k8s.io"]
    resources:
      - ingresses
    verbs: ["get", "list", "watch"]
  - apiGroups: ["cert-manager.io"]
    resources:
      - certificates
    verbs: ["get", "list", "watch"]
  - apiGroups: ["metrics.k8s.io"]
    resources:
      - pods
      - nodes
    verbs: ["get", "list"]
  - apiGroups: ["postgresql.cnpg.io"]
    resources:
      - clusters
      - backups
      - scheduledbackups
    verbs: ["get", "list", "watch"]
  - apiGroups: ["psmdb.percona.com"]
    resources:
      - perconaservermongodbs
      - perconaservermongodbbackups
    verbs: ["get", "list", "watch"]
  - apiGroups: ["kustomize.toolkit.fluxcd.io"]
    resources:
      - kustomizations
    verbs: ["get", "list", "watch"]
  - apiGroups: ["helm.toolkit.fluxcd.io"]
    resources:
      - helmreleases
    verbs: ["get", "list", "watch"]
  - apiGroups: ["source.toolkit.fluxcd.io"]
    resources:
      - gitrepositories
      - helmrepositories
      - helmcharts
    verbs: ["get", "list", "watch"]

5. Verification: Did I Get It?

Verify the Guardian is active and capturing cluster signals:

# Check the Guardian logs for detection events
kubectl -n observability logs deploy/k8s-ai-monitor
# Query the Guardian's API for the current incident state
curl http://localhost:8080/state

Expected Output: You should see a list of recent cluster events being processed, sanitized, and (if applicable) grouped into incident records.


Detailed Lessons

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Lab: Guardian on Top of Controlled Chaos Members
  • Quiz: Chapter 13 (AI-Assisted SRE Guardian) Members
  • Runbook: AI Guardian Operations Members

The Incident: The Alert Storm

Result: Tooling is not enough if it does not provide a single, normalized, and actionable picture of the incident. Observed Symptoms What the team sees first: Many alerts are technically true but operationally …

Investigation & Containment

Safe investigation sequence: Inspect Raw Signals: Review the raw Kubernetes events and metrics entering the Guardian. Verify Sanitization: Confirm that secrets, tokens, and context budgets are correctly handled before …

Workflow & Analysis Pipeline

Detect: From real-time events, Flux stalled conditions, and periodic scanners (Pods, PVCs, Certs, etc.). Analyze: Collects pod state, logs, and metrics, then sanitizes and budgets the context. Decide: Creates/updates an …

Lab & Completion

Done When You have completed this chapter when: The Guardian has captured and analyzed at least one controlled chaos scenario. You can explain why the Guardian has a read-only RBAC boundary. You have successfully …