Chapter 13: AI-Assisted SRE Guardian

Learning Objectives

By the end of this chapter, you will be able to:

Describe k8s-ai-monitor’s normalization, deduplication, and routing model
Configure LLM boundaries including sanitization, budget caps, and no-mutation rules
Use API, CLI, and MCP surfaces for safe investigation support
Explain why AI cannot own production decisions

Start with the video for the concept overview, then work through each lesson section.

Alert fatigue is real. During an incident, you are often flooded with dozens of technical alerts but lack a clear root cause. In this chapter, we deploy an AI Guardian that normalizes signals, enriches them with context, and assists in triage without taking risky autonomous actions.

1. The Problem: The “Alert Storm”

Multiple warning signals fire after a failure. Responders receive fragmented alerts with no clear priority, leading to manual triage that burns time on duplicate noise while the real impact grows. The problem is not a lack of detection; it is a lack of normalization.

2. The Concept: AI as a Triage Assistant

We use the k8s-ai-monitor to act as a filter between the cluster and the human.

Normalization: Collapses multiple related alerts into a single structured incident.
Context Enrichment: Automatically pulls relevant logs, events, and metrics for analysis.
Sanitization: Redacts secrets and tokens before any LLM call.
No Mutation: The AI can propose a fix, but it can never apply it.

3. The Code: The Guardian Deployment

Our sre/ repo deploys the Guardian as a singleton operator in the observability namespace. It watches the cluster’s health heartbeats and reacts to failures in real-time.

Guardian deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: k8s-ai-monitor
  labels:
    app.kubernetes.io/name: k8s-ai-monitor
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: k8s-ai-monitor
  template:
    metadata:
      labels:
        app.kubernetes.io/name: k8s-ai-monitor
    spec:
      serviceAccountName: k8s-ai-monitor
      securityContext:
        fsGroup: 1000
      imagePullSecrets:
        - name: ghcr-credentials-docker
      terminationGracePeriodSeconds: 30
      containers:
        - name: k8s-ai-monitor
          image: ${image_registry}/k8s-ai-monitor:main # {"$imagepolicy": "observability:k8s-ai-monitor:tag"}
          imagePullPolicy: IfNotPresent
          ports:
            - name: http
              containerPort: 8080
          env:
            - name: CLUSTER_NAME
              value: safeops
            - name: WATCH_NAMESPACES
              value: production
            - name: NON_PROD_NAMESPACES
              value: develop,staging
            - name: EXCLUDE_NAMESPACES
              value: kube-system,kube-public,kube-node-lease,flux-system
            - name: LOG_LEVEL
              value: INFO
            - name: LLM_PROVIDER
              value: openai
            - name: PROMETHEUS_URL
              value: http://kube-prometheus-stack-prometheus.observability.svc.cluster.local:9090
            - name: SQLITE_PATH
              value: /data/k8s-ai-monitor.db
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: openai-api-key
                  optional: true
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: anthropic-api-key
                  optional: true
            - name: SLACK_WEBHOOK_URL
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: slack-webhook-url
                  optional: true
            - name: SLACK_WEBHOOK_URL_NONPROD
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: slack-webhook-url-nonprod
                  optional: true
            - name: INTERNAL_TOKEN
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: internal-token
                  optional: true
            - name: ELASTICSEARCH_URL
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: elasticsearch-url
                  optional: true
            - name: ELASTICSEARCH_USER
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: elasticsearch-user
                  optional: true
            - name: ELASTICSEARCH_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: k8s-ai-monitor-secrets
                  key: elasticsearch-password
                  optional: true
            - name: SCANNER_CRITICAL_ENDPOINT_ENABLED
              value: "true"
            - name: ENDPOINT_INGRESS_SERVICE
              value: traefik.traefik.svc.cluster.local
            - name: SCANNER_BACKUP_ENABLED
              value: "true"
          volumeMounts:
            - name: data
              mountPath: /data
          readinessProbe:
            httpGet:
              path: /healthz
              port: http
            initialDelaySeconds: 10
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /healthz
              port: http
            initialDelaySeconds: 20
            periodSeconds: 20
          resources:
            requests:
              cpu: 10m
              memory: 64Mi
            limits:
              cpu: 100m
              memory: 256Mi
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: k8s-ai-monitor-data

4. The Guardrail: Read-Only RBAC

The most important safety rule for AI in SRE is: AI proposes, humans approve. The Guardian is restricted by a read-oriented ClusterRole; it can observe everything it needs to understand the incident, but it has no authority to mutate workloads or manifests.

Guardian runtime context

Show the guardian ClusterRole

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: k8s-ai-monitor
  labels:
    app.kubernetes.io/name: k8s-ai-monitor
rules:
  - apiGroups: [""]
    resources:
      - pods
      - pods/log
      - events
      - namespaces
      - nodes
      - services
      - endpoints
      - persistentvolumeclaims
      - configmaps
      - secrets
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources:
      - events
    verbs: ["create", "patch", "update"]
  - apiGroups: ["apps"]
    resources:
      - deployments
      - replicasets
      - statefulsets
      - daemonsets
    verbs: ["get", "list", "watch"]
  - apiGroups: ["batch"]
    resources:
      - jobs
      - cronjobs
    verbs: ["get", "list", "watch"]
  - apiGroups: ["autoscaling"]
    resources:
      - horizontalpodautoscalers
    verbs: ["get", "list", "watch"]
  - apiGroups: ["networking.k8s.io"]
    resources:
      - ingresses
    verbs: ["get", "list", "watch"]
  - apiGroups: ["cert-manager.io"]
    resources:
      - certificates
    verbs: ["get", "list", "watch"]
  - apiGroups: ["metrics.k8s.io"]
    resources:
      - pods
      - nodes
    verbs: ["get", "list"]
  - apiGroups: ["postgresql.cnpg.io"]
    resources:
      - clusters
      - backups
      - scheduledbackups
    verbs: ["get", "list", "watch"]
  - apiGroups: ["psmdb.percona.com"]
    resources:
      - perconaservermongodbs
      - perconaservermongodbbackups
    verbs: ["get", "list", "watch"]
  - apiGroups: ["kustomize.toolkit.fluxcd.io"]
    resources:
      - kustomizations
    verbs: ["get", "list", "watch"]
  - apiGroups: ["helm.toolkit.fluxcd.io"]
    resources:
      - helmreleases
    verbs: ["get", "list", "watch"]
  - apiGroups: ["source.toolkit.fluxcd.io"]
    resources:
      - gitrepositories
      - helmrepositories
      - helmcharts
    verbs: ["get", "list", "watch"]

5. Verification: Did I Get It?

Verify the Guardian is active and capturing cluster signals:

# Check the Guardian logs for detection events
kubectl -n observability logs deploy/k8s-ai-monitor
# Query the Guardian's API for the current incident state
curl http://localhost:8080/state

Expected Output: You should see a list of recent cluster events being processed, sanitized, and (if applicable) grouped into incident records.

Estimated Time

Prerequisites

Source Code References

What You Will Produce

Chapter video unlocks with Core membership

Watch Interactive Explainer