Chapter 13 — AI-Assisted SRE Guardian (Part 1)
Chapter 13 — AI-Assisted SRE Guardian (Part 1) is being produced. Check back soon.
Sign in to view source code.
A reproducible lab result plus quiz verification and incident-safe operating evidence.
Members see the full interactive explainer with checkpoint questions and downloadable labs. The first two chapters are free previews — try those to get a feel for the format before you subscribe.
Chapter 13 — AI-Assisted SRE Guardian (Part 1) is being produced. Check back soon.
Chapter 13 — AI-Assisted SRE Guardian (Part 2) is being produced. Check back soon.
By the end of this chapter, you will be able to:
Start with the video for the concept overview, then work through each lesson section.
Alert fatigue is real. During an incident, you are often flooded with dozens of technical alerts but lack a clear root cause. In this chapter, we deploy an AI Guardian that normalizes signals, enriches them with context, and assists in triage without taking risky autonomous actions.
Multiple warning signals fire after a failure. Responders receive fragmented alerts with no clear priority, leading to manual triage that burns time on duplicate noise while the real impact grows. The problem is not a lack of detection; it is a lack of normalization.
We use the k8s-ai-monitor to act as a filter between the cluster and the human.
Our sre/ repo deploys the Guardian as a singleton operator in the observability namespace. It watches the cluster’s health heartbeats and reacts to failures in real-time.
Guardian deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: k8s-ai-monitor
labels:
app.kubernetes.io/name: k8s-ai-monitor
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: k8s-ai-monitor
template:
metadata:
labels:
app.kubernetes.io/name: k8s-ai-monitor
spec:
serviceAccountName: k8s-ai-monitor
securityContext:
fsGroup: 1000
imagePullSecrets:
- name: ghcr-credentials-docker
terminationGracePeriodSeconds: 30
containers:
- name: k8s-ai-monitor
image: ${image_registry}/k8s-ai-monitor:main # {"$imagepolicy": "observability:k8s-ai-monitor:tag"}
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8080
env:
- name: CLUSTER_NAME
value: safeops
- name: WATCH_NAMESPACES
value: production
- name: NON_PROD_NAMESPACES
value: develop,staging
- name: EXCLUDE_NAMESPACES
value: kube-system,kube-public,kube-node-lease,flux-system
- name: LOG_LEVEL
value: INFO
- name: LLM_PROVIDER
value: openai
- name: PROMETHEUS_URL
value: http://kube-prometheus-stack-prometheus.observability.svc.cluster.local:9090
- name: SQLITE_PATH
value: /data/k8s-ai-monitor.db
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: k8s-ai-monitor-secrets
key: openai-api-key
optional: true
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: k8s-ai-monitor-secrets
key: anthropic-api-key
optional: true
- name: SLACK_WEBHOOK_URL
valueFrom:
secretKeyRef:
name: k8s-ai-monitor-secrets
key: slack-webhook-url
optional: true
- name: SLACK_WEBHOOK_URL_NONPROD
valueFrom:
secretKeyRef:
name: k8s-ai-monitor-secrets
key: slack-webhook-url-nonprod
optional: true
- name: INTERNAL_TOKEN
valueFrom:
secretKeyRef:
name: k8s-ai-monitor-secrets
key: internal-token
optional: true
- name: ELASTICSEARCH_URL
valueFrom:
secretKeyRef:
name: k8s-ai-monitor-secrets
key: elasticsearch-url
optional: true
- name: ELASTICSEARCH_USER
valueFrom:
secretKeyRef:
name: k8s-ai-monitor-secrets
key: elasticsearch-user
optional: true
- name: ELASTICSEARCH_PASSWORD
valueFrom:
secretKeyRef:
name: k8s-ai-monitor-secrets
key: elasticsearch-password
optional: true
- name: SCANNER_CRITICAL_ENDPOINT_ENABLED
value: "true"
- name: ENDPOINT_INGRESS_SERVICE
value: traefik.traefik.svc.cluster.local
- name: SCANNER_BACKUP_ENABLED
value: "true"
volumeMounts:
- name: data
mountPath: /data
readinessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 10
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 20
periodSeconds: 20
resources:
requests:
cpu: 10m
memory: 64Mi
limits:
cpu: 100m
memory: 256Mi
volumes:
- name: data
persistentVolumeClaim:
claimName: k8s-ai-monitor-data
The most important safety rule for AI in SRE is: AI proposes, humans approve. The Guardian is restricted by a read-oriented ClusterRole; it can observe everything it needs to understand the incident, but it has no authority to mutate workloads or manifests.
Guardian runtime context
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: k8s-ai-monitor
labels:
app.kubernetes.io/name: k8s-ai-monitor
rules:
- apiGroups: [""]
resources:
- pods
- pods/log
- events
- namespaces
- nodes
- services
- endpoints
- persistentvolumeclaims
- configmaps
- secrets
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources:
- events
verbs: ["create", "patch", "update"]
- apiGroups: ["apps"]
resources:
- deployments
- replicasets
- statefulsets
- daemonsets
verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
resources:
- jobs
- cronjobs
verbs: ["get", "list", "watch"]
- apiGroups: ["autoscaling"]
resources:
- horizontalpodautoscalers
verbs: ["get", "list", "watch"]
- apiGroups: ["networking.k8s.io"]
resources:
- ingresses
verbs: ["get", "list", "watch"]
- apiGroups: ["cert-manager.io"]
resources:
- certificates
verbs: ["get", "list", "watch"]
- apiGroups: ["metrics.k8s.io"]
resources:
- pods
- nodes
verbs: ["get", "list"]
- apiGroups: ["postgresql.cnpg.io"]
resources:
- clusters
- backups
- scheduledbackups
verbs: ["get", "list", "watch"]
- apiGroups: ["psmdb.percona.com"]
resources:
- perconaservermongodbs
- perconaservermongodbbackups
verbs: ["get", "list", "watch"]
- apiGroups: ["kustomize.toolkit.fluxcd.io"]
resources:
- kustomizations
verbs: ["get", "list", "watch"]
- apiGroups: ["helm.toolkit.fluxcd.io"]
resources:
- helmreleases
verbs: ["get", "list", "watch"]
- apiGroups: ["source.toolkit.fluxcd.io"]
resources:
- gitrepositories
- helmrepositories
- helmcharts
verbs: ["get", "list", "watch"]
Verify the Guardian is active and capturing cluster signals:
# Check the Guardian logs for detection events
kubectl -n observability logs deploy/k8s-ai-monitor
# Query the Guardian's API for the current incident state
curl http://localhost:8080/state
Expected Output: You should see a list of recent cluster events being processed, sanitized, and (if applicable) grouped into incident records.
Labs, quizzes, and runbooks — available to course members.
Result: Tooling is not enough if it does not provide a single, normalized, and actionable picture of the incident. Observed Symptoms What the team sees first: Many alerts are technically true but operationally …
Safe investigation sequence: Inspect Raw Signals: Review the raw Kubernetes events and metrics entering the Guardian. Verify Sanitization: Confirm that secrets, tokens, and context budgets are correctly handled before …
Detect: From real-time events, Flux stalled conditions, and periodic scanners (Pods, PVCs, Certs, etc.). Analyze: Collects pod state, logs, and metrics, then sanitizes and budgets the context. Decide: Creates/updates an …
Done When You have completed this chapter when: The Guardian has captured and analyzed at least one controlled chaos scenario. You can explain why the Guardian has a read-only RBAC boundary. You have successfully …