Chapter 12: AI-Assisted SRE Guardian (Draft)

Chapter 12: AI-Assisted SRE Guardian (Draft)

Why This Chapter Exists

Chaos testing and alerts generate noise unless incidents are normalized and prioritized. This chapter introduces an AI-assisted guardian that analyzes incidents, proposes actions, and escalates safely without auto-fixing production.

Scope (Current Draft)

Implementation target is ../k8s-ai-monitor/:

Kopf operator handlers for events and Flux objects
scanner loops for pod/pvc/certificate/endpoint
LLM analysis with strict JSON schema
incident lifecycle backend (SQLite preferred)
confidence-based human escalation

Guardian Responsibilities

Detect:

Kubernetes Warning events
Flux stalled conditions
periodic scanner findings

Analyze:

collect structured context
sanitize sensitive data
enforce context budget
call LLM for structured root-cause hypotheses

Decide:

create/update incident record
deduplicate repeated noise
escalate recurring/persistent incidents

Notify:

send structured alert
expose incident APIs for ack/resolve

Guardrails

AI proposes; human approves remediation.
No autonomous write-back to production workloads.
Confidence < threshold implies explicit human review.
Secret/token redaction is mandatory before LLM call.
Rate and cost limits are mandatory.

Repository Mapping

Guardian config: ../k8s-ai-monitor/src/config.py
Event handlers: ../k8s-ai-monitor/src/handlers/events.py, ../k8s-ai-monitor/src/handlers/flux.py
Scanner startup loops + HTTP API: ../k8s-ai-monitor/src/handlers/startup.py
Processing pipeline: ../k8s-ai-monitor/src/engine/pipeline.py
LLM schema + cost tracking: ../k8s-ai-monitor/src/engine/llm.py
Sanitizer: ../k8s-ai-monitor/src/engine/sanitizer.py
Incident store: ../k8s-ai-monitor/src/engine/store/sqlite.py

Lab Files

lab.md
runbook-guardian.md
quiz.md

Done When (MVP)

guardian catches one Chapter 11 chaos scenario
incident is persisted with structured analysis and confidence
on-call can ack/resolve incident via API
one escalation scenario is demonstrated (recurring or persistent)

Lab: Guardian on Top of Controlled Chaos (Draft)

Lab: Guardian on Top of Controlled Chaos (Draft)

Goal

Validate guardian flow end-to-end for one controlled incident:

detect from cluster events/scanner
structured AI analysis
incident persistence and lifecycle actions

Prerequisites

Chapter 11 chaos flow available in develop
k8s-ai-monitor image built and deployed in playground cluster
STATE_BACKEND=sqlite
API token configured for write endpoints

Step 1: Trigger Controlled Failure

Use one scenario:

backend /status/500 burst, or
backend /panic, or
one manual Chaos Monkey run.

Capture start timestamp.

Quiz: Chapter 12 (AI-Assisted SRE Guardian)

Quiz: Chapter 12 (AI-Assisted SRE Guardian)

Questions

Why should guardian not auto-fix production by default?
Which backend enables full incident lifecycle in guardian?

A) configmap
B) sqlite
C) in-memory only

Which output format is required from LLM for reliable automation boundaries?
What should happen when confidence is low?
Which is a mandatory pre-LLM guardrail?

A) plaintext env dump
B) sanitizer/redaction
C) unlimited context

What is the purpose of dedup + cooldown in guardian pipeline?
Runbook: AI Guardian Operations (Draft)
Runbook: AI Guardian Operations (Draft)
Purpose
Operate the guardian as a safe incident triage layer, not an auto-remediation engine.
Runtime Checks
1. Health:
```
curl -s http://localhost:8080/healthz
```
1. Recent incidents:
```
curl -s http://localhost:8080/incidents | jq
```
1. LLM usage and rate:
```
curl -s "http://localhost:8080/llm-usage?hours=24" | jq
```
Incident Handling Workflow
1. Confirm symptom in platform observability.
2. Open guardian incident detail.
3. Validate confidence and evidence:
- if low confidence, require manual deep-dive
- if high confidence with strong evidence, apply runbook action
1. Ack incident when ownership is clear.
2. Resolve only after recovery verification.
Escalation Logic
- recurring incidents should raise urgency
- persistent incidents should trigger hardening task with owner/due date
- no closure without verified mitigation
Security & Compliance Checks
- ensure sanitizer policy is active
- verify no plaintext secrets in incident payloads
- rotate API tokens regularly
Failure Modes
1. LLM unavailable:
- continue with raw context and manual triage
- avoid blocking incident response
1. SQLite unavailable:
- fallback to configmap mode if needed for continuity
- restore SQLite for full incident lifecycle features
1. Alert storm:
- tune dedup/cooldown thresholds
- reduce scanner frequency temporarily