Chapter 12: AI-Assisted SRE Guardian (Draft)

Chapter 12: AI-Assisted SRE Guardian (Draft)

Why This Chapter Exists

Chaos testing and alerts generate noise unless incidents are normalized and prioritized. This chapter introduces an AI-assisted guardian that analyzes incidents, proposes actions, and escalates safely without auto-fixing production.

Scope (Current Draft)

Implementation target is ../k8s-ai-monitor/:

  • Kopf operator handlers for events and Flux objects
  • scanner loops for pod/pvc/certificate/endpoint
  • LLM analysis with strict JSON schema
  • incident lifecycle backend (SQLite preferred)
  • confidence-based human escalation

Guardian Responsibilities

  1. Detect:
  • Kubernetes Warning events
  • Flux stalled conditions
  • periodic scanner findings
  1. Analyze:
  • collect structured context
  • sanitize sensitive data
  • enforce context budget
  • call LLM for structured root-cause hypotheses
  1. Decide:
  • create/update incident record
  • deduplicate repeated noise
  • escalate recurring/persistent incidents
  1. Notify:
  • send structured alert
  • expose incident APIs for ack/resolve

Guardrails

  • AI proposes; human approves remediation.
  • No autonomous write-back to production workloads.
  • Confidence < threshold implies explicit human review.
  • Secret/token redaction is mandatory before LLM call.
  • Rate and cost limits are mandatory.

Repository Mapping

  • Guardian config: ../k8s-ai-monitor/src/config.py
  • Event handlers: ../k8s-ai-monitor/src/handlers/events.py, ../k8s-ai-monitor/src/handlers/flux.py
  • Scanner startup loops + HTTP API: ../k8s-ai-monitor/src/handlers/startup.py
  • Processing pipeline: ../k8s-ai-monitor/src/engine/pipeline.py
  • LLM schema + cost tracking: ../k8s-ai-monitor/src/engine/llm.py
  • Sanitizer: ../k8s-ai-monitor/src/engine/sanitizer.py
  • Incident store: ../k8s-ai-monitor/src/engine/store/sqlite.py

Lab Files

  • lab.md
  • runbook-guardian.md
  • quiz.md

Done When (MVP)

  • guardian catches one Chapter 11 chaos scenario
  • incident is persisted with structured analysis and confidence
  • on-call can ack/resolve incident via API
  • one escalation scenario is demonstrated (recurring or persistent)

Lab: Guardian on Top of Controlled Chaos (Draft)

Lab: Guardian on Top of Controlled Chaos (Draft)

Goal

Validate guardian flow end-to-end for one controlled incident:

  • detect from cluster events/scanner
  • structured AI analysis
  • incident persistence and lifecycle actions

Prerequisites

  • Chapter 11 chaos flow available in develop
  • k8s-ai-monitor image built and deployed in playground cluster
  • STATE_BACKEND=sqlite
  • API token configured for write endpoints

Step 1: Trigger Controlled Failure

Use one scenario:

  • backend /status/500 burst, or
  • backend /panic, or
  • one manual Chaos Monkey run.

Capture start timestamp.

Quiz: Chapter 12 (AI-Assisted SRE Guardian)

Quiz: Chapter 12 (AI-Assisted SRE Guardian)

Questions

  1. Why should guardian not auto-fix production by default?

  2. Which backend enables full incident lifecycle in guardian?

  • A) configmap
  • B) sqlite
  • C) in-memory only
  1. Which output format is required from LLM for reliable automation boundaries?

  2. What should happen when confidence is low?

  3. Which is a mandatory pre-LLM guardrail?

  • A) plaintext env dump
  • B) sanitizer/redaction
  • C) unlimited context
  1. What is the purpose of dedup + cooldown in guardian pipeline?

    Runbook: AI Guardian Operations (Draft)

    Runbook: AI Guardian Operations (Draft)

    Purpose

    Operate the guardian as a safe incident triage layer, not an auto-remediation engine.

    Runtime Checks

    1. Health:
    curl -s http://localhost:8080/healthz
    
    1. Recent incidents:
    curl -s http://localhost:8080/incidents | jq
    
    1. LLM usage and rate:
    curl -s "http://localhost:8080/llm-usage?hours=24" | jq
    

    Incident Handling Workflow

    1. Confirm symptom in platform observability.
    2. Open guardian incident detail.
    3. Validate confidence and evidence:
    • if low confidence, require manual deep-dive
    • if high confidence with strong evidence, apply runbook action
    1. Ack incident when ownership is clear.
    2. Resolve only after recovery verification.

    Escalation Logic

    • recurring incidents should raise urgency
    • persistent incidents should trigger hardening task with owner/due date
    • no closure without verified mitigation

    Security & Compliance Checks

    • ensure sanitizer policy is active
    • verify no plaintext secrets in incident payloads
    • rotate API tokens regularly

    Failure Modes

    1. LLM unavailable:
    • continue with raw context and manual triage
    • avoid blocking incident response
    1. SQLite unavailable:
    • fallback to configmap mode if needed for continuity
    • restore SQLite for full incident lifecycle features
    1. Alert storm:
    • tune dedup/cooldown thresholds
    • reduce scanner frequency temporarily