Lab: Guardian on Top of Controlled Chaos

Goal

Validate guardian flow end-to-end for one controlled incident:

  • detect from cluster events/scanner
  • structured AI analysis
  • incident persistence and lifecycle actions

Prerequisites

  • Chapter 11 chaos flow available in develop
  • k8s-ai-monitor image built and deployed in playground cluster
  • STATE_BACKEND=sqlite
  • API token configured for write endpoints

Step 1: Trigger Controlled Failure

Use one scenario:

  • backend /status/500 burst, or
  • backend /panic, or
  • one manual Chaos Monkey run.

Capture start timestamp.

Step 2: Verify Detection

Check guardian logs:

kubectl -n <guardian-ns> logs deploy/k8s-ai-monitor --since=15m

Expected:

  • warning/scan detection
  • state key creation
  • analysis call entry

Step 3: Verify Incident Record

kubectl -n <guardian-ns> port-forward deploy/k8s-ai-monitor 8080:8080
curl -s http://localhost:8080/incidents | jq

Expected:

  • active incident present
  • occurrence_count >= 1

Step 4: Validate Structured Analysis

curl -s http://localhost:8080/incidents/<id> | jq

Expected fields:

  • root_cause
  • confidence
  • hypotheses[]
  • suggested_actions[]

Step 5: Incident Lifecycle Actions

curl -s -X POST -H "X-Internal-Token: <token>" http://localhost:8080/incidents/<id>/ack
curl -s -X POST -H "X-Internal-Token: <token>" http://localhost:8080/incidents/<id>/resolve

Expected:

  • status transitions to acknowledged, then resolved

Step 6: Cost/Usage Check

curl -s "http://localhost:8080/llm-usage?hours=24" | jq

Confirm:

  • calls are rate-limited
  • usage and cost visible for audit

Hard Stop Conditions

  • guardian attempts autonomous remediation
  • raw secrets/tokens visible in incident context
  • no dedup and alert storm on repeated identical events

Done When

  • one chaos incident is fully tracked by guardian
  • analysis is structured and actionable
  • lifecycle actions are auditable