Lab: Guardian on Top of Controlled Chaos
Goal
Validate guardian flow end-to-end for one controlled incident:
- detect from cluster events/scanner
- structured AI analysis
- incident persistence and lifecycle actions
Prerequisites
- Chapter 11 chaos flow available in
develop k8s-ai-monitorimage built and deployed in playground clusterSTATE_BACKEND=sqlite- API token configured for write endpoints
Step 1: Trigger Controlled Failure
Use one scenario:
backend /status/500burst, orbackend /panic, or- one manual Chaos Monkey run.
Capture start timestamp.
Step 2: Verify Detection
Check guardian logs:
kubectl -n <guardian-ns> logs deploy/k8s-ai-monitor --since=15m
Expected:
- warning/scan detection
- state key creation
- analysis call entry
Step 3: Verify Incident Record
kubectl -n <guardian-ns> port-forward deploy/k8s-ai-monitor 8080:8080
curl -s http://localhost:8080/incidents | jq
Expected:
- active incident present
occurrence_count >= 1
Step 4: Validate Structured Analysis
curl -s http://localhost:8080/incidents/<id> | jq
Expected fields:
root_causeconfidencehypotheses[]suggested_actions[]
Step 5: Incident Lifecycle Actions
curl -s -X POST -H "X-Internal-Token: <token>" http://localhost:8080/incidents/<id>/ack
curl -s -X POST -H "X-Internal-Token: <token>" http://localhost:8080/incidents/<id>/resolve
Expected:
- status transitions to
acknowledged, thenresolved
Step 6: Cost/Usage Check
curl -s "http://localhost:8080/llm-usage?hours=24" | jq
Confirm:
- calls are rate-limited
- usage and cost visible for audit
Hard Stop Conditions
- guardian attempts autonomous remediation
- raw secrets/tokens visible in incident context
- no dedup and alert storm on repeated identical events
Done When
- one chaos incident is fully tracked by guardian
- analysis is structured and actionable
- lifecycle actions are auditable