Chapter 09: Observability (Metrics, Logs, Traces)

Why This Chapter Exists

Without correlated signals, incidents become guesswork. This chapter defines the minimum production baseline:

metrics for symptom detection
traces for path analysis
logs for evidence

Scope Decision (MVP)

No in-cluster OpenTelemetry Collector in this phase.
Frontend and backend export telemetry directly to Uptrace.
Target investigation path: frontend -> backend now, -> database when DB layer is introduced.

References:

docs/observability/uptrace-cloud.md
docs/observability/uptrace-e2e-plan.md

The Incident Hook

Users report intermittent 5xx errors and slow responses. Dashboards show elevated latency, but root cause is unclear. Without trace correlation, the team jumps between pods/logs blindly. With baseline observability, on-call narrows cause in minutes.

Guardrails

No telemetry credentials in plaintext Git.
No debugging based on logs-only; always pivot through traces.
Keep rollback decision tied to evidence: metrics + traces + logs.

Repo Mapping

Frontend telemetry init: ../frontend/src/services/telemetry.js
Frontend manual spans: ../frontend/src/stores/backend.js, ../frontend/src/views/ChaosView.vue
Backend telemetry: ../backend/pkg/telemetry/telemetry.go
Backend trace/log correlation and panic endpoint: ../backend/pkg/server/server.go
Kubernetes env wiring: flux/apps/frontend/base/deployment.yaml, flux/apps/backend/base/deployment.yaml

Lab Files

lab.md
runbook-incident-debug.md
sli-slo.md
quiz.md

Done When

learner can trigger and find one end-to-end trace from frontend to backend
learner can match backend error log by trace_id
learner can run incident workflow metrics -> traces -> logs -> action
learner can explain backend availability SLI/SLO and validate burn-rate alerts

Lab: Baseline Observability with Uptrace (No In-Cluster Collector)

Goal

Validate that telemetry is operational and correlated:

frontend creates spans for user actions
backend receives trace context and emits correlated logs
Uptrace shows trace chain and related service signals
Prometheus alert path is connected to the same incident workflow

Prerequisites

frontend and backend are deployed in one environment (recommended: develop)
Uptrace DSN is configured in secrets and injected into workloads
Flux reconciliation is healthy

Quick checks:

Quiz: Chapter 09 (Observability)

Questions

Why is “metrics -> traces -> logs” the preferred incident flow?
In this MVP, where is telemetry exported from?

A) only in-cluster OTel collector
B) directly from frontend/backend to Uptrace
C) only backend exports telemetry

What header set is required for end-to-end context propagation?

Which backend endpoint is used for controlled crash correlation drill?

Which signal should confirm symptom first during incident triage?

If backend spans are orphaned (not linked to frontend), what should you inspect first?
Runbook: Incident Debug (Metrics -> Traces -> Logs)
Runbook: Incident Debug (Metrics -> Traces -> Logs)
Purpose
Provide one repeatable on-call path for the most common symptom:
- elevated latency and/or sporadic 5xx
This runbook is optimized for the current MVP setup:
- direct export to Uptrace from frontend/backend
- no in-cluster OTel collector
Inputs
- environment (develop, staging, or production)
- incident window (UTC time range)
- primary route/symptom if known
Step 1: Confirm Symptom (Metrics First)
Check service-level symptoms:
- request rate anomaly
- p95/p99 latency increase
- 5xx error-rate increase
Decision:
SLI/SLO Spec: Chapter 09 Baseline
SLI/SLO Spec: Chapter 09 Baseline
Scope
Service in scope:
- backend HTTP API
Environment scope:
- develop, staging, production
Indicators (SLIs)
1. Availability SLI
- Definition: ratio of successful requests (non-5xx) to total requests.
- PromQL:
```
1 - (
  sum(rate(app_http_requests_total{job="backend",status=~"5.."}[30m]))
  /
  clamp_min(sum(rate(app_http_requests_total{job="backend"}[30m])), 1e-9)
)
```
1. Latency SLI (p95)
- Definition: p95 backend request duration over rolling window.
- PromQL:
```
histogram_quantile(0.95,
  sum(rate(app_http_request_duration_seconds_bucket{job="backend"}[5m])) by (le)
)
```
Objectives (SLOs)
1. Availability SLO
- Target: 99.5% over 30 days.
- Error budget: 0.5%.
1. Latency objective (operational target)
- Target: p95 < 1s on 5-minute windows.
- Used for warning/critical operational alerts.
Alert Strategy
1. Immediate symptom alerts
- BackendCriticalErrorRate
- BackendHighLatency
- BackendServiceDown
1. Budget consumption alerts (burn-rate)
- BackendSLOErrorBudgetBurnCritical: fast burn on 5m and 1h windows (14.4x budget).
- BackendSLOErrorBudgetBurnWarning: sustained burn on 30m and 1h windows (6x budget).
Guardrails
- Do not page only on single-point spikes without cross-signal evidence.
- For customer-impact decisions, require: metrics symptom + one representative trace + correlated log line.
- Every alert route must include runbook: runbook-incident-debug.md.

Chapter 09: Observability (Metrics, Logs, Traces)

Chapter 09: Observability (Metrics, Logs, Traces)

Why This Chapter Exists

Scope Decision (MVP)

The Incident Hook

Guardrails

Repo Mapping

Lab Files

Done When

Lab: Baseline Observability with Uptrace (No In-Cluster Collector)

Lab: Baseline Observability with Uptrace (No In-Cluster Collector)

Goal

Prerequisites

Quiz: Chapter 09 (Observability)

Quiz: Chapter 09 (Observability)

Questions

Runbook: Incident Debug (Metrics -> Traces -> Logs)

Runbook: Incident Debug (Metrics -> Traces -> Logs)

Purpose

Inputs

Step 1: Confirm Symptom (Metrics First)

SLI/SLO Spec: Chapter 09 Baseline

SLI/SLO Spec: Chapter 09 Baseline

Scope

Indicators (SLIs)

Objectives (SLOs)

Alert Strategy

Guardrails