Chapter 09: Observability (Metrics, Logs, Traces)

Chapter 09: Observability (Metrics, Logs, Traces)

Why This Chapter Exists

Without correlated signals, incidents become guesswork. This chapter defines the minimum production baseline:

  • metrics for symptom detection
  • traces for path analysis
  • logs for evidence

Scope Decision (MVP)

  • No in-cluster OpenTelemetry Collector in this phase.
  • Frontend and backend export telemetry directly to Uptrace.
  • Target investigation path: frontend -> backend now, -> database when DB layer is introduced.

References:

  • docs/observability/uptrace-cloud.md
  • docs/observability/uptrace-e2e-plan.md

The Incident Hook

Users report intermittent 5xx errors and slow responses. Dashboards show elevated latency, but root cause is unclear. Without trace correlation, the team jumps between pods/logs blindly. With baseline observability, on-call narrows cause in minutes.

Guardrails

  • No telemetry credentials in plaintext Git.
  • No debugging based on logs-only; always pivot through traces.
  • Keep rollback decision tied to evidence: metrics + traces + logs.

Repo Mapping

  • Frontend telemetry init: ../frontend/src/services/telemetry.js
  • Frontend manual spans: ../frontend/src/stores/backend.js, ../frontend/src/views/ChaosView.vue
  • Backend telemetry: ../backend/pkg/telemetry/telemetry.go
  • Backend trace/log correlation and panic endpoint: ../backend/pkg/server/server.go
  • Kubernetes env wiring: flux/apps/frontend/base/deployment.yaml, flux/apps/backend/base/deployment.yaml

Lab Files

  • lab.md
  • runbook-incident-debug.md
  • sli-slo.md
  • quiz.md

Done When

  • learner can trigger and find one end-to-end trace from frontend to backend
  • learner can match backend error log by trace_id
  • learner can run incident workflow metrics -> traces -> logs -> action
  • learner can explain backend availability SLI/SLO and validate burn-rate alerts

Lab: Baseline Observability with Uptrace (No In-Cluster Collector)

Lab: Baseline Observability with Uptrace (No In-Cluster Collector)

Goal

Validate that telemetry is operational and correlated:

  • frontend creates spans for user actions
  • backend receives trace context and emits correlated logs
  • Uptrace shows trace chain and related service signals
  • Prometheus alert path is connected to the same incident workflow

Prerequisites

  • frontend and backend are deployed in one environment (recommended: develop)
  • Uptrace DSN is configured in secrets and injected into workloads
  • Flux reconciliation is healthy

Quick checks:

Quiz: Chapter 09 (Observability)

Quiz: Chapter 09 (Observability)

Questions

  1. Why is “metrics -> traces -> logs” the preferred incident flow?

  2. In this MVP, where is telemetry exported from?

  • A) only in-cluster OTel collector
  • B) directly from frontend/backend to Uptrace
  • C) only backend exports telemetry
  1. What header set is required for end-to-end context propagation?

  2. Which backend endpoint is used for controlled crash correlation drill?

  3. Which signal should confirm symptom first during incident triage?

  4. If backend spans are orphaned (not linked to frontend), what should you inspect first?

    Runbook: Incident Debug (Metrics -> Traces -> Logs)

    Runbook: Incident Debug (Metrics -> Traces -> Logs)

    Purpose

    Provide one repeatable on-call path for the most common symptom:

    • elevated latency and/or sporadic 5xx

    This runbook is optimized for the current MVP setup:

    • direct export to Uptrace from frontend/backend
    • no in-cluster OTel collector

    Inputs

    • environment (develop, staging, or production)
    • incident window (UTC time range)
    • primary route/symptom if known

    Step 1: Confirm Symptom (Metrics First)

    Check service-level symptoms:

    • request rate anomaly
    • p95/p99 latency increase
    • 5xx error-rate increase

    Decision:

    SLI/SLO Spec: Chapter 09 Baseline

    SLI/SLO Spec: Chapter 09 Baseline

    Scope

    Service in scope:

    • backend HTTP API

    Environment scope:

    • develop, staging, production

    Indicators (SLIs)

    1. Availability SLI
    • Definition: ratio of successful requests (non-5xx) to total requests.
    • PromQL:
    1 - (
      sum(rate(app_http_requests_total{job="backend",status=~"5.."}[30m]))
      /
      clamp_min(sum(rate(app_http_requests_total{job="backend"}[30m])), 1e-9)
    )
    
    1. Latency SLI (p95)
    • Definition: p95 backend request duration over rolling window.
    • PromQL:
    histogram_quantile(0.95,
      sum(rate(app_http_request_duration_seconds_bucket{job="backend"}[5m])) by (le)
    )
    

    Objectives (SLOs)

    1. Availability SLO
    • Target: 99.5% over 30 days.
    • Error budget: 0.5%.
    1. Latency objective (operational target)
    • Target: p95 < 1s on 5-minute windows.
    • Used for warning/critical operational alerts.

    Alert Strategy

    1. Immediate symptom alerts
    • BackendCriticalErrorRate
    • BackendHighLatency
    • BackendServiceDown
    1. Budget consumption alerts (burn-rate)
    • BackendSLOErrorBudgetBurnCritical: fast burn on 5m and 1h windows (14.4x budget).
    • BackendSLOErrorBudgetBurnWarning: sustained burn on 30m and 1h windows (6x budget).

    Guardrails

    • Do not page only on single-point spikes without cross-signal evidence.
    • For customer-impact decisions, require: metrics symptom + one representative trace + correlated log line.
    • Every alert route must include runbook: runbook-incident-debug.md.