Chapter 10: Observability (Metrics, Logs, Traces)

Learning Objectives

By the end of this chapter, you will be able to:

Follow the metrics-to-traces-to-logs investigation path during an incident
Correlate signals across service boundaries using trace_id
Configure ServiceMonitor for automatic Prometheus discovery
Distinguish symptom detection from root-cause analysis

Start with the video for the concept overview, then work through each lesson section.

Monitoring tells you if something is wrong. Observability tells you why it is wrong. In this chapter, we implement the three pillars of observability to move from guesswork to evidence-based incident response.

1. The Problem: The “Intermittent 5xx” Mystery

Users report slow responses and random errors. Dashboards show high latency, but logs are a wall of noise. You jump between Pods blindly, hoping a restart fixes it. You waste hours because your signals aren’t connected by a single causal path.

2. The Concept: The Three Pillars

We use three correlated signals to triangulate any issue:

Metrics (The Symptom): High-level indicators (Rate, Error, Latency). They answer: “Is there a problem?”
Traces (The Path): A map of a single request traveling through services. They answer: “Where is it failing?”
Logs (The Evidence): Detailed text records of a specific point in time. They answer: “What exactly happened?”

3. The Code: ServiceMonitor & Alerts

Our sre/ repo uses the Prometheus Operator to pull metrics from our applications. The ServiceMonitor is our contract for symptom discovery.

Backend ServiceMonitor

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: backend
  labels:
    app: backend
spec:
  selector:
    matchLabels:
      app: backend
  endpoints:
    - port: http
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s

4. The Guardrail: Trace Correlation (The `trace_id`)

The most important guardrail in observability is ensuring that your logs and traces share the same Trace ID. This allows you to go from a slow trace directly to the relevant log entries, saving minutes or hours during an investigation.

Backend alert rules

Show the backend alert rules

---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: backend-alerts
  namespace: observability
  labels:
    prometheus: kube-prometheus-stack
    role: alert-rules
spec:
  groups:
    - name: backend.rules
      interval: 30s
      rules:
        # High error rate alert
        - alert: BackendHighErrorRate
          expr: |
            (
              sum(rate(app_http_requests_total{job="backend",status=~"5.."}[5m]))
              / clamp_min(sum(rate(app_http_requests_total{job="backend"}[5m])), 1e-9)
            ) > 0.05
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend service has high error rate"
            description: "Backend error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # Critical error rate alert
        - alert: BackendCriticalErrorRate
          expr: |
            (
              sum(rate(app_http_requests_total{job="backend",status=~"5.."}[5m]))
              / clamp_min(sum(rate(app_http_requests_total{job="backend"}[5m])), 1e-9)
            ) > 0.10
          for: 2m
          labels:
            severity: critical
            component: backend
          annotations:
            summary: "Backend service has critical error rate"
            description: "Backend error rate is {{ $value | humanizePercentage }} (threshold: 10%)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # High latency alert (p95)
        - alert: BackendHighLatency
          expr: |
            histogram_quantile(0.95,
              sum(rate(app_http_request_duration_seconds_bucket{job="backend"}[5m])) by (le)
            ) > 1
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend service has high latency"
            description: "Backend p95 latency is {{ $value }}s (threshold: 1s)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # Service down alert
        - alert: BackendServiceDown
          expr: up{job="backend"} == 0
          for: 1m
          labels:
            severity: critical
            component: backend
          annotations:
            summary: "Backend service is down"
            description: "Backend service {{ $labels.instance }} is down"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # High memory usage
        - alert: BackendHighMemoryUsage
          expr: |
            (
              process_resident_memory_bytes{job="backend"}
              /
              1024 / 1024 / 1024
            ) > 0.8
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend service has high memory usage"
            description: "Backend memory usage is {{ $value }}GB (threshold: 0.8GB)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # Too many goroutines
        - alert: BackendHighGoroutines
          expr: go_goroutines{job="backend"} > 10000
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend has too many goroutines"
            description: "Backend has {{ $value }} goroutines (threshold: 10000)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # Pod restarts
        - alert: BackendPodRestarting
          expr: |
            rate(kube_pod_container_status_restarts_total{
              namespace=~"develop|staging|production",
              pod=~"backend-.*"
            }[15m]) > 0
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend pod is restarting frequently"
            description: "Backend pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

5. Verification: Did I Get It?

Run an end-to-end “Signal Drill”:

Open Grafana to detect a latency spike.
Open Uptrace and find the failing trace.
Search your Backend Logs for that specific trace_id.

Estimated Time

Prerequisites

Source Code References

What You Will Produce

Chapter video unlocks with Core membership

Watch Interactive Explainer