Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • backend-alerts.yaml Members
  • servicemonitor.yaml Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter video unlocks with Core membership

Members see the full interactive explainer with checkpoint questions and downloadable labs. The first two chapters are free previews — try those to get a feel for the format before you subscribe.

Chapter 10: Observability (Metrics, Logs, Traces)

Learning Objectives

By the end of this chapter, you will be able to:

  • Follow the metrics-to-traces-to-logs investigation path during an incident
  • Correlate signals across service boundaries using trace_id
  • Configure ServiceMonitor for automatic Prometheus discovery
  • Distinguish symptom detection from root-cause analysis

Start with the video for the concept overview, then work through each lesson section.

Monitoring tells you if something is wrong. Observability tells you why it is wrong. In this chapter, we implement the three pillars of observability to move from guesswork to evidence-based incident response.


1. The Problem: The “Intermittent 5xx” Mystery

Users report slow responses and random errors. Dashboards show high latency, but logs are a wall of noise. You jump between Pods blindly, hoping a restart fixes it. You waste hours because your signals aren’t connected by a single causal path.

2. The Concept: The Three Pillars

We use three correlated signals to triangulate any issue:

  1. Metrics (The Symptom): High-level indicators (Rate, Error, Latency). They answer: “Is there a problem?”
  2. Traces (The Path): A map of a single request traveling through services. They answer: “Where is it failing?”
  3. Logs (The Evidence): Detailed text records of a specific point in time. They answer: “What exactly happened?”

3. The Code: ServiceMonitor & Alerts

Our sre/ repo uses the Prometheus Operator to pull metrics from our applications. The ServiceMonitor is our contract for symptom discovery.

Backend ServiceMonitor

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: backend
  labels:
    app: backend
spec:
  selector:
    matchLabels:
      app: backend
  endpoints:
    - port: http
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s

4. The Guardrail: Trace Correlation (The trace_id)

The most important guardrail in observability is ensuring that your logs and traces share the same Trace ID. This allows you to go from a slow trace directly to the relevant log entries, saving minutes or hours during an investigation.

Backend alert rules

Show the backend alert rules
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: backend-alerts
  namespace: observability
  labels:
    prometheus: kube-prometheus-stack
    role: alert-rules
spec:
  groups:
    - name: backend.rules
      interval: 30s
      rules:
        # High error rate alert
        - alert: BackendHighErrorRate
          expr: |
            (
              sum(rate(app_http_requests_total{job="backend",status=~"5.."}[5m]))
              / clamp_min(sum(rate(app_http_requests_total{job="backend"}[5m])), 1e-9)
            ) > 0.05
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend service has high error rate"
            description: "Backend error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # Critical error rate alert
        - alert: BackendCriticalErrorRate
          expr: |
            (
              sum(rate(app_http_requests_total{job="backend",status=~"5.."}[5m]))
              / clamp_min(sum(rate(app_http_requests_total{job="backend"}[5m])), 1e-9)
            ) > 0.10
          for: 2m
          labels:
            severity: critical
            component: backend
          annotations:
            summary: "Backend service has critical error rate"
            description: "Backend error rate is {{ $value | humanizePercentage }} (threshold: 10%)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # High latency alert (p95)
        - alert: BackendHighLatency
          expr: |
            histogram_quantile(0.95,
              sum(rate(app_http_request_duration_seconds_bucket{job="backend"}[5m])) by (le)
            ) > 1
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend service has high latency"
            description: "Backend p95 latency is {{ $value }}s (threshold: 1s)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # Service down alert
        - alert: BackendServiceDown
          expr: up{job="backend"} == 0
          for: 1m
          labels:
            severity: critical
            component: backend
          annotations:
            summary: "Backend service is down"
            description: "Backend service {{ $labels.instance }} is down"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # High memory usage
        - alert: BackendHighMemoryUsage
          expr: |
            (
              process_resident_memory_bytes{job="backend"}
              /
              1024 / 1024 / 1024
            ) > 0.8
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend service has high memory usage"
            description: "Backend memory usage is {{ $value }}GB (threshold: 0.8GB)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # Too many goroutines
        - alert: BackendHighGoroutines
          expr: go_goroutines{job="backend"} > 10000
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend has too many goroutines"
            description: "Backend has {{ $value }} goroutines (threshold: 10000)"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

        # Pod restarts
        - alert: BackendPodRestarting
          expr: |
            rate(kube_pod_container_status_restarts_total{
              namespace=~"develop|staging|production",
              pod=~"backend-.*"
            }[15m]) > 0
          for: 5m
          labels:
            severity: warning
            component: backend
          annotations:
            summary: "Backend pod is restarting frequently"
            description: "Backend pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting"
            runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"

5. Verification: Did I Get It?

Run an end-to-end “Signal Drill”:

  1. Open Grafana to detect a latency spike.
  2. Open Uptrace and find the failing trace.
  3. Search your Backend Logs for that specific trace_id.

Detailed Lessons

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Lab: Baseline Observability with Uptrace Members
  • Quiz: Chapter 10 (Observability) Members
  • Runbook: Incident Debug (Metrics -> Traces -> Logs) Members
  • SLI/SLO Spec: Chapter 10 Baseline Members

The Incident: Intermittent 5xx

Result: Time is lost because responders cannot see the causal path across service boundaries. Observed Symptoms What the team sees first: Metrics clearly show a user-facing problem (latency/error spikes). Logs contain …

Investigation & Containment

Safe investigation sequence: Detect Symptom: Start from the metric symptom (latency or error spike). Pivot to Traces: Use traces to isolate the exact failing path. Correlate Logs: Search logs for the trace_id from the …

Workflow & Operating Model

Backend ServiceMonitor --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: backend labels: app: backend spec: selector: matchLabels: app: backend endpoints: - port: http path: /metrics interval: …

Lab & Completion

Metrics Symptom: Latency or error-rate spike in Grafana. Trace Path: Showing the failing route and span chain in Uptrace. Log Evidence: Matching backend log with the correct trace_id. Success Condition: All three …