Chapter 10 — Observability (Metrics, Logs, Traces) (Part 1)
Chapter 10 — Observability (Metrics, Logs, Traces) (Part 1) is being produced. Check back soon.
Sign in to view source code.
A reproducible lab result plus quiz verification and incident-safe operating evidence.
Members see the full interactive explainer with checkpoint questions and downloadable labs. The first two chapters are free previews — try those to get a feel for the format before you subscribe.
Chapter 10 — Observability (Metrics, Logs, Traces) (Part 1) is being produced. Check back soon.
Chapter 10 — Observability (Metrics, Logs, Traces) (Part 2) is being produced. Check back soon.
By the end of this chapter, you will be able to:
Start with the video for the concept overview, then work through each lesson section.
Monitoring tells you if something is wrong. Observability tells you why it is wrong. In this chapter, we implement the three pillars of observability to move from guesswork to evidence-based incident response.
Users report slow responses and random errors. Dashboards show high latency, but logs are a wall of noise. You jump between Pods blindly, hoping a restart fixes it. You waste hours because your signals aren’t connected by a single causal path.
We use three correlated signals to triangulate any issue:
Our sre/ repo uses the Prometheus Operator to pull metrics from our applications. The ServiceMonitor is our contract for symptom discovery.
Backend ServiceMonitor
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: backend
labels:
app: backend
spec:
selector:
matchLabels:
app: backend
endpoints:
- port: http
path: /metrics
interval: 30s
scrapeTimeout: 10s
trace_id)The most important guardrail in observability is ensuring that your logs and traces share the same Trace ID. This allows you to go from a slow trace directly to the relevant log entries, saving minutes or hours during an investigation.
Backend alert rules
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: backend-alerts
namespace: observability
labels:
prometheus: kube-prometheus-stack
role: alert-rules
spec:
groups:
- name: backend.rules
interval: 30s
rules:
# High error rate alert
- alert: BackendHighErrorRate
expr: |
(
sum(rate(app_http_requests_total{job="backend",status=~"5.."}[5m]))
/ clamp_min(sum(rate(app_http_requests_total{job="backend"}[5m])), 1e-9)
) > 0.05
for: 5m
labels:
severity: warning
component: backend
annotations:
summary: "Backend service has high error rate"
description: "Backend error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# Critical error rate alert
- alert: BackendCriticalErrorRate
expr: |
(
sum(rate(app_http_requests_total{job="backend",status=~"5.."}[5m]))
/ clamp_min(sum(rate(app_http_requests_total{job="backend"}[5m])), 1e-9)
) > 0.10
for: 2m
labels:
severity: critical
component: backend
annotations:
summary: "Backend service has critical error rate"
description: "Backend error rate is {{ $value | humanizePercentage }} (threshold: 10%)"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# High latency alert (p95)
- alert: BackendHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(app_http_request_duration_seconds_bucket{job="backend"}[5m])) by (le)
) > 1
for: 5m
labels:
severity: warning
component: backend
annotations:
summary: "Backend service has high latency"
description: "Backend p95 latency is {{ $value }}s (threshold: 1s)"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# Service down alert
- alert: BackendServiceDown
expr: up{job="backend"} == 0
for: 1m
labels:
severity: critical
component: backend
annotations:
summary: "Backend service is down"
description: "Backend service {{ $labels.instance }} is down"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# High memory usage
- alert: BackendHighMemoryUsage
expr: |
(
process_resident_memory_bytes{job="backend"}
/
1024 / 1024 / 1024
) > 0.8
for: 5m
labels:
severity: warning
component: backend
annotations:
summary: "Backend service has high memory usage"
description: "Backend memory usage is {{ $value }}GB (threshold: 0.8GB)"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# Too many goroutines
- alert: BackendHighGoroutines
expr: go_goroutines{job="backend"} > 10000
for: 5m
labels:
severity: warning
component: backend
annotations:
summary: "Backend has too many goroutines"
description: "Backend has {{ $value }} goroutines (threshold: 10000)"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
# Pod restarts
- alert: BackendPodRestarting
expr: |
rate(kube_pod_container_status_restarts_total{
namespace=~"develop|staging|production",
pod=~"backend-.*"
}[15m]) > 0
for: 5m
labels:
severity: warning
component: backend
annotations:
summary: "Backend pod is restarting frequently"
description: "Backend pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting"
runbook: "docs/course/chapter-09-observability/runbook-incident-debug.md"
Run an end-to-end “Signal Drill”:
trace_id.Labs, quizzes, and runbooks — available to course members.
Result: Time is lost because responders cannot see the causal path across service boundaries. Observed Symptoms What the team sees first: Metrics clearly show a user-facing problem (latency/error spikes). Logs contain …
Safe investigation sequence: Detect Symptom: Start from the metric symptom (latency or error spike). Pivot to Traces: Use traces to isolate the exact failing path. Correlate Logs: Search logs for the trace_id from the …
Backend ServiceMonitor --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: backend labels: app: backend spec: selector: matchLabels: app: backend endpoints: - port: http path: /metrics interval: …
Metrics Symptom: Latency or error-rate spike in Grafana. Trace Path: Showing the failing route and span chain in Uptrace. Log Evidence: Matching backend log with the correct trace_id. Success Condition: All three …