Advanced Track Do this after finishing Chapters 01-14.

Estimated Time

  • Reading: 30-40 min
  • Lab: 60-90 min
  • Quiz: 15-20 min

Prerequisites

  • Core track (Chapters 01-14) completed.
  • GitOps promotion and observability workflows available.

Source Code References

  • canary.example.yaml Members

Sign in to view source code.

What You Will Produce

A go/no-go evidence package: rollout results, remediation notes, and explicit rollback conditions.

Module: Progressive Delivery

Learning Objectives

By the end of this chapter, you will be able to:

  • Configure Flagger canary analysis with weighted rollout progression
  • Define Prometheus-driven abort criteria for canary deployments
  • Execute controlled traffic shifting via Traefik ingress-level control
  • Analyze canary metrics to make informed promotion or rollback decisions

Start with the video for the concept overview, then work through each lesson section.

A deployment reaches 100% of production traffic instantly. A hidden bug causes a global outage. In this module, we implement Progressive Delivery using Flagger to move away from high-risk “all-or-nothing” deployments toward automated, metric-driven canary releases.


1. The Problem: The “Big-Bang” Failure

Traditional “all-at-once” deployments have a 100% blast radius. If a bug reaches production, every single user is affected simultaneously. Manual rollbacks are slow and error-prone, leading to extended downtime and a high-stress environment for responders.

2. The Concept: Metric-Driven Canaries

We use Flagger to shift traffic incrementally while automatically analyzing system health.

  1. Initial Shift: Route a tiny fraction of traffic (e.g., 5%) to the new version.
  2. Analysis: The system checks real-time metrics (latency, error rate) from Prometheus.
  3. Automated Promotion: If healthy, traffic weight increases step-by-step.
  4. Automated Rollback: If metrics degrade, Flagger reverts to the stable version instantly, before the majority of users even notice.

3. The Code: The Canary Object

Our sre/ repo defines Canary objects that act as the brain of our release process. These objects define the analysis intervals, traffic steps, and health thresholds.

Canary object example

This file is available only to members with repository access.

4. The Guardrail: Automated Health Checks

We never “guess” if a release is safe. We use PromQL queries to enforce our production invariants at every stage of the traffic shift. If the canary’s error rate exceeds 1% or latency passes 500ms, the release is aborted automatically.

5. Verification: Did I Get It?

Verify your canary status and observe a traffic shift in real-time:

# Watch the canary analysis progress
kubectl get canaries -n develop -w
# Trigger a deployment and check the traffic split
kubectl get canary backend -n develop -o jsonpath='{.status.canaryWeight}'

Expected Output: You should see the weight increase incrementally (5, 10, 20…) until the promotion is complete or a rollback is triggered.


Detailed Lessons

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Lab: Canary Rollout with Traefik + Flagger (Advanced) Members
  • Progressive Delivery Scorecard (Template) Members
  • Quiz: Advanced Module (Progressive Delivery with Traefik + Flagger) Members
  • Runbook: Progressive Delivery Operations (Advanced) Members

The Incident: The Big-Bang Failure

Result: The failure blast radius is 100% of your users because the deployment was an “all-or-nothing” event. Observed Symptoms What the team sees first: Error rates spike across all users immediately after …

Investigation & Containment

Safe investigation sequence: Verify Traffic Split: Check the current traffic distribution between the stable and canary versions. Monitor Canary Metrics: Analyze the latency and error rates specifically for the canary …

Workflow & Components

Canary Object: The manifest that defines the traffic shifting rules and health metrics. Analysis: Defines the PromQL queries used to validate the canary pods. Primary Service: The stable version of your application. …

Lab & Completion

Done When You have completed this module when: You can explain the difference between a “Big-Bang” and “Progressive” deployment. You have successfully executed a canary deployment with at least 3 …