Advanced Module: Linkerd + Progressive Delivery (Canary / A-B)

Advanced Module: Linkerd + Progressive Delivery (Canary / A-B)

Why This Module Exists

Safe delivery is not only “deploy or rollback”. This module adds service-mesh-driven progressive rollout guardrails:

  • Linkerd mTLS by default
  • canary rollout with measurable abort criteria
  • A/B routing with explicit experiment boundaries

The Incident Hook

A full rollout passes smoke checks but fails under real production traffic mix. Error rate and latency spike after deploy, and rollback starts late because detection is manual. The team needs controlled traffic progression with automatic safety checks.

What AI Would Propose (Brave Junior)

  • “Ship 100% now; we can rollback if needed.”
  • “Canary is too slow for this fix.”
  • “Use ad-hoc routing rules without SLO checks.”

Why this sounds reasonable:

  • fastest short-term path
  • fewer moving parts in one deploy

Why This Is Dangerous

  • blast radius is immediate and broad
  • no objective stop conditions during rollout
  • A/B test drift can hide impact in one segment

Guardrails That Stop It

  • traffic progression in controlled steps (for example 5% -> 25% -> 50% -> 100%)
  • abort on SLO violation (error rate, latency, success rate)
  • mTLS identity and policy checks before rollout
  • rollback path tested before canary start

Module Scope

  1. Linkerd baseline (check, inject, identity, mTLS status).
  2. Canary rollout flow (Flagger + Linkerd or equivalent controller).
  3. A/B routing flow (header/cookie based).
  4. Evidence capture for rollout decision and postmortem.

Repository Mapping

  • flux/infrastructure/progressive-delivery/linkerd/
  • flux/infrastructure/progressive-delivery/flagger/
  • flux/infrastructure/progressive-delivery/develop/
  • flux/bootstrap/flux-system/infrastructure.yaml (Linkerd + Flagger enabled, develop canary pack opt-in)

Files

  • lab.md
  • runbook-linkerd-progressive-delivery.md
  • quiz.md

Done When

  • learner can run canary with automated abort criteria
  • learner can execute bounded A/B experiment with clear success metrics
  • learner can explain mesh value for rollout risk reduction

Lab: Linkerd Canary Rollout and A/B Routing (Advanced)

Lab: Linkerd Canary Rollout and A/B Routing (Advanced)

Goal

Run one progressive delivery exercise in develop:

  • validate Linkerd health and mTLS
  • execute canary rollout with automated analysis
  • run one A/B route experiment and review outcomes

Prerequisites

  • Linkerd control plane installed and healthy
  • rollout controller installed (Flagger recommended)
  • test workload and service available in develop
  • baseline SLO signals available (Prometheus metrics)
  • progressive-delivery manifests present in:
    • flux/infrastructure/progressive-delivery/linkerd/
    • flux/infrastructure/progressive-delivery/flagger/
    • flux/infrastructure/progressive-delivery/develop/

Quick checks:

linkerd check
kubectl -n linkerd get pods
kubectl -n develop get deploy,svc

Step 1: Verify Mesh Baseline

Confirm workload is meshed and identities are present:

Quiz: Advanced Module (Linkerd + Progressive Delivery)

Quiz: Advanced Module (Linkerd + Progressive Delivery)

Questions

  1. Why is progressive rollout safer than immediate 100% rollout?

  2. What is the main value of Linkerd in canary operations?

  3. Which signal is mandatory for automated canary abort decisions?

  4. Why should A/B routing be time-bounded?

  5. Which statement is correct?

  • A) Canary without abort criteria is acceptable in production.
  • B) Mesh telemetry can provide per-route success/latency for rollout decisions.
  • C) A/B rules should stay permanently after experiment end.
  1. Give one valid canary traffic progression pattern.

    Runbook: Linkerd Progressive Delivery Operations (Advanced)

    Runbook: Linkerd Progressive Delivery Operations (Advanced)

    Purpose

    Operate canary and A/B rollouts with objective safety gates.

    Pre-Rollout Checklist

    1. Linkerd control plane healthy (linkerd check).
    2. Target workload meshed and observable.
    3. Abort thresholds defined and approved.
    4. Rollback action documented and tested.

    Canary Operation Flow

    1. Start canary at low traffic weight.
    2. Evaluate window metrics (success rate, latency, error rate).
    3. Promote only if all thresholds pass.
    4. Abort automatically or manually on threshold breach.
    5. Record decision with evidence.

    A/B Operation Flow

    1. Define experiment hypothesis and metric target.
    2. Apply bounded route split (header/cookie/segment).
    3. Run for fixed window.
    4. Compare cohorts and decide keep/revert.
    5. Remove temporary routing rules after decision.

    Commands (Examples)

    linkerd check
    linkerd -n develop stat deploy
    linkerd -n develop routes deploy/<app-name>
    kubectl -n develop get canary
    kubectl -n develop describe canary <app-name>
    

    Failure Modes

    1. Metric noise causes false abort:
    • increase observation window; validate baseline first
    1. Canary stuck:
    • inspect controller events and policy spec; rollback if uncertain
    1. A/B drift:
    • ensure route selectors are explicit and temporary rules are removed