Advanced Track Do this after finishing Chapters 01-14.

Estimated Time

  • Reading: 30-40 min
  • Lab: 60-90 min
  • Quiz: 15-20 min

Prerequisites

  • Core track (Chapters 01-14) completed.
  • GitOps promotion and observability workflows available.

Source Code References

  • cnpg-clusters/ Members

Sign in to view source code.

What You Will Produce

A go/no-go evidence package: rollout results, remediation notes, and explicit rollback conditions.

Chapter 17: Rollback & Data Migrations

Learning Objectives

By the end of this chapter, you will be able to:

  • Apply the expand/contract migration model for safe schema changes
  • Use feature flags to maintain rollback capability during releases
  • Configure destructive migration approval gates with evidence requirements
  • Design a recovery plan for failed data migrations

Start with the video for the concept overview, then work through each lesson section.

A database migration runs, fails, and corrupts data. You roll back the application code, but the schema is already changed. In this final chapter, we tackle the Point-of-No-Return problem by implementing atomic code and data rollback procedures.


1. The Problem: The “Point-of-No-Return”

Database migrations are inherently risky because they change the persistent state of the system. If a migration fails halfway through, a standard GitOps rollback of the application code is not enough—the application will still fail because the database schema is in an inconsistent or incompatible state.

2. The Concept: Atomic Recovery

We treat stateful changes as a synchronized operation between code and data.

  1. Expand-Contract Pattern: We avoid breaking changes by separating “Adding” from “Removing” across two releases.
  2. Backup-Before-Migrate: We ensure a fresh backup or PITR checkpoint exists immediately before any schema change.
  3. Point-in-Time Recovery (PITR): We use CloudNativePG to restore the database to the exact millisecond before the migration started.

3. The Code: PITR Recovery

Our sre/ repo uses CloudNativePG’s native recovery objects to automate stateful rollbacks. This allows us to “revert” the database state just as easily as we revert a Git commit.

CloudNativePG cluster baseline

  • flux/infrastructure/data/cnpg-clusters/develop/cluster.yaml
  • flux/infrastructure/data/cnpg-clusters/develop/kustomization.yaml
  • flux/infrastructure/data/cnpg-clusters/develop/scheduled-backup.yaml
  • flux/infrastructure/data/cnpg-clusters/production/cluster.yaml
  • flux/infrastructure/data/cnpg-clusters/production/kustomization.yaml
  • flux/infrastructure/data/cnpg-clusters/production/postgres-app-secret.yaml
  • flux/infrastructure/data/cnpg-clusters/production/scheduled-backup.yaml
  • flux/infrastructure/data/cnpg-clusters/staging/cluster.yaml
  • flux/infrastructure/data/cnpg-clusters/staging/kustomization.yaml
  • flux/infrastructure/data/cnpg-clusters/staging/scheduled-backup.yaml

4. The Guardrail: Pre-Migration Checkpoint

We never run a migration blindly. We enforce a guardrail that requires a manual or automated “snapshot” of the database before any DDL statements are executed. This provides a guaranteed safety net for even the most complex migrations.

5. Verification: Did I Get It?

Verify your database health and capture a pre-migration checkpoint:

# Check current PITR status
kubectl get cluster -n develop <cluster-name> -o yaml | grep "firstBackup"
# Capture the current timestamp for a potential restore
date -u +"%Y-%m-%dT%H:%M:%SZ"

Expected Output: You should have a valid backup history and a clear timestamp to use for a PITR restore if the migration fails.


Detailed Lessons

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Lab: Rollback-Safe Migration Drill (Advanced) Members
  • Quiz: Chapter 17 (Rollback and Data Migrations) Members
  • Rollback & Data Migrations Scorecard (Template) Members
  • Runbook: Rollback and Migration Operations (Advanced) Members

The Incident: Point-of-No-Return

Result: You&rsquo;ve reached a &ldquo;point-of-no-return&rdquo; where a standard GitOps rollback is insufficient because the stateful layer is broken. Observed Symptoms What the team sees first: The application fails to …

Investigation & Containment

Safe investigation sequence: Verify Migration Status: Identify which migration step failed and what changes were partially applied. Check App Compatibility: Confirm if the current application version can function with …

Workflow & Safe Migrations

Expand: Add new columns or tables. Update code to write to both old and new locations. Migrate: Move existing data from old to new structures. Contract: Update code to read only from new locations. Once stable, remove …

Lab & Completion

Done When You have completed this chapter when: You can explain why code rollbacks are insufficient for stateful changes. You have successfully executed an atomic code + data rollback. You can demonstrate a Point-in-Time …