Chapter 17: Rollback & Data Migrations

Learning Objectives

By the end of this chapter, you will be able to:

Apply the expand/contract migration model for safe schema changes
Use feature flags to maintain rollback capability during releases
Configure destructive migration approval gates with evidence requirements
Design a recovery plan for failed data migrations

Start with the video for the concept overview, then work through each lesson section.

A database migration runs, fails, and corrupts data. You roll back the application code, but the schema is already changed. In this final chapter, we tackle the Point-of-No-Return problem by implementing atomic code and data rollback procedures.

1. The Problem: The “Point-of-No-Return”

Database migrations are inherently risky because they change the persistent state of the system. If a migration fails halfway through, a standard GitOps rollback of the application code is not enough—the application will still fail because the database schema is in an inconsistent or incompatible state.

2. The Concept: Atomic Recovery

We treat stateful changes as a synchronized operation between code and data.

Expand-Contract Pattern: We avoid breaking changes by separating “Adding” from “Removing” across two releases.
Backup-Before-Migrate: We ensure a fresh backup or PITR checkpoint exists immediately before any schema change.
Point-in-Time Recovery (PITR): We use CloudNativePG to restore the database to the exact millisecond before the migration started.

3. The Code: PITR Recovery

Our sre/ repo uses CloudNativePG’s native recovery objects to automate stateful rollbacks. This allows us to “revert” the database state just as easily as we revert a Git commit.

CloudNativePG cluster baseline

flux/infrastructure/data/cnpg-clusters/develop/cluster.yaml
flux/infrastructure/data/cnpg-clusters/develop/kustomization.yaml
flux/infrastructure/data/cnpg-clusters/develop/scheduled-backup.yaml
flux/infrastructure/data/cnpg-clusters/production/cluster.yaml
flux/infrastructure/data/cnpg-clusters/production/kustomization.yaml
flux/infrastructure/data/cnpg-clusters/production/postgres-app-secret.yaml
flux/infrastructure/data/cnpg-clusters/production/scheduled-backup.yaml
flux/infrastructure/data/cnpg-clusters/staging/cluster.yaml
flux/infrastructure/data/cnpg-clusters/staging/kustomization.yaml
flux/infrastructure/data/cnpg-clusters/staging/scheduled-backup.yaml

4. The Guardrail: Pre-Migration Checkpoint

We never run a migration blindly. We enforce a guardrail that requires a manual or automated “snapshot” of the database before any DDL statements are executed. This provides a guaranteed safety net for even the most complex migrations.

5. Verification: Did I Get It?

Verify your database health and capture a pre-migration checkpoint:

# Check current PITR status
kubectl get cluster -n develop <cluster-name> -o yaml | grep "firstBackup"
# Capture the current timestamp for a potential restore
date -u +"%Y-%m-%dT%H:%M:%SZ"

Expected Output: You should have a valid backup history and a clear timestamp to use for a PITR restore if the migration fails.

Estimated Time

Prerequisites

Source Code References

What You Will Produce

Chapter 17: Rollback & Data Migrations

Learning Objectives

1. The Problem: The “Point-of-No-Return”

2. The Concept: Atomic Recovery

3. The Code: PITR Recovery

4. The Guardrail: Pre-Migration Checkpoint

5. Verification: Did I Get It?

Detailed Lessons

Hands-On Materials

Hands-On Materials

The Incident: Point-of-No-Return

Investigation & Containment

Workflow & Safe Migrations

Lab & Completion