Chapter 17: Rollback & Data Migrations
Learning Objectives
By the end of this chapter, you will be able to:
- Apply the expand/contract migration model for safe schema changes
- Use feature flags to maintain rollback capability during releases
- Configure destructive migration approval gates with evidence requirements
- Design a recovery plan for failed data migrations
Start with the video for the concept overview, then work through each lesson section.
A database migration runs, fails, and corrupts data. You roll back the application code, but the schema is already changed. In this final chapter, we tackle the Point-of-No-Return problem by implementing atomic code and data rollback procedures.
1. The Problem: The “Point-of-No-Return”
Database migrations are inherently risky because they change the persistent state of the system. If a migration fails halfway through, a standard GitOps rollback of the application code is not enough—the application will still fail because the database schema is in an inconsistent or incompatible state.
2. The Concept: Atomic Recovery
We treat stateful changes as a synchronized operation between code and data.
- Expand-Contract Pattern: We avoid breaking changes by separating “Adding” from “Removing” across two releases.
- Backup-Before-Migrate: We ensure a fresh backup or PITR checkpoint exists immediately before any schema change.
- Point-in-Time Recovery (PITR): We use CloudNativePG to restore the database to the exact millisecond before the migration started.
3. The Code: PITR Recovery
Our sre/ repo uses CloudNativePG’s native recovery objects to automate stateful rollbacks. This allows us to “revert” the database state just as easily as we revert a Git commit.
CloudNativePG cluster baseline
flux/infrastructure/data/cnpg-clusters/develop/cluster.yamlflux/infrastructure/data/cnpg-clusters/develop/kustomization.yamlflux/infrastructure/data/cnpg-clusters/develop/scheduled-backup.yamlflux/infrastructure/data/cnpg-clusters/production/cluster.yamlflux/infrastructure/data/cnpg-clusters/production/kustomization.yamlflux/infrastructure/data/cnpg-clusters/production/postgres-app-secret.yamlflux/infrastructure/data/cnpg-clusters/production/scheduled-backup.yamlflux/infrastructure/data/cnpg-clusters/staging/cluster.yamlflux/infrastructure/data/cnpg-clusters/staging/kustomization.yamlflux/infrastructure/data/cnpg-clusters/staging/scheduled-backup.yaml
4. The Guardrail: Pre-Migration Checkpoint
We never run a migration blindly. We enforce a guardrail that requires a manual or automated “snapshot” of the database before any DDL statements are executed. This provides a guaranteed safety net for even the most complex migrations.
5. Verification: Did I Get It?
Verify your database health and capture a pre-migration checkpoint:
# Check current PITR status
kubectl get cluster -n develop <cluster-name> -o yaml | grep "firstBackup"
# Capture the current timestamp for a potential restore
date -u +"%Y-%m-%dT%H:%M:%SZ"
Expected Output: You should have a valid backup history and a clear timestamp to use for a PITR restore if the migration fails.