Core Track Guardrails-first chapter in core learning path.

Estimated Time

  • Reading: 20-25 min
  • Lab: 45-60 min
  • Quiz: 10-15 min

Prerequisites

Source Code References

  • cnpg-clusters/ Members
  • main.tf Members

Sign in to view source code.

What You Will Produce

A reproducible lab result plus quiz verification and incident-safe operating evidence.

Chapter 11: Backup & Restore Basics

Learning Objectives

By the end of this chapter, you will be able to:

  • Verify backup success independently from restore success
  • Execute CloudNativePG backup and point-in-time restore
  • Apply the restore verification checklist as the actual safety bar
  • Explain why untested backups are not backups

Start with the video for the concept overview, then work through each lesson section.

A backup job reports success, but a real restore attempt fails under pressure. In this chapter, we move from “backup as a checkbox” to a validated recovery capability. A backup does not exist until it has been successfully restored.


1. The Problem: The “Schrödinger’s Backup” Incident

Objects are present in your S3 storage, but when a real incident happens, the restore fails due to a permission or schema mismatch. The service remains degraded because you mistook backup existence for proof of recoverability.

2. The Concept: Validated Recovery

We treat stateful recovery with the same evidence standard as code and infrastructure.

  1. Automated Operator: We use CloudNativePG to handle complex PostgreSQL operations.
  2. Immutable Artifacts: Backups are streamed to remote, off-cluster storage.
  3. Drill-First Culture: We don’t trust a backup until we’ve “cloned” it into a test target and validated the data.

3. The Code: CloudNativePG (CNPG) Clusters

Our sre/ repo defines our data plane as a set of managed Postgres clusters. The ScheduledBackup object ensures we have a regular heartbeat of recovery artifacts.

CloudNativePG cluster baseline

  • flux/infrastructure/data/cnpg-clusters/develop/cluster.yaml
  • flux/infrastructure/data/cnpg-clusters/develop/kustomization.yaml
  • flux/infrastructure/data/cnpg-clusters/develop/scheduled-backup.yaml
  • flux/infrastructure/data/cnpg-clusters/production/cluster.yaml
  • flux/infrastructure/data/cnpg-clusters/production/kustomization.yaml
  • flux/infrastructure/data/cnpg-clusters/production/postgres-app-secret.yaml
  • flux/infrastructure/data/cnpg-clusters/production/scheduled-backup.yaml
  • flux/infrastructure/data/cnpg-clusters/staging/cluster.yaml
  • flux/infrastructure/data/cnpg-clusters/staging/kustomization.yaml
  • flux/infrastructure/data/cnpg-clusters/staging/scheduled-backup.yaml

4. The Guardrail: S3-Backed Bootstrap

We ensure that our clusters can always find their remote storage. Credentials for the backup target (e.g., Cloudflare R2) are bootstrapped by Terraform and managed via secure GitOps secrets.

Backup credential bootstrap

Show the bootstrap Terraform
provider "hcloud" {
  token = var.hcloud_token
}

locals {
  # Control plane — always one pool.
  control_plane_nodepools = [
    {
      name         = "cp"
      server_type  = var.control_plane_server_type
      location     = var.location
      labels       = ["project=sre", "managed-by=terraform"]
      taints       = []
      count        = var.control_plane_count
      disable_ipv6 = true
    },
  ]

  # Static workers — used when autoscaling is OFF.
  # When autoscaling is ON the workers pool moves to autoscaler_nodepools.
  static_agent_pools = var.autoscaling_enabled ? [] : [
    {
      name         = "workers"
      server_type  = var.workers_server_type
      location     = var.location
      labels       = ["role=workers", "project=sre", "managed-by=terraform"]
      taints       = []
      count        = var.workers_count
      disable_ipv6 = true
    },
  ]

  # Autoscaler pool — used when autoscaling is ON.
  autoscaler_nodepools = var.autoscaling_enabled ? [
    {
      name        = "workers"
      server_type = var.workers_server_type
      location    = var.location
      min_nodes   = var.autoscaling_min_nodes
      max_nodes   = var.autoscaling_max_nodes
      labels      = { "role" = "workers", "project" = "sre", "managed-by" = "terraform" }
    },
  ] : []

  # Kured options — only populated when enabled.
  kured_options = var.kured_enabled ? {
    "reboot-days" = var.kured_reboot_days
    "start-time"  = var.kured_start_time
    "end-time"    = var.kured_end_time
  } : {}

  # etcd S3 backup — reuses the R2/S3 credentials already wired through load-env.sh.
  # k3s expects a bare hostname (no https:// prefix).
  etcd_s3_endpoint = var.backup_s3_endpoint != "" ? replace(var.backup_s3_endpoint, "https://", "") : ""

  etcd_s3_backup = local.etcd_s3_endpoint != "" ? {
    "etcd-s3-endpoint"   = local.etcd_s3_endpoint
    "etcd-s3-access-key" = var.backup_s3_access_key_id
    "etcd-s3-secret-key" = var.backup_s3_secret_access_key
    "etcd-s3-bucket"     = var.backup_s3_bucket
    "etcd-s3-folder"     = "${var.cluster_name}/etcd-snapshots"
    "etcd-s3-region"     = var.backup_s3_region
  } : {}
}

module "kube_hetzner" {
  # kube-hetzner v2.19.0
  source = "git::https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner.git?ref=a52d120bfb9f67d6c1d01add5d202609543df3ab"
  providers = {
    hcloud = hcloud
  }

  # Core
  hcloud_token    = var.hcloud_token
  cluster_name    = var.cluster_name
  ssh_public_key  = var.ssh_public_key
  ssh_private_key = var.ssh_private_key

  # Node pools
  control_plane_nodepools           = local.control_plane_nodepools
  agent_nodepools                   = local.static_agent_pools
  autoscaler_nodepools              = local.autoscaler_nodepools
  allow_scheduling_on_control_plane = var.allow_scheduling_on_control_plane

  # Load balancer
  load_balancer_type         = var.load_balancer_type
  load_balancer_location     = var.location
  load_balancer_disable_ipv6 = true

  # Ingress
  ingress_controller        = var.ingress_controller
  traefik_redirect_to_https = var.traefik_redirect_to_https
  traefik_autoscaling       = var.traefik_autoscaling

  # K3s versioning
  initial_k3s_channel       = var.k3s_channel
  install_k3s_version       = var.k3s_version
  automatically_upgrade_k3s = var.auto_upgrade_k3s
  automatically_upgrade_os  = var.auto_upgrade_os

  # cert-manager is managed by Flux, not kube-hetzner
  enable_cert_manager = false

  # Kured
  kured_options = local.kured_options

  # etcd backup to S3/R2
  etcd_s3_backup = local.etcd_s3_backup

  # OIDC / extra kube-apiserver flags
  k3s_exec_server_args = var.k3s_exec_server_args
}

locals {
  kubeconfig_path = pathexpand("${path.module}/kubeconfig.yaml")

  # Render pullSecret only when a token is provided.
  flux_pull_secret_yaml = var.flux_git_token != "" ? "    pullSecret: flux-system\n" : ""

  flux_git_secret_enabled = var.flux_git_token != ""
  sops_age_secret_enabled = var.sops_age_key != ""
  backup_s3_secret_enabled = nonsensitive(
    var.backup_s3_access_key_id != "" &&
    var.backup_s3_secret_access_key != "" &&
    var.backup_s3_bucket != ""
  )
}

resource "local_sensitive_file" "kubeconfig" {
  content         = module.kube_hetzner.kubeconfig
  filename        = local.kubeconfig_path
  file_permission = "0600"
}

provider "helm" {
  kubernetes {
    host                   = module.kube_hetzner.kubeconfig_data.host
    client_certificate     = module.kube_hetzner.kubeconfig_data.client_certificate
    client_key             = module.kube_hetzner.kubeconfig_data.client_key
    cluster_ca_certificate = module.kube_hetzner.kubeconfig_data.cluster_ca_certificate
  }
}

provider "kubernetes" {
  host                   = module.kube_hetzner.kubeconfig_data.host
  client_certificate     = module.kube_hetzner.kubeconfig_data.client_certificate
  client_key             = module.kube_hetzner.kubeconfig_data.client_key
  cluster_ca_certificate = module.kube_hetzner.kubeconfig_data.cluster_ca_certificate
}

resource "kubernetes_namespace" "bootstrap" {
  for_each = toset([
    "flux-system",
    "develop",
    "staging",
    "production",
    "observability",
    "auth",
  ])

  metadata {
    name = each.value
    labels = {
      "managed-by" = "terraform"
    }
  }

  depends_on = [local_sensitive_file.kubeconfig]

  lifecycle {
    ignore_changes = [
      metadata[0].labels,
      metadata[0].annotations,
    ]
  }
}

  data = {
    cloudflare_proxied = "enabled"
    cluster_name       = var.cluster_name
    image_registry     = var.image_registry
    git_owner          = var.git_owner
  }

  depends_on = [kubernetes_namespace.bootstrap]
}

  type = "Opaque"

  data = {
    uptrace_dsn = var.uptrace_dsn
  }

  depends_on = [kubernetes_namespace.bootstrap]
}

  metadata {
    name      = "flux-system"
    namespace = "flux-system"
  }

  type = "Opaque"

  data = {
    username = "git"
    password = var.flux_git_token
  }

  depends_on = [kubernetes_namespace.bootstrap]
}

resource "null_resource" "flux_operator_install" {
  depends_on = [kubernetes_namespace.bootstrap]

  triggers = {
    kubeconfig_path = local.kubeconfig_path
  }

  provisioner "local-exec" {
    when        = create
    interpreter = ["/bin/bash", "-c"]
    command     = "kubectl --kubeconfig=\"${local.kubeconfig_path}\" apply -f https://github.com/controlplaneio-fluxcd/flux-operator/releases/latest/download/install.yaml"
  }
}

resource "null_resource" "flux_instance" {
  depends_on = [
    null_resource.flux_operator_install,
    kubernetes_secret.flux_git_credentials,
  ]

  triggers = {
    kubeconfig_path = local.kubeconfig_path
    repo_url        = var.flux_git_repository_url
    repo_branch     = var.flux_git_repository_branch
    repo_path       = var.flux_kustomization_path
    flux_version    = var.flux_version
    provider        = "generic"
  }

  provisioner "local-exec" {
    when        = create
    interpreter = ["/bin/bash", "-c"]
    command     = <<-EOC
      cat <<EOF | kubectl --kubeconfig="${local.kubeconfig_path}" apply -f -
apiVersion: fluxcd.controlplane.io/v1
kind: FluxInstance
metadata:
  name: flux
  namespace: flux-system
spec:
  distribution:
    version: "${var.flux_version}"
    registry: ghcr.io/fluxcd
  components:
    - source-controller
    - kustomize-controller
    - helm-controller
    - notification-controller
    - image-reflector-controller
    - image-automation-controller
  cluster:
    type: kubernetes
  sync:
    kind: GitRepository
    url: "${var.flux_git_repository_url}"
    ref: "refs/heads/${var.flux_git_repository_branch}"
    provider: generic
    path: "${var.flux_kustomization_path}"
${local.flux_pull_secret_yaml}
EOF
    EOC
  }

  provisioner "local-exec" {
    when        = destroy
    on_failure  = continue
    interpreter = ["/bin/bash", "-c"]
    command     = "kubectl --kubeconfig=\"${self.triggers.kubeconfig_path}\" delete fluxinstance flux -n flux-system --ignore-not-found=true --wait=false --timeout=30s 2>/dev/null || true"
  }
}

resource "null_resource" "flux_pre_destroy" {
  depends_on = [
    local_sensitive_file.kubeconfig,
    kubernetes_namespace.bootstrap,
    null_resource.flux_instance,
  ]

  triggers = {
    kubeconfig_path = local.kubeconfig_path
    namespaces      = "flux-system,develop,staging,production,observability"
  }

  provisioner "local-exec" {
    when        = destroy
    on_failure  = continue
    interpreter = ["/bin/bash", "-c"]
    command     = "\"${path.module}/../scripts/flux-pre-destroy.sh\" \"${self.triggers.kubeconfig_path}\" \"${self.triggers.namespaces}\""
  }
}

  metadata {
    name      = "ghcr-credentials-docker"
    namespace = each.key
  }

  type = "kubernetes.io/dockerconfigjson"

  data = {
    ".dockerconfigjson" = jsonencode({
      auths = {
        "ghcr.io" = {
          username = var.ghcr_username
          password = var.ghcr_token
          auth     = base64encode("${var.ghcr_username}:${var.ghcr_token}")
        }
      }
    })
  }

  depends_on = [kubernetes_namespace.bootstrap]
}


  metadata {
    name      = "sops-age"
    namespace = "flux-system"
  }

  data = {
    "age.agekey" = var.sops_age_key
  }

  type = "Opaque"

  depends_on = [kubernetes_namespace.bootstrap]
}

  metadata {
    name      = "cnpg-backup-s3"
    namespace = each.key
  }

  type = "Opaque"

  data = merge(
    {
      ACCESS_KEY_ID     = var.backup_s3_access_key_id
      ACCESS_SECRET_KEY = var.backup_s3_secret_access_key
      BUCKET            = var.backup_s3_bucket
    },
    var.backup_s3_endpoint != "" ? { ENDPOINT = var.backup_s3_endpoint } : {},
    var.backup_s3_region != "" ? { REGION = var.backup_s3_region } : {},
  )

  depends_on = [kubernetes_namespace.bootstrap]
}

5. Verification: Did I Get It?

Verify your scheduled backups and perform a manual “heartbeat” check:

# Verify backups are running
kubectl get backups -n develop
# Describe the most recent backup status
kubectl describe backup -n develop <name>

Expected Output: The backup should be in Completed phase and have a valid S3/R2 target path.


Detailed Lessons

Hands-On Materials

Labs, quizzes, and runbooks — available to course members.

  • Lab: CloudNativePG Backup and Restore Simulation Members
  • Quiz: Chapter 11 (Backup & Restore Basics) Members
  • Runbook: Backup and Restore (CNPG) Members

The Incident: Schrödinger's Backup

Result: The team realizes too late that a &ldquo;green&rdquo; backup status does not equal a functional recovery plan. Observed Symptoms What the team sees first: The backup job is &ldquo;green&rdquo; (successful). …

Investigation & Containment

Safe investigation sequence: Confirm Artifact: Verify the backup artifact exists and matches the retention policy. Restore Locally: Restore into an isolated, non-production target (e.g., a restore-test namespace). Verify …

Workflow & Data Plane

Scheduled Backups: Built-in support for streaming to S3-compatible storage. Environment Isolation: Separate clusters for develop, staging, and production. Operator-Managed: Flux handles the operator lifecycle and cluster …

Lab & Completion

Schema Exists: Verify that database objects and expected migrations are present. Read/Write Check: Execute a sample read query and write query to confirm access. App Smoke Check: Verify that the application can …