Appendix: DNS and TLS Automation

Incident Hook

A service works through a raw load balancer IP, but the real hostname fails during rollout. DNS still points at the wrong target, HTTPS is missing or expired, and the incident looks like an application bug. Time is wasted debugging pods while the actual failure sits at the edge. Production ingress needs automated DNS and automated certificate issuance together.

Why This Appendix Exists

The main course keeps early chapters focused on platform safety and GitOps. This appendix explains the edge automation layer used by the SafeOps platform:

  • external-dns manages DNS records from cluster state
  • cert-manager issues certificates through Cloudflare DNS-01
  • Traefik ingresses reference the production issuer and TLS hosts

This is not a separate core chapter because ingress is not the center of the course. It is a supporting production capability you will rely on once the platform is running.

SafeOps Baseline

In the current SafeOps implementation:

  • Traefik is the ingress controller.
  • external-dns runs in the cert-manager namespace and syncs records for the target domain.
  • cert-manager manages ClusterIssuer objects for Let’s Encrypt staging and production.
  • ingresses request certificates by referencing the issuer and TLS hostnames.
  • Cloudflare API token secret is the shared dependency for both DNS and certificate issuance.

Investigation Snapshots

Here is the DNS/TLS GitOps bundle used in the SafeOps system.

DNS and TLS GitOps bundle

Show the DNS and TLS bundle layout
  • flux/infrastructure/security/dns-and-certificates/cluster-issuer.yaml
  • flux/infrastructure/security/dns-and-certificates/kustomization.yaml
  • flux/infrastructure/security/dns-and-certificates/release-external-dns.yaml
  • flux/infrastructure/security/dns-and-certificates/repository-external-dns.yaml

Here is the external-dns release used to synchronize DNS records.

external-dns release

Show the external-dns release
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: external-dns
  namespace: flux-system
spec:
  interval: 30m
  chart:
    spec:
      chart: external-dns
      version: "1.20.0"
      sourceRef:
        kind: HelmRepository
        name: external-dns
        namespace: flux-system
  targetNamespace: cert-manager
  install:
    createNamespace: false
    remediation:
      retries: 3
  upgrade:
    remediation:
      retries: 3
  values:
    provider:
      name: cloudflare
    env:
      - name: CF_API_TOKEN
        valueFrom:
          secretKeyRef:
            name: cloudflare-api-token
            key: api-token
    extraArgs:
      - --annotation-filter=cloudflare-proxied=enabled
      - --cloudflare-proxied
      - --request-timeout=60s
      - --cloudflare-dns-records-per-page=500
    domainFilters:
      - safeops.work
    policy: sync
    sources:
      - ingress
    txtOwnerId: "k8s-external-dns-${cluster_name}"

Here are the ClusterIssuer objects used for Let’s Encrypt staging and production.

ClusterIssuer configuration

Show the ClusterIssuer configuration
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-production
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: "admin@safeops.work"
    privateKeySecretRef:
      name: letsencrypt-production-key
    solvers:
      - dns01:
          cloudflare:
            apiTokenSecretRef:
              name: cloudflare-api-token
              key: api-token
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: "admin@safeops.work"
    privateKeySecretRef:
      name: letsencrypt-staging-key
    solvers:
      - dns01:
          cloudflare:
            apiTokenSecretRef:
              name: cloudflare-api-token
              key: api-token

Here is the frontend ingress pattern that requests TLS from cert-manager.

Frontend ingress with TLS

Show the frontend ingress pattern
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: frontend
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-production
    cloudflare-proxied: "${cloudflare_proxied}"
spec:
  ingressClassName: traefik
  rules:
  - host: frontend.local
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: frontend
            port:
              number: 8080

Safe Workflow (Step-by-Step)

  1. Confirm the Cloudflare token secret exists in the cert-manager namespace before enabling either DNS sync or certificate issuance.
  2. Reconcile the dns-and-certificates bundle so external-dns and the issuers exist before you depend on them.
  3. Verify cert-manager and external-dns pods are healthy.
  4. Confirm ClusterIssuer readiness for both staging and production issuers.
  5. Add or verify ingress hostnames, TLS blocks, and issuer annotations in the application ingress.
  6. Wait for DNS record creation and certificate issuance before declaring the route healthy.
  7. Validate with real hostname and HTTPS, not only with raw service or load balancer IP checks.

Verification Commands

kubectl -n cert-manager get pods
kubectl get clusterissuer
kubectl -n cert-manager logs deploy/external-dns --since=10m
kubectl -n develop describe ingress frontend
kubectl -n develop get certificate,secret

Common Failure Patterns

  • Cloudflare token secret missing or wrong, so DNS records and ACME challenges fail.
  • Ingress host exists, but TLS block or issuer annotation is missing.
  • DNS points correctly, but certificate is still pending because the ACME challenge never completed.
  • Teams test only with raw IPs and miss that the real hostname path is still broken.

Guardrail Principle

Automate DNS and TLS together. Manual DNS records plus manual certificate handling create hidden outage debt.

Done When

  • external-dns is reconciling without errors
  • staging and production ClusterIssuer objects are ready
  • ingress resources request TLS explicitly
  • hostname resolution and HTTPS both succeed for the intended route
  • you can explain whether a failure belongs to app routing, DNS sync, or certificate issuance