Skip to content

Disaster Recovery

Scenario 1: Bastion Goes Down

Impact: All traffic stops. WireGuard hub lost. No ingress to cluster.

Recovery

  1. Reboot from Euronodes panel — WireGuard restarts automatically via systemd
  2. Verify WireGuard peers reconnect: sudo wg show
  3. Check Caddy/traffic is flowing: systemctl status caddy

If Disk is Corrupt

  1. Rebuild with Ubuntu 24.04 from Euronodes panel
  2. Run Ansible playbooks:
bash
ansible-playbook -i inventory playbooks/bastion.yml

This restores: WireGuard, Caddy, OTel Collector (host + relay), nftables rules, Semaphore (Docker Compose), webhook relay.


Scenario 2: K8s Node Dies

Control Plane Node

Impact: Cluster API unavailable. Workloads on workers continue running but cannot be managed.

DANGER

Do NOT run talosctl bootstrap — etcd data persists on disk. Bootstrapping would destroy it.

  1. Flash Talos on the node again
  2. Apply the node config:
bash
talosctl apply-config --insecure --nodes [ipv6] --file /tmp/quinza-cp.yaml
  1. Node rejoins with existing etcd data

Single Worker Node

Impact: Minimal. Pods migrate to the other worker automatically via Kubernetes scheduling.

No action required unless the node won't recover.

Both Worker Nodes

Impact: All workloads down. Need at least one worker to restore service.

  1. Flash Talos on at least one worker
  2. Apply config:
bash
talosctl apply-config --insecure --nodes [ipv6] --file /tmp/quinza-worker-X.yaml
  1. Pods will be scheduled once the node joins

Scenario 3: Lost Cluster Secrets

Talos secrets are encrypted in the repository with SOPS + age.

Decrypt

bash
SOPS_AGE_KEY_FILE=~/.config/sops/age/keys.txt sops -d talsecret.enc.yaml

WARNING

The age private key is backed up in 1Password under "Talos Cluster - Age Key (SOPS)". If you lose both the local key file and 1Password access, secrets are unrecoverable.


Scenario 4: Home Servers Go Down

Impact: SigNoz and OneUptime become unavailable. Production is NOT affected — these are observability-only.

  • PagerDuty continues alerting (SaaS, independent of homelab)
  • Metrics will have a gap until home servers recover
  • No user-facing impact

Scenario 5: etcd Data Loss

Impact: Cluster state lost. All workloads, services, and configuration must be recreated.

Prevention

The script scripts/etcd-snapshot.sh takes etcd snapshots via talosctl.

bash
# Local snapshot
make etcd-snapshot

# Snapshot + push offsite to bastion
make etcd-snapshot-push

The --push-to-bastion flag copies the snapshot to the bastion for offsite storage. Retention: 14 days.

Recovery

  1. Restore etcd from the most recent snapshot:
bash
talosctl etcd snapshot restore /path/to/snapshot.db
  1. If no snapshot is available, the cluster must be rebuilt from scratch using the Talos configs and ArgoCD will re-sync workloads from GitLab.

Scenario 6: PostgreSQL Data Loss

Impact: Application data lost (Directus CMS content).

Prevention

A CronJob runs pg_dump daily at 02:00 UTC with 7-day retention. Offsite backups to the bastion are available via scripts/pg-backup-to-bastion.sh.

Recovery

bash
# Restore from a backup
./scripts/pg-restore.sh <backup-file>

If both in-cluster PVC and bastion copies are lost, data is unrecoverable.


Semaphore Auto-Remediation

Ansible Semaphore runs on the bastion (Docker Compose, BoltDB, port 8010) and is accessible at https://semaphore.quinza.dev. SigNoz alerts trigger playbooks automatically via a webhook relay (Python, bastion:8011).

Pipeline: SigNoz alert --> webhook --> relay (bastion:8011) --> Semaphore API --> Ansible playbook

Task Templates

IDNameTriggerNotes
P1Restart OTel Relay + DaemonSetAutoFixes DeadlineExceeded after WireGuard restart
P2Restart OTel DaemonSetAutoFixes stale K8s collector connections
P3Delete CrashLoop PodAutoParameterized: requires pod name and namespace
P4Restart WireGuard + OTel cascadeAutoFull connectivity recovery
P5Scale DeploymentAutoParameterized: deployment name, namespace, replicas
P6Restart TraefikAutoFixes ingress routing issues
P7PostgreSQL FailoverManual confirmTriggers CNPG failover -- requires human approval

DANGER

P7 (PostgreSQL Failover) requires manual confirmation. Do NOT make this fully automatic -- a bad failover can cause data loss.

Recovery with Semaphore

For scenarios where automated remediation is appropriate:

  1. Bastion OTel relay down -- P1 restarts relay + DaemonSet cascade
  2. K8s pod in CrashLoopBackOff -- P3 deletes the pod (K8s recreates it)
  3. WireGuard tunnel broken -- P4 restarts WireGuard and cascades OTel restarts
  4. Traefik not routing -- P6 restarts Traefik pods
  5. Need to scale a deployment -- P5 scales to desired replicas

Manual Semaphore Access

bash
# Run a template manually via API
curl -X POST https://semaphore.quinza.dev/api/project/1/tasks \
  -H "Authorization: Bearer $SEMAPHORE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"template_id": 1}'

Quinza Infrastructure