Disaster Recovery
Scenario 1: Bastion Goes Down
Impact: All traffic stops. WireGuard hub lost. No ingress to cluster.
Recovery
- Reboot from Euronodes panel — WireGuard restarts automatically via systemd
- Verify WireGuard peers reconnect:
sudo wg show - Check Caddy/traffic is flowing:
systemctl status caddy
If Disk is Corrupt
- Rebuild with Ubuntu 24.04 from Euronodes panel
- Run Ansible playbooks:
ansible-playbook -i inventory playbooks/bastion.ymlThis restores: WireGuard, Caddy, OTel Collector (host + relay), nftables rules, Semaphore (Docker Compose), webhook relay.
Scenario 2: K8s Node Dies
Control Plane Node
Impact: Cluster API unavailable. Workloads on workers continue running but cannot be managed.
DANGER
Do NOT run talosctl bootstrap — etcd data persists on disk. Bootstrapping would destroy it.
- Flash Talos on the node again
- Apply the node config:
talosctl apply-config --insecure --nodes [ipv6] --file /tmp/quinza-cp.yaml- Node rejoins with existing etcd data
Single Worker Node
Impact: Minimal. Pods migrate to the other worker automatically via Kubernetes scheduling.
No action required unless the node won't recover.
Both Worker Nodes
Impact: All workloads down. Need at least one worker to restore service.
- Flash Talos on at least one worker
- Apply config:
talosctl apply-config --insecure --nodes [ipv6] --file /tmp/quinza-worker-X.yaml- Pods will be scheduled once the node joins
Scenario 3: Lost Cluster Secrets
Talos secrets are encrypted in the repository with SOPS + age.
Decrypt
SOPS_AGE_KEY_FILE=~/.config/sops/age/keys.txt sops -d talsecret.enc.yamlWARNING
The age private key is backed up in 1Password under "Talos Cluster - Age Key (SOPS)". If you lose both the local key file and 1Password access, secrets are unrecoverable.
Scenario 4: Home Servers Go Down
Impact: SigNoz and OneUptime become unavailable. Production is NOT affected — these are observability-only.
- PagerDuty continues alerting (SaaS, independent of homelab)
- Metrics will have a gap until home servers recover
- No user-facing impact
Scenario 5: etcd Data Loss
Impact: Cluster state lost. All workloads, services, and configuration must be recreated.
Prevention
The script scripts/etcd-snapshot.sh takes etcd snapshots via talosctl.
# Local snapshot
make etcd-snapshot
# Snapshot + push offsite to bastion
make etcd-snapshot-pushThe --push-to-bastion flag copies the snapshot to the bastion for offsite storage. Retention: 14 days.
Recovery
- Restore etcd from the most recent snapshot:
talosctl etcd snapshot restore /path/to/snapshot.db- If no snapshot is available, the cluster must be rebuilt from scratch using the Talos configs and ArgoCD will re-sync workloads from GitLab.
Scenario 6: PostgreSQL Data Loss
Impact: Application data lost (Directus CMS content).
Prevention
A CronJob runs pg_dump daily at 02:00 UTC with 7-day retention. Offsite backups to the bastion are available via scripts/pg-backup-to-bastion.sh.
Recovery
# Restore from a backup
./scripts/pg-restore.sh <backup-file>If both in-cluster PVC and bastion copies are lost, data is unrecoverable.
Semaphore Auto-Remediation
Ansible Semaphore runs on the bastion (Docker Compose, BoltDB, port 8010) and is accessible at https://semaphore.quinza.dev. SigNoz alerts trigger playbooks automatically via a webhook relay (Python, bastion:8011).
Pipeline: SigNoz alert --> webhook --> relay (bastion:8011) --> Semaphore API --> Ansible playbook
Task Templates
| ID | Name | Trigger | Notes |
|---|---|---|---|
| P1 | Restart OTel Relay + DaemonSet | Auto | Fixes DeadlineExceeded after WireGuard restart |
| P2 | Restart OTel DaemonSet | Auto | Fixes stale K8s collector connections |
| P3 | Delete CrashLoop Pod | Auto | Parameterized: requires pod name and namespace |
| P4 | Restart WireGuard + OTel cascade | Auto | Full connectivity recovery |
| P5 | Scale Deployment | Auto | Parameterized: deployment name, namespace, replicas |
| P6 | Restart Traefik | Auto | Fixes ingress routing issues |
| P7 | PostgreSQL Failover | Manual confirm | Triggers CNPG failover -- requires human approval |
DANGER
P7 (PostgreSQL Failover) requires manual confirmation. Do NOT make this fully automatic -- a bad failover can cause data loss.
Recovery with Semaphore
For scenarios where automated remediation is appropriate:
- Bastion OTel relay down -- P1 restarts relay + DaemonSet cascade
- K8s pod in CrashLoopBackOff -- P3 deletes the pod (K8s recreates it)
- WireGuard tunnel broken -- P4 restarts WireGuard and cascades OTel restarts
- Traefik not routing -- P6 restarts Traefik pods
- Need to scale a deployment -- P5 scales to desired replicas
Manual Semaphore Access
# Run a template manually via API
curl -X POST https://semaphore.quinza.dev/api/project/1/tasks \
-H "Authorization: Bearer $SEMAPHORE_TOKEN" \
-H "Content-Type: application/json" \
-d '{"template_id": 1}'