Disaster Recovery

Scenario 1: Bastion Goes Down

Impact: All traffic stops. WireGuard hub lost. No ingress to cluster.

Recovery

Reboot from Euronodes panel — WireGuard restarts automatically via systemd
Verify WireGuard peers reconnect: sudo wg show
Check Caddy/traffic is flowing: systemctl status caddy

If Disk is Corrupt

Rebuild with Ubuntu 24.04 from Euronodes panel
Run Ansible playbooks:

bash

ansible-playbook -i inventory playbooks/bastion.yml

This restores: WireGuard, Caddy, OTel Collector (host + relay), nftables rules, Semaphore (Docker Compose), webhook relay.

Scenario 2: K8s Node Dies

Control Plane Node

Impact: Cluster API unavailable. Workloads on workers continue running but cannot be managed.

DANGER

Do NOT run talosctl bootstrap — etcd data persists on disk. Bootstrapping would destroy it.

Flash Talos on the node again
Apply the node config:

bash

talosctl apply-config --insecure --nodes [ipv6] --file /tmp/quinza-cp.yaml

Node rejoins with existing etcd data

Single Worker Node

Impact: Minimal. Pods migrate to the other worker automatically via Kubernetes scheduling.

No action required unless the node won't recover.

Both Worker Nodes

Impact: All workloads down. Need at least one worker to restore service.

Flash Talos on at least one worker
Apply config:

bash

talosctl apply-config --insecure --nodes [ipv6] --file /tmp/quinza-worker-X.yaml

Pods will be scheduled once the node joins

Scenario 3: Lost Cluster Secrets

Talos secrets are encrypted in the repository with SOPS + age.

Decrypt

bash

SOPS_AGE_KEY_FILE=~/.config/sops/age/keys.txt sops -d talsecret.enc.yaml

WARNING

The age private key is backed up in 1Password under "Talos Cluster - Age Key (SOPS)". If you lose both the local key file and 1Password access, secrets are unrecoverable.

Scenario 4: Home Servers Go Down

Impact: SigNoz and OneUptime become unavailable. Production is NOT affected — these are observability-only.

PagerDuty continues alerting (SaaS, independent of homelab)
Metrics will have a gap until home servers recover
No user-facing impact

Scenario 5: etcd Data Loss

Impact: Cluster state lost. All workloads, services, and configuration must be recreated.

Prevention

The script scripts/etcd-snapshot.sh takes etcd snapshots via talosctl.

bash

# Local snapshot
make etcd-snapshot

# Snapshot + push offsite to bastion
make etcd-snapshot-push

The --push-to-bastion flag copies the snapshot to the bastion for offsite storage. Retention: 14 days.

Recovery

Restore etcd from the most recent snapshot:

bash

talosctl etcd snapshot restore /path/to/snapshot.db

If no snapshot is available, the cluster must be rebuilt from scratch using the Talos configs and ArgoCD will re-sync workloads from GitLab.

Scenario 6: PostgreSQL Data Loss

Impact: Application data lost (Directus CMS content).

Prevention

A CronJob runs pg_dump daily at 02:00 UTC with 7-day retention. Offsite backups to the bastion are available via scripts/pg-backup-to-bastion.sh.

Recovery

bash

# Restore from a backup
./scripts/pg-restore.sh <backup-file>

If both in-cluster PVC and bastion copies are lost, data is unrecoverable.

Semaphore Auto-Remediation

Ansible Semaphore runs on the bastion (Docker Compose, BoltDB, port 8010) and is accessible at https://semaphore.quinza.dev. SigNoz alerts trigger playbooks automatically via a webhook relay (Python, bastion:8011).

Pipeline: SigNoz alert --> webhook --> relay (bastion:8011) --> Semaphore API --> Ansible playbook

Task Templates

ID	Name	Trigger	Notes
P1	Restart OTel Relay + DaemonSet	Auto	Fixes DeadlineExceeded after WireGuard restart
P2	Restart OTel DaemonSet	Auto	Fixes stale K8s collector connections
P3	Delete CrashLoop Pod	Auto	Parameterized: requires pod name and namespace
P4	Restart WireGuard + OTel cascade	Auto	Full connectivity recovery
P5	Scale Deployment	Auto	Parameterized: deployment name, namespace, replicas
P6	Restart Traefik	Auto	Fixes ingress routing issues
P7	PostgreSQL Failover	Manual confirm	Triggers CNPG failover -- requires human approval

DANGER

P7 (PostgreSQL Failover) requires manual confirmation. Do NOT make this fully automatic -- a bad failover can cause data loss.

Recovery with Semaphore

For scenarios where automated remediation is appropriate:

Bastion OTel relay down -- P1 restarts relay + DaemonSet cascade
K8s pod in CrashLoopBackOff -- P3 deletes the pod (K8s recreates it)
WireGuard tunnel broken -- P4 restarts WireGuard and cascades OTel restarts
Traefik not routing -- P6 restarts Traefik pods
Need to scale a deployment -- P5 scales to desired replicas

Manual Semaphore Access

bash

# Run a template manually via API
curl -X POST https://semaphore.quinza.dev/api/project/1/tasks \
  -H "Authorization: Bearer $SEMAPHORE_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"template_id": 1}'

Disaster Recovery ​

Scenario 1: Bastion Goes Down ​

Recovery ​

If Disk is Corrupt ​

Scenario 2: K8s Node Dies ​

Control Plane Node ​

Single Worker Node ​

Both Worker Nodes ​

Scenario 3: Lost Cluster Secrets ​

Decrypt ​

Scenario 4: Home Servers Go Down ​

Scenario 5: etcd Data Loss ​

Prevention ​

Recovery ​

Scenario 6: PostgreSQL Data Loss ​

Prevention ​

Recovery ​

Semaphore Auto-Remediation ​

Task Templates ​

Recovery with Semaphore ​

Manual Semaphore Access ​

Disaster Recovery

Scenario 1: Bastion Goes Down

Recovery

If Disk is Corrupt

Scenario 2: K8s Node Dies

Control Plane Node

Single Worker Node

Both Worker Nodes

Scenario 3: Lost Cluster Secrets

Decrypt

Scenario 4: Home Servers Go Down

Scenario 5: etcd Data Loss

Prevention

Recovery

Scenario 6: PostgreSQL Data Loss

Prevention

Recovery

Semaphore Auto-Remediation

Task Templates

Recovery with Semaphore

Manual Semaphore Access