Skip to content

Troubleshooting

Known issues encountered with the infrastructure and their fixes.


Issue 1: OTel K8s Pipelines Exporting Only to Debug

Symptom: SigNoz dashboards empty, no K8s metrics visible. OTel DaemonSet pods are running without errors.

Root cause: Helm values had logs and metrics pipelines configured with exporters: [debug] only. The otlp exporter was only present in the traces pipeline. When using the OTel Helm chart with presets, preset receivers are injected automatically but exporters in custom pipeline config are NOT merged -- they override entirely.

Fix: Run helm upgrade adding otlp to both logs and metrics pipeline exporters.

Detection:

bash
kubectl get configmap -n monitoring otel-collector-opentelemetry-collector-agent -o yaml \
  | grep -A3 'exporters:'

If any pipeline shows only [debug], the otlp exporter is missing.


Issue 2: Bastion OTel Relay DeadlineExceeded

Symptom: SigNoz receives no data from the K8s cluster. Bastion relay logs show:

rpc error: code = DeadlineExceeded desc = context deadline exceeded

for the otlp/signoz exporter.

Root cause: After a WireGuard restart, the gRPC connection to SigNoz (10.0.0.201:4317) becomes stale. The OTel Collector enters exponential backoff and eventually stops retrying.

Fix:

bash
# On the bastion
sudo systemctl restart otelcol-contrib

# Then restart the K8s DaemonSet to re-establish upstream connections
kubectl rollout restart daemonset/otel-collector-opentelemetry-collector-agent -n monitoring

Prevention: The bastion relay should auto-recover under normal conditions, but WireGuard restarts break the underlying TCP state. After any WireGuard restart, check the relay logs.


Issue 3: kubeletstats TLS Certificate Validation Failure

Symptom: OTel DaemonSet logs show:

Error scraping metrics: tls: failed to verify certificate: x509: cannot validate certificate
for 10.10.1.x because it doesn't contain any IP SANs

Root cause: Talos Linux generates kubelet TLS certificates without IP SANs. The OTel kubeletstats receiver validates the certificate against the node IP and fails.

Fix: Add insecure_skip_verify: true to the kubeletstats receiver config in Helm values:

yaml
config:
  receivers:
    kubeletstats:
      insecure_skip_verify: true

This is acceptable in a homelab where all traffic is internal and travels over WireGuard.

Effect: Without this fix, pod-level CPU and memory metrics are missing. Cluster-level metrics (pod phase, container restarts, deployments) still work because they come from the K8s API server, not the kubelet.


Issue 4: step-cli Binary Extracted to Wrong Path

Symptom: step ca init fails with No such file or directory: b'step'.

Root cause: The step CLI tarball contains the binary at step_linux_amd64/bin/step (two levels deep). The Ansible task used --strip-components=1, which extracted to /usr/local/bin/bin/step instead of /usr/local/bin/step.

Fix: Changed to --strip-components=2 with pattern */bin/step in roles/step_ca/tasks/main.yml.


Issue 5: SigNoz API Routes Returning HTML Instead of JSON

Symptom: Some SigNoz API v3 endpoints (like /api/v3/autocomplete/metric_name) return the frontend HTML instead of JSON.

Root cause: The SigNoz frontend proxy intercepts unrecognized routes and serves the SPA. The v1 endpoints (/api/v1/dashboards, /api/v1/rules) work correctly.

Workaround: Use v1 API endpoints when available. For metric discovery, use the SigNoz UI Metrics Explorer instead of the API.


Issue 6: MetalLB VIP Unreachable from Bastion

Symptom: curl to 10.10.1.200 times out from the bastion.

Root cause: MetalLB L2 mode uses ARP announcements, which do not cross WireGuard tunnels. The VIP needs explicit routing in the WireGuard peer config.

Fix:

bash
wg set wg0 peer <pubkey> allowed-ips 10.10.1.1/32,10.10.1.200/32

Persist the change in inventories/production/hosts.yml so the Ansible WireGuard playbook applies it on future runs.


Issue 7: CNPG Metrics Not Scraped -- hostNetwork Cannot Reach Pod IPs

Symptom: OTel Collector DaemonSet fails to scrape CNPG metrics on 10.42.x.x pod IPs. No PostgreSQL metrics in SigNoz.

Root cause: The OTel DaemonSet runs with hostNetwork: true, which means it operates on the host network stack and cannot reach pod IPs on the CNI network (10.42.x.x). Additionally, DNS-based discovery fails because CoreDNS ClusterIP is unreachable from hostNetwork.

Fix: Deploy a dedicated OTel Collector Deployment (otel-cnpg) in the monitoring namespace without hostNetwork. This Deployment scrapes CNPG Prometheus metrics on port 9187 using kubernetes_sd_configs. A headless Service postgresql-metrics provides DNS discovery.


Issue 8: Semaphore Template Shows Blank Page (Vue Error)

Symptom: TypeError: Cannot read properties of undefined (reading 'name') in TemplateView.vue. Template page renders blank.

Root cause: The template was created via the Semaphore API without an environment_id. The Semaphore UI crashes when trying to read environment.name on a template with no environment assigned.

Fix: Create an Environment in the Semaphore project, then assign environment_id to all templates (via API or UI).


Issue 9: ExternalDNS --cloudflare-proxied=false Crashes

Symptom: ExternalDNS pod exits with flag parsing error: unexpected false.

Root cause: The extraArgs format for boolean flags is incorrect. --cloudflare-proxied=false is not valid flag syntax for the ExternalDNS binary.

Fix: Remove the flag entirely. proxied=false is the default behavior -- no flag needed.


OTel Pipeline Health Checks

Diagnostic commands for verifying pipeline health end to end.

Bastion Relay

bash
sudo journalctl -u otelcol-contrib --since '5 minutes ago' --no-pager

K8s DaemonSet

bash
kubectl logs -n monitoring -l app.kubernetes.io/name=opentelemetry-collector --tail=20

WireGuard Tunnel

bash
sudo wg show wg0

Look for recent handshakes on relevant peers. A handshake older than 5 minutes may indicate a connectivity issue.

Bastion to SigNoz Connectivity

bash
nc -zv 10.0.0.201 4317

ConfigMap Exporter Verification

bash
kubectl get configmap -n monitoring otel-collector-opentelemetry-collector-agent -o yaml \
  | grep -A3 'exporters:'

All pipelines (traces, metrics, logs) must include otlp in their exporters list.

SigNoz Alert Rules

bash
curl -s http://10.0.0.201:8080/api/v1/rules -H "SIGNOZ-API-KEY: $KEY"

Recovery Playbook: Empty Dashboards

When SigNoz dashboards show no data, work through these steps in order:

  1. Check bastion relay logs for DeadlineExceeded errors. If present, restart the relay:

    bash
    sudo systemctl restart otelcol-contrib
  2. Check K8s DaemonSet logs for exporter errors. If present, restart the DaemonSet:

    bash
    kubectl rollout restart daemonset/otel-collector-opentelemetry-collector-agent -n monitoring
  3. Verify the ConfigMap has otlp in all pipeline exporters (traces, metrics, logs).

  4. Verify WireGuard handshakes are recent on the bastion:

    bash
    sudo wg show wg0
  5. Wait 2-3 minutes for ClickHouse to process incoming data before concluding the pipeline is still broken.


Recovery Playbook: CNPG Metrics Missing

When PostgreSQL metrics are absent from SigNoz:

  1. Check the CNPG collector pod is running:

    bash
    kubectl get pods -n monitoring -l app=otel-cnpg
  2. Check the headless Service resolves:

    bash
    kubectl get endpoints postgresql-metrics -n apps
  3. Verify the collector logs for scrape errors:

    bash
    kubectl logs -n monitoring -l app=otel-cnpg --tail=20
  4. If the collector is running but metrics are missing, check that CNPG pods expose port 9187 (the metrics port configured in the CNPG Cluster resource).

Quinza Infrastructure