Skip to content

Observability

Stack

ComponentURLHostPurpose
SigNozdogs.quinza.devEliteDesk LXC 101 (10.0.0.201)Metrics, logs, traces
OneUptimeops.quinza.devThinkCentre LXC 100 (10.0.0.51)Incidents, on-call, status pages
PagerDutySaaSBackup alerting channel

Data Flow

mermaid
graph LR
    subgraph K8s["K8s Cluster (euronodes)"]
        OTEL_DS[OTel DaemonSet x3]
    end

    subgraph Bastion
        OTEL_HOST[OTel Collector - host metrics]
        OTEL_RELAY[OTel Relay - 10.10.0.1:4317]
    end

    subgraph EliteDesk
        SIGNOZ[SigNoz - 10.0.0.201:4317]
    end

    subgraph SaaS
        PD[PagerDuty]
    end

    subgraph ThinkCentre
        OT[OneUptime]
    end

    OTEL_DS -->|OTLP| OTEL_RELAY
    OTEL_HOST -->|OTLP| SIGNOZ
    OTEL_RELAY -->|OTLP| SIGNOZ
    SIGNOZ -->|alerts| PD
    SIGNOZ -->|alerts| OT

OTel Collectors

LocationRoleDestination
Bastion (host)Host metrics + logsSigNoz directly (10.0.0.201:4317)
Bastion (relay)Listens on 10.10.0.1:4317Forwards K8s data to SigNoz
K8s DaemonSet (3 pods)hostMetrics, kubeletMetrics, kubernetesAttributes, logsCollection, clusterMetricsBastion relay (10.10.0.1:4317)
K8s CNPG Collector (Deployment)Scrapes CNPG Prometheus metrics on port 9187Bastion relay (10.10.0.1:4317)

The CNPG Collector is a separate Deployment (not a DaemonSet) in the monitoring namespace. This is necessary because the DaemonSet runs with hostNetwork: true and cannot reach pod IPs on the CNI network (10.42.x.x). A headless Service postgresql-metrics provides DNS-based discovery via kubernetes_sd_configs.

Cluster name: euronodes

K8s nodes cannot reach 10.0.0.x directly — the bastion relay bridges the gap.

Alerts

13 alerts configured with PagerDuty integration:

CategoryCountExamples
USE method6CPU saturation, memory utilization, disk usage, network errors (wg0 excluded, threshold >100), swap (>500MB, warning)
SLO burn rate2Disk headroom burn, network health burn
Heartbeat1"No data" -- detects missing telemetry
Kubernetes1Pod CrashLoopBackOff (container restarts >5 in 10min)
Traefik15xx Error Rate (>10 in 5min, log-based)
PostgreSQL (CNPG)4PG down, high connections, replication lag, low cache hit ratio

Alert tuning notes:

  • Network Errors: excluded wg0 interface, raised threshold to >100
  • Swap Memory: raised threshold to 500MB, severity changed from info to warning

SLOs

SLOTargetWindow
Disk Headroom99%30 days
Network Health99.9%30 days

Dashboards

DashboardPanelsMethod
USE Method18Utilization, Saturation, Errors per resource
Infrastructure SLOs--Burn rate and error budget tracking
K8s Nodes (Euronodes Cluster)--Node-level metrics for the K8s cluster
K8s Pods (Euronodes Cluster)--Pod-level metrics for the K8s cluster
K8s Namespaces (Euronodes Cluster)--Namespace-level metrics for the K8s cluster
PostgreSQL (CNPG)--Connections, DB size, cache hit ratio, WAL, transactions, replication lag

K8s Pod Status panels

Pod Status value panels (Running/Pending/Failed) use ClickHouse SQL that returns count only, not timestamp. k8s.pod.phase is a GAUGE value (1-5), not a label -- it requires ClickHouse SQL queries, not builder filters.

The K8s dashboards are stored as JSON files in roles/signoz_dashboards/files/dashboards/ and deployed via the Ansible role signoz_dashboards (or direct API POST to SigNoz).

Auto-Instrumentation (Traces)

The OTel Operator injects auto-instrumentation into application pods via annotations. No code changes required.

ServiceNamespaceTypeAnnotation
carzying (frontend)carzyingNode.jsinstrumentation.opentelemetry.io/inject-nodejs: "true"
directus (CMS)carzyingNode.jsinstrumentation.opentelemetry.io/inject-nodejs: "true"

Instrumentation resources are created per namespace with:

  • Exporter endpoint: http://10.10.0.1:4318 (bastion relay, HTTP)
  • Environment: production
  • Cluster: euronodes

Traces flow: App pod → OTel init container → bastion relay → SigNoz

Services in SigNoz

ServiceMetrics
carzyingP99 latency, error rate, operations/sec
directusP99 latency, error rate, operations/sec (includes SQL query traces)

Adding instrumentation to a new service

  1. Ensure an Instrumentation resource exists in the target namespace
  2. Annotate the deployment:
bash
kubectl patch deployment my-app -n my-namespace -p \
  '{"spec":{"template":{"metadata":{"annotations":{"instrumentation.opentelemetry.io/inject-nodejs":"true"}}}}}'

Supported runtimes: inject-nodejs, inject-python, inject-java, inject-dotnet, inject-go

Auto-Remediation Pipeline

SigNoz alerts can trigger automated remediation via Ansible Semaphore on the bastion.

SigNoz alert --> webhook --> relay (bastion:8011) --> Semaphore API --> Ansible playbook

The webhook relay is a Python service on the bastion (port 8011) that maps SigNoz alert names to Semaphore task template IDs. See Disaster Recovery - Semaphore Runbooks for the full template list.

Caddy Improvements

  • Passive health checks on all reverse_proxy upstreams (fail_duration 30s, max_fails 2)
  • Traefik JSON access logs enabled
  • OTel transform processor enriches Traefik logs with target namespace (RouterName extraction)
  • Filtering by namespace in SigNoz now shows Traefik access logs for that namespace

Completed

  • Deploy OTel Collector inside the Kubernetes cluster Done
  • Implement "no data" alerts (detect missing telemetry) Done
  • Auto-instrumentation for traces Done
  • PostgreSQL CNPG monitoring (dedicated collector) Done
  • Auto-remediation pipeline (Semaphore + webhook relay) Done
  • MetalLB L2 for Traefik LoadBalancer Done
  • ExternalDNS for Cloudflare Done

Quinza Infrastructure