Observability
Stack
| Component | URL | Host | Purpose |
|---|---|---|---|
| SigNoz | dogs.quinza.dev | EliteDesk LXC 101 (10.0.0.201) | Metrics, logs, traces |
| OneUptime | ops.quinza.dev | ThinkCentre LXC 100 (10.0.0.51) | Incidents, on-call, status pages |
| PagerDuty | SaaS | — | Backup alerting channel |
Data Flow
graph LR
subgraph K8s["K8s Cluster (euronodes)"]
OTEL_DS[OTel DaemonSet x3]
end
subgraph Bastion
OTEL_HOST[OTel Collector - host metrics]
OTEL_RELAY[OTel Relay - 10.10.0.1:4317]
end
subgraph EliteDesk
SIGNOZ[SigNoz - 10.0.0.201:4317]
end
subgraph SaaS
PD[PagerDuty]
end
subgraph ThinkCentre
OT[OneUptime]
end
OTEL_DS -->|OTLP| OTEL_RELAY
OTEL_HOST -->|OTLP| SIGNOZ
OTEL_RELAY -->|OTLP| SIGNOZ
SIGNOZ -->|alerts| PD
SIGNOZ -->|alerts| OTOTel Collectors
| Location | Role | Destination |
|---|---|---|
| Bastion (host) | Host metrics + logs | SigNoz directly (10.0.0.201:4317) |
| Bastion (relay) | Listens on 10.10.0.1:4317 | Forwards K8s data to SigNoz |
| K8s DaemonSet (3 pods) | hostMetrics, kubeletMetrics, kubernetesAttributes, logsCollection, clusterMetrics | Bastion relay (10.10.0.1:4317) |
| K8s CNPG Collector (Deployment) | Scrapes CNPG Prometheus metrics on port 9187 | Bastion relay (10.10.0.1:4317) |
The CNPG Collector is a separate Deployment (not a DaemonSet) in the monitoring namespace. This is necessary because the DaemonSet runs with hostNetwork: true and cannot reach pod IPs on the CNI network (10.42.x.x). A headless Service postgresql-metrics provides DNS-based discovery via kubernetes_sd_configs.
Cluster name: euronodes
K8s nodes cannot reach 10.0.0.x directly — the bastion relay bridges the gap.
Alerts
13 alerts configured with PagerDuty integration:
| Category | Count | Examples |
|---|---|---|
| USE method | 6 | CPU saturation, memory utilization, disk usage, network errors (wg0 excluded, threshold >100), swap (>500MB, warning) |
| SLO burn rate | 2 | Disk headroom burn, network health burn |
| Heartbeat | 1 | "No data" -- detects missing telemetry |
| Kubernetes | 1 | Pod CrashLoopBackOff (container restarts >5 in 10min) |
| Traefik | 1 | 5xx Error Rate (>10 in 5min, log-based) |
| PostgreSQL (CNPG) | 4 | PG down, high connections, replication lag, low cache hit ratio |
Alert tuning notes:
- Network Errors: excluded
wg0interface, raised threshold to >100 - Swap Memory: raised threshold to 500MB, severity changed from info to warning
SLOs
| SLO | Target | Window |
|---|---|---|
| Disk Headroom | 99% | 30 days |
| Network Health | 99.9% | 30 days |
Dashboards
| Dashboard | Panels | Method |
|---|---|---|
| USE Method | 18 | Utilization, Saturation, Errors per resource |
| Infrastructure SLOs | -- | Burn rate and error budget tracking |
| K8s Nodes (Euronodes Cluster) | -- | Node-level metrics for the K8s cluster |
| K8s Pods (Euronodes Cluster) | -- | Pod-level metrics for the K8s cluster |
| K8s Namespaces (Euronodes Cluster) | -- | Namespace-level metrics for the K8s cluster |
| PostgreSQL (CNPG) | -- | Connections, DB size, cache hit ratio, WAL, transactions, replication lag |
K8s Pod Status panels
Pod Status value panels (Running/Pending/Failed) use ClickHouse SQL that returns count only, not timestamp. k8s.pod.phase is a GAUGE value (1-5), not a label -- it requires ClickHouse SQL queries, not builder filters.
The K8s dashboards are stored as JSON files in roles/signoz_dashboards/files/dashboards/ and deployed via the Ansible role signoz_dashboards (or direct API POST to SigNoz).
Auto-Instrumentation (Traces)
The OTel Operator injects auto-instrumentation into application pods via annotations. No code changes required.
| Service | Namespace | Type | Annotation |
|---|---|---|---|
| carzying (frontend) | carzying | Node.js | instrumentation.opentelemetry.io/inject-nodejs: "true" |
| directus (CMS) | carzying | Node.js | instrumentation.opentelemetry.io/inject-nodejs: "true" |
Instrumentation resources are created per namespace with:
- Exporter endpoint:
http://10.10.0.1:4318(bastion relay, HTTP) - Environment:
production - Cluster:
euronodes
Traces flow: App pod → OTel init container → bastion relay → SigNoz
Services in SigNoz
| Service | Metrics |
|---|---|
| carzying | P99 latency, error rate, operations/sec |
| directus | P99 latency, error rate, operations/sec (includes SQL query traces) |
Adding instrumentation to a new service
- Ensure an
Instrumentationresource exists in the target namespace - Annotate the deployment:
kubectl patch deployment my-app -n my-namespace -p \
'{"spec":{"template":{"metadata":{"annotations":{"instrumentation.opentelemetry.io/inject-nodejs":"true"}}}}}'Supported runtimes: inject-nodejs, inject-python, inject-java, inject-dotnet, inject-go
Auto-Remediation Pipeline
SigNoz alerts can trigger automated remediation via Ansible Semaphore on the bastion.
SigNoz alert --> webhook --> relay (bastion:8011) --> Semaphore API --> Ansible playbookThe webhook relay is a Python service on the bastion (port 8011) that maps SigNoz alert names to Semaphore task template IDs. See Disaster Recovery - Semaphore Runbooks for the full template list.
Caddy Improvements
- Passive health checks on all
reverse_proxyupstreams (fail_duration 30s,max_fails 2) - Traefik JSON access logs enabled
- OTel transform processor enriches Traefik logs with target namespace (RouterName extraction)
- Filtering by namespace in SigNoz now shows Traefik access logs for that namespace
Completed
Deploy OTel Collector inside the Kubernetes clusterDoneImplement "no data" alerts (detect missing telemetry)DoneAuto-instrumentation for tracesDonePostgreSQL CNPG monitoring (dedicated collector)DoneAuto-remediation pipeline (Semaphore + webhook relay)DoneMetalLB L2 for Traefik LoadBalancerDoneExternalDNS for CloudflareDone