Observability

Stack

Component	URL	Host	Purpose
SigNoz	dogs.quinza.dev	EliteDesk LXC 101 (10.0.0.201)	Metrics, logs, traces
OneUptime	ops.quinza.dev	ThinkCentre LXC 100 (10.0.0.51)	Incidents, on-call, status pages
PagerDuty	SaaS	—	Backup alerting channel

Data Flow

mermaid

graph LR
    subgraph K8s["K8s Cluster (euronodes)"]
        OTEL_DS[OTel DaemonSet x3]
    end

    subgraph Bastion
        OTEL_HOST[OTel Collector - host metrics]
        OTEL_RELAY[OTel Relay - 10.10.0.1:4317]
    end

    subgraph EliteDesk
        SIGNOZ[SigNoz - 10.0.0.201:4317]
    end

    subgraph SaaS
        PD[PagerDuty]
    end

    subgraph ThinkCentre
        OT[OneUptime]
    end

    OTEL_DS -->|OTLP| OTEL_RELAY
    OTEL_HOST -->|OTLP| SIGNOZ
    OTEL_RELAY -->|OTLP| SIGNOZ
    SIGNOZ -->|alerts| PD
    SIGNOZ -->|alerts| OT

OTel Collectors

Location	Role	Destination
Bastion (host)	Host metrics + logs	SigNoz directly (10.0.0.201:4317)
Bastion (relay)	Listens on 10.10.0.1:4317	Forwards K8s data to SigNoz
K8s DaemonSet (3 pods)	hostMetrics, kubeletMetrics, kubernetesAttributes, logsCollection, clusterMetrics	Bastion relay (10.10.0.1:4317)
K8s CNPG Collector (Deployment)	Scrapes CNPG Prometheus metrics on port 9187	Bastion relay (10.10.0.1:4317)

The CNPG Collector is a separate Deployment (not a DaemonSet) in the monitoring namespace. This is necessary because the DaemonSet runs with hostNetwork: true and cannot reach pod IPs on the CNI network (10.42.x.x). A headless Service postgresql-metrics provides DNS-based discovery via kubernetes_sd_configs.

Cluster name: euronodes

K8s nodes cannot reach 10.0.0.x directly — the bastion relay bridges the gap.

Alerts

13 alerts configured with PagerDuty integration:

Category	Count	Examples
USE method	6	CPU saturation, memory utilization, disk usage, network errors (wg0 excluded, threshold >100), swap (>500MB, warning)
SLO burn rate	2	Disk headroom burn, network health burn
Heartbeat	1	"No data" -- detects missing telemetry
Kubernetes	1	Pod CrashLoopBackOff (container restarts >5 in 10min)
Traefik	1	5xx Error Rate (>10 in 5min, log-based)
PostgreSQL (CNPG)	4	PG down, high connections, replication lag, low cache hit ratio

Alert tuning notes:

Network Errors: excluded wg0 interface, raised threshold to >100
Swap Memory: raised threshold to 500MB, severity changed from info to warning

SLOs

SLO	Target	Window
Disk Headroom	99%	30 days
Network Health	99.9%	30 days

Dashboards

Dashboard	Panels	Method
USE Method	18	Utilization, Saturation, Errors per resource
Infrastructure SLOs	--	Burn rate and error budget tracking
K8s Nodes (Euronodes Cluster)	--	Node-level metrics for the K8s cluster
K8s Pods (Euronodes Cluster)	--	Pod-level metrics for the K8s cluster
K8s Namespaces (Euronodes Cluster)	--	Namespace-level metrics for the K8s cluster
PostgreSQL (CNPG)	--	Connections, DB size, cache hit ratio, WAL, transactions, replication lag

K8s Pod Status panels

Pod Status value panels (Running/Pending/Failed) use ClickHouse SQL that returns count only, not timestamp. k8s.pod.phase is a GAUGE value (1-5), not a label -- it requires ClickHouse SQL queries, not builder filters.

The K8s dashboards are stored as JSON files in roles/signoz_dashboards/files/dashboards/ and deployed via the Ansible role signoz_dashboards (or direct API POST to SigNoz).

Auto-Instrumentation (Traces)

The OTel Operator injects auto-instrumentation into application pods via annotations. No code changes required.

Service	Namespace	Type	Annotation
carzying (frontend)	carzying	Node.js	`instrumentation.opentelemetry.io/inject-nodejs: "true"`
directus (CMS)	carzying	Node.js	`instrumentation.opentelemetry.io/inject-nodejs: "true"`

Instrumentation resources are created per namespace with:

Exporter endpoint: http://10.10.0.1:4318 (bastion relay, HTTP)
Environment: production
Cluster: euronodes

Traces flow: App pod → OTel init container → bastion relay → SigNoz

Services in SigNoz

Service	Metrics
carzying	P99 latency, error rate, operations/sec
directus	P99 latency, error rate, operations/sec (includes SQL query traces)

Adding instrumentation to a new service

Ensure an Instrumentation resource exists in the target namespace
Annotate the deployment:

bash

kubectl patch deployment my-app -n my-namespace -p \
  '{"spec":{"template":{"metadata":{"annotations":{"instrumentation.opentelemetry.io/inject-nodejs":"true"}}}}}'

Supported runtimes: inject-nodejs, inject-python, inject-java, inject-dotnet, inject-go

Auto-Remediation Pipeline

SigNoz alerts can trigger automated remediation via Ansible Semaphore on the bastion.

SigNoz alert --> webhook --> relay (bastion:8011) --> Semaphore API --> Ansible playbook

The webhook relay is a Python service on the bastion (port 8011) that maps SigNoz alert names to Semaphore task template IDs. See Disaster Recovery - Semaphore Runbooks for the full template list.

Caddy Improvements

Passive health checks on all reverse_proxy upstreams (fail_duration 30s, max_fails 2)
Traefik JSON access logs enabled
OTel transform processor enriches Traefik logs with target namespace (RouterName extraction)
Filtering by namespace in SigNoz now shows Traefik access logs for that namespace

Completed

~~Deploy OTel Collector inside the Kubernetes cluster~~ Done
~~Implement "no data" alerts (detect missing telemetry)~~ Done
~~Auto-instrumentation for traces~~ Done
~~PostgreSQL CNPG monitoring (dedicated collector)~~ Done
~~Auto-remediation pipeline (Semaphore + webhook relay)~~ Done
~~MetalLB L2 for Traefik LoadBalancer~~ Done
~~ExternalDNS for Cloudflare~~ Done

Observability ​

Stack ​

Data Flow ​

OTel Collectors ​

Alerts ​

SLOs ​

Dashboards ​

Auto-Instrumentation (Traces) ​

Services in SigNoz ​

Adding instrumentation to a new service ​

Auto-Remediation Pipeline ​

Caddy Improvements ​

Completed ​