Observability och driftövervakning

F16.13 11 features In progress

At a glance

Full-stack observability across the entire petanque platform: structured logging with structlog and request-context binding, Prometheus metrics covering HTTP and business signals, OpenTelemetry distributed tracing with auto-instrumentation, four-level health probes for Container Apps, JSON-provisioned Grafana dashboards, per-tenant alert routing to PagerDuty and Opsgenie, per-tenant SLO measurement with error-budget burndown, Sentry frontend error capture and a public live status page.

How it works

Structured logging is the spine. structlog binds request_id, tenant_id, user_id and trace_id into the context of every log line, masks a frozen set of sensitive field names (passwords, tokens, secrets) before serialisation, and switches between a JSON renderer in production and a console renderer in dev. ObservabilityMiddleware logs every request with method, path, status, duration and client IP, plus a startup info event so a deploy is unmissable in the log stream.

Prometheus metrics layer two registries together. craft-easy MetricsMiddleware exposes generic HTTP metrics (request count, duration, in-flight, request/response size); petanque-specific gauges and counters cover business signals — petanque_license_operations_total, petanque_auth_operations_total, petanque_tenant_api_requests_total, petanque_match_scores_submitted_total, petanque_audit_writes_total. The /metrics endpoint scrapes into the same Prometheus that Grafana dashboards read.

OpenTelemetry distributed tracing is initialised through setup_telemetry() with an OTLP gRPC exporter, auto-instrumenting FastAPI, httpx and pymongo. trace_id and span_id are injected into every structlog event so logs and traces correlate by ID; an X-Trace-Id response header lets the apps echo the trace into client telemetry. When OTel packages are absent the integration becomes a no-op; configuration is environment-driven (OTEL_ENABLED, OTEL_EXPORTER_OTLP_ENDPOINT).

Health checks come at four levels: /health (liveness — process up), /ready (readiness with MongoDB ping for traffic admission), /startup (startup probe for Container Apps cold-start), /health/detailed (per-dependency status with latency for MongoDB and Redis). /observability/config exposes non-sensitive runtime config for ops verification.

Grafana dashboards are JSON-provisioned under infrastructure/grafana/ — an API overview with twelve panels (request rate, latency p50/p95/p99, in-flight, error rate, health status, license ops, audit writes), plus per-domain views. Alerts route per-tenant through AlertRoutingConfig with first-class support for PagerDuty Events API v2, Opsgenie Alerts API v2 and generic webhooks; cooldown-based dedup prevents alert storms and AlertHistory captures the audit trail. Per-tenant SLO measurement runs through TenantSLOConfig (defaults: 99.9% availability, p95 ≤ 300 ms, p99 ≤ 800 ms) with SLOMeasurement documents for history. Error-budget burndown computes burnrate per hour and estimated time to exhaustion, surfacing status as ok / warning / critical / exhausted with a Prometheus gauge. Sentry handles frontend errors through DSN config plus an envelope tunnel that satisfies the apps' CSP. The public status page hits /public/status and /public/status/incidents, backed by ServiceHealthSample (90-day TTL) and StatusIncident with auto-incident creation on status change driven by a 60-second background job.

Key capabilities

Structured logging with structlog, sensitive-field masking and request-context binding
Prometheus metrics covering HTTP and petanque-specific business signals
OpenTelemetry distributed tracing with auto-instrumentation and trace-id correlation in logs
Four-level health probes (liveness, readiness, startup, detailed) for Container Apps
JSON-provisioned Grafana dashboards with twelve-panel API overview
Per-tenant alert routing to PagerDuty, Opsgenie and generic webhooks with cooldown dedup
Per-tenant SLO measurement with error-budget burndown and burnrate-per-hour calculation
Sentry frontend error capture with envelope tunnel for CSP compliance
Public live status page driven by 60-second health sampling and auto-incident creation

In practice

A 5xx spike hits the API at 14:07. Within 60 seconds the Prometheus alert fires, AlertRoutingConfig fans the page out to PagerDuty for the on-call, ServiceHealthSample flips the API service to degraded and the public status page auto-creates an incident the user community can see. The on-call clicks the trace_id from the log line, the OpenTelemetry span tree shows MongoDB latency p99 at 4.2 s on the licenses collection, the Grafana API overview confirms the regression started at 14:06 right after a deploy.

They roll back through the deployment subsystem, watch the burndown chart recover, and the auto-resolved status incident posts a clean timeline that the federation can read after the fact.

Features in this subsystem

ID	Status	Features
F16.13.01	Shipped	Strukturerad loggning med structlog — sensitive-field masking (SENSITIVE_FIELDS frozen set), request context binding (request_id/tenant_id/user_id/trace_id), JSON/console renderers, startup info event, ObservabilityMiddleware logs every request with method/path/status/duration/client_ip. Implemented (PL-F1613a). ✅ PL-F1613a
F16.13.02	Shipped	Prometheus metrics — craft-easy MetricsMiddleware (HTTP requests, duration, in-flight, request/response size), petanque-specific business metrics: petanque_license_operations_total, petanque_auth_operations_total, petanque_tenant_api_requests_total, petanque_match_scores_submitted_total, petanque_audit_log_entries_total, petanque_sync_operations_total, petanque_db_query_duration_seconds, petanque_health_check_*. /metrics scrape endpoint. Implemented (PL-F1613a). ✅ PL-F1613a
F16.13.03	Shipped	OpenTelemetry distributed tracing — setup_telemetry() initializes OTLP gRPC exporter, auto-instruments FastAPI/httpx/pymongo, injects trace_id/span_id into structlog events, X-Trace-Id response header, graceful no-op when OTel packages absent. Configurable via OTEL_ENABLED, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME. Implemented (PL-F1613a). ✅ PL-F1613a
F16.13.04	Shipped	Health check endpoints — craft-easy /health (liveness) and /ready (readiness with MongoDB ping), petanque /startup (startup probe for Kubernetes/Azure Container Apps), /health/detailed (per-dependency status with latency: MongoDB, Redis), /observability/config (non-sensitive config inspection). Implemented (PL-F1613a). ✅ PL-F1613a
F16.13.05	Shipped	Grafana dashboards — JSON provisioning under infrastructure/grafana/: dashboard provider config (provisioning/dashboards.yml), API overview dashboard (dashboards/api-overview.json) with 12 panels: request rate, latency p50/p95/p99, in-flight, error rate, health status, health latency, license ops, auth ops, tenant traffic, DB query latency, audit log rate, sync ops/conflicts. Implemented (PL-F1613a). ✅ PL-F1613a
F16.13.06	Shipped	Alert routing till PagerDuty/Opsgenie — per-tenant alert routing config (AlertRoutingConfig), stöd för PagerDuty Events API v2, Opsgenie Alerts API v2, generiska webhooks, cooldown-baserad deduplicering, AlertHistory audit trail. Endpoints: GET/PUT /observability/alerts/config/{tenant_id}, POST /observability/alerts/fire/{tenant_id}, GET /observability/alerts/history/{tenant_id}. Prometheus-metrics: petanque_alerts_fired_total, petanque_alerts_suppressed_total. Implemented (PL-F1613b). ✅ PL-F1613b
F16.13.07	Shipped	Per-tenant SLO-mätning — konfigurerbara SLO-mål per tenant (TenantSLOConfig) med systemdefaulter (99.9% tillgänglighet, p95 ≤ 300 ms, p99 ≤ 800 ms). SLOMeasurement dokument för historik. Endpoints: GET/PUT /observability/slo/config/{tenant_id}, POST /observability/slo/measure/{tenant_id}, GET /observability/slo/status/{tenant_id}, GET /observability/slo/history/{tenant_id}/{slo_type}. Prometheus-gauge: petanque_slo_compliance_ratio. Implemented (PL-F1613b). ✅ PL-F1613b
F16.13.08	Shipped	Error budget burndown — beräknar felbudget från SLO-mål och faktisk compliance, burnrate per timme, estimerad tid till budgetuttömning. Status: ok/warning/critical/exhausted. Endpoint: GET /observability/slo/burndown/{tenant_id}. Prometheus-gauge: petanque_error_budget_remaining_ratio. Implemented (PL-F1613b). ✅ PL-F1613b
F16.13.09	Shipped	Sentry för frontend-felfångst — Sentry DSN-konfiguration för frontend-appar (GET /observability/sentry/config), envelope tunnel för CSP-compliance (POST /observability/sentry/envelope), lättviktig felrapportering (POST /observability/sentry/report). Prometheus-counter: petanque_frontend_errors_total. Implemented (PL-F1613b). ✅ PL-F1613b
F16.13.10	Shipped	Live status page — publik statussida som hämtar realtidsstatus via GET /public/status och GET /public/status/incidents. ServiceHealthSample-modell med TTL (90 dagar), StatusIncident-modell med automatisk incident-skapning vid statusändring. Bakgrundsjobb (CollectStatusSamplesJob) kör var 60:e sekund. Admin-endpoints POST /incidents + PATCH /incidents/{id} för manuell incidenthantering. Uptime beräknas som 30-dagars rullande fönster. www-statussidan hämtar data via client-side fetch med 30s auto-refresh och 5s timeout-fallback. Implemented (PL-T045). ✅ PL-T045
F16.13.11	Shipped	Performance profiling (continuous) ✅ PL-T288