Infrastructure & Costs
At a glance
Infrastructure observability and cost control: Azure cost breakdown per service per month, Container App CPU/memory/latency metrics, error-rate dashboards with App Insights deep-links, 30-day uptime per service, budget alerts with PagerDuty, Cosmos RU dashboards, external-dependency health probes, a force-directed service-dependency graph, impact analysis with exposed MRR, incident impact-preview, and a historical-incidents scorecard per dependency.
How it works
Infrastructure & Costs is where finance, engineering, and on-call all read the same numbers. Azure cost breakdown calls the Azure Cost Management API through a pluggable `AzureProvider` (with stub fallback when the `sp-petanque-sys-cost-reader` service principal is missing) and renders per-service totals (Cosmos, ACR, SWA, Container Apps, AI Services) per month with trend; results cache for six hours. Container App metrics expose CPU, memory, replicas, request rate, and P50/P95/P99 latency for the hard-coded allowlist of api/admin/app/sys/web/www; anything outside the allowlist returns `404 SysInfraContainerAppUnknown`.
The error-rate dashboard rolls 5xx by endpoint over 1/6/24/72 hours from `ApiUsageEvent` (capped at 20 000 rows per window) and renders a per-row App Insights KQL deep-link. The uptime log derives 30-day rolling availability per service from `SysIncident.affected_services` overlapped with the window; services outside any incident surface 100 percent. Budget alerts let `sys_engineer` or `sys_finance` create per-service thresholds with a notify list and optional PagerDuty; a daily `sys_infra_budget_alert` job probes Azure MTD and fires the pluggable `BudgetNotifier`.
The Cosmos RU dashboard plots provisioned vs consumed P95, throttled requests, and top hot partitions per collection. External-dependency health is probed every five minutes by a scheduled job through pluggable `DepProbe` adapters; the endpoint returns the most recent `SysDepHealthSample` per service with a status badge. The service-dependency graph reads a static `service-graph.yaml`, enriches each node with live health and any open incidents, and renders a force-directed SVG at `/dependencies` with regional-scope badges (SE-only, FR-only).
Impact analysis at `GET /sys/dependencies/{id}/impact` computes exposed MRR, affected tenant count and top-5 by MRR, affected features and endpoints, and mitigation guidance — regional-aware so a Bankgirot outage doesn't list French tenants. Incident impact-preview aggregates per-service impact across `affected_services` and renders above the timeline on `/incidents/{id}` with a deduplicated tenant union and top-10 by aggregated MRR. The historical-incidents scorecard at `GET /sys/dependencies/{id}/incidents` lists every incident that affected a service in a window with downtime minutes clamped to `to_date` so still-open incidents do not inflate the number.
Key capabilities
- Azure cost breakdown per service per month with 6-hour cache and pluggable provider
- Container App CPU/memory/replicas/latency on hard-coded allowlist
- 5xx error-rate dashboard with App Insights KQL deep-links
- 30-day rolling uptime derived from `SysIncident.affected_services`
- Budget alerts with daily probe and pluggable notifier (PagerDuty optional)
- Cosmos RU/s dashboard with throttled-request and hot-partition views
- External-dependency health probes every 5 minutes with status badges
- Force-directed service-dependency graph with regional-scope badges
- Impact analysis: exposed MRR, top-5 affected tenants, mitigation guidance
- Incident impact-preview rendered on the incident timeline
- Historical-incidents scorecard per dependency with clamped downtime
In practice
A SEV2 incident opens against `payments`. The on-call sees `incident impact-preview` above the timeline: 23 affected tenants, 1 274 EUR/month aggregated MRR, top-10 listed. He clicks through the dependency graph to `Bankgirot`, opens its impact analysis, and confirms the regional filter excludes French tenants.
The dependency-health badge has been red for nine minutes; the historical-incidents scorecard shows two prior outages this quarter. He pages the upstream contact and uses the scorecard data to negotiate an SLA conversation. Separately the daily Azure cost report flips a budget alert: Cosmos RU consumption is up 22 percent month-over-month.
The CFO sees the alert in PagerDuty and opens the RU dashboard to identify a hot partition.
Features in this subsystem
11| ID | Status | Features |
|---|---|---|
| F21.15.01 | Shipped | Azure cost breakdown — per service (Cosmos, ACR, SWA, Container Apps, AI Services) per month, with trend. Data from Azure Cost Management API. Results cached 6 h; MTD window + per-bucket amounts; pluggable AzureProvider with stub fallback when sp-petanque-sys-cost-reader is missing. ✅ PL-T135 |
| F21.15.02 | Shipped | Container app metrics — CPU, memory, replicas, request rate, P50/P95/P99 latency per service. Hard-coded allowlist SYS_INFRA_CONTAINER_APPS (api/admin/app/sys/web/www); 60 s cache; 404 SysInfraContainerAppUnknown for anything outside the list. Implemented (PL-T135) |
| F21.15.03 | Shipped | Error-rate dashboard — 5xx by endpoint over 1/6/24/72 h, top offenders, per-row App Insights KQL deeplink. Aggregated from ApiUsageEvent; backend caps at 20 000 rows per window (follow-up in the backlog for aggregation pipeline). Implemented (PL-T135) |
| F21.15.04 | Shipped | Uptime log — 30-day rolling availability per service, derived from SysIncident.affected_services overlapped with the window. Services outside any incident surface 100 %. Implemented (PL-T135) |
| F21.15.05 | Shipped | Budget alerts — create/activate/deactivate per-service thresholds with notify list + optional PagerDuty. sys_engineer or sys_finance for writes; sys_support rejected by the capability gate. Daily sys_infra_budget_alert job probes Azure MTD and fires the pluggable BudgetNotifier. Implemented (PL-T135) |
| F21.15.06 | Shipped | Cosmos RU/s dashboard — provisioned vs consumed (P95), throttled requests, top hot partitions per collection. Collection + window-hours query params; per-collection cache keyed by window. Implemented (PL-T135) |
| F21.15.07 | Shipped | External-dependency health — latest probe per upstream (SendGrid/PayPal/Azure/…). Scheduled sys_dep_health job probes every 5 min via the pluggable DepProbe; endpoint returns the most recent SysDepHealthSample per service with status badge (up/degraded/down/unknown). Implemented (PL-T135) |
| F21.15.08 | Shipped | Service dependency graph — static service-graph.yaml enriched with live health + open incidents, rendered as a force-directed SVG at /dependencies. Regional-scope badges (SE-only, FR-only). Four GET /sys/dependencies[/…] endpoints (list, detail, impact, historical incidents); impact cached 5 min via InfraCache. See specs/api/endpoints/sys-dependencies.md + specs/sys/views/dependencies.md. ✅ PL-T147 |
| F21.15.09 | Shipped | Impact analysis — GET /sys/dependencies/{id}/impact returns exposed MRR, affected tenant count + top-5 by MRR, affected features / endpoints, mitigation. Regional-aware filter excludes non-matching tenants (Bankgirot = SE-only). ✅ PL-T147 |
| F21.15.10 | Shipped | Incident impact-preview — GET /sys/incidents/{id}/impact-preview aggregates per-service impact across affected_services; rendered in /incidents/{id} above the timeline. Deduplicated tenant union + top-10 by aggregated MRR. Runbook: docs/engineering/operations/incident-impact-analysis.md. ✅ PL-T147 |
| F21.15.11 | Shipped | Historical-incidents scorecard — GET /sys/dependencies/{id}/incidents?from=&to= lists every incident that affected the service during the window. downtime_minutes clamped to to_date so still-open incidents don't inflate the number. Feeds the upstream-SLA conversation. ✅ PL-T147 |