Aller au contenu principal
Petanque Life

Incident Management

F21.09 7 fonctionnalités

En bref

End-to-end incident management built into the console: severity-tagged active list, a create-and-page wizard that wakes on-call through PagerDuty or OpsGenie, an append-only timeline of updates and comms, public status-page sync, post-mortem links, an affected-users impact estimate scanned from real `ApiUsageEvent` data, and SLA credit calculation per affected tenant against the 99.9 percent target.

Comment ça fonctionne

When the firehose lights up or a customer reports an outage, this is where the response happens. The active incident list shows severity (SEV1/2/3/4), affected tenants, responder, started-at, and ETA to resolution. Create-incident captures severity, affected services, affected tenants, and an initial summary, and pages on-call through PagerDuty Events API v2 or OpsGenie Alerts API v2 with `dedup_key = sys-incident:<id>` so a single incident never wakes anyone twice.

The timeline at `POST /sys/incidents/{id}/updates` writes append-only `SysIncidentUpdateEntry` rows with author, status transitions, and a comms-sent summary so investigators can reconstruct the response without reading Slack. Public status-page sync flips visibility through `POST /sys/incidents/{id}/public`; `GET /public/status/active` returns a customer-safe projection for `status.petanque.life`, cached 60 seconds so the status page never hammers the API. Post-mortem links are external (Notion, Confluence, Google Docs) — the console intentionally does not host the doc, just records the URL so the audit trail can prove the post-mortem exists.

The affected-users endpoint scans `ApiUsageEvent` for the outage window and returns the unique user count by tenant, so an operator can answer `how many people did this hit` with real data rather than a guess. The SLA impact report emits one credit row per affected tenant against a 99.9 percent target with credit tiers at 0 / 10 / 25 / 50 percent for outage durations of 0 / 45 / 240 / 720 minutes; the rows feed the F21.07 billing flow so credits land on the next invoice without manual arithmetic. Every action is audited and the incident also ties back to the dashboard system map and firehose for deep links during response.

Capacités clés

  • Active incident list with severity, affected tenants, responder, ETA
  • Create-incident wizard with PagerDuty/OpsGenie paging and dedup_key
  • Append-only timeline with author, status, and comms-sent summary
  • Public status-page sync with customer-safe projection (60s cache)
  • External post-mortem URL recorded for audit
  • Affected-users estimate scanned from real `ApiUsageEvent` data
  • SLA credit calculation (0/10/25/50% tiers at 0/45/240/720 minutes)

En pratique

The dashboard correlation engine fires an `incident_suggest` for `api × tenant-fr`. The on-call sys engineer clicks `Create incident`, picks SEV2, marks `api` and `tenant-fr` as affected, and submits. PagerDuty pages the on-call backend engineer in 12 seconds.

While the fix is rolling she posts three timeline updates: `investigating`, `mitigation deployed`, `resolved`. She flips `public=true` so customers see the issue on `status.petanque.life`, then attaches a Notion post-mortem link the next morning. The affected-users endpoint reports 1 247 unique users hit during the 38-minute window; the SLA impact report emits a 0 percent credit row (under the 45-minute threshold) per affected tenant, and finance is satisfied without manual calculation.

Fonctionnalités de ce sous-système

7
ID Status Fonctionnalités
F21.09.01 Livré Active incident list — severity, affected tenants, responder, started-at, ETA to resolution. ✅ PL-T129
F21.09.02 Livré Create incident — severity (SEV1/2/3/4), affected services, affected tenants, initial summary. Auto-pages on-call (PagerDuty Events API v2 / OpsGenie Alerts API v2, dedup_key = sys-incident:<id>). ✅ PL-T129
F21.09.03 Livré Incident timeline — append-only SysIncidentUpdateEntry with author, status transitions and comms-sent summary. ✅ PL-T129
F21.09.04 Livré Public status-page sync — POST /sys/incidents/{id}/public toggles visibility; GET /public/status/active returns the customer-safe projection for status.petanque.life, cached 60 s. ✅ PL-T129
F21.09.05 Livré Post-mortem link — POST /sys/incidents/{id}/post-mortem stores an external URL (Notion/Confluence/Google Docs). ✅ PL-T129
F21.09.06 Livré Affected-users impact estimate — GET /sys/incidents/{id}/affected-users scans ApiUsageEvent for the outage window. ✅ PL-T129
F21.09.07 Livré SLA impact report — GET /sys/incidents/{id}/sla-impact emits one credit row per affected tenant (credit tiers 0 / 10 / 25 / 50 % at 0 / 45 / 240 / 720 outage minutes, 99.9 % target). ✅ PL-T129