Disaster Recovery & Restore Drill

F16.14 8 features Planned

At a glance

Documented disaster-recovery plan with RPO 4 h / RTO 8 h for Cosmos DB, automated monthly restore drill via GitHub Actions cron that exercises point-in-time restore on real Azure resources, smoke-test verification against production counts within ±1 %, a dedicated drill service principal with minimum-permission custom role, and rotation runbooks for every critical credential including Key Vault, Stripe and Azure Communication Services.

How it works

The DR plan in docs/engineering/architecture/07a-disaster-recovery.md is a scenario matrix — data corruption, region-down, credential leak, ransomware, delete-by-mistake — with per-scenario detection → containment → recovery → post-mortem flow and the exact az cosmosdb mongodb database restore commands that the on-call needs at 3 a.m. RPO is 4 hours, RTO is 8 hours, and the plan is reviewed against drill outcomes recorded in infrastructure/drill-history.md.

The monthly restore drill runs as a GitHub Actions workflow on cron 0 4 1-7 * 1 (the first Monday of each month at 04:00 UTC). It uses Cosmos DB's point-in-time restore to spin up petanque-drill-<timestamp>, runs tools/dr-drill-smoke.py over the recovered database, compares document counts in tenants, users, licenses, invoices and tenant_subscriptions against production within a ±1 % delta tolerance, deletes the temporary database and posts a Slack report with PASS/FAIL. Exit codes are explicit: 0 PASS, 1 FAIL, 2 prereq-fail, so an ops dashboard or PagerDuty integration can react cleanly.

The drill runs under sp-petanque-dr-drill, a dedicated Azure service principal with a custom role Petanque DR Drill Operator scoped to exactly the Cosmos DB permissions needed for restore and read — nothing more. It is created through infrastructure/scripts/create-dr-drill-sp.sh so re-provisioning is reproducible. The drill never touches production data; the restored database is named with a timestamp and torn down at the end of every run.

Key credential rotation has documented runbooks. Key Vault soft-delete and purge-protection are on, with az keyvault secret backup/restore procedures captured. Stripe key rotation under credential leak is az containerapp secret set followed by az containerapp update, ensuring the API picks up the new value without losing in-flight requests. Azure Communication Services connection-string rotation follows the same pattern with az communication regenerate-key. Every drill run updates the RPO/RTO baseline so the team can see drift over time and renegotiate the SLO with confidence in the data, not the wishful spec.

Key capabilities

DR runbook with scenario matrix (data corruption, region down, credential leak, ransomware, delete-by-mistake)
Documented RPO 4 h / RTO 8 h with exact az cosmosdb mongodb restore commands
Automated monthly restore drill via GitHub Actions cron (first Monday 04:00 UTC)
Smoke-test script comparing tenants, users, licenses, invoices and subscriptions against production within ±1 %
Dedicated sp-petanque-dr-drill service principal with custom minimum-permission role
Documented Key Vault backup, Stripe key rotation and ACS connection-string rotation runbooks
RPO/RTO baseline auto-updated in infrastructure/drill-history.md after every drill

In practice

First Monday of the month, 04:00 UTC. The DR drill workflow kicks off, restores petanque-drill-2026-04-06 from a PITR point exactly four hours back, runs the smoke-test script and posts to Slack: PASS, deltas within 0.3 %, total restore + verify time 47 minutes — well inside the 8-hour RTO. The drill-history file updates and the temporary database is torn down before breakfast.

Three weeks later a junior engineer accidentally drops a collection on staging. The on-call opens the runbook, picks delete-by-mistake, runs the documented restore command against the production timestamp seven minutes before the drop, swaps the collection name back in, and the data is whole again — because they have done this exact dance, on real Azure resources, every month for a year.

Features in this subsystem

ID	Status	Features
F16.14.01	Shipped	DR-runbook med scenariomatris (data-corruption, region-down, credential-leak, ransomware, delete-by-mistake). Per scenario: detection → containment → recovery → post-mortem. Exakta az cosmosdb mongodb database restore-kommandon. RPO 4h / RTO 8h. ✅ PL-T052
F16.14.02	Shipped	Automatiserad månatlig restore-drill (GitHub Actions cron 0 4 1-7 * 1). Skapar petanque-drill-<timestamp> via PITR, kör smoke-queries, jämför counts mot prod (±1 %), raderar temp-DB, skickar Slack-rapport. ✅ PL-T052
F16.14.03	Shipped	Smoke-test-script (tools/dr-drill-smoke.py) — queries tenants, users, licenses, invoices, tenant_subscriptions. Delta-gräns ±1 %. Exit code 0 = PASS, 1 = FAIL, 2 = prereq-fel. ✅ PL-T052
F16.14.04	Shipped	Service principal sp-petanque-dr-drill med custom role Petanque DR Drill Operator (minimala Cosmos DB-rättigheter). Skapas via infrastructure/scripts/create-dr-drill-sp.sh. ✅ PL-T052
F16.14.05	Shipped	Key Vault-backup-rutin dokumenterad (soft-delete + purge-protection, az keyvault secret backup/restore). ✅ PL-T052
F16.14.06	Shipped	Stripe-nyckelrotationsprocedur dokumenterad (az containerapp secret set + az containerapp update vid credential-läcka). ✅ PL-T052
F16.14.07	Shipped	ACS connection-string-rotationsprocedur dokumenterad (az communication regenerate-key + Container App-uppdatering). ✅ PL-T052
F16.14.08	Shipped	RPO/RTO baseline i infrastructure/drill-history.md — uppdateras automatiskt efter varje drill-körning. ✅ PL-T052