Skip to main content
Petanque Life

Disaster Recovery & Restore Drill

F16.14 8 features Planned

At a glance

Documented disaster-recovery plan with RPO 4 h / RTO 8 h for Cosmos DB, automated monthly restore drill via GitHub Actions cron that exercises point-in-time restore on real Azure resources, smoke-test verification against production counts within ±1 %, a dedicated drill service principal with minimum-permission custom role, and rotation runbooks for every critical credential including Key Vault, Stripe and Azure Communication Services.

How it works

The DR plan in docs/engineering/architecture/07a-disaster-recovery.md is a scenario matrix — data corruption, region-down, credential leak, ransomware, delete-by-mistake — with per-scenario detection → containment → recovery → post-mortem flow and the exact az cosmosdb mongodb database restore commands that the on-call needs at 3 a.m. RPO is 4 hours, RTO is 8 hours, and the plan is reviewed against drill outcomes recorded in infrastructure/drill-history.md.

The monthly restore drill runs as a GitHub Actions workflow on cron 0 4 1-7 * 1 (the first Monday of each month at 04:00 UTC). It uses Cosmos DB's point-in-time restore to spin up petanque-drill-<timestamp>, runs tools/dr-drill-smoke.py over the recovered database, compares document counts in tenants, users, licenses, invoices and tenant_subscriptions against production within a ±1 % delta tolerance, deletes the temporary database and posts a Slack report with PASS/FAIL. Exit codes are explicit: 0 PASS, 1 FAIL, 2 prereq-fail, so an ops dashboard or PagerDuty integration can react cleanly.

The drill runs under sp-petanque-dr-drill, a dedicated Azure service principal with a custom role Petanque DR Drill Operator scoped to exactly the Cosmos DB permissions needed for restore and read — nothing more. It is created through infrastructure/scripts/create-dr-drill-sp.sh so re-provisioning is reproducible. The drill never touches production data; the restored database is named with a timestamp and torn down at the end of every run.

Key credential rotation has documented runbooks. Key Vault soft-delete and purge-protection are on, with az keyvault secret backup/restore procedures captured. Stripe key rotation under credential leak is az containerapp secret set followed by az containerapp update, ensuring the API picks up the new value without losing in-flight requests. Azure Communication Services connection-string rotation follows the same pattern with az communication regenerate-key. Every drill run updates the RPO/RTO baseline so the team can see drift over time and renegotiate the SLO with confidence in the data, not the wishful spec.

Key capabilities

  • DR runbook with scenario matrix (data corruption, region down, credential leak, ransomware, delete-by-mistake)
  • Documented RPO 4 h / RTO 8 h with exact az cosmosdb mongodb restore commands
  • Automated monthly restore drill via GitHub Actions cron (first Monday 04:00 UTC)
  • Smoke-test script comparing tenants, users, licenses, invoices and subscriptions against production within ±1 %
  • Dedicated sp-petanque-dr-drill service principal with custom minimum-permission role
  • Documented Key Vault backup, Stripe key rotation and ACS connection-string rotation runbooks
  • RPO/RTO baseline auto-updated in infrastructure/drill-history.md after every drill

In practice

First Monday of the month, 04:00 UTC. The DR drill workflow kicks off, restores petanque-drill-2026-04-06 from a PITR point exactly four hours back, runs the smoke-test script and posts to Slack: PASS, deltas within 0.3 %, total restore + verify time 47 minutes — well inside the 8-hour RTO. The drill-history file updates and the temporary database is torn down before breakfast.

Three weeks later a junior engineer accidentally drops a collection on staging. The on-call opens the runbook, picks delete-by-mistake, runs the documented restore command against the production timestamp seven minutes before the drop, swaps the collection name back in, and the data is whole again — because they have done this exact dance, on real Azure resources, every month for a year.

Features in this subsystem

8
ID Status Features
F16.14.01 Shipped DR-runbook med scenariomatris (data-corruption, region-down, credential-leak, ransomware, delete-by-mistake). Per scenario: detection → containment → recovery → post-mortem. Exakta az cosmosdb mongodb database restore-kommandon. RPO 4h / RTO 8h. ✅ PL-T052
F16.14.02 Shipped Automatiserad månatlig restore-drill (GitHub Actions cron 0 4 1-7 * 1). Skapar petanque-drill-<timestamp> via PITR, kör smoke-queries, jämför counts mot prod (±1 %), raderar temp-DB, skickar Slack-rapport. ✅ PL-T052
F16.14.03 Shipped Smoke-test-script (tools/dr-drill-smoke.py) — queries tenants, users, licenses, invoices, tenant_subscriptions. Delta-gräns ±1 %. Exit code 0 = PASS, 1 = FAIL, 2 = prereq-fel. ✅ PL-T052
F16.14.04 Shipped Service principal sp-petanque-dr-drill med custom role Petanque DR Drill Operator (minimala Cosmos DB-rättigheter). Skapas via infrastructure/scripts/create-dr-drill-sp.sh. ✅ PL-T052
F16.14.05 Shipped Key Vault-backup-rutin dokumenterad (soft-delete + purge-protection, az keyvault secret backup/restore). ✅ PL-T052
F16.14.06 Shipped Stripe-nyckelrotationsprocedur dokumenterad (az containerapp secret set + az containerapp update vid credential-läcka). ✅ PL-T052
F16.14.07 Shipped ACS connection-string-rotationsprocedur dokumenterad (az communication regenerate-key + Container App-uppdatering). ✅ PL-T052
F16.14.08 Shipped RPO/RTO baseline i infrastructure/drill-history.md — uppdateras automatiskt efter varje drill-körning. ✅ PL-T052