Security Operations
En resumen
Security Operations is the surface a `sys_security` operator lives in during a live threat: cross-tenant active user-session panel with bulk revoke, a suspicious-activity queue covering impossible-travel, new-device on high-value accounts, and brute-force, a failed-login heatmap, API-key and OAuth-client rotation, Service Principal inventory, idempotent six-class secret-rotation runbooks, per-tenant IP allowlists, and a two-person-armed emergency kill-switch.
Cómo funciona
Security Operations is the surface a `sys_security` operator lives in during a live threat. The active user-sessions panel lists every end-user session across all tenants, filterable by tenant, role, origin, and a `suspicious` flag; bulk-revoke takes a list of session ids and tears them down in one POST. All mutations require fresh-auth.
The suspicious-activity queue persists `SysSuspiciousLoginEvent` rows from three rules — `impossible_travel` (two logins from geo-distant IPs within an implausible window), `new_device_hva` (unknown device on a high-value account), and `brute_force` (failure spike per IP/user). Triage is a single POST with action `none/notify/revoke/lock`. The failed-login heatmap aggregates `sys_failed_login_samples` by IP, user, country, or hour.
API-key and OAuth-client management offers cross-tenant lists with rotate and revoke for both `ApiToken` and `M2MClient`; the raw secret is returned exactly once after rotation, never persisted in plaintext server-side. Service Principal management treats `SysServicePrincipal` rows as inventory, surfacing expiry and last-used; schedule-rotation queues a future rotation, and rotate executes it. Secret rotation is the centrepiece: idempotent stepwise runbooks for six secret classes (`db`, `jwt_keys`, `webhook`, `stripe`, `sendgrid`, `bankgirot`) with `advance/retry/cancel` verbs and a pluggable `RotationStepRunner`.
Re-running a runbook resumes from the last completed step, so a partial rotation never leaves the platform stuck. Per-tenant IP allowlists set CIDR ranges that the tenant authentication path enforces — useful for federations on a stable office network. The emergency kill-switch is the nuclear option: arming requires two operators, each entering a single-use `SysKillSwitchApprovalCode` with a 10-minute TTL, and once armed the middleware returns `503 tenant_kill_switch_armed` for every non-sys tenant-scoped request until disarmed.
The kill-switch is the only way an on-call security engineer can stop the platform end-to-end without redeploying.
Capacidades clave
- Cross-tenant active user-sessions panel with filter and bulk-revoke
- Suspicious-activity queue: impossible_travel, new_device_hva, brute_force; one-POST triage
- Failed-login heatmap by IP / user / country / hour
- API-key and OAuth-client rotation with single-time raw-secret return
- Service Principal inventory with expiry, last-used, schedule-rotation, rotate
- Idempotent six-class secret-rotation runbooks with advance/retry/cancel and resume-from-step
- Per-tenant IP allowlist (CIDR) enforced at tenant auth
- Two-person-armed emergency kill-switch with single-use codes (10-minute TTL) returning 503
En la práctica
The suspicious-activity queue flips three `impossible_travel` events for the same user inside 12 minutes. The on-call security engineer triages with action `revoke`; all sessions for that user are torn down. She opens the failed-login heatmap and sees a clear cluster from one IP range.
She adds the user's tenant to its IP allowlist and revokes a leaked partner API key, copying the new secret once and emailing it through a secure channel. Later in the week, a planned Stripe-key rotation walks through the runbook one step at a time; an interrupted run resumes cleanly from step 4. During a separate ransomware drill she triggers the kill-switch with a colleague, every tenant request returns 503 within seconds, and the disarm thirty minutes later restores normal operation.
Funcionalidades de este subsistema
8| ID | Status | Funcionalidades |
|---|---|---|
| F21.17.01 | Entregado | Active user-sessions panel — cross-tenant end-user sessions with filter (tenant, role, origin, suspicious) and bulk-revoke. GET /sys/security/user-sessions + DELETE /sys/security/user-sessions/{id} + POST /sys/security/user-sessions/bulk-revoke; fresh-auth on mutations. ✅ PL-T137 |
| F21.17.02 | Entregado | Suspicious-activity queue — three rules (impossible_travel, new_device_hva, brute_force) persist SysSuspiciousLoginEvent rows; triage via POST /sys/security/suspicious/{id}/triage with action none lock_user|revoke_sessions. | Implemented (PL-T137) |
| F21.17.03 | Entregado | Failed-login heatmap — GET /sys/security/failed-logins?group_by=ip email&window=24h aggregates SysFailedLoginSample for abuse detection. | Implemented (PL-T137) |
| F21.17.04 | Entregado | API-key & OAuth-client management — cross-tenant list + rotate + revoke for ApiToken and M2MClient; raw secret returned once after rotation. Implemented (PL-T137) |
| F21.17.05 | Entregado | Service Principal management — SysServicePrincipal inventory with expiry, last-used, rotation scheduling; schedule-rotation + rotate endpoints mirror the manual Azure procedure in docs/engineering/security/service-principals.md. Implemented (PL-T137) |
| F21.17.06 | Entregado | Secret-rotation runbook launcher — idempotent stepwise rotation for six secret classes (db, jwt_keys, webhook, stripe, sendgrid, bankgirot) with advance/retry/cancel verbs; pluggable RotationStepRunner. Implemented (PL-T137) |
| F21.17.07 | Entregado | Per-tenant IP allowlist — GET/PUT /sys/tenants/{id}/ip-allowlist with CIDR list; enforced during tenant authentication. Implemented (PL-T137) |
| F21.17.08 | Entregado | Emergency kill-switch — two-person armed via single-use SysKillSwitchApprovalCode (10 min TTL); middleware returns 503 tenant_kill_switch_armed for every non-sys tenant-scoped request until disarmed. Implemented (PL-T137) |