Infrastructure
Operations Runbook
Day-2 operations for running Nexis in production.
Operations Runbook
Daily checks
- Data plane healthy and accepting websocket connections.
- Control API healthy and serving token/key endpoints.
- Postgres healthy and migrations current.
- Error rates and reconnect rates within normal range.
Key metrics to monitor
- handshake success/failure rate
- room join success/failure rate
- rpc error rate
- reconnect/resume success rate
- state resync frequency
Alerts to add first
- sudden spike in handshake failures
- sudden spike in invalid signature errors
- high websocket disconnect rate
- control API token endpoint failure rate
- no data-plane metrics scrape for N minutes
Key rotation procedure
- Create new project key.
- Start minting tokens with new key.
- Revoke old key after grace window.
- Confirm old-key tokens fail and new-key tokens succeed.
Deploy checklist
- Run CI and integration smoke.
- Deploy control API and data plane.
- Verify health + metrics.
- Run synthetic join/send/patch check.
- Watch error dashboard for 15-30 minutes.
Incident triage quick steps
- Identify scope (all users / one project / one room type).
- Check control API token/key status.
- Check data-plane handshake and decode errors.
- Verify recent deploy or config changes.
- If needed: roll back to last known healthy release.
Typed Incident State
let : = 'investigating';
= 'mitigating';
= 'monitoring';
= 'resolved';