Infrastructure

Operations Runbook

Day-2 operations for running Nexis in production.

Operations Runbook

Daily checks

Data plane healthy and accepting websocket connections.
Control API healthy and serving token/key endpoints.
Postgres healthy and migrations current.
Error rates and reconnect rates within normal range.

Key metrics to monitor

handshake success/failure rate
room join success/failure rate
rpc error rate
reconnect/resume success rate
state resync frequency

Alerts to add first

sudden spike in handshake failures
sudden spike in invalid signature errors
high websocket disconnect rate
control API token endpoint failure rate
no data-plane metrics scrape for N minutes

Key rotation procedure

Create new project key.
Start minting tokens with new key.
Revoke old key after grace window.
Confirm old-key tokens fail and new-key tokens succeed.

Deploy checklist

Run CI and integration smoke.
Deploy control API and data plane.
Verify health + metrics.
Run synthetic join/send/patch check.
Watch error dashboard for 15-30 minutes.

Incident triage quick steps

Identify scope (all users / one project / one room type).
Check control API token/key status.
Check data-plane handshake and decode errors.
Verify recent deploy or config changes.
If needed: roll back to last known healthy release.

Typed Incident State

let :  = 'investigating';
 = 'mitigating';
 = 'monitoring';
 = 'resolved';

Observability

Metrics endpoints and baseline checks.

On this page

Operations Runbook Daily checks Key metrics to monitor Alerts to add first Key rotation procedure Deploy checklist Incident triage quick steps Typed Incident State