Improve resource usage alerting
Created by: caugustus-sourcegraph
Context
As a site admin, I am missing out on important infra-level alerts.
Deliverable
Review the existing provisioning alerts with an eye towards notifying site admins about underprovisioned setups.
For example:
-
Add critical-level alerts for exceeding provisioning limits -
Add warnings for long-term downward trends -
CPU throttling, -
OOM @sourcegraph/cloud-devops will implement this -
container restarts -
max connection errors from the databases @sourcegraph/cloud-devops will implement this -
Improve instruction how to address memory limit issue https://github.com/sourcegraph/sourcegraph/pull/34808/files#r863734505
Are there any other performance metrics that could be monitored ("golden signals")?
Are there any metrics about site configuration errors?