Skip to content

Improve resource usage alerting

Created by: caugustus-sourcegraph

Context

As a site admin, I am missing out on important infra-level alerts.

Deliverable

Review the existing provisioning alerts with an eye towards notifying site admins about underprovisioned setups.

For example:

  • Add critical-level alerts for exceeding provisioning limits
  • Add warnings for long-term downward trends
  • CPU throttling,
  • OOM @sourcegraph/cloud-devops will implement this
  • container restarts
  • max connection errors from the databases @sourcegraph/cloud-devops will implement this
  • Improve instruction how to address memory limit issue https://github.com/sourcegraph/sourcegraph/pull/34808/files#r863734505

Are there any other performance metrics that could be monitored ("golden signals")?

Are there any metrics about site configuration errors?