monitoring: alerting + dashboard overhaul
Created by: slimsag
Blog post entry
In recent Sourcegraph versions we introduced standardized Prometheus and Grafana monitoring as part of all Sourcegraph deployments. Since then, we have continued to invest in making it easier to understand the health of your instance.
In Sourcegraph 3.11, we introduce a new set of dashboards and high-level health metrics which make understanding the health of your Sourcegraph instance at a glance easier:
These dashboards are built using a new set of combinatorial alerting metrics we have introduced for each service. These allow site admins to measure the number of critical and warning-class alerts their Sourcegraph instance is facing, thus allowing one to query the number of critical and warning-class problems (shown above) as well as querying alerts on a per-service basis:
Alerting can also be configured easily through these metrics, so that you can get Email, Slack, PagerDuty, (and much more) that inform you when your instance is definitely - or could be - unhealthy.
Because these combinatorial alerting metrics are composed as Prometheus metrics, you can even use the Prometheus API to easily query the list of critical and warning alerts that are firing on your instance through e.g. curl
(documentation for this is coming soon).
In future versions, we will add more exhaustive alert definitions (today we only monitor a basic set of ~21 alerts over 9 services) and more detailed information on these dashboards, so please stay tuned!
Near future follow-ups:
-
dev docs -
Describe how to configure alerting; SMTP via env vars GF_SMTP_ in docs.