generate Grafana dashboards and Prometheus alert rules from single source of truth
Created by: slimsag
Today
We have three sets of dashboards:
- Non-generated & deprecated "internal" dashboards
- jsonnet-generated & deprecated "internal" dashboards
- New non-generated & formalized high-level-alerting dashboards
I am actively working towards us just having the third type.
Constraints of new dashboards
One of the primary constraints of the new non-generated dashboards is imposing a number of important restrictions:
- Enforcing that we do not introduce arbitrary dashboards, i.e. that we only have one dashboard per service and that is it.
- Enforcing that we do not introduce panels or dashboards that monitor Prometheus metrics without defined alerts.
I began down the road of documenting why the above two restrictions are so critically important, and thought a lot about how to enforce these restrictions.
When does generation make sense?
Previously, with our jsonnet generation of dashboards I had stated I was opposed to them because it often makes it more difficult than needed to keep dashboards up to date (requires learning what options there are possible, etc). However, Keegan brought up a good point previously as well: generation can be used to enforce consistency across dashboards.
Expanding upon this idea, I have created a generator which enforces both policies and generates both the Grafana dashboards and Prometheus alerting rules for anything we want to observe.
This PR only converts a single dashboard, the new syntect-server one, into this new format -- in order to keep this PR smaller in scope. Next, I will convert all other dashboards and completely eliminate all other types of dashboards (including jsonnet and hand-crafted ones).
How does it look?
These two files:
- docker-images/grafana/config/provisioning/dashboards/sourcegraph/syntect-server.json
- docker-images/prometheus/config/syntect_server_rules.yml
Are now replaced with just:
- observability/syntect_server.go
The above file automatically produces the relevant Prometheus alert definitions, as well as producing the relevant Grafana dashboard -- all of which are compatible with our high level alerting approach.
Before:
After:
Note: The orange background region shows when a "warning" alert threshold has been met, there is also a red backgrund for when a "critical" alert threshold has been met it is just not shown in the above as no critical alerts were firing.
Developer productivity
Whenever you make edits to observability/*.go
, the generator is automatically ran and both Prometheus and Grafana are automatically updated. This just required calling into some Grafana/Prometheus APIs correctly. So you can simply edit, save, and refresh in order to develop.
Fixes #9523 Fixes #9524 Fixes #9525