prometheus: bundle Alertmanager and siteConfig sync
Created by: bobheadxi
Context
For onlookers, some recent product-level context may help: this change is aiding in getting us to a point where we can push all site admins to turn on monitoring for their Sourcegraph instance by having them configure alerting through the Sourcegraph site configuration and having our monitoring stack automatically respond to that. This comment goes into a little more detail.
Changes
This PR completely removes the grafana-wrapper
and functionality for deploying observability.alerts
as Grafana alerts (effectively reverting it to vanilla Grafana), and replaces it with a new prom-wrapper
that bundles Prometheus + Alertmanager in sourcegraph/prometheus
alongside the same observability.alerts
features added in #10641
The rationale for this change is covered in the comments of https://github.com/sourcegraph/sourcegraph/issues/11452, but in a nutshell:
- Grafana's vision of alerting doesn't align with what we have (most notably, warning/critical for each metric), which makes #11452 very difficult without nontrivial workarounds. Alertmanager leverages our Prometheus
alert_count
rules so will give us the granularity we need, and routing makes it easy to send alerts to the appropriate notifiers (referred to in Alertmanager as "receivers") - Without the above, it's pretty much impossible to achieve #11210 (silencing of individual alerts) either in Grafana without some more nontrivial workarounds
- The initial implementation of the Grafana-based version of this feature (https://github.com/sourcegraph/sourcegraph/pull/11427, https://github.com/sourcegraph/sourcegraph/pull/11483) was already a bit of a hack, and involved duplicating dashboards to work around limitations, whereas this PR achieves the same thing with a lot less fenangling
Other advantages:
- Alertmanager configuration just involves writing to a file and hitting a reload endpoint, which simplifies the implementation (#11554 required adding some Grafana process management to reload some configuration)
- The frontend is already configured to talk to the Prometheus container for information about alerts as part of https://github.com/sourcegraph/sourcegraph/pull/10704, which makes responsibilities a bit clearer than the Grafana approach where it seemed like getting information about alerting would start to involve talking to different services
- Alertmanager has a nice UI which I've reverse-proxied on
prometheus:9090/alertmanager
- might come in handy
Caveats:
- This PR introduces some breaking changes in configuration to align it with Alertmanager's options - after discussion with @slimsag, we decided the number of admins taking advantage of the alerting introduced in 3.17 should be small enough that we should be fine just noting the breaking changes in the changelog. Going forward, I think it would be best if we don't align configuration directly with the options of our alerting provider like I did in #11427, and just add minimal
notifier
fields on a kind of need-to-have basis - Logs of 3 services in one container are difficult to read, but I'm unsure how often we would look at Prometheus logs in practice. I considered leveraging a library that can add prefixes to lines written to an
io.Writer
, but only managed to find one simple but seemingly abandoned library,go-prefix-writer
, so I've opted not to introduce the dependency for now - Somewhat duplicated definitions - to trigger Alertmanager, we need "real" Prometheus alerts, so now we have
alert_count
and Prometheus alerts
Best reviewed as a whole and ignoring all the Grafana changes
Issues
- closes #11452 - see above
- closes #11554 - this PR starts with the WIP stuff here (35fe4c6)
- closes #11612 - I'll follow up with a similar PR that accounts for this change for #11473
- closes #11663 - Grafana no longer has the
grafana-wrapper
, and is now pretty much just vanilla Grafana. This implementation loads the initial configuration in a goroutine, so it won't be a problem here - closes #11454 - based on #11554
- helps #11210 - the Alertmanager UI is reverse-proxied on
prometheus:9090/alertmanager
, so users will be able to port-forward the Prometheus service and go to http://localhost:9090/alertmanager/#/silences to silence alerts out of the box (though I think #11210 envisions a site-config based approach, so this doesn't fully resolve it yet)
TODOs
-
manually test each individual notifier - the documentation indicates I'm setting the fields correctly and the Alertmanager UI indicates that changes are applying as expected, but I've yet to successfully receive a notification. I'm currently doing this by manually triggering alerts via: curl -H "Content-Type: application/json" -d '[{"labels":{"alertname":"robertrobert","level":"critical"}}]' localhost:9090/alerts/api/v1/alerts
trying to figure out a way around this short of copy-pasting the definitionsopened https://github.com/prometheus/alertmanager/pull/2316 -
find a way to trigger a Sourcegraph alert to make sure the whole thing works (because the above is manual) => set low thresholds temporarily for some metrics. See samples in #alerts -
changelog item communicating breaking changes
Non-goals
- Linking alerts back to Grafana panels for the corresponding metric - this requires some work on our generator, and might not be very straightforward (seems only incremental IDs are available, and not UIDs, for panels)
- Amazing templates for each notification type - the most lacking one right now is the email template, which is just a plain-text thing at the moment, but I think those are best addressed in follow-ups. The goal for this iteration is to have them be consistent (link back to solutions docs, indicate basic metadata)