Skip to content

prometheus: bundle Alertmanager and siteConfig sync

Warren Gifford requested to merge distribution/prom-alerting into master

Created by: bobheadxi

Context

For onlookers, some recent product-level context may help: this change is aiding in getting us to a point where we can push all site admins to turn on monitoring for their Sourcegraph instance by having them configure alerting through the Sourcegraph site configuration and having our monitoring stack automatically respond to that. This comment goes into a little more detail.

Changes

This PR completely removes the grafana-wrapper and functionality for deploying observability.alerts as Grafana alerts (effectively reverting it to vanilla Grafana), and replaces it with a new prom-wrapper that bundles Prometheus + Alertmanager in sourcegraph/prometheus alongside the same observability.alerts features added in #10641 (closed)

The rationale for this change is covered in the comments of https://github.com/sourcegraph/sourcegraph/issues/11452, but in a nutshell:

  • Grafana's vision of alerting doesn't align with what we have (most notably, warning/critical for each metric), which makes #11452 (closed) very difficult without nontrivial workarounds. Alertmanager leverages our Prometheus alert_count rules so will give us the granularity we need, and routing makes it easy to send alerts to the appropriate notifiers (referred to in Alertmanager as "receivers")
  • Without the above, it's pretty much impossible to achieve #11210 (closed) (silencing of individual alerts) either in Grafana without some more nontrivial workarounds
  • The initial implementation of the Grafana-based version of this feature (https://github.com/sourcegraph/sourcegraph/pull/11427, https://github.com/sourcegraph/sourcegraph/pull/11483) was already a bit of a hack, and involved duplicating dashboards to work around limitations, whereas this PR achieves the same thing with a lot less fenangling

Other advantages:

  • Alertmanager configuration just involves writing to a file and hitting a reload endpoint, which simplifies the implementation (#11554 required adding some Grafana process management to reload some configuration)
  • The frontend is already configured to talk to the Prometheus container for information about alerts as part of https://github.com/sourcegraph/sourcegraph/pull/10704, which makes responsibilities a bit clearer than the Grafana approach where it seemed like getting information about alerting would start to involve talking to different services
  • Alertmanager has a nice UI which I've reverse-proxied on prometheus:9090/alertmanager - might come in handy

Caveats:

  • This PR introduces some breaking changes in configuration to align it with Alertmanager's options - after discussion with @slimsag, we decided the number of admins taking advantage of the alerting introduced in 3.17 should be small enough that we should be fine just noting the breaking changes in the changelog. Going forward, I think it would be best if we don't align configuration directly with the options of our alerting provider like I did in #11427, and just add minimal notifier fields on a kind of need-to-have basis
  • Logs of 3 services in one container are difficult to read, but I'm unsure how often we would look at Prometheus logs in practice. I considered leveraging a library that can add prefixes to lines written to an io.Writer, but only managed to find one simple but seemingly abandoned library, go-prefix-writer, so I've opted not to introduce the dependency for now
  • Somewhat duplicated definitions - to trigger Alertmanager, we need "real" Prometheus alerts, so now we have alert_count and Prometheus alerts

Best reviewed as a whole and ignoring all the Grafana changes

Issues

  • closes #11452 (closed) - see above
    • closes #11554 - this PR starts with the WIP stuff here (35fe4c6)
    • closes #11612 - I'll follow up with a similar PR that accounts for this change for #11473 (closed)
  • closes #11663 (closed) - Grafana no longer has the grafana-wrapper, and is now pretty much just vanilla Grafana. This implementation loads the initial configuration in a goroutine, so it won't be a problem here
  • closes #11454 (closed) - based on #11554
  • helps #11210 (closed) - the Alertmanager UI is reverse-proxied on prometheus:9090/alertmanager, so users will be able to port-forward the Prometheus service and go to http://localhost:9090/alertmanager/#/silences to silence alerts out of the box (though I think #11210 (closed) envisions a site-config based approach, so this doesn't fully resolve it yet)

TODOs

  • manually test each individual notifier - the documentation indicates I'm setting the fields correctly and the Alertmanager UI indicates that changes are applying as expected, but I've yet to successfully receive a notification. I'm currently doing this by manually triggering alerts via:
     curl -H "Content-Type: application/json" -d '[{"labels":{"alertname":"robertrobert","level":"critical"}}]' localhost:9090/alerts/api/v1/alerts
    update: the config structs provided by alertmanager never marshal any secrets, so they are not being written to disk correctly, which is very unfortunate (https://github.com/prometheus/alertmanager/issues/1985) - trying to figure out a way around this short of copy-pasting the definitions opened https://github.com/prometheus/alertmanager/pull/2316
  • find a way to trigger a Sourcegraph alert to make sure the whole thing works (because the above is manual) => set low thresholds temporarily for some metrics. See samples in #alerts
  • changelog item communicating breaking changes

Non-goals

  • Linking alerts back to Grafana panels for the corresponding metric - this requires some work on our generator, and might not be very straightforward (seems only incremental IDs are available, and not UIDs, for panels)
  • Amazing templates for each notification type - the most lacking one right now is the email template, which is just a plain-text thing at the moment, but I think those are best addressed in follow-ups. The goal for this iteration is to have them be consistent (link back to solutions docs, indicate basic metadata)

Merge request reports

Loading