Skip to content
Snippets Groups Projects

prometheus: bundle Alertmanager and siteConfig sync

Merged Administrator requested to merge distribution/prom-alerting into master

Created by: bobheadxi

Context

For onlookers, some recent product-level context may help: this change is aiding in getting us to a point where we can push all site admins to turn on monitoring for their Sourcegraph instance by having them configure alerting through the Sourcegraph site configuration and having our monitoring stack automatically respond to that. This comment goes into a little more detail.

Changes

This PR completely removes the grafana-wrapper and functionality for deploying observability.alerts as Grafana alerts (effectively reverting it to vanilla Grafana), and replaces it with a new prom-wrapper that bundles Prometheus + Alertmanager in sourcegraph/prometheus alongside the same observability.alerts features added in #10641

The rationale for this change is covered in the comments of https://github.com/sourcegraph/sourcegraph/issues/11452, but in a nutshell:

  • Grafana's vision of alerting doesn't align with what we have (most notably, warning/critical for each metric), which makes #11452 very difficult without nontrivial workarounds. Alertmanager leverages our Prometheus alert_count rules so will give us the granularity we need, and routing makes it easy to send alerts to the appropriate notifiers (referred to in Alertmanager as "receivers")
  • Without the above, it's pretty much impossible to achieve #11210 (silencing of individual alerts) either in Grafana without some more nontrivial workarounds
  • The initial implementation of the Grafana-based version of this feature (https://github.com/sourcegraph/sourcegraph/pull/11427, https://github.com/sourcegraph/sourcegraph/pull/11483) was already a bit of a hack, and involved duplicating dashboards to work around limitations, whereas this PR achieves the same thing with a lot less fenangling

Other advantages:

  • Alertmanager configuration just involves writing to a file and hitting a reload endpoint, which simplifies the implementation (#11554 required adding some Grafana process management to reload some configuration)
  • The frontend is already configured to talk to the Prometheus container for information about alerts as part of https://github.com/sourcegraph/sourcegraph/pull/10704, which makes responsibilities a bit clearer than the Grafana approach where it seemed like getting information about alerting would start to involve talking to different services
  • Alertmanager has a nice UI which I've reverse-proxied on prometheus:9090/alertmanager - might come in handy

Caveats:

  • This PR introduces some breaking changes in configuration to align it with Alertmanager's options - after discussion with @slimsag, we decided the number of admins taking advantage of the alerting introduced in 3.17 should be small enough that we should be fine just noting the breaking changes in the changelog. Going forward, I think it would be best if we don't align configuration directly with the options of our alerting provider like I did in #11427, and just add minimal notifier fields on a kind of need-to-have basis
  • Logs of 3 services in one container are difficult to read, but I'm unsure how often we would look at Prometheus logs in practice. I considered leveraging a library that can add prefixes to lines written to an io.Writer, but only managed to find one simple but seemingly abandoned library, go-prefix-writer, so I've opted not to introduce the dependency for now
  • Somewhat duplicated definitions - to trigger Alertmanager, we need "real" Prometheus alerts, so now we have alert_count and Prometheus alerts

Best reviewed as a whole and ignoring all the Grafana changes

Issues

  • closes #11452 - see above
    • closes #11554 - this PR starts with the WIP stuff here (35fe4c6)
    • closes #11612 - I'll follow up with a similar PR that accounts for this change for #11473
  • closes #11663 - Grafana no longer has the grafana-wrapper, and is now pretty much just vanilla Grafana. This implementation loads the initial configuration in a goroutine, so it won't be a problem here
  • closes #11454 - based on #11554
  • helps #11210 - the Alertmanager UI is reverse-proxied on prometheus:9090/alertmanager, so users will be able to port-forward the Prometheus service and go to http://localhost:9090/alertmanager/#/silences to silence alerts out of the box (though I think #11210 envisions a site-config based approach, so this doesn't fully resolve it yet)

TODOs

  • manually test each individual notifier - the documentation indicates I'm setting the fields correctly and the Alertmanager UI indicates that changes are applying as expected, but I've yet to successfully receive a notification. I'm currently doing this by manually triggering alerts via:
     curl -H "Content-Type: application/json" -d '[{"labels":{"alertname":"robertrobert","level":"critical"}}]' localhost:9090/alerts/api/v1/alerts
    update: the config structs provided by alertmanager never marshal any secrets, so they are not being written to disk correctly, which is very unfortunate (https://github.com/prometheus/alertmanager/issues/1985) - trying to figure out a way around this short of copy-pasting the definitions opened https://github.com/prometheus/alertmanager/pull/2316
  • find a way to trigger a Sourcegraph alert to make sure the whole thing works (because the above is manual) => set low thresholds temporarily for some metrics. See samples in #alerts
  • changelog item communicating breaking changes

Non-goals

  • Linking alerts back to Grafana panels for the corresponding metric - this requires some work on our generator, and might not be very straightforward (seems only incremental IDs are available, and not UIDs, for panels)
  • Amazing templates for each notification type - the most lacking one right now is the email template, which is just a plain-text thing at the moment, but I think those are best addressed in follow-ups. The goal for this iteration is to have them be consistent (link back to solutions docs, indicate basic metadata)

Merge request reports

Merged by avatar (Jul 5, 2025 6:15pm UTC)

Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Created by: slimsag

    This is excellent, thanks for the write-up! I will take a look at the implementation and try this out tomorrow.

  • Created by: bobheadxi

    [...] but I've yet to successfully receive a notification. I'm going this by [???]

    It looks like your message got cut off here

    I've fixed the point now :)

  • Created by: slimsag

    find a way to trigger a Sourcegraph alert to make sure the whole thing works (because the above is manual)

    It would be nice to have this happen automatically after saving changes to the alerting configuration - or even just a GraphQL query one can run in e.g. sourcegraph.com/api/console (but neither of this is needed for this PR and I don't want to increase its scope)

    One easy way to do it would be to declare a a Prometheus metric like test_alert and define alerts on that through our regular monitoring stack, then simply increment that metric whenever you want it to fire

  • Created by: uwedeportivo

    this is very cool. i'll take a closer look tomorrow.

  • Created by: unknwon

    Do you think cloud teams should review this PR? (come here via code owner)

  • Created by: slimsag

    @unknwon I would say not necessary as Distribution owns this - but feel free to review if you like as usual of course. Will make sure CODEOWNERS file reflects this.

  • Created by: tsenart

    Can we dog-food this for Sourcegraph Cloud?

  • Created by: bobheadxi

    Can we dog-food this for Sourcegraph Cloud?

    Yep! Once I finish this up, I'll update https://github.com/sourcegraph/deploy-sourcegraph-dot-com/pull/2839 for Cloud (Cloud == dot-com right?)

  • Created by: davejrt

    Will take a closer look today and review

  • Created by: codecov[bot]

    Codecov Report

    Merging #11832 into master will not change coverage. The diff coverage is n/a.

    @@           Coverage Diff           @@
    ##           master   #11832   +/-   ##
    =======================================
      Coverage   47.76%   47.76%           
    =======================================
      Files        1401     1401           
      Lines       79917    79917           
      Branches     6726     6726           
    =======================================
      Hits        38172    38172           
      Misses      38159    38159           
      Partials     3586     3586           
    Flag Coverage Δ
    #go 51.94% <0.00%> (ø)
    #storybook 10.11% <0.00%> (ø)
    #typescript 36.61% <0.00%> (ø)
    #unit 47.36% <0.00%> (ø)
  • Created by: codecov[bot]

    Codecov Report

    Merging #11832 into master will not change coverage. The diff coverage is n/a.

    @@           Coverage Diff           @@
    ##           master   #11832   +/-   ##
    =======================================
      Coverage   50.07%   50.07%           
    =======================================
      Files        1517     1517           
      Lines       88832    88832           
      Branches     6769     6769           
    =======================================
      Hits        44483    44483           
      Misses      40403    40403           
      Partials     3946     3946           
    Flag Coverage Δ
    #go 54.50% <0.00%> (ø)
    #storybook 10.74% <0.00%> (ø)
    #typescript 36.58% <0.00%> (ø)
    #unit 49.69% <0.00%> (ø)
  • Created by: keegancsmith

    That PR description is a thing of beauty. Thank you so much.

Please register or sign in to reply
Loading