Skip to content
Snippets Groups Projects

monitoring: migrate out-of-band alerts to the generator

Created by: bobheadxi

Migrate all out-of-band alert rules defined by deploy-sourcegraph to the generator, such that each alert now has a panel as part of a service dashboard like all our other alerts, with a few changes:

  • previously multi-service alerts are now defined per-service as per monitoring pillars (enforced by the generator), since a multi-service alert would require multi-service dashboards
  • k8s monitoring in a new group (currently, only pods availability)

And a few additions:

  • go_gc_duration to go along with go_goroutines + new "golang runtime monitoring" group for relevant services
  • an entire prometheus + alertmanager dashboard (to go with prometheus_metrics_bloat)

And a few exceptions:

  • gitserver alerts were not migrated (seems captured by existing gitserver observables)
  • k8s node disk space remaining alerts were not migrated (seems nonfunctional - see thread)

Closes https://github.com/sourcegraph/sourcegraph/issues/12117

Try it out

  1. kubectl port-forward svc/prometheus 9090:30090 -n prod
  2. Check out monitoring/migrate-out-of-band-alerts
  3. go generate ./monitoring
  4. ./dev/grafana.sh
  5. http://localhost:3370

TODOs

  • A few alerts use for, so I'm going to go complete #12336 first and add the appropriate for parameters here
  • Write up possible solutions for alerts that might have some
  • Improve panel options where possible

Merge request reports

Approval is optional

Merged by avatar (Aug 27, 2025 7:49am UTC)

Merge details

  • Changes merged into master with 54800c2a.
  • Deleted the source branch.

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Created by: pecigonzalo

    previously multi-service alerts are now defined per-service as per monitoring pillars

    Is tihs the 2nd pillar? I don't think it applies here, I believe this was also closely related to defining alerts in Grafana and the map between alerts and dashboards (cc/ @slimsag). Unless we need different thresholds for each, I dont think we need to define per-service alerts/recording rules. wdyt?

    Related: https://www.robustperception.io/undoing-the-benefits-of-labels

  • Created by: bobheadxi

    @pecigonzalo to have alerts span multiple services, we'd need a multi-service dashboard. We could forgo the dashboard entirely and make the generator capable of creating alerts without dashboards at all, but I'm not sure how I feel about that - this is not explicitly stated in the pillars but I like the idea of each alert is associated with a specific panel for a specific service so that it is useful to site admins

    The monitoring dashboards here are also similar to our container monitoring/provisioning monitoring dashboards, where each service gets its own dashboard

  • Created by: pecigonzalo

    to have alerts span multiple services, we'd need a multi-service dashboard.

    Is this a rule set by our current generator? As I interpreted the pillar, it defines that a graph should have an associated alert, but not that an alert necessarily needs a graph.

    If we want to create an dashboard per service that shows service alerts, we should not need to split alerts into multiple alert rules, I believe we could use one of the labels to filter on each graph.

  • Created by: bobheadxi

    As I interpreted the pillar, it defines that a graph should have an associated alert, but not that an alert necessarily needs a graph.

    I think requiring a dashboard per alert makes sense as well, otherwise what we are alerting on is essentially invisible. For example, our disk space alerts have been nonfunctional without an obvious way to realize that (can only tell what we are alerting on by going into config/prometheus itself, and then we have to dive into the specific metric).

    Would like to hear what @slimsag thinks about this and defer to his judgement :)

    If we want to create an dashboard per service that shows service alerts, we should not need to split alerts into multiple alert rules, we should be able to use one of the labels to filter on each graph.

    This is definitely an improvement that could be made! However, I think this is an implementation detail that is not within the scope of this PR to fix (since such an improvement would apply to many alerts we have) - I'm focusing on simply moving the alerts here for now

  • Created by: pecigonzalo

    I think requiring a dashboard per alert makes sense as well

    I agree that we should link to a relevant dashboard, graph or/and document, my comment is around what the monitoring pillars required as the reason why we are splitting into an alert rule per service.

    This is definitely an improvement that could be made! However, I think this is an implementation detail that is not within the scope of this PR to fix (since such an improvement would apply to many alerts we have) - I'm focusing on simply moving the alerts here for now

    I understood from "previously multi-service alerts are now defined per-service as per monitoring pillars, since a multi-service alert would require multi-service dashboards" that we were implementing that split into multiple per-service alerts here.

  • Created by: bobheadxi

    Clarified in meeting that this is because of requirements imposed by the generator

  • Created by: slimsag

    Sounds like the confusion here was cleared up? If not let me know.

    You can find why I believe we should only allow 1 dashboard per service here: https://about.sourcegraph.com/handbook/engineering/distribution/observability/monitoring_pillars#faq-why-can-t-i-create-an-ad-hoc-dashboard-e-g-an-http-dashboard

  • Created by: bobheadxi

    Checked each dashboard using the steps described in the PR

  • Created by: codecov[bot]

    Codecov Report

    Merging #12391 into master will decrease coverage by 0.03%. The diff coverage is 71.15%.

    @@            Coverage Diff             @@
    ##           master   #12391      +/-   ##
    ==========================================
    - Coverage   50.61%   50.58%   -0.04%     
    ==========================================
      Files        1423     1417       -6     
      Lines       80436    79283    -1153     
      Branches     6838     6552     -286     
    ==========================================
    - Hits        40711    40103     -608     
    + Misses      36169    35728     -441     
    + Partials     3556     3452     -104     
    Flag Coverage Δ
    #go 51.97% <71.27%> (-0.39%) :arrow_down:
    #integration 24.25% <0.00%> (+0.10%) :arrow_up:
    #storybook 12.63% <0.00%> (?)
    #typescript 46.85% <0.00%> (+0.86%) :arrow_up:
    #unit 47.20% <71.15%> (-0.39%) :arrow_down:

    Flags with carried forward coverage won't be shown. Click here to find out more.

    Impacted Files Coverage Δ
    browser/src/shared/code-hosts/github/codeHost.ts 60.40% <0.00%> (-0.41%) :arrow_down:
    cmd/frontend/backend/site_admin.go 0.00% <0.00%> (ø)
    cmd/frontend/graphqlbackend/campaigns.go 0.00% <0.00%> (ø)
    cmd/frontend/graphqlbackend/json.go 47.36% <0.00%> (-5.58%) :arrow_down:
    cmd/frontend/graphqlbackend/saved_searches.go 33.73% <0.00%> (ø)
    cmd/frontend/graphqlbackend/search.go 66.27% <0.00%> (-0.68%) :arrow_down:
    cmd/frontend/graphqlbackend/search_alert.go 28.36% <0.00%> (-0.79%) :arrow_down:
    cmd/frontend/graphqlbackend/search_results.go 43.55% <0.00%> (+0.16%) :arrow_up:
    cmd/frontend/internal/app/debugproxies/scanner.go 38.20% <0.00%> (-1.34%) :arrow_down:
    cmd/query-runner/email.go 0.00% <0.00%> (ø)
    ... and 206 more
Please register or sign in to reply
Loading