Skip to content

monitoring: simple support for alert rule 'for'

Warren Gifford requested to merge monitoring/support-alert-rule-for into master

Created by: bobheadxi

Adds simple support for the for: duration parameter of Prometheus's native alerts by:

  • making the alert query on an alert
  • making record: alert_count query ALERTS (so that the results remain consistent, even with for)

This might help us resolve some flakey alerts (eg #12158 (closed)) and is needed to ensure we can migrate our out-of-band alerts exactly (since some do use for - see #12391).

Also adds a new label, alert_type: builtin, to indicate generated alerts.

Closes https://github.com/sourcegraph/sourcegraph/issues/12336

Considerations

Keeping the 0-or-1 query

I'm calling all the min-ing and maxing stuff part of generating a "0-or-1 query". I originally wanted to change the generator so that we get a clean alert rule that simply renders some_value > my_threshold or similar, since the operations to get a clean 0 or 1 is very difficult to parse (see below). However, this plays poorly with our support for DataMayBeNaN and DataMayNotExist, which depends on the min/max stuff, and I'm currently unable to find a way to make these two options work while removing the 0-or-1 query, so I've opt to keep this PR simple and just swap alert and alert_count

Keeping alert_count

alert_count is a documented way for customers to query alerts

Use of Prometheus ALERTS metrics

Sadly, inactive alerts don't show up in ALERTS. One feature of alert_count is that they exist even if inactive, so we can use them for e.g. seeing if customers have an alert configured at all. So the alert_count record does a max and or

New output

  - alert: critical_gitserver_running_git_commands
    labels:
      alert_type: builtin
      description: 'gitserver: 100+ running git commands (signals load)'
      level: critical
      name: running_git_commands
      service_name: gitserver
    expr: max(((((max(src_gitserver_exec_running)) / 100) OR on() vector(0)) >= 0) OR on() vector(1)) > 0
  - record: alert_count
    labels:
      alert_type: builtin
      description: 'gitserver: 100+ running git commands (signals load)'
      level: critical
      name: running_git_commands
      service_name: gitserver
    expr: max(ALERTS{alertname="critical_gitserver_running_git_commands"} OR on() vector(0))

Comparison: https://sourcegraph.com/-/debug/grafana/explore?orgId=1&left=%5B%22now-7d%22,%22now%22,%22Prometheus%22,%7B%22expr%22:%22max(ALERTS%7Balertname%3D%5C%22critical_gitserver_running_git_commands%5C%22%7D%20OR%20on()%20vector(0))%22%7D,%7B%22mode%22:%22Metrics%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D&right=%5B%22now-7d%22,%22now%22,%22Prometheus%22,%7B%22expr%22:%22clamp_max(clamp_min(floor(%5Cn%5Cn%20%20%20%20%20%20max(((((max(src_gitserver_exec_running))%20%2F%20100)%20OR%20on()%20vector(0))%20%3E%3D%200)%20OR%20on()%20vector(1))%5Cn%5Cn),%200),%201)%20OR%20on()%20vector(1)%22%7D,%7B%22mode%22:%22Metrics%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D

Merge request reports

Loading