monitoring: simple support for alert rule 'for'
Created by: bobheadxi
Adds simple support for the for: duration
parameter of Prometheus's native alerts by:
- making the alert query on an
alert
- making
record: alert_count
queryALERTS
(so that the results remain consistent, even withfor
)
This might help us resolve some flakey alerts (eg #12158 (closed)) and is needed to ensure we can migrate our out-of-band alerts exactly (since some do use for
- see #12391).
Also adds a new label, alert_type: builtin
, to indicate generated alerts.
Closes https://github.com/sourcegraph/sourcegraph/issues/12336
Considerations
Keeping the 0-or-1 query
I'm calling all the min-ing and maxing stuff part of generating a "0-or-1 query". I originally wanted to change the generator so that we get a clean alert rule that simply renders some_value > my_threshold
or similar, since the operations to get a clean 0 or 1 is very difficult to parse (see below). However, this plays poorly with our support for DataMayBeNaN
and DataMayNotExist
, which depends on the min/max stuff, and I'm currently unable to find a way to make these two options work while removing the 0-or-1 query, so I've opt to keep this PR simple and just swap alert and alert_count
Keeping alert_count
alert_count
is a documented way for customers to query alerts
Use of Prometheus ALERTS metrics
Sadly, inactive alerts don't show up in ALERTS
. One feature of alert_count
is that they exist even if inactive, so we can use them for e.g. seeing if customers have an alert configured at all. So the alert_count
record does a max and or
New output
- alert: critical_gitserver_running_git_commands
labels:
alert_type: builtin
description: 'gitserver: 100+ running git commands (signals load)'
level: critical
name: running_git_commands
service_name: gitserver
expr: max(((((max(src_gitserver_exec_running)) / 100) OR on() vector(0)) >= 0) OR on() vector(1)) > 0
- record: alert_count
labels:
alert_type: builtin
description: 'gitserver: 100+ running git commands (signals load)'
level: critical
name: running_git_commands
service_name: gitserver
expr: max(ALERTS{alertname="critical_gitserver_running_git_commands"} OR on() vector(0))