Skip to content

codeintel: alert when all executor jobs are failing

Warren Gifford requested to merge nsc-ef/executor-errors-over-time into main

Created by: Strum355

Creates alert for executors error rate that alerts when the rate of errors is 100%, indicating some global misconfiguration (as happened before with src-cli related issues).

The alert is a bit special in that it uses a different query to the panel, one based on last_over_time aggregate. We do this as we dont want the alert to mark itself as resolved if there happens to be a period over the defined window where there are no auto indexing jobs (aka when the error rate is "technically" < 100%).

The screenshot below illustrates how the alert query maintains the last value over a predefined window, so that if no executor jobs are processing but over the error rate was 100% before, we will continue alerting as the absence of running jobs does not imply the issue is resolved.

image

Closes https://github.com/sourcegraph/sourcegraph/issues/30494

Test plan

Only modifies dashboards/alerts, n/a

Merge request reports

Loading