Skip to content

monitoring: improve long-term provisioning alerts aggregation

Administrator requested to merge monitoring/provisioning-alerts-improvements into main

Created by: bobheadxi

Our long-term provisioning alerts currently a few issues:

This change addresses the above by switching long-term alerts to quantile_over_time for the 90th percentile, on 1d aggregation, ie: "if the 90th percentile usage per day is above/below X for 14 days straight, fire an alert". See below for comparisons.

The resulting panels are far more readable, which closes https://github.com/sourcegraph/sourcegraph/issues/12692

This change also makes the following improvements:

  • render for in alert description
  • more fixes to correct the wording of the provisiong alert solutions
  • remove the time from the provisioning alerts identifiers and use short_term and long_term instead, which gives us more flexibility in improving what we are trying to signal with these alerts

Comparisons

Queries for searcher - current on the left, new on the right, charting min/max of the target query compared to "actual" (avg of max over 5m - in screenshot, this is green)

image

Before and after provisioning dashboard for Searcher now accurately reflects low util (current left, new right):

image

Similarly for Frontend:

image

Merge request reports

Loading