monitoring: improve long-term provisioning alerts aggregation
Created by: bobheadxi
Our long-term provisioning alerts currently a few issues:
- the aggregation was changed from
avg_over_time
tomax_over_time
(https://github.com/sourcegraph/sourcegraph/issues/12032), which reflects poorly on actual usage, overemphasizing small peaks and causing warnings to fire about under-provisioning when instances of a service reaching high usage are very rare (https://github.com/sourcegraph/sourcegraph/issues/12454#issuecomment-669658768) - the aggregation window was changed to
7d
, which made sense until we added support for thefor
parameter:- alerts now represent "the peak usage of a service per week, for 2 weeks", which doesn't seem particularly useful
- the current 7d query aligns poorly with actual usage (see below) and is difficult to understand (https://github.com/sourcegraph/sourcegraph/issues/12692)
This change addresses the above by switching long-term alerts to quantile_over_time
for the 90th percentile, on 1d aggregation, ie: "if the 90th percentile usage per day is above/below X for 14 days straight, fire an alert". See below for comparisons.
The resulting panels are far more readable, which closes https://github.com/sourcegraph/sourcegraph/issues/12692
This change also makes the following improvements:
- render
for
in alert description - more fixes to correct the wording of the provisiong alert solutions
- remove the time from the provisioning alerts identifiers and use
short_term
andlong_term
instead, which gives us more flexibility in improving what we are trying to signal with these alerts
Comparisons
Queries for searcher - current on the left, new on the right, charting min/max of the target query compared to "actual" (avg of max over 5m - in screenshot, this is green)
Before and after provisioning dashboard for Searcher now accurately reflects low util (current left, new right):
Similarly for Frontend: