monitoring: improve long-term provisioning alerts aggregation (!12778) · Merge requests · Warren Gifford / sourcegraph

Warren Gifford requested to merge monitoring/provisioning-alerts-improvements into main Aug 06, 2020

Created by: bobheadxi

Our long-term provisioning alerts currently a few issues:

the aggregation was changed from avg_over_time to max_over_time (https://github.com/sourcegraph/sourcegraph/issues/12032), which reflects poorly on actual usage, overemphasizing small peaks and causing warnings to fire about under-provisioning when instances of a service reaching high usage are very rare (https://github.com/sourcegraph/sourcegraph/issues/12454#issuecomment-669658768)
the aggregation window was changed to 7d, which made sense until we added support for the for parameter:
- alerts now represent "the peak usage of a service per week, for 2 weeks", which doesn't seem particularly useful
- the current 7d query aligns poorly with actual usage (see below) and is difficult to understand (https://github.com/sourcegraph/sourcegraph/issues/12692)

This change addresses the above by switching long-term alerts to quantile_over_time for the 90th percentile, on 1d aggregation, ie: "if the 90th percentile usage per day is above/below X for 14 days straight, fire an alert". See below for comparisons.

The resulting panels are far more readable, which closes https://github.com/sourcegraph/sourcegraph/issues/12692

This change also makes the following improvements:

render for in alert description
more fixes to correct the wording of the provisiong alert solutions
remove the time from the provisioning alerts identifiers and use short_term and long_term instead, which gives us more flexibility in improving what we are trying to signal with these alerts

Comparisons

Queries for searcher - current on the left, new on the right, charting min/max of the target query compared to "actual" (avg of max over 5m - in screenshot, this is green)

Before and after provisioning dashboard for Searcher now accurately reflects low util (current left, new right):

Similarly for Frontend:

monitoring: improve long-term provisioning alerts aggregation

Comparisons

Merge request reports