monitoring(frontend): use ratios instead of hard thresholds
Created by: bobheadxi
While the original alert described in #12158 (closed) is not as noisy anymore, in general some of the noisiest alerts in #alerts-cloud
are those alerts with hard thresholds, ie "Y+ errors in X minutes" - on larger instances like Sourcegraph Cloud, this could mean we fire alerts on issues that only affect a very small number of users.
This change tries to convert all of the frontend service's hard threshold alerts to ratio-based alerts. I opted to just change them all for consistency.
To try and account for smaller instances where a few errors in a few requests can trigger an alert, I've added a for
parameter to each alert, so that requests should consistently error out before causing an alert to be raised.
Closes #12158 (closed)
Some comparisons - left is previous query, right is new ratio query. Each link also has a second metric query on the old panel to compare with total requests
-
precise_code_intel_api_errors
- this one is particularly noisy
If this change seems useful, I can apply it to our other services as well