monitoring(frontend): use ratios instead of hard thresholds (!12756) · Merge requests · Administrator / sourcegraph

Administrator requested to merge monitoring/ratio-instead-of-hard into main Aug 05, 2020

Created by: bobheadxi

While the original alert described in #12158 is not as noisy anymore, in general some of the noisiest alerts in #alerts-cloud are those alerts with hard thresholds, ie "Y+ errors in X minutes" - on larger instances like Sourcegraph Cloud, this could mean we fire alerts on issues that only affect a very small number of users.

This change tries to convert all of the frontend service's hard threshold alerts to ratio-based alerts. I opted to just change them all for consistency.

To try and account for smaller instances where a few errors in a few requests can trigger an alert, I've added a for parameter to each alert, so that requests should consistently error out before causing an alert to be raised.

Closes #12158

Some comparisons - left is previous query, right is new ratio query. Each link also has a second metric query on the old panel to compare with total requests

hard_error_search_responses

precise_code_intel_api_errors - this one is particularly noisy

If this change seems useful, I can apply it to our other services as well

monitoring(frontend): use ratios instead of hard thresholds

Merge request reports