Skip to content

monitoring: reduce threshold of low-utilization alerts

Created by: bobheadxi

Low-utilization alerts were introduced as part of efforts for alerting to help guide provisioning decisions. At Sourcegraph, we currently have many low-traffic instances where this alert is firing for pretty much every single service, all the time, prompting a need to silence them en masse (https://github.com/sourcegraph/sourcegraph/pull/14474). However, I don't think just removing them is the best solution, since severe over-provisioning can be a problem.

https://github.com/sourcegraph/sourcegraph/pull/14474#issuecomment-705287782

I really think they can be a lot stricter - the 30% is rather high, at 10% or even 5% it's a lot more of a useful indicator that "hey, this service is really not using any resources" and more likely to prompt a response.

That said, having to silence low-util alerts is not strictly an indicator that they are not useful - we have instances we keep around for testing that probably won't get any traffic but we dont want to scale them down to 0 (eg devmanaged.sourcegraph.com), and I hear about customers having test deployments as well where they might want to ignore this.