Skip to content

proposal(monitoring): forbid critical alerts with no solutions

Created by: bobheadxi

This proposal arises from a discussion I had with @michaellzc today regarding https://github.com/sourcegraph/sourcegraph/pull/36321 , the motivation for which is basically "most critical alerts are not useful, because they are unactionable with unhelpful documentation". I think this points to a very serious issue with our critical alerts today - not only can they be noisy, but many do not provide site admins like @michaellzc and the DevOps team with any useful action to take, or indicate how to debug these critical alerts, to the point of one such admin deciding it is easier to build an entire new feature for allow-listing certain alerts rather than be paged night and day. The severity of a critical alert is already outlined in our documentation:

critical: something is definitively wrong with Sourcegraph. We suggest using a high-visibility notification channel for these alerts.

When we add a critical alert, we are basically saying that we would be willing to page Sourcegraph admins from every customer if this alert goes off. IMO a reasonable expectation of adding a critical alert should be a robust debugging path for how to deal with this critical alert. Not only is this a better experience for customers, but it is useful for growing our teams as new members onboard and take on-call rotations.

The proposal: forbid critical alerts from being defined without any PossibleSolutions provided, and remove all critical alerts that currently do not have any PossibleSolutions defined. Additionally, as follow-ups:

This PR adds the above restriction as a hard requirement within the generator, and fixes all the outstanding issues:

  • Add PossibleSolutions for the pods_available_percentage and pg_up critical alerts (we have a similar solution doc in the container_missing alert)
  • Removes all the other critical alerts without possible solutions

Test plan

Eventually, go generate ./monitoring should pass.

Merge request reports

Loading