proposal(monitoring): forbid critical alerts with no solutions
Created by: bobheadxi
This proposal arises from a discussion I had with @michaellzc today regarding https://github.com/sourcegraph/sourcegraph/pull/36321 , the motivation for which is basically "most critical alerts are not useful, because they are unactionable with unhelpful documentation". I think this points to a very serious issue with our critical alerts today - not only can they be noisy, but many do not provide site admins like @michaellzc and the DevOps team with any useful action to take, or indicate how to debug these critical alerts, to the point of one such admin deciding it is easier to build an entire new feature for allow-listing certain alerts rather than be paged night and day. The severity of a critical alert is already outlined in our documentation:
critical: something is definitively wrong with Sourcegraph. We suggest using a high-visibility notification channel for these alerts.
When we add a critical alert, we are basically saying that we would be willing to page Sourcegraph admins from every customer if this alert goes off. IMO a reasonable expectation of adding a critical alert should be a robust debugging path for how to deal with this critical alert. Not only is this a better experience for customers, but it is useful for growing our teams as new members onboard and take on-call rotations.
The proposal: forbid critical alerts from being defined without any PossibleSolutions
provided, and remove all critical alerts that currently do not have any PossibleSolutions
defined. Additionally, as follow-ups:
- we will rebrand
PossibleSolutions
asNextSteps
to lower the bar for this field (the instructions don't need to provide an immediate solution, just guide the next steps once an alert is received): https://github.com/sourcegraph/sourcegraph/pull/36500 - check in with DevOps to identify other problematic critical alerts they have encountered: https://github.com/sourcegraph/sourcegraph/issues/36434
This PR adds the above restriction as a hard requirement within the generator, and fixes all the outstanding issues:
- Add
PossibleSolutions
for thepods_available_percentage
andpg_up
critical alerts (we have a similar solution doc in thecontainer_missing
alert) - Removes all the other critical alerts without possible solutions
Test plan
Eventually, go generate ./monitoring
should pass.