Something went wrong on our end. Please try again.
Created by: bobheadxi
This proposal arises from a discussion I had with @michaellzc today regarding https://github.com/sourcegraph/sourcegraph/pull/36321 , the motivation for which is basically "most critical alerts are not useful, because they are unactionable with unhelpful documentation". I think this points to a very serious issue with our critical alerts today - not only can they be noisy, but many do not provide site admins like @michaellzc and the DevOps team with any useful action to take, or indicate how to debug these critical alerts, to the point of one such admin deciding it is easier to build an entire new feature for allow-listing certain alerts rather than be paged night and day. The severity of a critical alert is already outlined in our documentation:
critical: something is definitively wrong with Sourcegraph. We suggest using a high-visibility notification channel for these alerts.
When we add a critical alert, we are basically saying that we would be willing to page Sourcegraph admins from every customer if this alert goes off. IMO a reasonable expectation of adding a critical alert should be a robust debugging path for how to deal with this critical alert. Not only is this a better experience for customers, but it is useful for growing our teams as new members onboard and take on-call rotations.
The proposal: forbid critical alerts from being defined without any PossibleSolutions
provided, and remove all critical alerts that currently do not have any PossibleSolutions
defined. Additionally, as follow-ups:
PossibleSolutions
as NextSteps
to lower the bar for this field (the instructions don't need to provide an immediate solution, just guide the next steps once an alert is received): https://github.com/sourcegraph/sourcegraph/pull/36500
This PR adds the above restriction as a hard requirement within the generator, and fixes all the outstanding issues:
PossibleSolutions
for the pods_available_percentage
and pg_up
critical alerts (we have a similar solution doc in the container_missing
alert)Eventually, go generate ./monitoring
should pass.