Approved: Proposal: RFC-189: Support per-team alerts and on-call rotations
Created by: bobheadxi
Context
RFC 189 suggests that we might want engineering teams to own their own alerts. The RFC does not yet thoroughly detail this, but this would likely entail:
- Some way to automatically send alerts to appropriate teams to handle
- Some way to denote ownership of alerts (defined in package
/monitoring
)
Proposal
Simple alert routing
https://github.com/sourcegraph/sourcegraph/pull/11832 adds site-config-based notification definitions via Prometheus Alertmanager. We can extend the observability.alerts
to accomodate matching on a small set of labels, for example:
{
"level": "critical",
"notifier": {
"type": "opsgenie",
"apiKey": "xxx",
"responders": [ ... ]
},
+ "onLabels": {
+ "service": [ "git_server", "frontend" ]
+ }
}
This alone might be a sufficient (if tedious) way to help teams own their own alerts. It might also cause some alerts to remain unowned. We could also restrict the implied breadth of onLabels
field from above and just routing fields be a top-level option (example in the next point)
Denoting ownership
Ideally, whatever we do to denote ownership should not be Cloud-specific, ie it would be unpleasant to have to generate different alerting for Cloud. An additional required field could be added to our monitoring Observable
s to give panel a "product area" corresponding to the teams defined in https://github.com/sourcegraph/about/pull/1150, for example:
{
Name: "disk_space_remaining",
Description: "disk space remaining by instance",
+ Owner: OwnerSearch, // "search"
Query: `(src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100`,
DataMayNotExist: true,
Critical: Alert{LessOrEqual: 5},
PanelOptions: PanelOptions().LegendFormat("{{instance}}").Unit(Percentage),
PossibleSolutions: `
- **Provision more disk space:** Sourcegraph will begin deleting least-used repository clones at 10% disk space remaining which may result in decreased performance, users having to wait for repositories to clone, etc.
`,
},
This would:
- give us useful information on who to contact for an alert when provided alerts by customers (ie via the bug report page)
- "a lot of alerts in group 'search' are firing, ask someone in the search team"
- make it easier to maintain notifiers by being able to just use site config and the functionality developed to deploy/silence these notifications.
- enable us to better dogfood alerts (we'll use the same templates, alert timings, etc. as customers)
Alternative naming for this field: Team
, ProductArea
, Group
A notifier configuration with the above ownership label them might look like:
{
"level": "critical",
"notifier": {
"type": "opsgenie",
"apiKey": "xxx",
"responders": [ ... ]
},
+ "owners": [ "search" ]
}