Skip to content
Snippets Groups Projects
Closed Approved: Proposal: RFC-189: Support per-team alerts and on-call rotations
  • View options
  • Approved: Proposal: RFC-189: Support per-team alerts and on-call rotations

  • View options
  • Closed Issue created by Warren Gifford

    Created by: bobheadxi

    Context

    RFC 189 suggests that we might want engineering teams to own their own alerts. The RFC does not yet thoroughly detail this, but this would likely entail:

    • Some way to automatically send alerts to appropriate teams to handle
    • Some way to denote ownership of alerts (defined in package /monitoring)

    Proposal

    Simple alert routing

    https://github.com/sourcegraph/sourcegraph/pull/11832 adds site-config-based notification definitions via Prometheus Alertmanager. We can extend the observability.alerts to accomodate matching on a small set of labels, for example:

    {
      "level": "critical",
      "notifier": {
        "type": "opsgenie",
        "apiKey": "xxx",
        "responders": [ ... ]
      },
    + "onLabels": {
    +   "service": [ "git_server", "frontend" ]
    + }
    }

    This alone might be a sufficient (if tedious) way to help teams own their own alerts. It might also cause some alerts to remain unowned. We could also restrict the implied breadth of onLabels field from above and just routing fields be a top-level option (example in the next point)

    Denoting ownership

    Ideally, whatever we do to denote ownership should not be Cloud-specific, ie it would be unpleasant to have to generate different alerting for Cloud. An additional required field could be added to our monitoring Observables to give panel a "product area" corresponding to the teams defined in https://github.com/sourcegraph/about/pull/1150, for example:

    {
    	Name:            "disk_space_remaining",
    	Description:     "disk space remaining by instance",
    +	Owner:           OwnerSearch, // "search"
    	Query:           `(src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100`,
    	DataMayNotExist: true,
    	Critical:        Alert{LessOrEqual: 5},
    	PanelOptions:    PanelOptions().LegendFormat("{{instance}}").Unit(Percentage),
    	PossibleSolutions: `
    		- **Provision more disk space:** Sourcegraph will begin deleting least-used repository clones at 10% disk space remaining which may result in decreased performance, users having to wait for repositories to clone, etc.
    	`,
    },

    This would:

    • give us useful information on who to contact for an alert when provided alerts by customers (ie via the bug report page)
      • "a lot of alerts in group 'search' are firing, ask someone in the search team"
    • make it easier to maintain notifiers by being able to just use site config and the functionality developed to deploy/silence these notifications.
      • enable us to better dogfood alerts (we'll use the same templates, alert timings, etc. as customers)

    Alternative naming for this field: Team, ProductArea, Group

    A notifier configuration with the above ownership label them might look like:

    {
      "level": "critical",
      "notifier": {
        "type": "opsgenie",
        "apiKey": "xxx",
        "responders": [ ... ]
      },
    + "owners": [ "search" ]
    }

    Activity

    • All activity
    • Comments only
    • History only
    • Newest first
    • Oldest first