Skip to content

monitoring: relax alerting thresholds for sentinel queries

Administrator requested to merge sentinel-relax into main

Created by: ggilmore

We get sentinel query alerts whenever Zoekt is deployed on sourcegraph.com.

These alerts are just noise - artifacts caused by how Zoekt is deployed:

  • a statefulset with ~50 replicas will take quite a long time to update
  • when a Zoekt pod first starts, it needs to load the shards on its disk into memory before it can respond to search queries

There is no action that a user can take to resolve this - all these alerts do is cause stress. Therefore, this PR relaxes the sentinel query alerts so they don't trigger opsgenie in the common case of a Zoekt rollout:

  • the critical alerting threshold duration increases from 30 minutes to 3 1/2 hours (this window was determined by looking at the width of latency spikes from past alerts + adding some slack)
  • the width of the sampling duration increases from 1 1/2 hours to 2 hours

Take a look at this dashboard snapshot over the past 7 days. Note the highlighted spikes - those correspond to a Zoekt rollout that tripped the critical alerting threshold (which, in-turn, triggered an opsgenie call). Screen Shot 2022-04-19 at 8 41 36 AM

Now take a look at the same dashboard after we apply the changes from this PR (relaxing the alerting thresholds and widening the sampling duration):

Screen Shot 2022-04-19 at 8 46 55 AM
  • Notice that the highlighted spike no longer trips the critical alert threshold (which would trigger an ops genie call), only the warning threshold
  • The original spike (that corresponds to the back-to-back Zoekt rollouts in this incident) would still trigger a critical alert (which is still okay, since it's an unusual situation)

Test plan

This PR just changes a Grafana dashboard, and doesn't need further testing.

Merge request reports

Loading