monitoring: relax alerting thresholds for sentinel queries (!34106) · Merge requests · Administrator / sourcegraph

Administrator requested to merge sentinel-relax into main Apr 19, 2022

Created by: ggilmore

We get sentinel query alerts whenever Zoekt is deployed on sourcegraph.com.

These alerts are just noise - artifacts caused by how Zoekt is deployed:

a statefulset with ~50 replicas will take quite a long time to update
when a Zoekt pod first starts, it needs to load the shards on its disk into memory before it can respond to search queries

There is no action that a user can take to resolve this - all these alerts do is cause stress. Therefore, this PR relaxes the sentinel query alerts so they don't trigger opsgenie in the common case of a Zoekt rollout:

the critical alerting threshold duration increases from 30 minutes to 3 1/2 hours (this window was determined by looking at the width of latency spikes from past alerts + adding some slack)
the width of the sampling duration increases from 1 1/2 hours to 2 hours

Take a look at this dashboard snapshot over the past 7 days. Note the highlighted spikes - those correspond to a Zoekt rollout that tripped the critical alerting threshold (which, in-turn, triggered an opsgenie call).

Now take a look at the same dashboard after we apply the changes from this PR (relaxing the alerting thresholds and widening the sampling duration):

Notice that the highlighted spike no longer trips the critical alert threshold (which would trigger an ops genie call), only the warning threshold
The original spike (that corresponds to the back-to-back Zoekt rollouts in this incident) would still trigger a critical alert (which is still okay, since it's an unusual situation)

Test plan

This PR just changes a Grafana dashboard, and doesn't need further testing.

monitoring: relax alerting thresholds for sentinel queries

Test plan

Merge request reports