monitoring: relax alerting thresholds for sentinel queries
Created by: ggilmore
We get sentinel query alerts whenever Zoekt is deployed on sourcegraph.com.
These alerts are just noise - artifacts caused by how Zoekt is deployed:
- a statefulset with ~50 replicas will take quite a long time to update
- when a Zoekt pod first starts, it needs to load the shards on its disk into memory before it can respond to search queries
There is no action that a user can take to resolve this - all these alerts do is cause stress. Therefore, this PR relaxes the sentinel query alerts so they don't trigger opsgenie in the common case of a Zoekt rollout:
- the critical alerting threshold duration increases from 30 minutes to 3 1/2 hours (this window was determined by looking at the width of latency spikes from past alerts + adding some slack)
- the width of the sampling duration increases from 1 1/2 hours to 2 hours
Take a look at this dashboard snapshot over the past 7 days. Note the highlighted spikes - those correspond to a Zoekt rollout that tripped the critical alerting threshold (which, in-turn, triggered an opsgenie call).
Now take a look at the same dashboard after we apply the changes from this PR (relaxing the alerting thresholds and widening the sampling duration):
- Notice that the highlighted spike no longer trips the critical alert threshold (which would trigger an ops genie call), only the warning threshold
- The original spike (that corresponds to the back-to-back Zoekt rollouts in this incident) would still trigger a critical alert (which is still okay, since it's an unusual situation)
Test plan
This PR just changes a Grafana dashboard, and doesn't need further testing.