Revert "Revert "search-blitz: add ability to change sampling duration""
Created by: ggilmore
This PR re-adds the search-blitz dashboards changes that were reverted in https://github.com/sourcegraph/sourcegraph/pull/29813 with one new bit.
Prometheus' Alertmanager doesn't like sampling durations that come from the interpolated strings ($...
) that Grafana will fill in:
ts=2022-01-18T16:22:30.432Z caller=manager.go:968 level=error component="rule manager" msg="loading groups failed" err="/sg_config_prometheus/frontend_alert_rules.yml: 1203:11: group \"frontend\", rule 99, \"critical_frontend_90th_percentile_successful_sentinel_duration\": could not parse expression: 1:148: parse error: missing unit character in duration"
This error was (silently) causing the sourcegraph/server
docker image failures that we were seeing in CI.
I worked around this by simply hard-coding the duration (1h30m
) for the 4 dashboards that we have that generate alerts. As a result, the dashboard now looks like this:
passing main-dry-run buildkite build (to prove that the CI failures are fixed): https://buildkite.com/sourcegraph/sourcegraph/builds/126002