Revert "Revert "search-blitz: add ability to change sampling duration"" (!29872) · Merge requests · Administrator / sourcegraph

Administrator requested to merge revert-revert into main Jan 18, 2022

Created by: ggilmore

This PR re-adds the search-blitz dashboards changes that were reverted in https://github.com/sourcegraph/sourcegraph/pull/29813 with one new bit.

Prometheus' Alertmanager doesn't like sampling durations that come from the interpolated strings ($...) that Grafana will fill in:

ts=2022-01-18T16:22:30.432Z caller=manager.go:968 level=error component="rule manager" msg="loading groups failed" err="/sg_config_prometheus/frontend_alert_rules.yml: 1203:11: group \"frontend\", rule 99, \"critical_frontend_90th_percentile_successful_sentinel_duration\": could not parse expression: 1:148: parse error: missing unit character in duration"

This error was (silently) causing the sourcegraph/server docker image failures that we were seeing in CI.

I worked around this by simply hard-coding the duration (1h30m) for the 4 dashboards that we have that generate alerts. As a result, the dashboard now looks like this:

passing main-dry-run buildkite build (to prove that the CI failures are fixed): https://buildkite.com/sourcegraph/sourcegraph/builds/126002

Revert "Revert "search-blitz: add ability to change sampling duration""

Merge request reports