prometheus: alert dashboard links with fixed timestamps
Created by: bobheadxi
While working on https://github.com/sourcegraph/sourcegraph/pull/17014 I added a relative timestamp to the dashboard link in alerts, did a bit of fenangling to make the link completely fixed to timestamps associated with the delivered alert.
We can depend on (index .Alerts 0)
because our grouping strategy ensures each group delivered only has one alert.
I'm still a bit unsure about this now that I've got it working, the experience is a bit less than ideal for alerts that e.g. are spikes lasting a second, since then we get a link to a panel that has a tiny window. An alternative is time
and time.window
, but that might not be great for alerts lasting longer. It is not possible to do arithmetic on these timestamps in alertmanager templates: https://github.com/prometheus/alertmanager/issues/1188
=> update: see https://github.com/sourcegraph/sourcegraph/pull/17034#issuecomment-756598154
Merge request reports
Activity
Created by: sourcegraph-bot
Notifying subscribers in CODENOTIFY files for diff 1851376163f4158e3458a73d0c222a3f6ddce5d0...ffa11b5105aeebdb76481bdb36f41a0769bca2b1.
No notifications.
Created by: codecov[bot]
Codecov Report
Merging #17034 (ffa11b5) into main (1851376) will decrease coverage by
0.00%
. The diff coverage is100.00%
.@@ Coverage Diff @@ ## main #17034 +/- ## ========================================== - Coverage 51.98% 51.98% -0.01% ========================================== Files 1703 1703 Lines 84786 84788 +2 Branches 7524 7666 +142 ========================================== - Hits 44079 44077 -2 - Misses 36806 36808 +2 - Partials 3901 3903 +2
Flag Coverage Δ go 51.02% <100.00%> (-0.01%)
integration 30.54% <ø> (ø)
storybook 30.03% <ø> (ø)
typescript 54.30% <ø> (ø)
unit 34.80% <ø> (ø)
Impacted Files Coverage Δ ...er-images/prometheus/cmd/prom-wrapper/receivers.go 66.84% <100.00%> (+0.35%)
.../internal/codeintel/resolvers/graphql/locations.go 79.38% <0.00%> (-4.13%)
Created by: pecigonzalo
@bobheadxi You can use something like
&time=1609931477000&time.window=3600000
instead, check https://grafana.com/docs/grafana/latest/dashboards/time-range-controls/#control-the-time-range-using-a-urlCreated by: bobheadxi
I considered that (see PR description):
An alternative is time and time.window, but that might not be great for alerts lasting longer.
It's not a great experience if the alert spans, say, 24 hours of problems (the link would only show a few minutes or whatever value we set there)
Created by: pecigonzalo
I see, I did not notice that in the body my bad. I think defaulting to 1h or inferring
time.window
from something like the alertperiod * 1.5
should be ok.Eg if CPU utilization alerts when its
> 50 for 5m
we link to a dashboard of time of alert andtime.window
of7.5m
Created by: bobheadxi
inferring time.window from something like the alert period * 1.5 should be ok.
Unfortunately you cannot do arithmetic in alertmanager templates :/ However, I think I found a nice middle-ground strategy with e3f166ff1be18a491b39dce230c03d1c70cd1cb2:
- If start and end available, link to a fixed timeframe on the start and end
- If end is not available (alert still active), link to start and window of 1 hour
Created by: pecigonzalo
We can't infer period * 1.5 for example (assuming by period you mean the time the alert is active - or you mean the for parameter? these are often just 0 or a small number)
I meant the
for
parameter. To be honest, I think just using 1h for the link is also fine. /shrug