observability: NaN values can leak into alert_count metric
Created by: slimsag
There is a nasty edge case where NaN can leak into our alert_count
metric from an underlying query either being wrong or correctly representing its current value as NaN. For example:
This also violates the promise of alert_count
that it can only ever have whole zero or one values.
Consider for example a query that produces NaN like:
vector(0/0)
== NaN
With our clamping logic which surrounds all alert_count
definitions, NaN can propagate outward:
clamp_max(clamp_min(floor(
vector(0/0)
), 0), 1) OR on() vector(1)
== NaN!
Luckily, since the alert query (vector(0/0)
above) should produce a value in the range of [0,1] (i.e. it cannot ever go negative), we can simply use the >=
operator to filter out NaN:
clamp_max(clamp_min(floor(
(vector(0/0)) >= 0
), 0), 1) OR on() vector(1)
== 1
So, if NaN leaks out the alert will fire. However, then we don't have any good way to say "don't fire an alert if NaN is present" (which seems to be a practical thing to do), so instead we can do:
clamp_max(clamp_min(floor(
((vector(0/0)) >= 0) OR on() vector($VALUE)
), 0), 1) OR on() vector(1)
Where $VALUE can either be 0
("do not fire when NaN is present") or 1
("fire when NaN is present"). The default should be to fire, as it may signal a mistake/typo in the user's query -- with an option to disable it for queries that may have expected NaN values.