Skip to content

observability: NaN values can leak into alert_count metric

Created by: slimsag

There is a nasty edge case where NaN can leak into our alert_count metric from an underlying query either being wrong or correctly representing its current value as NaN. For example:

image

This also violates the promise of alert_count that it can only ever have whole zero or one values.

Consider for example a query that produces NaN like:

vector(0/0)
== NaN

With our clamping logic which surrounds all alert_count definitions, NaN can propagate outward:

clamp_max(clamp_min(floor(
vector(0/0)
), 0), 1) OR on() vector(1)
== NaN!

Luckily, since the alert query (vector(0/0) above) should produce a value in the range of [0,1] (i.e. it cannot ever go negative), we can simply use the >= operator to filter out NaN:

clamp_max(clamp_min(floor(
(vector(0/0)) >= 0
), 0), 1) OR on() vector(1)
== 1

So, if NaN leaks out the alert will fire. However, then we don't have any good way to say "don't fire an alert if NaN is present" (which seems to be a practical thing to do), so instead we can do:

clamp_max(clamp_min(floor(
((vector(0/0)) >= 0) OR on() vector($VALUE)
), 0), 1) OR on() vector(1)

Where $VALUE can either be 0 ("do not fire when NaN is present") or 1 ("fire when NaN is present"). The default should be to fire, as it may signal a mistake/typo in the user's query -- with an option to disable it for queries that may have expected NaN values.