Skip to content

metrics: fix broken error rate when count unset

Warren Gifford requested to merge nsc/error-rate-null into main

Created by: Strum355

We use a query something along the lines of sum by (op) (increase(src_<root>_errors_total[5m])) / (sum by (op) (increase(src_<root>_total[5m])) + sum by (op) (increase(src_<root>_errors_total[5m]))) * 100 to determine the error rate. This works because on success, we increment a "success" counter (src_<root>_total), and on error, we increment an "error" counter (src_<root>_errors_total).

This falls apart when the "success" counter has never been incremented, resulting in the metric series being "unset" (see the gaps in the "success" visualization on the right below), resulting in an apparent error rate of 0 with a non-zero error count.

This PR addresses the issue by always making sure the "success" counter is seeded with at least 0 in the error case by incrementing by 0.

image

Test plan

Tested locally 👍 👍 👍

Merge request reports

Loading