metrics: fix broken error rate when count unset (!40357) · Merge requests · Administrator / sourcegraph

Administrator requested to merge nsc/error-rate-null into main Aug 15, 2022

Created by: Strum355

We use a query something along the lines of sum by (op) (increase(src_<root>_errors_total[5m])) / (sum by (op) (increase(src_<root>_total[5m])) + sum by (op) (increase(src_<root>_errors_total[5m]))) * 100 to determine the error rate. This works because on success, we increment a "success" counter (src_<root>_total), and on error, we increment an "error" counter (src_<root>_errors_total).

This falls apart when the "success" counter has never been incremented, resulting in the metric series being "unset" (see the gaps in the "success" visualization on the right below), resulting in an apparent error rate of 0 with a non-zero error count.

This PR addresses the issue by always making sure the "success" counter is seeded with at least 0 in the error case by incrementing by 0.

Test plan

Tested locally 👍 👍 👍

metrics: fix broken error rate when count unset

Test plan

Merge request reports