Skip to content

monitoring: Update syncer sync errors alerting

Warren Gifford requested to merge ig/syncer_sync_error_rate into main

Created by: indradhanush

In this commit we:

  1. Add a Warning alert if the threshold is greater than 0.5 over 10 minutes.
  2. Modify the Critical alert to fire if the threshold is greater than 1 over 10 minutes.

Historically, over the last six months approximately, the value of this metric has rarely gone above 0.5 and has recovered almost immediately. Only in one instance has this gone over 1.

At the moment we get paged for this alert for what might be intermittent errors on the code host's end which get resolved immediately, reducing the signal to noise ration in our alerting.

Link to production dashboard for reference: https://sourcegraph.com/-/debug/grafana/explore?orgId=1&left=%5B%221614537000000%22,%221629484199000%22,%22Prometheus%22,%7B%22expr%22:%22max%20by%20(family)%20(rate(src_repoupdater_syncer_sync_errors_total%7Bowner!%3D%5C%22user%5C%22%7D%5B5m%5D))%20OR%20on()%20vector(0)%22,%22datasource%22:%22Prometheus%22,%22exemplar%22:true,%22requestId%22:%22Q-6ba88982-7cc8-4c07-9135-56645d70c8ca-0A%22%7D%5D

Screenshot from the above linked dashboard: image

Merge request reports

Loading