Skip to content

search: ignore dial errors during zoekt rollout

Administrator requested to merge k/zoekt-connection-error into main

Created by: keegancsmith

This is an extension of our previous pattern where we ignore errors caused by a zoekt rollout. Starting 2022-01-18 we started encountering these errors during zoekt rolloouts. Our suspicion is a change in kubernetes/gce networking or service discovery.

We extracted these errors from honeycomb and correlated them with rollouts. In particular the i/o timeout error was occurring enough to trigger our alert thresholds.

Observability was extended to record a reason in prometheus and traces since we can now have multiple reasons. Additionally a minor observability bug was fixed where we counted non-dns.IsNotFound errors.

Test Plan: Just unit tests. I'm confident in the code due to lots of exploring of our instrumentation and the reading of the stdlibs net package.

Fixes https://github.com/sourcegraph/sourcegraph/issues/30795

Merge request reports

Loading