Skip to content

search: correctly check if a timeout has ocurred

Administrator requested to merge sg/no-suppress-alert into master

Created by: slimsag

In #7020 I thought I was exhaustive enough by checking for timeouts via effectively the following condition:

err == nil && len(timedout) == len(repositoriesSearched)

It was later uncovered that unfortunately diff, commit, and repo search violated our search API promises and did not properly fill out the repositoriesSearched list as they should have. This broke commit, diff, and repo search (#7090) and was released as 3.10.2 (no customers upgraded AFAIK).

To fix the issue, I corrected the faulty commit/diff/repo search implementations in both non-timeout and timeout cases on both single-container and sourcegraph.com instances.

However, upon releasing 3.10.3 with this patch it became clear that something was still wrong. On cluster deployments with more than 50 repositories, diff and commit search would show the following timeout alert instead of the expected "diff/commit search is limited to 50 repositories" warning alert:

image

To fix this, I have finally conceded and made the conditional here somewhat defensive:

err == nil && len(timedout) > 0 && len(timedout) == len(repositoriesSearched

This is needed because it can be the case that we get no error (err == nil), no results (timedout = [], repositoriesSearched = []`) AND still get a search alert as an "error" ("diff/commit search is limited to 50 repositories").

Ultimately, I place fault on two areas:

  1. A substantial lack of proper end-to-end regression tests (issue #7102) which would have caught the first issue above but not the second.
  2. A large amount of complexity in our search API around what constitues a failure condition for a search exactly (an error? all repositories timing out? an alert and no results? something else I haven't thought of?).
  3. Technical debt we have failed to pay off and have accrued in the search codebase which has led to multiple complex nesting layers ("we will just wrap X because handling this problem properly is too involves and would require large refactors").

I am confident in this change and it should once and for all fix #7090, and I will do yet another 10.3.4 patch release with this change, but I would not be doing diligence if I did not stress and emphasize the fact that making such a relatively small and simple change to the search codebase has proven so damn hard and error-prone. We have a problem (not to be solved here).

Fixes #7090

Merge request reports

Loading