insights: search can produce non-deterministic result counts in the case of timeouts for large search corpus
Created by: coury-clark
We have observed two behaviors when creating a code insight using streaming search with a large (millions of results) result set.
-
Insights execute in two modes. In the backfill mode we generate and execute one search query per repository. Each query will be given the same timeout value (say 60 seconds). This means the backfill mode gets a sum total of
num_repos * 60s
to perform searches. In the snapshot mode, we execute a single search query over all repositories with the same timeout value (say 60 seconds). This results in the backfill typically having multiple orders of magnitude more time to search than the global query. If we were to provide the global snapshot search with the same amount of time (num_repos * timeout
) we would likely experience the opposite problem, where the concurrency across shards will result in the larger repos that would otherwise exceed the single-repo timeout getting more overall time. -
It seems from testing in the UI that global searches will generate non-deterministic results in a timeout window if the overall results exceed the timeout. One hypothesis is that the searches spread across the zoekt shards non-deterministically (or perhaps the repos are sharded non-deterministically). This means when we do encounter a timeout, there is no determinism in what the value will be which results in non-deterministic insight values.
according to the search team, the timeout is set universally for an instance through a site config option maxTimeoutSeconds
which is set to 60 seconds.
the context for this option is that:
most customers are in front of a loadbalancer that have a timeout. EG dot-com has a 60s timeout due to cloudflare. This might be historical and SSE with streaming would likely get around that