Skip to content

search: Deduplicate results w.r.t. repositories

Administrator requested to merge core/dedup-shards into master

Created by: keegancsmith

During a rebalancing event repositories can be indexed on multiple nodes. This is temporary, but will lead to us returning duplicate results. To avoid this situation we do not aggregate a filematch from a shard if another shard has already returned a result for the same repository.

Notes:

  • RFC 30 encourages we split up the RepoSet to avoid duplicate results. This is something we may still do, but this is a more incremental change and will allow us to focus on making indexing more intelligent first.
  • The map this allocates can be as large as our RepoSet. However, this should be much smaller in practice, and will always be smaller than the memory used by []FileMatch. As such this shouldn't impact performance.
  • We don't dedup List since the only use of it already does deduplication. Avoiding this coupling we will address later and is noted in the code.
  • The Stats struct has fields for MatchCount and FileCount. We don't update these fields w.r.t. deduplication since we do not consume them. Related to above we can address this at a later stage.

Test plan: Unit tests

Part of https://github.com/sourcegraph/sourcegraph/issues/5725

Merge request reports

Loading