search: Deduplicate results w.r.t. repositories (!6266) · Merge requests · Warren Gifford / sourcegraph

Warren Gifford requested to merge core/dedup-shards into master Oct 30, 2019

Created by: keegancsmith

During a rebalancing event repositories can be indexed on multiple nodes. This is temporary, but will lead to us returning duplicate results. To avoid this situation we do not aggregate a filematch from a shard if another shard has already returned a result for the same repository.

Notes:

RFC 30 encourages we split up the RepoSet to avoid duplicate results. This is something we may still do, but this is a more incremental change and will allow us to focus on making indexing more intelligent first.
The map this allocates can be as large as our RepoSet. However, this should be much smaller in practice, and will always be smaller than the memory used by []FileMatch. As such this shouldn't impact performance.
We don't dedup List since the only use of it already does deduplication. Avoiding this coupling we will address later and is noted in the code.
The Stats struct has fields for MatchCount and FileCount. We don't update these fields w.r.t. deduplication since we do not consume them. Related to above we can address this at a later stage.

Test plan: Unit tests

Part of https://github.com/sourcegraph/sourcegraph/issues/5725

search: Deduplicate results w.r.t. repositories

Merge request reports