gitserver rebalancing/sharding logic should be smarter
Created by: slimsag
If you introduce or remove a gitserver replica, the consistent hash on repo name means almost all repositories will be reassigned to another gitserver (example) which has negative consequences like:
- Most repositories will be recloned from the code host
- Most searches will remain fast (no re-indexing will be needed), but search results may load a bit slowly while repositories are cloning.
- Unindexed searches (non-master branches, commit/diff search, etc.) may be slower while repositories re-clone
- Users visiting repositories directly on Sourcegraph may be prompted to wait a few seconds while the repository reclones
Example:
you have 10,000 repositories across 3 gitserver instances:
- gitserver-1 contains repos 0 to 3,333
- gitserver-2 contains repos 3,333 to 6,666
- gitserver-3 contains repos 6,666 to 10,000
You introduce a new gitserver-4, something like the following will happen:
- gitserver-1 now begins cloning repos previously assigned to gitserver-2
- gitserver-2 now begins cloning repos previously assigned to gitserver-3
- gitserver-3 now begins cloning repos previously assigned to gitserver-1
- gitserver-4 now begins cloning 1/4th the repositories
The load will be even in the end, with each having 1/4th, but gitservers 1, 2, and 3 had their repositories unavailable for a period of time because everything got shuffled around and they had to reclone everything. What would be better (and what indexed-search does) is merely shift 1/4th the load to the new 4th replica, without the original replicas (effectively) starting from scratch (i.e., they take into account the data they already have).
Additionally, if a gitserver replica goes down for an extended period of time it becomes an outage of that entire subset of repositories, instead of the load rebalancing across shards.
indexed-search does not have these same issues, because it shards based on the hostname. We should do the same for gitserver - but care must be taken to ensure we respect the existing sharding scheme or migrate it appropriately so there is no service degradation for instances upgrading to this new scheme.