Highly available gitserver

Created by: tsenart

Context

While gitserver is already sharded for scalability, it is not highly available — if one shard goes down, all requests for repos in that shard will fail.

Outcomes

There are a few outcomes we want out of enabling a replication factor larger than one.

Zero downtime deployments — if one repo is in more than one shard, we can do a rolling update of all gitserver instances without introducing downtime.
Load balancing of hot repos and expensive commands — if one repo is very popular, we can spread the load across all the shards that have that repo.
First class rebalancing support — we should be able to easily add or remove shards and the system should rebalance the repos in each shard without introducing downtime or degradation of service, pacing data movement if necessary.

Open questions

Should we be able to specify a different replication factor for specific repos that are more popular / important?
Do we need better load balancing infrastructure before we tackle this? Currently, the list of gitserver addresses are hardcoded in config.
How do we take other factors into the sharding logic beyond the simple consistent hashing scheme we have today?
- Repo size
- Repo popularity / access frequency
- ...
Could we leverage Gitlab's Gitaly? What would the migration effort be like?
What kind of consistency do we need between the different shards? Do we need a single repo to be updated consistently across all shards it is in? Or could we have gitserver client be smart enough to talk to the most up-to-date shard by default, taking into account other constraints such as load and availability?

Reading material / Prior art

GitHub:

GitLab:

https://docs.gitlab.com/ee/administration/gitaly/