repo-updater: make syncer & scheduler aware of uncloned gitserver repositories
Created by: slimsag
This is a fix for #11091 which we have learned has become a critical and blocking issue for a large customer (details here) in the past day.
I chose to work after-hours to implement this as there is a heavy sense of urgency here. The implementation turned out to be trickier than I thought it would be: It was not 100% clear-cut to me where this logic should be structured and live, so it may make sense for someone with broader architect design goals in mind here to evaluate and adjust the implementation of this at a later date. This is not a completely unreasonable implementation, though, it just doesn't fit as seamlessly as I would like it to.
Prior to this change:
- repo-updater would instruct gitserver to update repositories if they were added, removed, or had their metadata changed on the code host.
- gitserver would be responsible for keeping its own repositories up-to-date for Git updates by periodically running
git pull
.
However, it was possible for these two to get out of sync. For example from a completely clean slate when deploying a new Sourcegraph instance:
- repo-updater identifies thousands of new repositories on the code host not yet in the DB store.
- repo-updater instructs gitserver to clone the thousands of new repositories.
- gitserver's in-memory clone queue now has thousands of repositories and begins cloning.
- gitserver is restarted for any reason, its in-memory update queue is now gone.
- gitserver does not have the repositories cloned, it is completely unaware of them.
- repo-updater just naively assumes that gitserver has them cloned and does nothing to correct the situation.
This has particularly bad effects for users & admins:
-
Searches are incredibly slow and timeout because e.g. half of your several thousand repos are not indexed.
- Why: zoekt can discover that the repositories exist, but when it attempts to fetch the tar archive it cannot because gitserver says the repository doesn't exist.)
-
The admin UI shows "100,000 repositories enqueued for cloning..." yet the system never does anything
- Why: quite ironically, the status message code does compare with Gitserver how many repositories are uncloned - we just do absolutely nothing for that case in the actual code
-
All logs, monitoring, and metrics just show "everything is fine" and "nothing is happening"
- Why: from the perspective of gitserver, everything is cloned and from the naive perspective of repo-updater everything is cloned.
-
There is no good workaround / way to correct this issue
- Why: You could delete all repositories and start over, but cloning takes a long time (and you better hope it doesn't happen again, as was the case with this customer)
Fixes #11091