Handle renames and deletions of mirrored repositories
Created by: sqs
This is an umbrella issue that covers generally cleaning up the problem where code host entries in site config (and repos.list
) merely create repositories, they do not "manage" the active set of repositories. This causes the following (specific) problems:
- If a GitHub token is used and grants access to repos
a
andb
, then the token's owner no longer has access tob
, theb
repo remains on the Sourcegraph instance (but is not git-updated) - If a repository is renamed on the code host, both the old and new repos will be added on Sourcegraph
- If a repository is deleted on the code host, it remains on Sourcegraph (from the backend POV, this is equivalent to the token-no-longer-can-access case above)
- If a repository's metadata is changed such that it is no longer in the set of repositories-to-sync (e.g., a GitHub
repositoryQuery
excluding archived repos and a someone archives a repo on GitHub), it remains on Sourcegraph
Fixing this issue also addresses the following problems:
- https://github.com/sourcegraph/sourcegraph/issues/383
- https://github.com/sourcegraph/sourcegraph/issues/368
- https://github.com/sourcegraph/sourcegraph/issues/108
In general, the principle we want is: "You tell Sourcegraph which repositories you want on Sourcegraph. Sourcegraph upholds that contract continuously." So if you want archived repositories to not exist on Sourcegraph, then you need to specify a repositoryQuery
that omits archived repositories. If a GitHub repository is archived, then Sourcegraph will automatically remove it. Another way to say this is that the site config is essentially a constraint you specify that Sourcegraph always seeks to satisfy in the current instant.
Robustness
However, we don't want this to result in data loss. It needs to be robust to (1) unexpected errors from the code host API and (2) typos in site config. If one of those things happens and suddenly a previously included repository is no longer included, we do not want to permanently delete all data associated with the repository. (Today this really just includes code discussions but will likely include other info in the future.) So if Sourcegraph detects that a repository is no longer included, it should disable it on Sourcegraph.
This entails another addition, I think: there should be a "disable reason" to explain to site admins why a repo was disabled (e.g., "repository deleted on GitHub.com" or "repository not included by repositoryQuery")
Of course, this should be smart in the case of renames; if there is a rename, it should just rename the repo on Sourcegraph, too, not disable the old name and create a new repo with the new name.