repo-updater: Deterministic pagination from github and gitlab API
Created by: keegancsmith
Previously we did this so we would add recently pushed repositories first. However, with paged based pagination if a repository is pushed to while we are querying we may miss it. This is the likely cause of repositories occasionally being deleted at a customer.
Instead we use the default sort order. In the case of GitLab this is by created_at which is perfect since that will have a consistent order. For GitHub it is "best match" which assigns a score based on the query. This could not be deterministic. However, I would be surprised if it wasn't. There doesn't seem to be an option you can pass in to GitHub which is based on an immutable value.
Updated testdata with the following commands:
ruplacer --go sort=pushed\& ''
ruplacer --go order_by=last_activity_at\& ''
Merge request reports
Activity
Created by: slimsag
GitHub's docs seem to suggest we can do
sort=created
: https://developer.github.com/v3/repos/#list-user-repositoriesor am I wrong here
Created by: keegancsmith
i wonder if instead we need something stronger here. Like if we decide to delete a repo, we need to source all repos again just to confirm. eg if a repo is deleted while paginating, we could miss a repo. Doing it twice in case of deletion will solve this in general (assuming we arent caching)? (or make it exceptionally rare).
Created by: keegancsmith
For delete, what about we just keep them in a list, and try to explicitly fetch them to double check after the main streaming of sync is done?
If you directly fetch the repo it has a different meaning. Since the external service is configured to only return repos matching a search/etc. But directly fetching a repo doesn't mean it does match the search. But the idea of double checking may be needed to be honest if this PR still doesn't resolve complaints. Its a bit more complexity, but should give us a very high confidence.
But this approach won’t solve the problem for the very first sync, which we wouldn’t even know there is a missing repo.
FYI that is fine, since if we don't know a repo exists, you can't delete it either :P