Skip to content
Snippets Groups Projects

repo-updater: Deterministic pagination from github and gitlab API

Merged Administrator requested to merge core/gh-deterministic-api-order into master

Created by: keegancsmith

Previously we did this so we would add recently pushed repositories first. However, with paged based pagination if a repository is pushed to while we are querying we may miss it. This is the likely cause of repositories occasionally being deleted at a customer.

Instead we use the default sort order. In the case of GitLab this is by created_at which is perfect since that will have a consistent order. For GitHub it is "best match" which assigns a score based on the query. This could not be deterministic. However, I would be surprised if it wasn't. There doesn't seem to be an option you can pass in to GitHub which is based on an immutable value.

Updated testdata with the following commands:

ruplacer --go sort=pushed\& ''
ruplacer --go order_by=last_activity_at\& ''

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Created by: slimsag

    GitHub's docs seem to suggest we can do sort=created: https://developer.github.com/v3/repos/#list-user-repositories

    or am I wrong here

  • Created by: slimsag

    Oh, I see it doesn't matter because the default sorting on these APIs on GitHub is deterministic anyway (either sorting by full repo name or by created as a default)

  • Created by: keegancsmith

    Created is probably better, but you are right we should never miss a repo we previously had due to repos being created. When I said github didn't have it I looked at the wrong docs.

  • Created by: keegancsmith

    i wonder if instead we need something stronger here. Like if we decide to delete a repo, we need to source all repos again just to confirm. eg if a repo is deleted while paginating, we could miss a repo. Doing it twice in case of deletion will solve this in general (assuming we arent caching)? (or make it exceptionally rare).

  • Created by: keegancsmith

    My idea around sourcing repos twice is a bit more complicated, so gonna punt on that for now and hope that sorting by created is good enough. However, I am concerned about the fact we do hard deletes. This won't fly when we become a system of record.

  • Created by: unknwon

    For delete, what about we just keep them in a list, and try to explicitly fetch them to double check after the main streaming of sync is done? But this approach won’t solve the problem for the very first sync, which we wouldn’t even know there is a missing repo.

  • Created by: keegancsmith

    For delete, what about we just keep them in a list, and try to explicitly fetch them to double check after the main streaming of sync is done?

    If you directly fetch the repo it has a different meaning. Since the external service is configured to only return repos matching a search/etc. But directly fetching a repo doesn't mean it does match the search. But the idea of double checking may be needed to be honest if this PR still doesn't resolve complaints. Its a bit more complexity, but should give us a very high confidence.

    But this approach won’t solve the problem for the very first sync, which we wouldn’t even know there is a missing repo.

    FYI that is fine, since if we don't know a repo exists, you can't delete it either :P

Please register or sign in to reply
Loading