Skip to content

repoupdater/github: Fetch GitHub repositories in batches

Administrator requested to merge core/github-batch into master

Created by: mrnugget

This PR adds a new method to the GitHub client, GetRepositoriesByNameWithOwnerFromAPI, which allows the caller to fetch repositories in batches, using the GraphQL API. It should fix https://github.com/sourcegraph/sourcegraph/issues/3907 but...

IMPORTANT: GitHub's GraphQL API only offers the fetching of repositories in batches by using GraphQL node IDs. But we don't have the node IDs at hand when doing our first sync, we only have a list of "owner/repository-name"s. That means we can't use the existing getRepositoryByNodeIDFromAPI

Instead I decided to build GraphQL queries on the fly, using aliases to fetch multiple repositories in the same query. Here's an example query:

fragment RepositoryFields on Repository {
  id
  nameWithOwner
  description
  url
  isPrivate
  isFork
  isArchived
}

{
  repo_sourcegraph_repository_1: repository(owner: "sourcegraph", name: "repository-1") {
    ... on Repository {
      ...RepositoryFields
    }
  }
  repo_sourcegraph_repository_2: repository(owner: "sourcegraph", name: "repository-2") {
    ... on Repository {
      ...RepositoryFields
    }
  }
  repo_sourcegraph_repository_3: repository(owner: "sourcegraph", name: "repository-3") {
    ... on Repository {
      ...RepositoryFields
    }
  }
}

Of course I tried to find out how many repositories I could query in one request like this, but neither the GitHub API documentation nor the GraphQL spec were any help here (pointers appreciated!). So I did some experimentation and as it turns out, the API stops responding with results when specifying more than 37 repositories in one query.

Since 37 is a weird number I rounded it down to 30 and used that as the batch size.

But still: since I'm a GraphQL newbie I feel like I'm relying on undefined behavior and I'm not sure whether the GitHub API will keep it up.

Please see this PR as an idea/proposal and cause for discussion: I would love to know what you think about using the API like this and what I'm possibly missing.

Test plan: go test and manual testing with an external service configuration that has 100 repos in

Merge request reports

Loading