Skip to content

repos: streaming syncer

Created by: tsenart

This PR introduces streaming repos syncing, feature flagged behind the ENABLE_STREAMING_REPOS_SYNCER environment variable. I structured this change to be the most reviewable, while being as well tested as the batch implementation. Once we have confidence in this streaming implementation by running it in production for a while, we can get rid of the batch one and the feature flag, which will be a massive code burn.

Streaming syncing is a pre-requisite for the search core team to keep 5.5M repo ranking data up to date, over time. In recent explorations, I was able to effectively use GitHub's search via their GraphQL API to sync 1.1M repos in under 6h, without exhausting a single token's rate limit quota. This learning is different from our historical assumptions so far that syncing all repos on dotcom isn't feasible, and will enable us to get rid of many special cases we introduced for that reason.

The end goal will be to have the following site level external service continually syncing. I have overcome the GitHub API's 1000 result limit of searches by detecting that we're capped on each request and refining the original search query to be more narrow, leveraging both the stars filter and the created filter. Here's a sneak peak of that logic that should come to a Sourcegraph deployment close to you soon.

{
    "url": "https://github.com",
    "token": "REDACTED",
    "repositoryQuery": ["stars:1...500000"]
}

Better reviewed without whitespace changes!

Merge request reports

Loading