Add rate limiting handling to campaigns code host actions (!15370) · Merge requests · Warren Gifford / sourcegraph

Warren Gifford requested to merge es/rl-final into main Nov 04, 2020

Created by: eseliger

This closes #14526 (closed) that got me here, which was: running into rate limits in the reconciler/syncer and retrying over and over again. With this PR, in over three hours of testing, I've only got a single call running into a rate limit because the estimation was off by one and the request wasn't allowed although we thought it would be. (But thats fine, before this PR it's been a couple thousand calls, all still hitting the code host and potentially leading to us being blocked permanently).

What this PR does

It reverts to using one syncer per code host. I recently moved to using one syncer per external service, but now that I restructured stuff a bit I realized this doesn't give us much of a benefit. And one syncer per code host keeps us safe from not spawning hundreds of syncers on systems with many code host connections (ie dotcom where user-provided external services are coming)
It splits up the GitHub client into a V3 and a V4 client. That is, because they actually are different clients. They don't share rate limits, they have different endpoints and the way to request from both also differs a lot.
It adds a global registry for RateLimitMonitors, mirroring what we do for RateLimiters, where we also have a global registry.
- This fixes a common problem we ran into before: The client would be reconstructed per call, as we use a Sourcer to create a new source. Hence, the client doesn't know about previously reported rate limit stats and would need to make one request first before knowing. Since we usually only did one request with one client, when using Sourcer, this led to us never knowing about the state of the rate limit monitor. This effectively shares it across all clients of a codehost/token combination. For GitLab, this was implemented using a ClientRegistry that shared the Monitor, but it was wrong, because the monitor would be updated by any call to the code host, regardless of the token. So if we had two tokens used for gitlab, the monitored rate limit would always flip between the one for token 1 and the one for token 2.

What this PR does NOT:

Efficiently use all tokens available on the Sourcegraph instance (using round-robin, or some other sophisticated token rotation)
Handle the potential stalling of syncer or reconciler until the end of the rate limit period (usually 1h from the first request made)
- With this PR the reconciler repeatedly tries to make another call, but now detects that it's out of requests and sleeps until it can do another request (sleep time is capped at at most 1h, but is very likely under a minute, if no external client is also using up rate limits and our limiters estimation is right)
- This is no regression, we just don't fail continuously now but wait, to not risk getting flagged as "abusers"
Take the rate limiting into consideration when calculating the syncer schedule (as we need to back-feed rate limits PLUS need to know which token has access to which repository)
Doesn't change the backoff strategy (currently linear) based on certain metrics.
Have some sophisticated combination of the rate limit for the internal rate limiter and the rate limit monitor. (I can open a follow-up PR doing this as I have implemented it for testing, but I think to make it easier to get this here to land and us having something in place for the next few months, this should stay out of here)

Add rate limiting handling to campaigns code host actions

Merge request reports