Report: 3.3 upgrade general instability

Created by: slimsag

TL:DR

In this issue I perform an in-depth analysis of the issue we received multiple reports of by customers all around the same time today. This issue affects all instances upgrading to Sourcegraph 3.3 with more than a few thousand repositories. It causes general instability due to code host API rate limit exhaustion, with various symptoms (see below).

I've identified the four root causes of this issue, as well as 3 distinct symptoms stemming from those root causes.

This issue will be addressed in an upcoming patch release ASAP, for immediate help contact us and see "Mitigation & Immediate Resolution" below.

Problem background

Sourcegraph 3.3 changed how repository updating worked. As part of this change, some automatic migrations occurred. These auto-migrations were intended to be transparent to users (i.e. require no manual steps when upgrading).

After release of Sourcegraph 3.3, we began receiving reports (nearly simultaneously) that these auto-migrations were causing issues:

Experiencing repo-updater being blocked on migration (i.e. not running, despite process running).
Hitting code host API rate limits.
Jump-to-codehost links not appearing.
Panics in the frontend taking down the server.
Slow search performance and timeout errors / proxy errors / 502 errors (due to slow search performance).
Errors viewing repository settings pages (Cannot read property 'nodes' of null).
Increased codehost 404s, according to the src_bitbucket_requests_total prometheus metric.

This issue affects all customers who upgrade to 3.3.0 or 3.3.1 with a large number of repositories.

Mitigation & Immediate Resolution

When affected, repo-updater initially either fails to perform the auto-migrations due to one of two reasons:

The auto-migration started, but is unable to complete due to already hitting the codehost API rate limit.
The auto-migration did not start and is unable to do so due to an only-after-upgrade-invalid external service configuration (usually only affecting Bitbucket Server users).

Identifying which state the system is in can be done by inspecting logs relating to repo-updater, by grepping for migrate:

docker logs <container> >2&1 | grep 'migrate'

If the above reports an invalid configuration: visiting the external service and saving the config will tell you what needs to be done to bring it into a valid state once again.

At this point, the auto-migration will either complete or will complete eventually after enough code host API rate limit quota is acquired over time.

Once this occurs, new entries in the external service will appear under the "repos" field.

At this point, the issue is partially mitigated. To fully resolve the issue, however, non-existant repoitories MUST be removed from the "repos" field or else Sourcegraph will continue to hit code host API rate limits.

Root causes

There were four root causes of this issue. Two primary ones affecting all users, and two secondary ones affecting only Bitbucket Server users.

All four issues prevented repo-updater from working at all, which caused serious harm to the Sourcegraph instance (see Symptoms section below).

Auto-migration can become blocked on code-host rate limit forever

As part of migrating disabled repositories to the exclude list, our migration code must list all repositories on the code host.

If the code host API rate limit is already significantly consumed, and there are a lot of repositories to list (100 per page), the ListAll logic will return an error which bubbles up to fail the migration.

When the migration fails, the repo-updater process exits and Kubernetes/Docker restarts the process which starts the migration again and consumes more of the API quota making the problem worse.

Tracking issue: https://github.com/sourcegraph/sourcegraph/issues/3590

Auto-migration tries to add repositories that were deleted which consumes more API quota

Due to some bug in the auto-migration, repositories that were previously deleted are added to the external service configuration. This occurred in both reported cases of this issue.

In one particular case I observed requests to the code host at 6,000 req/hr all leading to 404s due to these deleted repositories. repo-updater trying to fetch info on these invalid repositories from the code host appears to eat ALL remaining API quota (6000 out of 7200/hr just going to 404s) and cause the instance to hit rate limits extremely quickly.

It is unclear to me as of yet where exactly these repositories were deleted and why they are added to the external service config after auto-migration.

Tracking issue: https://github.com/sourcegraph/sourcegraph/issues/3588

(Bitbucket Server only) external service missing `repositoryQuery` after upgrade

This prevents repo-updater from performing its auto-migration. It occurs because after upgrade to 3.3 Sourcegraph was supposed to add this field automatically.

The code for this was written and merged, but it apparently is not used anywhere.

Tracking issue: https://github.com/sourcegraph/sourcegraph/issues/3591

(Bitbucket Server only) external service missing `username` field after upgrade

This prevents repo-updater from performing its auto-migration. It occurs on instances where the Bitbucket Server external service configuration did not previously contain a username field alongside the token field.

In 3.3, the username field became required (this is expected) but it prevent repo-updater from auto-migrating and starting properly (this is very bad).

Tracking issue: https://github.com/sourcegraph/sourcegraph/issues/3592

Symptoms

The following issues mentioned at the start of this are all symptoms of the root causes above. I.e., these are symptoms of us not handling the error case of hitting the code host API rate limit well.

Jump-to-codehost links not appearing: This was the expected behavior when we hit code host API rate limits. However, I am arguing for improving the error display here: https://github.com/sourcegraph/sourcegraph/issues/3589
Panics in the frontend taking down the server: a recent regression, it occurs when a search error occurs (such as repo-updater being inaccessible). https://github.com/sourcegraph/sourcegraph/issues/3579
Errors viewing repository settings pages (Cannot read property 'nodes' of null): https://github.com/sourcegraph/sourcegraph/issues/3593

Users affected by this issue:

https://app.hubspot.com/contacts/2762526/company/407948923
https://app.hubspot.com/contacts/2762526/company/561806411
And any other user upgrading an instance with more than a few thousand repositories to 3.3.