Proposal: Global Repository Permissions Caching

Closed Issue created 6 years ago by Warren Gifford

Created by: tsenart

Abstract

This proposal outlines a potential new approach to repository permissions caching.

Background

At a high level, the current implementation of repository permissions verification works like this:

Load repositories from Postgres.
If the authenticated user is an admin, return all repos.
Load all external user accounts associated with the authenticated user from Postgres.
For each configured authorization provider:
- If none of the loaded external accounts match this provider:
  - Fetch an external account from this provider (i.e. via FetchAccount)
  - Associate this external account with the authenticated user and persist it to Postgres.
- Filter out repositories that don't belong to this provider (based on api.ExternalRepoSpec).
- Load repository permissions for the selected repositories (via RepoPerms).
  - Return cached permissions if there's an unexpired cache entry for this user.
  - Otherwise, fill the cache with fresh permissions from the provider's code host and return those.
- Add those repositories for which the current user has the required permissions to the accepted set.
- If there are still repositories to verify, loop to the next provider.
Return accepted repositories in the same order as the input repos slice.

Current Problems

Cache filling slows down interactive user requests

Since cache filling happens synchronously in the user request path, it makes the user wait for as long as it takes to load all the requested permissions. How often this happens depends on the configured cache expiration ttl. Admins who are concerned with the inflicted load on their code hosts can increase this setting to a large duration at the expense of freshness of data (i.e. changes in permissions in the code host will take up to ttl time to be reflected on Sourcegraph).

This problem manifests when either the code host has slow-downs or when loading permissions for a large number of repos is just slow, which is the case for some of our largest customers (e.g. loading permissions for ~10k repositories from the Bitbucket Server API takes roughly 10s).

Missing cache filling concurrency control

Neither the GitHub or the GitLab authorization providers control the concurrency of cache fills for a single user. If users happen to send multiple requests when their permissions cache have expired, all or many of these requests will concurrently refresh the permissions cache from the code host.

This is implemented in the Bitbucket Server provider, but it's not in an ideal state. The pessimistic locking approach used there, leveraging Postgres row locking, has the potential to exhaust the Postgres connection pool quickly, when faced with higher concurrency than the configured pool size.

Maintainability burden from lack of consistency and duplication

Each authorization provider is re-implementing many cross cutting concerns, chief among them, permissions caching. There are other redundancies that can be eliminated, such as the duplicated Repos method that partitions repositories into mine and theirs whose implementations simply call out to authz.GetCodeHostRepos.

This duplication of efforts opens the doors to bugs and makes it harder to maintain this sub-system. Moreover, caching repository permissions outside of each provider, at the authzFilter level, would result in a significant performance boost.

Proposal

To address the problems described above, we propose to unify repository permissions caching outside of authz.Provider implementations. With a central place that handles this concern, we can address all of the identified problems at once, for all providers.

Async cache filling: Instead of blocking a user's request we'd use the stale cache immediatelly and start a refresh in the background (i.e. in a go-routine).
Concurrency control: We'd start controlling concurrency of cache fills for every provider. Instead of Postgres row-level locking, we'd use an alternative method that isn't prone to exhaust the connection pool (e.g optimisitically update the updated_at timestamp the row first, so that other processes don't concurrently cache fill, and then update the other columns when the data becomes available). Additional concurrency control of cache fills of all users would be done via well-tuned bitbucketserver.Client rate limiting.
Maintainability: An Authorizer type would be introduced, with its logic based on authzFilter, but using dependency injection for easier testing. This type would handle all caching of permissions while each authz.Provider would only take care of fetching fresh permissions.
Performance: For most user requests, since caching of permissions is now global instead of per provider, filtering repos via the Authorizer would be a simple in-memory set intersection using the authenticated user's repository permissions cache, represented as roaring.Bitmap of authorized repository IDs.

Rationale

This proposal kills many birds with one stone. However, it is a significant change to a security sensitive sub-system. The risk of introducing bugs can be managed by splitting changes in manageable chunks that will be thoroughly reviewed and tested.

Compatibility

Individual ttl settings on each authorization provider could be replaced by a global setting that'd apply to all configured providers. If we pursure this route, deprecation of these settings would be necessary.
Redis ought to be cleaned up after this change since it holds significant amounts of data related to permissions caching of GitHub and GitLab providers.

Implementation

This work fits in with the 3.6 milestone goal of the Core Services team since it would decrease the 99th percentile latency of searches significantly. It's possible that it will transpose to the next milestone for its full completion, but here's a proposed implementation order that could bring some of the benefits to users already in the 3.6 release.

Change the current Bitbucket Server provider's store to use an alternative to row-level locking. By itself, this would prevent potential connection pool exhaustion in the scenarios described before.
Change the current Bitbucket Server provider to update its caches asynchronously. This would allow us to recommend Bitbucket Server ACLs to some of our largest customers after we release 3.6, with caveats (see Open Issues).
Building on the previous work, lift the store outside of the bitbucket provider package, to the top-level authz package as PermsStore.
Introduce the Authorizer type in the authz package. This type would use an injected PermsStore for all its caching. Implement the equivalent of authzFilter as a method in this type: func (a *Authorizer) AuthorizedRepos(ctx context.Context, user *types.User, repos []*types.Repo) ([]*types.Repo, error).
Reuse the Authorizer in the code-paths where authzFilter is currently used.
Change the GitHub authorization providers to not do any caching themselves.
Change the GitLab authorization providers to not do any caching themselves.

Open issues

Cold caches

This proposal doesn't address the cold cache issue. When new users are created, or caches are wiped, we can't use a stale cache in user requests because there is none. In this scenario, users will have to wait. This is particularly problematic when Sourcegraph is deployed anew, since potentially many users can storm it and all of them would have to wait: not a great first experience.

Here are two complementary ideas that could alleviate this issue:

Document this clearly and ask admins to not open the flood-gates all at once.
When creating a new user, start a permissions cache fill in the background. This raises the chances that when the user hits a page that needs authorization, the permissions will already be there.

2 of 7 checklist items completed