Proposal: Global Repository Permissions Caching
Created by: tsenart
Abstract
This proposal outlines a potential new approach to repository permissions caching.
Background
At a high level, the current implementation of repository permissions verification works like this:
- Load repositories from Postgres.
- If the authenticated user is an admin, return all repos.
- Load all external user accounts associated with the authenticated user from Postgres.
- For each configured authorization provider:
- If none of the loaded external accounts match this provider:
- Fetch an external account from this provider (i.e. via
FetchAccount
) - Associate this external account with the authenticated user and persist it to Postgres.
- Fetch an external account from this provider (i.e. via
- Filter out repositories that don't belong to this provider (based on
api.ExternalRepoSpec
). - Load repository permissions for the selected repositories (via
RepoPerms
).- Return cached permissions if there's an unexpired cache entry for this user.
- Otherwise, fill the cache with fresh permissions from the provider's code host and return those.
- Add those repositories for which the current user has the required permissions to the
accepted
set. - If there are still repositories to verify, loop to the next provider.
- If none of the loaded external accounts match this provider:
- Return accepted repositories in the same order as the input repos slice.
Current Problems
Cache filling slows down interactive user requests
Since cache filling happens synchronously in the user request path, it makes the user wait for as long as it takes to load all the requested permissions. How often this happens depends on the configured cache expiration ttl
. Admins who are concerned with the inflicted load on their code hosts can increase this setting to a large duration at the expense of freshness of data (i.e. changes in permissions in the code host will take up to ttl
time to be reflected on Sourcegraph).
This problem manifests when either the code host has slow-downs or when loading permissions for a large number of repos is just slow, which is the case for some of our largest customers (e.g. loading permissions for ~10k repositories from the Bitbucket Server API takes roughly 10s).
Missing cache filling concurrency control
Neither the GitHub or the GitLab authorization providers control the concurrency of cache fills for a single user. If users happen to send multiple requests when their permissions cache have expired, all or many of these requests will concurrently refresh the permissions cache from the code host.
This is implemented in the Bitbucket Server provider, but it's not in an ideal state. The pessimistic locking approach used there, leveraging Postgres row locking, has the potential to exhaust the Postgres connection pool quickly, when faced with higher concurrency than the configured pool size.
Maintainability burden from lack of consistency and duplication
Each authorization provider is re-implementing many cross cutting concerns, chief among them, permissions caching. There are other redundancies that can be eliminated, such as the duplicated Repos
method that partitions repositories into mine
and theirs
whose implementations simply call out to authz.GetCodeHostRepos
.
This duplication of efforts opens the doors to bugs and makes it harder to maintain this sub-system. Moreover, caching repository permissions outside of each provider, at the authzFilter
level, would result in a significant performance boost.
Proposal
To address the problems described above, we propose to unify repository permissions caching outside of authz.Provider
implementations. With a central place that handles this concern, we can address all of the identified problems at once, for all providers.
- Async cache filling: Instead of blocking a user's request we'd use the stale cache immediatelly and start a refresh in the background (i.e. in a go-routine).
-
Concurrency control: We'd start controlling concurrency of cache fills for every provider. Instead of Postgres row-level locking, we'd use an alternative method that isn't prone to exhaust the connection pool (e.g optimisitically update the
updated_at
timestamp the row first, so that other processes don't concurrently cache fill, and then update the other columns when the data becomes available). Additional concurrency control of cache fills of all users would be done via well-tunedbitbucketserver.Client
rate limiting. -
Maintainability: An
Authorizer
type would be introduced, with its logic based onauthzFilter
, but using dependency injection for easier testing. This type would handle all caching of permissions while eachauthz.Provider
would only take care of fetching fresh permissions. -
Performance: For most user requests, since caching of permissions is now global instead of per provider, filtering repos via the
Authorizer
would be a simple in-memory set intersection using the authenticated user's repository permissions cache, represented asroaring.Bitmap
of authorized repository IDs.
Rationale
This proposal kills many birds with one stone. However, it is a significant change to a security sensitive sub-system. The risk of introducing bugs can be managed by splitting changes in manageable chunks that will be thoroughly reviewed and tested.
Compatibility
- Individual
ttl
settings on each authorization provider could be replaced by a global setting that'd apply to all configured providers. If we pursure this route, deprecation of these settings would be necessary. - Redis ought to be cleaned up after this change since it holds significant amounts of data related to permissions caching of GitHub and GitLab providers.
Implementation
This work fits in with the 3.6 milestone goal of the Core Services team since it would decrease the 99th percentile latency of searches significantly. It's possible that it will transpose to the next milestone for its full completion, but here's a proposed implementation order that could bring some of the benefits to users already in the 3.6 release.
-
Change the current Bitbucket Server provider's store
to use an alternative to row-level locking. By itself, this would prevent potential connection pool exhaustion in the scenarios described before. -
Change the current Bitbucket Server provider to update its caches asynchronously. This would allow us to recommend Bitbucket Server ACLs to some of our largest customers after we release 3.6, with caveats (see Open Issues). -
Building on the previous work, lift the store
outside of thebitbucket
provider package, to the top-level authz package asPermsStore
. -
Introduce the Authorizer
type in theauthz
package. This type would use an injectedPermsStore
for all its caching. Implement the equivalent ofauthzFilter
as a method in this type:func (a *Authorizer) AuthorizedRepos(ctx context.Context, user *types.User, repos []*types.Repo) ([]*types.Repo, error)
. -
Reuse the Authorizer
in the code-paths whereauthzFilter
is currently used. -
Change the GitHub
authorization providers to not do any caching themselves. -
Change the GitLab
authorization providers to not do any caching themselves.
Open issues
Cold caches
This proposal doesn't address the cold cache issue. When new users are created, or caches are wiped, we can't use a stale cache in user requests because there is none. In this scenario, users will have to wait. This is particularly problematic when Sourcegraph is deployed anew, since potentially many users can storm it and all of them would have to wait: not a great first experience.
Here are two complementary ideas that could alleviate this issue:
- Document this clearly and ask admins to not open the flood-gates all at once.
- When creating a new user, start a permissions cache fill in the background. This raises the chances that when the user hits a page that needs authorization, the permissions will already be there.