authz: Bitbucket Server ACLs caching
Created by: tsenart
Summary
This change set introduces per-user Bitbucket ACLs caching that respects a user configured TTL. It leverages Postgres row locking for concurrency control of cache fill events, so that many concurrent requests during an expiration don't overload the Bitbucket Server API.
It features a second in-memory read-through cache to avoid the marshaling and network round trip costs during steady state.
We cache the complete set of Sourcegraph repository ids a user has access to. This set is represented as a highly compressed bitmap to save on storage, memory and compute resources. For a hypothetical upper bound of 100k repositories, and 10k users, the aggregate memory usage would be rounding 329,36 MB per frontend API instance. For 30k repositories and 1k users, that number would be 16,472MB. This is computed with https://godoc.org/github.com/RoaringBitmap/roaring#BoundSerializedSizeInBytes
Closes #1108
Benchmarks
benchmark iter time/iter bytes alloc allocs
--------- ---- --------- ----------- ------
BenchmarkStore/ttl=0-12 1 2010936381.00 ns/op 121320 B/op 450 allocs/op
BenchmarkStore/ttl=60s/no-in-memory-cache-12 3000 448181.00 ns/op 30705 B/op 78 allocs/op
BenchmarkStore/ttl=60s/in-memory-cache-12 10000000 134.00 ns/op 0 B/op 0 allocs/op
Load tests
Before this PR
Requests [total, rate] 240, 100.42
Duration [total, attack, wait] 12.750183628s, 2.389969s, 10.360214628s
Latencies [mean, 50, 95, 99, max] 257.711098ms, 21.508257ms, 45.953569ms, 8.382180325s, 10.360214628s
Bytes In [total, mean] 49490, 206.21
Bytes Out [total, mean] 16560, 69.00
Success [ratio] 100.00%
Status Codes [code:count] 200:240
Error Set:
After this PR
Requests [total, rate] 500, 100.16
Duration [total, attack, wait] 5.000159564s, 4.991773s, 8.386564ms
Latencies [mean, 50, 95, 99, max] 6.555395ms, 6.271367ms, 8.724946ms, 13.90822ms, 31.487489ms
Bytes In [total, mean] 105000, 210.00
Bytes Out [total, mean] 34500, 69.00
Success [ratio] 100.00%
Status Codes [code:count] 200:500
Error Set:
Merge request reports
Activity
Created by: codecov[bot]
Codecov Report
Merging #4493 into master will increase coverage by
0.1%
. The diff coverage is78.23%
.@@ Coverage Diff @@ ## master #4493 +/- ## ========================================= + Coverage 47.24% 47.34% +0.1% ========================================= Files 720 720 Lines 43589 43685 +96 Branches 1763 1763 ========================================= + Hits 20592 20683 +91 + Misses 21037 21031 -6 - Partials 1960 1971 +11
Impacted Files Coverage Δ pkg/extsvc/bitbucketserver/client.go 31.5% <ø> (ø)
cmd/frontend/authz/common.go 0% <0%> (ø)
...prise/cmd/frontend/internal/authz/github/github.go 66.82% <100%> (ø)
cmd/repo-updater/repos/store.go 86.58% <100%> (ø)
enterprise/cmd/frontend/internal/authz/init.go 67.85% <100%> (ø)
...d/frontend/internal/authz/bitbucketserver/store.go 74.46% <74.46%> (ø)
...d/frontend/internal/authz/authz_bitbucketserver.go 79.41% <75%> (ø)
...rontend/internal/authz/bitbucketserver/provider.go 79.79% <87.87%> (+6.95%)
Created by: keegancsmith
kk. Note we have in the past used https://godoc.org/gopkg.in/redsync.v1 but this seems good to me as well.
Created by: tsenart
I’ve been wary of Redis locks since Martin Kleppmann’s analysis a few years ago: http://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
Antirez has responded to that analysis, but I’m not entirely convinced of their safety. http://antirez.com/news/101
Created by: tsenart
yeah the analysis is probably correct, but the flaws only exist on distributed redis. We don't use distributed redis.
From what I understand, there are flaws that are independent of using cluster mode (e.g. the lock lease timeouts being vulnerable to long process halts, like some GC pauses in some languages can trigger)