resolve major search perf regression in v3.2.0 (add redis cache to git commit OID resolution)
Created by: slimsag
In Sourcegraph v3.2.0 a major search performance regression was introduced by accident. It affected all instances using indexed search. It went unnoticed because we currently don't have sufficient load testing of search (which is something we are addressing very soon).
The regression was introduced in https://github.com/sourcegraph/sourcegraph/pull/2318 (cc @ijsnow @keegancsmith) by accident due to some complex code interaction. The change made the assumption that using a sync.Once
associated with the *gitCommitResolver
was enough to prevent doing too much duplicative work. However, since there is a one-to-one mapping of search results and gitCommitResolver
we regressed significantly on search performance as after that change for every search result we would:
-
Resolve the default branch of the repo, which requires making an HTTP request to gitserver to ask it to execute
git symbolic-ref HEAD
(and in some cases executing a second command). -
Ask zoekt to list indexed git refs for the repo over HTTP (expensive given it is also wrapped in
sync.Once
). - Resolve the git ref zoekt says is indexed back to a concrete git OID.
This is all to say: after that change we would do a substantial amount of work per search result after https://github.com/sourcegraph/sourcegraph/pull/2318 was merged.
This PR resolves the issue by putting in place a Redis cache with a TTL of 60s. This makes computing this information just a simple cache lookup, which should be substantially faster. The sync.Once
logic remains because we may still call OID
repetitively on the same *gitCommitObject
(I didn't vet all call sites.)
Additionally, this PR adds metrics for us to monitor (which should've probably been added earlier as part of #2318):
- How much time we spend computing this information.
- How often we hit / miss the cache.
- How much time we spend performing cache lookups (in case a Redis cache isn't sufficient and we need a process-local cache in the future).
Test plan:
Manual load testing performed by benchmarking search with vegeta (see "Raw data / details on how tests were ran" at end for exact steps I performed to test this change).
Comparison
All tests performed with this search query returning only 10 results: repo:^github\\.com/sourcegraph/sourcegraph$ error count:10 timeout:60s
Initially I tried higher result counts (11,000, 1000) but the perf before this change was so bad ALL requests would timeout at just 1 QPS and vegeta can't send requests slower than that (or I couldn't find the option).
QPS | Mean change (before -> after) | P95 change (before -> after) | before: non-200s / total req | after: non-200s / total req |
---|---|---|---|---|
1 | (-92.98%) 228ms -> 16ms | (-93.67%) 332ms -> 21ms | 0/60 | 0/60 |
5 | (-99.37%) 2,230ms -> 14ms | (-99.60%) 4,321ms -> 17ms | 0/300 | 0/300 |
50 | (-99.845) 60,000ms -> 92ms | (-99.92%) 63,543ms -> 52ms | 166/250 (502 timeouts) | 0/250 |
- Negative percent above indicates e.g. 92% less time spent after change.
- Last row before time I believe is not super accurate because high number of timeouts skewed the request duration mean / P95..
All tests
Before: Lots of warnings: gitserver | WARN Long exec request, repo: github.com/sourcegraph/sourcegraph, args: [symbolic-ref HEAD], duration: 3.183s
After: No warnings.
Raw data / details on how tests were ran
Details
-
search.body.json
is the exact GraphQL search query we send when a user does a search request, copied from Chrome network tab.
1 QPS for 60s
Before:
$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=1qps -rate=1 -duration=60s | tee results.bin | vegeta report
Requests [total, rate] 60, 1.02
Duration [total, attack, wait] 59.1882914s, 59.004040464s, 184.250936ms
Latencies [mean, 50, 95, 99, max] 228.168332ms, 213.124139ms, 332.409ms, 554.912644ms, 575.35818ms
Bytes In [total, mean] 508206, 8470.10
Bytes Out [total, mean] 327300, 5455.00
Success [ratio] 100.00%
Status Codes [code:count] 200:60
Error Set:
After:
$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=1qps -rate=1 -duration=60s | tee results.bin | vegeta report
Requests [total, rate] 60, 1.02
Duration [total, attack, wait] 59.01640499s, 59.004437167s, 11.967823ms
Latencies [mean, 50, 95, 99, max] 16.430629ms, 13.723219ms, 20.86999ms, 133.612645ms, 145.057061ms
Bytes In [total, mean] 508201, 8470.02
Bytes Out [total, mean] 327300, 5455.00
Success [ratio] 100.00%
Status Codes [code:count] 200:60
Error Set:
5 QPS for 60s
Before:
$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=5qps -rate=5 -duration=60s | tee results.bin | vegeta report
Requests [total, rate] 300, 5.02
Duration [total, attack, wait] 1m2.116713558s, 59.803600375s, 2.313113183s
Latencies [mean, 50, 95, 99, max] 2.231654701s, 2.644444083s, 4.321173379s, 4.960104982s, 5.056616823s
Bytes In [total, mean] 2541017, 8470.06
Bytes Out [total, mean] 1636500, 5455.00
Success [ratio] 100.00%
Status Codes [code:count] 200:300
Error Set:
After:
$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=5qps -rate=5 -duration=60s | tee results.bin | vegeta report
Requests [total, rate] 300, 5.02
Duration [total, attack, wait] 59.815219217s, 59.804088472s, 11.130745ms
Latencies [mean, 50, 95, 99, max] 13.718213ms, 13.242097ms, 16.508382ms, 25.167756ms, 129.288757ms
Bytes In [total, mean] 2541003, 8470.01
Bytes Out [total, mean] 1636500, 5455.00
Success [ratio] 100.00%
Status Codes [code:count] 200:300
Error Set:
50 QPS for 5s
Before:
$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=50qps -rate=50 -duration=5s | tee results.bin | vegeta report
Requests [total, rate] 250, 50.18
Duration [total, attack, wait] 1m7.318174155s, 4.982542856s, 1m2.335631299s
Latencies [mean, 50, 95, 99, max] 1m0.103715429s, 1m0.957090646s, 1m3.543040632s, 1m4.220043201s, 1m4.250048319s
Bytes In [total, mean] 736409, 2945.64
Bytes Out [total, mean] 1363750, 5455.00
Success [ratio] 33.60%
Status Codes [code:count] 200:84 502:166
Error Set:
502 Bad Gateway
After:
$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=50qps -rate=50 -duration=5s | tee results.bin | vegeta report
Requests [total, rate] 250, 50.17
Duration [total, attack, wait] 4.993111979s, 4.983242598s, 9.869381ms
Latencies [mean, 50, 95, 99, max] 92.333272ms, 13.51449ms, 52.175179ms, 1.841178441s, 1.869576507s
Bytes In [total, mean] 2117508, 8470.03
Bytes Out [total, mean] 1363750, 5455.00
Success [ratio] 100.00%
Status Codes [code:count] 200:250
Error Set: