Warren Gifford requested to merge sg/oid-cache into master May 01, 2019

Created by: slimsag

In Sourcegraph v3.2.0 a major search performance regression was introduced by accident. It affected all instances using indexed search. It went unnoticed because we currently don't have sufficient load testing of search (which is something we are addressing very soon).

The regression was introduced in https://github.com/sourcegraph/sourcegraph/pull/2318 (cc @ijsnow @keegancsmith) by accident due to some complex code interaction. The change made the assumption that using a sync.Once associated with the *gitCommitResolver was enough to prevent doing too much duplicative work. However, since there is a one-to-one mapping of search results and gitCommitResolver we regressed significantly on search performance as after that change for every search result we would:

Resolve the default branch of the repo, which requires making an HTTP request to gitserver to ask it to execute git symbolic-ref HEAD (and in some cases executing a second command).
Ask zoekt to list indexed git refs for the repo over HTTP (expensive given it is also wrapped in sync.Once).
Resolve the git ref zoekt says is indexed back to a concrete git OID.

This is all to say: after that change we would do a substantial amount of work per search result after https://github.com/sourcegraph/sourcegraph/pull/2318 was merged.

This PR resolves the issue by putting in place a Redis cache with a TTL of 60s. This makes computing this information just a simple cache lookup, which should be substantially faster. The sync.Once logic remains because we may still call OID repetitively on the same *gitCommitObject (I didn't vet all call sites.)

Additionally, this PR adds metrics for us to monitor (which should've probably been added earlier as part of #2318):

How much time we spend computing this information.
How often we hit / miss the cache.
How much time we spend performing cache lookups (in case a Redis cache isn't sufficient and we need a process-local cache in the future).

Test plan:

Manual load testing performed by benchmarking search with vegeta (see "Raw data / details on how tests were ran" at end for exact steps I performed to test this change).

Comparison

All tests performed with this search query returning only 10 results: repo:^github\\.com/sourcegraph/sourcegraph$ error count:10 timeout:60s

Initially I tried higher result counts (11,000, 1000) but the perf before this change was so bad ALL requests would timeout at just 1 QPS and vegeta can't send requests slower than that (or I couldn't find the option).

QPS	Mean change (before -> after)	P95 change (before -> after)	before: non-200s / total req	after: non-200s / total req
1	(-92.98%) 228ms -> 16ms	(-93.67%) 332ms -> 21ms	0/60	0/60
5	(-99.37%) 2,230ms -> 14ms	(-99.60%) 4,321ms -> 17ms	0/300	0/300
50	(-99.845) 60,000ms -> 92ms	(-99.92%) 63,543ms -> 52ms	166/250 (502 timeouts)	0/250

Negative percent above indicates e.g. 92% less time spent after change.
Last row before time I believe is not super accurate because high number of timeouts skewed the request duration mean / P95..

All tests

Before: Lots of warnings: gitserver | WARN Long exec request, repo: github.com/sourcegraph/sourcegraph, args: [symbolic-ref HEAD], duration: 3.183s

After: No warnings.

Raw data / details on how tests were ran

Details

search.body.json is the exact GraphQL search query we send when a user does a search request, copied from Chrome network tab.

1 QPS for 60s

Before:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=1qps -rate=1 -duration=60s | tee results.bin | vegeta report
Requests      [total, rate]            60, 1.02
Duration      [total, attack, wait]    59.1882914s, 59.004040464s, 184.250936ms
Latencies     [mean, 50, 95, 99, max]  228.168332ms, 213.124139ms, 332.409ms, 554.912644ms, 575.35818ms
Bytes In      [total, mean]            508206, 8470.10
Bytes Out     [total, mean]            327300, 5455.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:60  
Error Set:

After:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=1qps -rate=1 -duration=60s | tee results.bin | vegeta report
Requests      [total, rate]            60, 1.02
Duration      [total, attack, wait]    59.01640499s, 59.004437167s, 11.967823ms
Latencies     [mean, 50, 95, 99, max]  16.430629ms, 13.723219ms, 20.86999ms, 133.612645ms, 145.057061ms
Bytes In      [total, mean]            508201, 8470.02
Bytes Out     [total, mean]            327300, 5455.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:60  
Error Set:

5 QPS for 60s

Before:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=5qps -rate=5 -duration=60s | tee results.bin | vegeta report
Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    1m2.116713558s, 59.803600375s, 2.313113183s
Latencies     [mean, 50, 95, 99, max]  2.231654701s, 2.644444083s, 4.321173379s, 4.960104982s, 5.056616823s
Bytes In      [total, mean]            2541017, 8470.06
Bytes Out     [total, mean]            1636500, 5455.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  
Error Set:

After:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=5qps -rate=5 -duration=60s | tee results.bin | vegeta report
Requests      [total, rate]            300, 5.02
Duration      [total, attack, wait]    59.815219217s, 59.804088472s, 11.130745ms
Latencies     [mean, 50, 95, 99, max]  13.718213ms, 13.242097ms, 16.508382ms, 25.167756ms, 129.288757ms
Bytes In      [total, mean]            2541003, 8470.01
Bytes Out     [total, mean]            1636500, 5455.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:300  
Error Set:

50 QPS for 5s

Before:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=50qps -rate=50 -duration=5s | tee results.bin | vegeta report
Requests      [total, rate]            250, 50.18
Duration      [total, attack, wait]    1m7.318174155s, 4.982542856s, 1m2.335631299s
Latencies     [mean, 50, 95, 99, max]  1m0.103715429s, 1m0.957090646s, 1m3.543040632s, 1m4.220043201s, 1m4.250048319s
Bytes In      [total, mean]            736409, 2945.64
Bytes Out     [total, mean]            1363750, 5455.00
Success       [ratio]                  33.60%
Status Codes  [code:count]             200:84  502:166  
Error Set:
502 Bad Gateway

After:

$ echo 'POST http://localhost:3080/.api/graphql?SearchBenchmark' | vegeta attack -body ./search.body.json -timeout=120s -name=50qps -rate=50 -duration=5s | tee results.bin | vegeta report
Requests      [total, rate]            250, 50.17
Duration      [total, attack, wait]    4.993111979s, 4.983242598s, 9.869381ms
Latencies     [mean, 50, 95, 99, max]  92.333272ms, 13.51449ms, 52.175179ms, 1.841178441s, 1.869576507s
Bytes In      [total, mean]            2117508, 8470.03
Bytes Out     [total, mean]            1363750, 5455.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:250  
Error Set:

resolve major search perf regression in v3.2.0 (add redis cache to git commit OID resolution)

Comparison

All tests

Details

1 QPS for 60s

5 QPS for 60s

50 QPS for 5s

Merge request reports