codeintel: Rockskip for symbols
Created by: chrismwendt
This implements Rockskip indexing and search for the symbols
service.
Motivation 1: improve speed on 10GB+ repos
On ~50GB repos, the symbols
service takes ~30s to process symbols for new commits. That slowness frequently causes empty hovers and red error messages in the symbols sidebar on such repos. Although the speed was massively improved by incrementally indexing symbols, there remains one part that's slow and not incremental: making a copy of the SQLite DB for each new commit. Rockskip processes new commits in ~100ms, not ~30s.
Motivation 2: maximize index usage
Because the symbols
service gets restarted ~10 times/day on Sourcegraph.com and its caches are wiped on each restart, it ends up re-processing lots of repos. An alternative solution would be to use a stateful set. This PR stores symbols data in Postgres and eliminates the problem of needing to reprocess all symbols on restart.
Motivation 3: vet the concept for usage elsewhere
Rockskip has been described as potentially "transformative" for many use cases at Sourcegraph: text search, symbol search, and code insights. Symbol search is the easiest to implement, and this PR serves as a proof-of-concept for Rockskip and increase confidence in using it elsewhere.
TODO
-
Implement the core algorithm using a MockGit
andMockDB
-
Implement real SubprocessGit
andPostgresDB
-
Perf milestone: add indexes to tables, use a persistent git cat-file
process -
Index ctags -
Try transactions, done: wrapping everything in a transaction made it 1.3x slower -
Feature milestone: integrate Rockskip into the symbols
service -
Wrap each commit in a transaction to avoid corruption -
Store line, kind, and parent for each symbol -
Create migration -
Limit Postgres disk usage with LRU cache -
https://github.com/sourcegraph/sourcegraph/pull/30663 to avoid hitting argv
and URL length limits -
Make sure rev-list
and others terminate whenctx
is canceled -
Stream git log
if the HTTP payload is too big -
Stream git rev-list
-
Add limits for max file size, max number of symbols per file, max concurrent repos being processed -
Make a Symbols Status page - Edit: merging here with a known bug (see below) to avoid merge conflicts due to long-lived branch
-
Log comparison of response time and errors between SQLite and Rockskip -
Run load test with one megarepo -
Run load test with many small repos -
(optional for speed) Use a pool of ctags parsers -
(optional for DB size) Store as int IDs: repo names, commit hashes, and paths -
(optional for speed) Teach gitserver to run ctags itself to avoid sending file contents over the network -
(optional for stability) Implement throttling in case autovacuum gets overwhelmed -
(optional for speed) Lazily compactify the ancestry chain in blobs -
(maybe) Try replacing git archive
with a persistentgit cat-file --batch
connection
Benchmarks
Currently, this prototype of Rockskip processes the kubernetes repo at ~40 commits per second on my laptop in a total of 20 minutes. Queries take ~10ms.
Timing breakdown on kubernetes:
- 62% constructing the Rockskip structure in Postgres
- 30% ctags (this can be parallelized down to almost nothing)
- 5% git operations
- 3% other
Known bug
There are cases (often in protobuf files) where Rockskip can't find a symbol that was apparently deleted.