codeintel: Rockskip for symbols

Review changes
Download
Patches
Plain diff

Warren Gifford requested to merge rockskip-symbols into main Dec 08, 2021

Overview 45
Commits 190
Pipelines 0
Changes 60

Created by: chrismwendt

This implements Rockskip indexing and search for the symbols service.

Motivation 1: improve speed on 10GB+ repos

On ~50GB repos, the symbols service takes ~30s to process symbols for new commits. That slowness frequently causes empty hovers and red error messages in the symbols sidebar on such repos. Although the speed was massively improved by incrementally indexing symbols, there remains one part that's slow and not incremental: making a copy of the SQLite DB for each new commit. Rockskip processes new commits in ~100ms, not ~30s.

Motivation 2: maximize index usage

Because the symbols service gets restarted ~10 times/day on Sourcegraph.com and its caches are wiped on each restart, it ends up re-processing lots of repos. An alternative solution would be to use a stateful set. This PR stores symbols data in Postgres and eliminates the problem of needing to reprocess all symbols on restart.

Motivation 3: vet the concept for usage elsewhere

Rockskip has been described as potentially "transformative" for many use cases at Sourcegraph: text search, symbol search, and code insights. Symbol search is the easiest to implement, and this PR serves as a proof-of-concept for Rockskip and increase confidence in using it elsewhere.

TODO

Implement the core algorithm using a MockGit and MockDB
Implement real SubprocessGit and PostgresDB
Perf milestone: add indexes to tables, use a persistent git cat-file process
Index ctags
Try transactions, done: wrapping everything in a transaction made it 1.3x slower
Feature milestone: integrate Rockskip into the symbols service
Wrap each commit in a transaction to avoid corruption
Store line, kind, and parent for each symbol
Create migration
Limit Postgres disk usage with LRU cache
https://github.com/sourcegraph/sourcegraph/pull/30663 to avoid hitting argv and URL length limits
Make sure rev-list and others terminate when ctx is canceled
Stream git log if the HTTP payload is too big
Stream git rev-list
Add limits for max file size, max number of symbols per file, max concurrent repos being processed
Make a Symbols Status page
Edit: merging here with a known bug (see below) to avoid merge conflicts due to long-lived branch
Log comparison of response time and errors between SQLite and Rockskip
Run load test with one megarepo
Run load test with many small repos
(optional for speed) Use a pool of ctags parsers
(optional for DB size) Store as int IDs: repo names, commit hashes, and paths
(optional for speed) Teach gitserver to run ctags itself to avoid sending file contents over the network
(optional for stability) Implement throttling in case autovacuum gets overwhelmed
(optional for speed) Lazily compactify the ancestry chain in blobs
(maybe) Try replacing git archive with a persistent git cat-file --batch connection

Benchmarks

Currently, this prototype of Rockskip processes the kubernetes repo at ~40 commits per second on my laptop in a total of 20 minutes. Queries take ~10ms.

Timing breakdown on kubernetes:

62% constructing the Rockskip structure in Postgres
30% ctags (this can be parallelized down to almost nothing)
5% git operations
3% other

Known bug

There are cases (often in protobuf files) where Rockskip can't find a symbol that was apparently deleted.

Merge request reports

Assignee Loading

Reviewers Loading

Request review from

Loading

Time tracking Loading

Loading