Skip to content

codeintel: Rockskip for symbols

Administrator requested to merge rockskip-symbols into main

Created by: chrismwendt

CleanShot 2022-03-02 at 10 07 25@2x

This implements Rockskip indexing and search for the symbols service.

Motivation 1: improve speed on 10GB+ repos

On ~50GB repos, the symbols service takes ~30s to process symbols for new commits. That slowness frequently causes empty hovers and red error messages in the symbols sidebar on such repos. Although the speed was massively improved by incrementally indexing symbols, there remains one part that's slow and not incremental: making a copy of the SQLite DB for each new commit. Rockskip processes new commits in ~100ms, not ~30s.

Motivation 2: maximize index usage

Because the symbols service gets restarted ~10 times/day on Sourcegraph.com and its caches are wiped on each restart, it ends up re-processing lots of repos. An alternative solution would be to use a stateful set. This PR stores symbols data in Postgres and eliminates the problem of needing to reprocess all symbols on restart.

Motivation 3: vet the concept for usage elsewhere

Rockskip has been described as potentially "transformative" for many use cases at Sourcegraph: text search, symbol search, and code insights. Symbol search is the easiest to implement, and this PR serves as a proof-of-concept for Rockskip and increase confidence in using it elsewhere.

TODO

  • Implement the core algorithm using a MockGit and MockDB
  • Implement real SubprocessGit and PostgresDB
  • Perf milestone: add indexes to tables, use a persistent git cat-file process
  • Index ctags
  • Try transactions, done: wrapping everything in a transaction made it 1.3x slower
  • Feature milestone: integrate Rockskip into the symbols service
  • Wrap each commit in a transaction to avoid corruption
  • Store line, kind, and parent for each symbol
  • Create migration
  • Limit Postgres disk usage with LRU cache
  • https://github.com/sourcegraph/sourcegraph/pull/30663 to avoid hitting argv and URL length limits
  • Make sure rev-list and others terminate when ctx is canceled
  • Stream git log if the HTTP payload is too big
  • Stream git rev-list
  • Add limits for max file size, max number of symbols per file, max concurrent repos being processed
  • Make a Symbols Status page
  • Edit: merging here with a known bug (see below) to avoid merge conflicts due to long-lived branch
  • Log comparison of response time and errors between SQLite and Rockskip
  • Run load test with one megarepo
  • Run load test with many small repos
  • (optional for speed) Use a pool of ctags parsers
  • (optional for DB size) Store as int IDs: repo names, commit hashes, and paths
  • (optional for speed) Teach gitserver to run ctags itself to avoid sending file contents over the network
  • (optional for stability) Implement throttling in case autovacuum gets overwhelmed
  • (optional for speed) Lazily compactify the ancestry chain in blobs
  • (maybe) Try replacing git archive with a persistent git cat-file --batch connection

Benchmarks

Currently, this prototype of Rockskip processes the kubernetes repo at ~40 commits per second on my laptop in a total of 20 minutes. Queries take ~10ms.

Timing breakdown on kubernetes:

  • 62% constructing the Rockskip structure in Postgres
  • 30% ctags (this can be parallelized down to almost nothing)
  • 5% git operations
  • 3% other

Known bug

There are cases (often in protobuf files) where Rockskip can't find a symbol that was apparently deleted.

Merge request reports

Loading