Skip to content

Symbols perf: use sqlite3

Administrator requested to merge symbols-sqlite into master

Created by: chrismwendt

Background

Previously, the symbols service was pretty slow on big repositories (linear in the repository size). See https://github.com/sourcegraph/sourcegraph/issues/2257 for analysis.

GIF of kubernetes/kubernetes code intel before this PR

2019-02-18 01 25 15

This vastly improves the speed of searches (it uses sqlite3 and indexes the symbol name column for nearly instant queries). This has no effect on the speed of initial indexing of symbols.

GIF of kubernetes/kubernetes code intel after this PR

2019-02-18 01 19 54

Benchmark results

$ env CTAGS_COMMAND=$PWD/cmd/symbols/universal-ctags-dev go test -v -bench=. -test.run=sdfsfd ./cmd/symbols/internal/symbols
goos: darwin
goarch: amd64
pkg: github.com/sourcegraph/sourcegraph/cmd/symbols/internal/symbols
BenchmarkSearch/indexing_go-langserver@391-8         	      10	 140234483 ns/op
BenchmarkSearch/indexing_moby@6e5-8                  	       1	7930319865 ns/op
BenchmarkSearch/searching_go-langserver@391_^sortedImportRecord$-8         	   30000	     44872 ns/op
BenchmarkSearch/searching_go-langserver@391_1234doesnotexist1234-8         	    5000	    281762 ns/op
BenchmarkSearch/searching_moby@6e5_^fsCache$-8                             	   20000	     80045 ns/op
BenchmarkSearch/searching_moby@6e5_1234doesnotexist1234-8                  	      30	  46515624 ns/op
PASS
ok  	github.com/sourcegraph/sourcegraph/cmd/symbols/internal/symbols	36.963s

I forced the non exact regexes to be run over all symbols by using 1234doesnotexist1234 as the regex. Exact matches (^...$) are much faster and scale better than non exact matches that don't exist.

The current implementation takes ~40ms to search for 1234doesnotexist1234 in moby/moby, and takes ~46ms using sqlite3, which I think is acceptable.

Memory usage improvements

This also reduces the memory consumption (it streams symbols from the ctags process to sqlite3 without needing to build up a giant []Symbol in Go). This should reduce the frequency of the symbols service getting OOM killed (cc @ggilmore, just FYI).

  • Memory usage while indexing kubernetes/kubernetes before this PR: ~200 MB
  • Memory usage while indexing kubernetes/kubernetes after this PR: ~26 MB

TypeScript build script

This also uses TypeScript for a build script. It's much safer than shell, and I'm curious if anyone has thoughts on this.

Review

This is one of the largest changes I've made, and I'd be happy to do a video chat 🎥 to make the review process go quickly and smoothly if you want.

Here are the areas I'd like review on (feel free to review anything, these are just my guesses for what you will want to look at):

  • @keegancsmith
    • Correct usage of concurrency primitives, especially in the functions that return channels for streaming symbols (I'm pretty new to this area of Go, so there is a high probability I've done things that are plain wrong)
    • Correct usage of sqlite3, sqlf, and sqlx
    • Database schema
  • @slimsag
    • Dev build system changes (mostly build.ts, some changes to Procfile and other shell scripts)
    • General Go idioms, control flow, organization of functions, methods, etc.
  • @felixfbecker
    • Usage of ts-node, ./dev/ts-script, and cmd/symbols/build.ts
  • Anyone: am I missing anything that you'd expect to see?

I'm hoping to address feedback until both @slimsag and @keegancsmith are confident that this is a robust implementation and are happy with the result before merging.

TODO

  • Add tests they already exist
  • Add a few more comments, minor refactoring
  • Have a few people run this branch locally to see what parts of the build break (missing libs, missing sqlite3, etc.), since my machine is dirty (both macOS and Linux)

Issue references

Resolves https://github.com/sourcegraph/sourcegraph/issues/2257 Fixes https://github.com/sourcegraph/sourcegraph/issues/1625

Merge request reports

Loading