API docs search: switch from pg_trgm -> tsvector index; 10x query time improvement & 1000x more results (!25994) · Merge requests · Administrator / sourcegraph

Administrator requested to merge sg/apidocs-tsvector into main Oct 13, 2021

Created by: slimsag

This switches API docs search indexing from a pg_trgm-based schema, to a tsvector-based schema:

tsvector is better for our use case of API docs search
We get far better relevancy of results
Query times are >10x faster, and we get 1000x more results (which translates directly into better relevance/ranking)

The pg_trgm-based schema stored trigrams, while the new tsvector-based schema stores lexemes in the form of tsvectors. We manually craft the tsvectors in Go code for order, case, and punctuation sensitivity (more on this below).

Other notable changes:

repo names, language names, and tags for search results are now stored in separate tables which reduces duplication of data and improves query perf.
we abbreviate the lsif_data_documentation_search table prefix to lsif_data_docs_search

More details on pg_trgm vs. tsvector differences

Performance

For the same set of 155 repos, 133,388 rows (symbols):

Index size: increases slightly 95MB -> 113MB
Query time: substantially reduced
- A query for Router goes from 663ms -> 63ms, and finds 10000 results instead of just 10
- A query for error goes from 471ms -> 91ms, and finds 10000 results instead of just 10

Relevancy

One major issue with our usage of pg_trgm at query time was finding a way to make matching terms in the left side of search key strings have more precedence than terms in the right side of the search key. For example, to make mux.Router match mux.Router before it matches foo.Router. With pg_trgm, this was not really possible at all and results were poor because of it.

With the tsvector approach, we get much better relevancy due to being able to utilize 'mux' <-> 'Router', the <-> operator indicating "followed by".

Because performance is so much better than pg_trgm, we can consider many more results in a partial ORDER BY (so we're not ordering over the entire table) than we could with pg_trgm, 10,000 instead of just 10 and it's still faster. In practice, this improves relevancy of results dramatically

Strictness vs. Fuzziness

One tradeoff with using tsvector is that only exact matches of the lexemes are found, unlike pg_trgm which did truly fuzzy word matching. In practice, this means that:

Typos are not handled, if you fumble Router to ruoter we're not going to find it. In practice, I think this is a good thing as results were often too fuzzy with pg_trgm.
Case sensitivity: with tsvector, we must decide to either be case sensitive or case-insensitive. If we want both, our index size doubles. I've chosen case sensitivity, which means a search for router will not turn up Router. This means users can't choose between case sensitive/insensitive search for now. I may revisit this in the future.
Partial string matching: tsvectors match whole lexemes (words) and within a lexeme, a tsvector query can match a prefix but not a suffix or substring. We can have Rout match Router, but we can't have oute or ter match Router. This is, of course, just within what we define as a lexeme/word. To support both prefix, suffix, and substring matching we store the reverse string (so we can use prefix matching for suffix matching) and double our index size.

Ranking

tsvector gives us far more flexibility over ranking. We can assign ranking weights per lexeme to e.g. give more priority to package name matches or similar. We're not using this functionality yet but it opens some interesting possibilities.

Lexemes

"But wait!" you might be thinking, "doesn't tsvector operate on human language / whole english words and ignore all punctuation, special characters, etc.?"

By default, yes. The to_tsvector configurations of simple, english, and everything else operate on human language dictionaries. They normalize words by e.g. lowercasing everything, and reducing words sometimes (e.g. dogs -> dog). In code search, and in our use case, this is obviously not acceptable.

Luckily, Postgres allows us to create tsvector lexemes on our own and define what we want. We create these in Go code by considering contiguous sequences of alphabetical/digit/number characters as a single lexeme, and all other characters as their own lexemes. For example, foo::bar.Router'a-baz() becomes foo, :, :, bar, ., Router, ', a, -, baz, (, ).

By crafting the lexemes ourselves, we gain case sensitivity and the ability to match punctuation/special characters. This obviously increase index size (which is why Postgres doesn't do it by default), but in practice is competitive / better than the trigram index we had previously by an awful lot.

Query power

With tsvector we have a lot more control over how our queries are matched.

Where pg_trgm did only fuzzy matching (or regexp, but at a fair cost), with tsvector we get the ability to match our query terms with logical boolean operators & (AND), | (OR), and ! (NOT) plus grouping ().

In practical terms, this allows us to do a few things better:

Where we have a query like github.com/golang/go http.StatusNotFound and we don't actually know which query terms are a repository or not, we can do a more precise match. With pg_trgm, we had to rely on string fuzziness to match such a thing and hope the unrelated terms wouldn't conflict.
Where we have a query like go http Post, and we aren't sure which term is likely to be a programming language name, we can use a logical OR of all query terms to match the language.
Where we have a query like private package internal/util and we're not sure which terms are tags or not, we can use a logical OR of all query terms to match the tags.

Helps #21938

Note: This is all current off by default and behind a feature flag. Actual querying against this table is not yet implemented, a subsequent PR will add this.

API docs search: switch from pg_trgm -> tsvector index; 10x query time improvement & 1000x more results