API docs search: switch from pg_trgm -> tsvector index; 10x query time improvement & 1000x more results
Created by: slimsag
This switches API docs search indexing from a pg_trgm-based schema, to a tsvector-based schema:
- tsvector is better for our use case of API docs search
- We get far better relevancy of results
- Query times are >10x faster, and we get 1000x more results (which translates directly into better relevance/ranking)
The pg_trgm-based schema stored trigrams, while the new tsvector-based schema stores lexemes in the form of tsvectors. We manually craft the tsvectors in Go code for order, case, and punctuation sensitivity (more on this below).
Other notable changes:
- repo names, language names, and tags for search results are now stored in separate tables which reduces duplication of data and improves query perf.
- we abbreviate the
lsif_data_documentation_search
table prefix tolsif_data_docs_search
More details on pg_trgm vs. tsvector differences
Performance
For the same set of 155 repos, 133,388 rows (symbols):
- Index size: increases slightly 95MB -> 113MB
-
Query time: substantially reduced
- A query for
Router
goes from 663ms -> 63ms, and finds 10000 results instead of just 10 - A query for
error
goes from 471ms -> 91ms, and finds 10000 results instead of just 10
- A query for
Relevancy
One major issue with our usage of pg_trgm at query time was finding a way to make matching terms in the left side of search key strings have more precedence than terms in the right side of the search key. For example, to make mux.Router
match mux.Router
before it matches foo.Router
. With pg_trgm, this was not really possible at all and results were poor because of it.
With the tsvector approach, we get much better relevancy due to being able to utilize 'mux' <-> 'Router'
, the <->
operator indicating "followed by".
Because performance is so much better than pg_trgm, we can consider many more results in a partial ORDER BY
(so we're not ordering over the entire table) than we could with pg_trgm
, 10,000 instead of just 10 and it's still faster. In practice, this improves relevancy of results dramatically
Strictness vs. Fuzziness
One tradeoff with using tsvector is that only exact matches of the lexemes are found, unlike pg_trgm which did truly fuzzy word matching. In practice, this means that:
- Typos are not handled, if you fumble
Router
toruoter
we're not going to find it. In practice, I think this is a good thing as results were often too fuzzy with pg_trgm. - Case sensitivity: with tsvector, we must decide to either be case sensitive or case-insensitive. If we want both, our index size doubles. I've chosen case sensitivity, which means a search for
router
will not turn upRouter
. This means users can't choose between case sensitive/insensitive search for now. I may revisit this in the future. - Partial string matching: tsvectors match whole lexemes (words) and within a lexeme, a tsvector query can match a prefix but not a suffix or substring. We can have
Rout
matchRouter
, but we can't haveoute
orter
matchRouter
. This is, of course, just within what we define as a lexeme/word. To support both prefix, suffix, and substring matching we store the reverse string (so we can use prefix matching for suffix matching) and double our index size.
Ranking
tsvector gives us far more flexibility over ranking. We can assign ranking weights per lexeme to e.g. give more priority to package name matches or similar. We're not using this functionality yet but it opens some interesting possibilities.
Lexemes
"But wait!" you might be thinking, "doesn't tsvector operate on human language / whole english words and ignore all punctuation, special characters, etc.?"
By default, yes. The to_tsvector
configurations of simple
, english
, and everything else operate on human language dictionaries. They normalize words by e.g. lowercasing everything, and reducing words sometimes (e.g. dogs
-> dog
). In code search, and in our use case, this is obviously not acceptable.
Luckily, Postgres allows us to create tsvector
lexemes on our own and define what we want. We create these in Go code by considering contiguous sequences of alphabetical/digit/number characters as a single lexeme, and all other characters as their own lexemes. For example, foo::bar.Router'a-baz()
becomes foo
, :
, :
, bar
, .
, Router
, '
, a
, -
, baz
, (
, )
.
By crafting the lexemes ourselves, we gain case sensitivity and the ability to match punctuation/special characters. This obviously increase index size (which is why Postgres doesn't do it by default), but in practice is competitive / better than the trigram index we had previously by an awful lot.
Query power
With tsvector we have a lot more control over how our queries are matched.
Where pg_trgm did only fuzzy matching (or regexp, but at a fair cost), with tsvector we get the ability to match our query terms with logical boolean operators &
(AND), |
(OR), and !
(NOT) plus grouping ()
.
In practical terms, this allows us to do a few things better:
- Where we have a query like
github.com/golang/go http.StatusNotFound
and we don't actually know which query terms are a repository or not, we can do a more precise match. With pg_trgm, we had to rely on string fuzziness to match such a thing and hope the unrelated terms wouldn't conflict. - Where we have a query like
go http Post
, and we aren't sure which term is likely to be a programming language name, we can use a logical OR of all query terms to match the language. - Where we have a query like
private package internal/util
and we're not sure which terms are tags or not, we can use a logical OR of all query terms to match the tags.
Helps #21938
Note: This is all current off by default and behind a feature flag. Actual querying against this table is not yet implemented, a subsequent PR will add this.