indexed-search: Performance improvements for concatenated regex queries
Created by: keegancsmith
A customer reported indexed-search running slower than expected for regexes of the form (AX|AY|AZ) (or groups of constant strings). We suggested adjusting the regex to A(X|Y|Z). This did indeed improve the performance of the query, but it is still slower than expected. As such, we should investigate potential performance improvements for evaluating these sort of queries.
Context
Zoekt's matching engine transforms regex's into a tree of substring queries. These substring queries can efficiently consult zoekt's index to find potentially matching documents. So a regex like foo(bar|baz) will become the query (AND (content "foo") (OR (content "bar") (content "baz"))). My guess at the performance issue is that (OR ...) could be more efficiently rule out poor matches.
Ideas
- Investigate how to more efficiently evaluate / reduce the work done for
(OR ...)in zoekt's matchtree evaluater. - More generally see if we can more efficiently consult the zoekt index for regular expressions with groups.
- Translate
(X|Y|Z)whereX,YandZare strings into just substring search. This avoids the cost of running the regex engine in this common case.