search: and/or expression parser (!8633) · Merge requests · Warren Gifford / sourcegraph

Warren Gifford requested to merge rvt/query-parser into master Feb 27, 2020

Created by: rvantonder

This is just the parser for the basic tree structure we want to target with tests. Leaves are only strings. This code is not called by any of our code yet. I will incrementally build on this code until it converges with our current query parsing and types, and migrate existing tests. Right now it's time to start checking in code :-). See the test inputs to get an idea of what's supported.

Background for posterity and those interested

I wanted to avoid a couple of things to make our implementation simple and straightforward. I looked at existing parser generators and implementations (including Zoekt's) and there was usually something that I put me off :P (confusing or bloated code). I used some existing implementations as inspiration but eventually settled on this PR. Implementation highlights:

No lex/parse distinction. Unnecessary level of indirection, particularly for something simpler like our language.
No token types: parse directly to the tree data structure. The traditional idea of having tokens might be helpful in some cases but it doesn't really gain much (in our context) and adds another layer of definitions that we don't need.
No state tracking for operator precedence parsing, it's implicit in the recursion order.
Parser control flow depends on lookahead on reserved syntax/keywords like (, and, or.
No extra steps or processing for tree reduction: expressions reduce as the tree is constructed during parsing. This removes the need for implementing another post processing pass. The key functions that do this are newOp and reduce and not very long. For comparison, the Zoekt implementation is ~130 lines just for this simplification, and it's not obvious what's going on, and does an additional traversal of the query after parsing (not that perf matters, but it's just preferable to avoid the complexity). This is also one of the reasons I avoided generators, which usually return the raw tree. In our language we know what reductions we can do at parsing time thanks to the shape of the grammar.

Next steps, to happen in subsequent PRs:

Support the lexical parts of our current queries (parameters, pattern sequences, not)
Add unicode support
Migrate tests

search: and/or expression parser

Merge request reports