search: optimize pattern expressions to native Zoekt queries
Created by: rvantonder
Closes #9824 (closed). Hopefully without incident. I'm quite confident in the correctness. I only tested speed a little bit on a local machine. The effect is going to be a much bigger win on larger instances. But you can still see a major difference locally:
Before:
Now:
The way we optimize is a bit involved. But it is "clean" and predictable, in the sense that it is a small footprint, and generalizes well enough that we can extend the current structure (e.g., for commit queries too, which Camden threatened to do). So yeah, because I don't want to spaghettify our code and actually have a concrete vision of keeping things hygienic and "small" (this ended up being under 100 line addition), this is how it works:
-
Create our job tree like we normally would: call
ToSearchJob
-
Do an optimization phase. First:
- Create a job tree like we normally would: call
ToSearchJob
. But this time, we give it the entire pattern expression by calling it with aquery.Basic
type, but before we do the expression splitting we normally do. The Zoekt jobs created inToSearchJob
is already structured to be able to handle this advanced expression type, and not just a simple string. This means that the Zoekt jobs created in this call are the optimized ones, with native Zoekt queries. We get some non-Zoekt jobs back from this tree that are invalid (maybe) because they might still be assuming that they're not dealing with an expression. This does not matter, we care about optimizing Zoekt queries only at this point. - To wit, we take this job tree with optimized Zoekt queries and extract them--we just want the atomic jobs, which have optimized Zoekt queries.
- Now we take our original job tree and remove (trim) all the Zoekt jobs that we've created optimizated versions of. Everything else stays in place.
- Now we add our optimized Zoekt jobs to the trimmed version and compose with a
Parallel
job.
- Create a job tree like we normally would: call
Here's a visual example.
foo and bar and not baz
Default job tree for: flowchart TB
0([AND])
0---1
1([LIMIT])
1---2
2[40000]
1---3
3([PARALLEL])
3---4
4([ZoektGlobalSearch])
3---5
5([RepoSearch])
3---6
6([ComputeExcludedRepos])
0---7
7([LIMIT])
7---8
8[40000]
7---9
9([PARALLEL])
9---10
10([ZoektGlobalSearch])
9---11
11([RepoSearch])
9---12
12([ComputeExcludedRepos])
0---13
13([LIMIT])
13---14
14[40000]
13---15
15([PARALLEL])
15---16
16([ZoektGlobalSearch])
15---17
17([RepoSearch])
15---18
18([ComputeExcludedRepos])
foo and bar and not baz
Optimized job tree for: flowchart TB
0([PARALLEL])
0---1
1([ZoektGlobalSearch])
0---2
2([AND])
2---3
3([LIMIT])
3---4
4[40000]
3---5
5([PARALLEL])
5---6
6([NoopJob])
5---7
7([RepoSearch])
5---8
8([ComputeExcludedRepos])
2---9
9([LIMIT])
9---10
10[40000]
9---11
11([PARALLEL])
11---12
12([NoopJob])
11---13
13([RepoSearch])
11---14
14([ComputeExcludedRepos])
2---15
15([LIMIT])
15---16
16[40000]
15---17
17([PARALLEL])
17---18
18([NoopJob])
17---19
19([RepoSearch])
17---20
20([ComputeExcludedRepos])
(don't mind the benign NoOp
nodes, we will tidy such things later)
Optional reading
Why not...?
-
why not create the optimized Zoekt jobs directly from the query? Because there is so much stuff that needs to be computed in order to know which Zoekt jobs to create when (global vs non global, on soucegraph.com, symbol or not, etc.). This is what I tried to keep contained but it is still too difficult to cut through the state. So the solution is: do exactly what we would have done before since we make a second call to `ToSearchJob. We're guaranteed to make the right decisions and not regress on which jobs to create. The static computations are cheap--simplifying the logic and separating the state is what's tricky. I'll continue chugging on this but we are able to have these optimizations now. At some point we may be able to achieve this.
-
why not figure out whether you can add optimized Zoekt queries before doing the default strategy and then just not add the other jobs? Even if we can't create them directly from the query, we can detect whether to add the queries beforehand when creating jobs currently no? Yeah but that would be far less tractable: with our current structure, you would then need to send flags all over the places between functions based on something you did or didn't do before. This is a recipe for disaster. Build an abstraction to manipulate trees and use those abstractions to ensure correctness. Manipulation is cheap and reasoning is sound--mutating state and ad-hoc control flow will destroy us.
Test plan
Added unit tests.