Skip to content

search: use regex expressions in Zoekt for structural search

Created by: rvantonder

This PR is also for discussion, I'd appreciate some insights @keegancsmith.

The problem: For structural search I just want Zoekt to tell me which files contain some content. Currently I use a Zoekt query that returns file matches, and then I extract thee file paths. A comment here mentions that

Our zoekt fork contains two extensions that may interest you:
- type:file will only return path file matches. IE type:file foo returns all files which has foo in contents.
...

But, (a) type:file doesn't seem to do anything in search (it does not give a list of files, just the matches as usual? example query) and the code just calls searcherFilesInRepos so it doesn't look like anything special is happening here. (b) I cannot find any indication in zoekt.go that we can specify an option that only returns file matches where the file contains some contents--it always seems to retrieve the matched contents and put them in []FileMatchResolver.

Now, the problem is that I care about only retrieving a list of files containing a matched string, but zoekt will always return the number of matches in these files too as far as I can tell from above. This behavior affects how many file matches zoekt will return, because it will return the number of matching files based on how many matches it finds in that file (something that I don't care about), leading to the following problematic behavior for structural search:

  • Specifying a pattern like foo(:[args]) will convert to a zoekt query that says "find me all files that contain matches of string foo( and )".
  • Zoekt finds a file bar.c and sees it contains both foo( and ), so it satisfies the query. Then it does a bunch of work and counts and returns how many matches there are of the strings foo( and ) in this file. It will find hundreds to thousands of matches of the dangling ).
  • Zoekt does this for a few more files then says "dang, that's a lot of matches, let's stop returning files that satisfy this query". Upstream, in zoekt.go, we look at this information and also say "dang, that's a lot of matches, let's set limitHit to true".
  • Meanwhile, I don't care how many matches there are in the file, I just want file paths irrespective of number of containing matches.
  • So, Zoekt only returns information about, say, 5 or 7 files that satisfy the original query, whereas I'd be happy to get 30 to 100.
  • All this messes with whether limitHit gets set or not inside zoekt, and I'll need to do extra accounting so that limitHit is accurate for structural search.

Ways to fix this?

  • Tell Zoekt to give me file matches and stop looking at match counts. I couldn't find a way to have Zoekt ignore the number of matches and just give me the files irrespective of match count. Setting the match count in searchOptions to some very high number is not a solution here, because that might force it to do a lot of work for cases it doesn't need to. In general it's just not the right approach to fiddle with match counts (if that's the only control we have) to get Zoekt to return matching files. The ideal behavior would be if zoekt shortcircuited on a matching file and just return that.
  • This PR: Change the conversion of foo(:[args]) to regex foo\(.*\) so that Zoekt doesn't tell me about all the thousands of matches it might find for some patterns, and only finds a handful satisfying foo\(.*\). This solution is unsatisfying because it's regex now, but it does play much better with the number of matching files and limitHitlogic.

It's possible I'm completely missing something. However, I spent 1d+ just figuring out why this behavior is what it is, so putting this up to see if I can get a quicker path to a solution.

Merge request reports

Loading