Automation should never miss search results
Created by: mrnugget
In its current state Automation campaigns might miss search results because of the way it interacts with our search infrastructure. See this comment chain for the full context. Below are a few excerpts
@rvantonder in https://github.com/sourcegraph/sourcegraph/pull/7790#discussion_r367736633:
Here's a breakdown of things:
for the repo question: this (to me) is a problem where we don't have the possibility of two things at the same time: (1) a clean way to operate on a repo given we know what the repo is already and (2) a filtering mechanism to only run on certain files, or with certain constraints. The nice thing about replacer is that it actually solves (1) as far as the API is concerned: just give it the repo, it'll do the fetching, and it'll run comby on all the files (there is a way to filter files, but it is not exactly the same as what we support in the query). But the point is there's no danger of fiddling with queries/escaping. The problem: the replacer interface is too rudimentary to understand/implement the full spectrum of our search query. The frontend/search interface (2) does understand the full query and can perform the filtering mechanism (useful!) but the problem is that we need to mess around with the query to do that properly. I agree this is the best possible option right now, but there is another danger hidden in (2). Next point.
Any time we ever call
newSearch
, we are never guaranteed to find all matches. For search/replace, it is imperative that we (a) always try our hardest to replace every instance (i.e., we will wait!) or (b) if we don't get everything, we need to know about what was missed. It would be great if there was a knob to say 'search and replace this, just guarantee me that you found every single instance', but there is a ton of search functionality across concerns (index/searcher/...) that make adding such a knob very very hard IMO, and it would be easier to build such a "guaranteed searcher" 'from scratch in replacer than try layer it onto search. The alternative is that we are cognizant of the limitations of search and monitor it when it tells us 'there were more results, but I didn't return them'. If we go this route, we will need to add a lot of (TBH probably annoying) logic that checks the state of returned results, refetches, and all that. I'm not even sure how we can guarantee that pans out, since search can skip repos or files. It is brittle for this purpose brittle. Pagination gives me some hope that we can do finer accounting and do 'retry' operations in a much easier way (right now we'd have to do the accounting ourselves, somehow).So we have bigger problems than that repo issue. Just for this regex campaign mode, I realized that since this is using searcher, partial results will return quickly, and so I basically will add
count:10000
with no real guarantee that all of the matches are covered.For example, when I create these campaigns locally:
{ "scopeQuery": "repo:github.com/sourcegraph/sourcegraph file:.*", "regexMatch": "func", "textReplace": "foo" }
will have 1048 matches (and returns quickly) while
{ "scopeQuery": "repo:github.com/sourcegraph/sourcegraph file:.* count:10000", "regexMatch": "func", "textReplace": "foo" }
will have 10647.
But there's an even bigger problem than the number of matches getting cut off by
count
ortimer
: it is that when we filter using, e.g.,file
, the list of files can also be cut off. So for example, the query{ "scopeQuery": "repo:github.com/torvalds/linux index:no file:.* count:10000", "regexMatch": "TODO", "textReplace": "DONE" }
on my machine will return only 15 matches. That's because we return some number of files (I think 1000 and
count
doesn't appear to do anything) and there are only 15 matches in those files. If I want to force search to try harder, I actually have to use theregexMatch
value in thescopeQuery
:{ "scopeQuery": "repo:github.com/torvalds/linux TODO index:no file:.* count:10000", "regexMatch": "TODO", "textReplace": "yo" }
This finds 1776 matches. Does this find all the matches? I still don't know and that's the problem. Long term we'll have to think about how to either rearchitect search code or queries to handle this issue, or build up a dedicated replacer service. There's contention in designing for 'time-to-first-result' and 'match-and-replace-all-instances-at-almost-any-cost'.
@rvantonder in https://github.com/sourcegraph/sourcegraph/pull/7790#discussion_r367868835:
To be clear, what I mean is that we need to do (possibly significant) extra work compared to what the current search code is geared towards, not that it's at odds.
@keegancsmith probably has better insight, but my gut reaction is that using indexed search results/Zoekt is a no go. It is a black box and takes lots of shortcuts in ways that's hard to guarantee that it returns all matches, whether content or files (issue in #7709 may be indicative). Which implies using searcher logic. But searcher doesn't do repo filtering, that happens in the frontend. So there is an interface to cut out here.
@mrnugget in https://github.com/sourcegraph/sourcegraph/pull/7790#discussion_r367877914:
There's two problems we have to address:
- We need to reliably come up with a list of repositories over which to run Automation campaigns.
- We need to find every search result in each one of these repositories.
With our current state of campaign types and our roadmpa, I think (1) is already (or can be made) reliable by restricting scopeQuery to only accept repo:. That should always give us the full list of repositories that match the scopeQuery.
The problem is that once you do that, you start to come up with a lot of false positives. Example: you run a NPM credentials campaign with "scopeQuery": "repo:github". That returns 200 repos, but only 3 of those have an .npmrc file. So I decided to add file:.npmrc to the scopeQuery. Problem is that, apparently, we're now not reliable anymore without adding count:99999 to it. For regex-search-replace the problem is even bigger, because it can yield more hits.
Problem (2) is what we're discussing this in this comment chain and now becomes more important with regex-search-and-replace and future campaigns that will run in the background and proactively scan/search repositories.
I hope that we, builders of code search, can up with a way that allows us to make use of the search infrastructure we already have in a reliable way
😄 [... comment by @rvantonder ...]
This comment combined with your earlier comment makes me think again of the idea of combining searcher/replacer into a single service with added "run code over git repo" functionality.