Support multiline matches in search: implement logical result and content retrieval parameter for search
Created by: rvantonder
Problem
Currently all content matches in search results conform to this type, the core assumption is that a matched range can only ever occur in one line:
export interface ContentMatch {
lineMatches: LineMatch[]
...
}
interface LineMatch {
line: string // AKA preview
lineNumber: number
offsetAndLengths: number[][]
}
This representation is problematic for rendering or post processing data. Examples below, but I'm going to jump into the proposed solution. The point is that the time has come to fix this because it's actively blocking Code Insights/Compute and will also solve a host of longstanding and thorny issues. Most pertinent business/customer use case ("why are we doing this now?"): https://docs.google.com/document/d/1lNwfX-yGXx-sEcM7-X9iIAG6yFspJwBXXrzkPsYAuII/edit#heading=h.6mw6j1ignh29. A rough ask around timeline is to enable the multiline use case by end of April, if I recall correctly @Joelkw?
We'll need to coordinate effort here. @sourcegraph/search-core will be invaluable for helping identify/implement changes that we settle on, so I appreciate your eyes and thoughts here
Proposed solution
We need to fix two parts:
(1) Represent the content matched as a single, logical result irrespective of how many lines it has
type FileMatch struct {
...
matches Range[]
...
}
type Range struct {
Start Location
End Location
}
type Location struct {
Line int
Column int
}
To do this, we need to update searcher and Zoekt to do bookkeeping of ranges. For single-line matches, there's an isomorphism from LineMatches
to this representation, so we already have code to do some bookkeeping. I don't recall the internals well enough to estimate how much effort this is. Having matches represented this way will stop regressions with match counts, make rendering logic easier, and also enable linking to ranges of associated matches in our UI/webview from Code Insights or compute applications.
(2) We need to allow clients to optionally request the content associated with a logical result.
We benefit from being able to request content for a result in the following cases:
- Compute / Code Insights can work with entirety of content to efficiently calculate capture groups (alternatively, it would need to request all file content from gitserver)
-
src-cli
can effectively render logical matches based on matched content and avoid fetching file content from gitserver.
In these cases, the requested content will replace and supersede the role of preview
in our current result type.
Note that we don't want to include the preview
or result content
by default in the main Sourcegraph app. This because the typical code path has the client getting search results from the backend, then request the highlighted file (i.e., "content") from the backend, and then only uses the LineMatch
ranges to compute the highlighting/match overlay. We want to continue with this architecture, and only respond to our main app's client with match ranges and migrate away from LineMatches
. We do not need to send it the actual content. There is a separate effort around streaming highlighted content, but that's an orthogonal issue to this issue of line-based results versus logical results (our streaming highlighting effort also relies on the line-based result type currently, which is not good).
So the point is, we likely need a request parameter to our search backends (searcher, Zoekt) to optionally return result content.
An important for Code Insights is to efficiently compute capture groups for (potentially multiline) search results. After discussion, I believe the easiest and cleanest path to get there is to implement the above by first tackling part (2) which means:
-
adding a parameter to our search backend API (Zoekt and searcher) to request it hydrate a new
content
field inFileMatch
result type. Over time, we deprecate and removepreview
andcontent
supersedespreview
. This is the direction we've wanted to go in for a while, so it aligns with "better state of search code". We cannot deprecatepreview
right away for<reasons>
. Thoughts here @sourcegraph/search-core, specifically for Zoekt? -
capturing the content for a logical match in our backends. Here I don't remember the details for searcher and Zoekt and how much effort it is to do this. Please chime in @sourcegraph/search-core if you have an intuition.
We can complete this part (2) without doing part (1), or migrating away from LineMatch
just yet. But while we're doing (2), it's useful to think about how we could add the bookkeeping to our backends to capture the full range of matches. Over time, but as a less urgent case, we want to represent matches as a list of these absolute ranges to overcome all the snags we have with the current line matches, and to unlock things like https://github.com/sourcegraph/sourcegraph/issues/30711 later.
Alternatives / related thoughts: It's possible we could add functionality to our search backends to natively recognize capture groups, but I honestly don't know how big this ask is, or whether it makes sense architecturally yet. Doing something like that will need a larger change to the result type coming from our backends, and I don't think that footprint is worth it right now. As a matter of our current needs, empirically Code Insights / Compute can operate efficiently if it can get the content to process capture groups and I don't think there's a strong need to add native capture group support to backends.
Next steps
Mainly open questions for @sourcegraph/search-core (@keegancsmith @ggilmore): does this direction make sense to you? Ideas around effort and availability on the core team to help with this (esp. for modification to Zoekt?)
@camdencheek and @coury-clark feel free to discuss/comment if the direction makes sense to you. I know @camdencheek could probably estimate some effort for doing the plumbing in frontend/searcher. From there we can do a walkthrough of the code with Coury and work together on changes.
More context, why a line-based result type is not great for our tooling.
Even though our search backend (searcher and Zoekt) can find matches for multiline regex patterns like foo\nbar
, multiline content that match the pattern are broken up to conform to the line-based assumption the above type. When multiline content matches a pattern, and we jam it into this line-based result type, we throw away the information that tells us which lines correspond logically to the pattern. In a read-only client, where we simply render highlighted ranges (potentially across multilines), we get away with the line-based assumption, since the user can likely visually pick out how the pattern corresponds to highlighted lines. But clients don't actually know which line ranges a matching pattern corresponds to, we've thrown that information away by the time it gets to the client. This one assumption has been a cause of many troubles:
-
inaccurate result counts. Here we would count the number of lines as a single result, since our type only tracks lines, and we have no way to tell how many of them correspond to the "logical" match (i.e., the one that corresponds to the pattern. https://github.com/sourcegraph/sourcegraph/issues/9426
-
inconsistent expectations from the search API. Using the line-based assumption, clients may assume that complete information about the content should be exposed in the
Preview
orline
field, and use that content data for rendering. These assumptions break down when using structural search which inherently does not consider a single line as any kind of match or result--a match is always qualified absolutely as a range over lines and columns (or, alternatively, absolute offsets in a file). https://github.com/sourcegraph/sourcegraph/issues/11134 -
The line-based content of
Preview
/line
stops us from building data processing on top of results. When a process wants to extract capture groups (or any similar process that needs to match over the entirety of some logical result), we can't rely on any search result likePreview
orLineMatch
offsets since it can't tell us what the extent of that entire result is. We'd have to process an entire file to get the result. This last case is the straw that breaks the camel's back that now pushes us to finally address the shortcomings of the line-based assumption.