search: move commit and diff search to gitserver, but still use git CLI
Created by: camdencheek
This is a MVP reimplementation of commit and diff search on gitserver. This still uses git CLI, not libgit2 (see #24595 (closed) for the reasons).
There are still a few followup tasks that need to be done before this is ready to be switched on in prod, but I want to get this merged before it gets even larger than it is. The followup tasks all have issues created for them and are attached to the epic associated with this PR.
An overview of the changes:
- The new codepath is only enabled with the feature flag "cc_commit_search"
- A new
/search
endpoint is added to gitserver- The endpoint uses gob encoding for simplicity for encoding the query tree
- The endpoint streams results back to the client using in the same style as searcher now does
-
search.Search()
iterates over all commits, batching them into chunks, and sending the batches to a worker pool- The chunks are sent in a "job", which contains the batched commits and a result channel. The workers process the jobs in parallel, but the results are read in the same order that the jobs are submitted
- For each commit in a job, a worker searches that commit with the given
MatchTree
- A
MatchTree
is a tree of commit predicates, such as(MessageMatches(camden) AND DiffModifiesFile(\.go))
- A
MatchTree
returns both whether a commit matches, and what parts of the commit match (highlights) - The
LazyCommit
passed to theMatchTree.Match()
method is a shallowly-parsed commit that provides helper methods to do more expensive parsing if needed by theMatchTree
- A
- The most expensive thing to generate for each commit is the diff.
- In order to generate diffs on demand (so we only generate them if earlier nodes in the match tree pass), we embed a
DiffFetcher
in theLazyCommit
, which is a handle to a subprocess runninggit diff-tree --stdin
. Whenever we need a diff for a commit, we write the commit hash to stdin of the subprocess, then read the output. This is pretty cheap, and keeps us from having to do complex diff batching when iterating over commits. - We have one subprocess per worker since the wrapper isn't thread-safe, and it allows us to parallelize the work of generating diffs since
git
only uses a single core.
- In order to generate diffs on demand (so we only generate them if earlier nodes in the match tree pass), we embed a
- The
onMatches
callback passed tosearch.Search()
is called for every commit that matches- In this case, the
onMatches
callback transforms the result into ourprotocol.CommitMatch
and passes it to the stream goroutine, which will write it to the stream, flushing occasionally
- In this case, the
Sorry for the massive PR. Happy to do a sync review if that's preferable