search: move commit and diff search to gitserver, but still use git CLI (!25006) · Merge requests · Administrator / sourcegraph

Administrator requested to merge cc/back-to-git into main Sep 15, 2021

Created by: camdencheek

This is a MVP reimplementation of commit and diff search on gitserver. This still uses git CLI, not libgit2 (see #24595 (closed) for the reasons).

There are still a few followup tasks that need to be done before this is ready to be switched on in prod, but I want to get this merged before it gets even larger than it is. The followup tasks all have issues created for them and are attached to the epic associated with this PR.

An overview of the changes:

The new codepath is only enabled with the feature flag "cc_commit_search"
A new /search endpoint is added to gitserver
- The endpoint uses gob encoding for simplicity for encoding the query tree
- The endpoint streams results back to the client using in the same style as searcher now does
search.Search() iterates over all commits, batching them into chunks, and sending the batches to a worker pool
- The chunks are sent in a "job", which contains the batched commits and a result channel. The workers process the jobs in parallel, but the results are read in the same order that the jobs are submitted
For each commit in a job, a worker searches that commit with the given MatchTree
- A MatchTree is a tree of commit predicates, such as (MessageMatches(camden) AND DiffModifiesFile(\.go))
- A MatchTree returns both whether a commit matches, and what parts of the commit match (highlights)
- The LazyCommit passed to the MatchTree.Match() method is a shallowly-parsed commit that provides helper methods to do more expensive parsing if needed by the MatchTree
The most expensive thing to generate for each commit is the diff.
- In order to generate diffs on demand (so we only generate them if earlier nodes in the match tree pass), we embed a DiffFetcher in the LazyCommit, which is a handle to a subprocess running git diff-tree --stdin. Whenever we need a diff for a commit, we write the commit hash to stdin of the subprocess, then read the output. This is pretty cheap, and keeps us from having to do complex diff batching when iterating over commits.
- We have one subprocess per worker since the wrapper isn't thread-safe, and it allows us to parallelize the work of generating diffs since git only uses a single core.
The onMatches callback passed to search.Search() is called for every commit that matches
- In this case, the onMatches callback transforms the result into our protocol.CommitMatch and passes it to the stream goroutine, which will write it to the stream, flushing occasionally

Sorry for the massive PR. Happy to do a sync review if that's preferable

search: move commit and diff search to gitserver, but still use git CLI

Merge request reports