proposal: indexed search uses shallow git clones
Created by: keegancsmith
Parent issue: https://github.com/sourcegraph/sourcegraph/issues/6728
Zoekt currently relies on archives of git repositories to do its indexing. Instead of relying on archives, we can use shallow clones. This is motivated as a way to avoid complications around git archives and multiple branches I encountered while doing https://github.com/sourcegraph/sourcegraph/issues/7930
Two approaches:
- Clone just for indexing. Can increase complexity by adding in some caching to avoid needing to refetch objects. This has some nice properties for fast moving repositories at the cost of simplicity.
- Maintain a global set of clones. Can then just use the standard zoekt repository indexing tooling which is quite nice. Cost is disk space, benefit is simplified code / decoupling.
I'm looking for some input around the viability of this approach, especially the cost of requiring larger disks in the second approach.
We will also need to add endpoints to allow us to clone. We used to have these and it isn't too much work to add back.
Notes on how I tested this approach:
export REMOTE=file:///Users/keegan/go/src/github.com/sourcegraph/sourcegraph
mkdir target.git
# just clones HEAD
git clone --bare --depth 1 "$REMOTE" target.git
cd target.git
# for each branch we want we add it to the fetch specs
for branch in develop test; do
git config --add remote.origin.fetch "+refs/heads/$branch:refs/heads/$branch"
done
# This will now fetch each of the above branches only
git fetch --depth 1
# you can just use the upstream zoekt-git-index tool
zoekt-git-index -submodules=false -incremental -branches HEAD,develop,test .
The main divergence in the index produced is we no longer refer to the branch as HEAD, instead zoekt-git-index will resolve it to the actual branch name. This will break assumptions made in our frontend. Probably easier to patch zoekt-git-index to keep the HEAD name.
cc @sourcegraph/core-services