Write external import script for CVS repositories
Created by: LawnGnome
Based on the excellent work of @christinelovett and @abeatrix, we need to write a script that allows CVS repositories to be imported into Sourcegraph.
Requirements
- At least partial history
- Able to scale to hundred+ gigabyte scale repositories with reasonable resource usage (run time in the hours or days, not weeks; no exotic requirements around memory or storage)
- Able to be re-run to sync with new commits incrementally, rather than doing a full reimport
- Able to filter included content based on file name
- Able to filter branches
Prior art
There are three tools/workflows I know of that allow for CVS-to-Git conversion:
-
git-cvsimport
: this tool is bundled with Git itself, but depends oncvsps
version 2 being available to do patchset detection. This tool supports incremental update, but is slow (as it shells out to rungit
for each commit), and suffers fromcvsps
's limitations around what it considers to be a valid repository. The performance concerns make it a non-starter. -
cvs-fast-export
: this tool is based on earlier tools to parse and export CVS repositories, and generates a stream of data that can be imported bygit-fast-import
. In practice, while faster thangit-cvsimport
, this tool also suffers from significant scaling issues due to the way it detects patchsets and branches, has a tendency to error on CVS repositories with messier branch/tag histories that can't be represented sensibly in Git, and doesn't support incremental updates. - Combining
cvs2svn
with one of the many tools to convert a Subversion repository to Git (most likelysvn2git
).cvs2svn
doesn't have a concept of incremental updates, although practically speaking, it might be OK to just do the conversion each time. However, this relies on us being able to retain enough history and structure to be useful across two leaky abstractions, which feels dangerous.
Preferred method
I believe there's a path to implementing a tool with the desired properties. The key is that we can use git-fast-import
to perform a full import with only one pass over the CVSROOT:
- Add all file revisions as blobs, not bothering to note if we actually need them or not.
- Simultaneously build a map of file commits in an ordered map
(author, commit) => (file, time, revision)
, with the order taken from the RCS revision IDs. - Split map values into buckets based on the closeness of their commit times, as
cvsps
does. - Retrieve patchsets in order with author, commit, files; tracking the previous patchset as the parent commit, we can use
filedeleteall
when constructinggit-fast-import
commits to allow Git to figure out history.
If we store and retrieve marks and inferred patchsets, we can also do this incrementally.
Milestones
To be turned into issues:
-
HEAD
only import- RCS parsing: done
-
git-fast-import
support: done - Patchset detection: 1d
- More forward
ed
testing: 1d
- Branch support: 2d
- Persist state for incremental updates: 2-3d