Skip to content

campaigns: add and use volume mounts by default on Intel macOS

Erik Seliger requested to merge aharvey/in-docker-workspace into main

Created by: LawnGnome

Background

As we now know all too well, the default bind mount behaviour we use to manage workspaces when executing campaigns is slow on macOS: any file I/O in the workspace has to go through the loopback interface, and this causes problems with I/O heavy operations such as pretty much anything to do with npm or yarn.

The steps we perform within a workspace are extremely simple, however: we unzip the repository archive and run a variety of git commands to generate the diff. It's handy for this to be done on the host filesystem, since it allows the user to inspect what happens to their repository if a campaign step fails, but it's not a technical requirement.

What's in the box PR

This PR generalises our existing concept of a workspace, and expands it significantly. Workspace modes are now defined by implementing a WorkspaceCreator interface, which knows how to create Workspace implementations which implement the common operations we need to set up, diff, and tear down workspaces. This is used to add a new workspace mode that uses a Docker volume to contain the workspace, rather than bind mounting the workspace from the host. This is controlled by the new -workspace flag, and the default for Intel macOS has been switched to the new volume mode.

For this to work, we have to perform the workspace management commands — the aforementioned unzipping and git malarkey — within a container, since that's the only way to access the volume. As such, part of this PR adds a Docker image that extends alpine to always have git available and configured appropriately, along with unzip and curl. A GitHub Action has been added that will rebuild and push this image on src release.

There's also a lot of new testing stuff: the volume workspace code is copiously unit tested. In order to support this, there's a (very) minimal implementation of a mock API client, and an expanded ripoff implementation of the method Go's os/exec test suite uses to mock external commands, which means the new tests don't need dummydocker and work on Windows.

Adam's guess at obvious questions

Why is CI failing?

The os/exec method for mocking external commands basically redirects exec.Cmd calls back into the test binary, which gains a little bit of logic in its TestMain to handle those correctly.

This works fine cross-platform, but something about AppVeyor specifically makes this not work on Windows. The errors are fairly inscrutable. However, the approach doesn't have any systemic issue on Windows: it works fine both for me locally, and in GitHub Actions, which now has Windows runner support.

I've opened #415, which is included in this PR as well, to remove our AppVeyor support and only use GitHub Actions for CI. If we decide to go ahead with that, then we can remove the AppVeyor integration and this PR will start passing.

How much faster is this on macOS?

It depends.

For campaigns that generate small-to-medium sized diffs and do little I/O, there isn't much difference: maybe a few percent here and there. For campaigns that generate large diffs and do little I/O, this can actually be slower. (That feels like a really weird case, though.)

For campaigns that perform lots of I/O, this is a big win. A test campaign that upgrades TypeScript in Sourcegraph completes in approximately a quarter of the time in volume mode compared to bind mode. (~11 minutes compared to ~42.)

Why change the default on macOS only?

Volume mode isn't universally better. On Linux (assuming nothing weird like a remote Docker server), there's no reason not to use bind mount mode: the performance is the same, and you have the advantage that you can inspect what happened in your workspace if a step fails.

Why change the default on Intel macOS only?

Conservatism. I've put in a bunch of effort to make multi-architecture builds of the Docker image needed by volume mode work, but I don't have an M1 Mac sitting around to test this. It should work, but I'd prefer to find that out by testing it using -workspace volume explicitly, rather than finding out that it doesn't work after making it the default.

What about Windows?

My suspicion is that this will provide similar performance improvements on Windows, but I don't have good numbers on this right now, don't have an environment ready to go to prove that out, and I'd prefer to focus on macOS for now. It's easy enough to change the default later.

That said, I think there's another reason we should consider this for Windows sooner rather than later: it effectively removes the requirement to have git in your PATH, and removes any potential for Windows-specific git weirdness.

What future work could we do to improve this further?

Glad you asked! I had more ideas, but this PR was more than big enough already.

  1. Bind mode tests: now that we have comprehensive tests for volume workspaces, we could retrofit the same techniques onto the bind workspace tests.
  2. Volume mode debugging: right now, if a step fails, the volume is removed, which means the user would have to do some hackery using something like sleep to exec into the container to inspect the workspace if that was necessary. We should probably provide some sort of interactive debugging mode where you get dropped into a container with the workspace already set up and running on error. (I mean, this is probably a good idea for bind mode, too.)
  3. Volume container communication: I have a suspicion that the large diff slowness noted above might be more to do with extracting the diff from the Docker container via stdout than anything endemic to the approach. Another way of doing this would be to have the utility container running the whole time while executing a campaign on a repository, and provide some sort of RPC interface to perform setup and teardown and (most importantly) get diffs without having to go through Docker.
  4. Exec output: to allow mocking external commands, the volume workspace uses the new internal/exec package instead of os/exec directly. If we migrate other modules to use this, then we could use that down the track to provide verbose logging of every command that's run, since command execution will go through a central place.

PR links

Skeletal end user documentation is provided by sourcegraph/sourcegraph#16979.

Fixes sourcegraph/sourcegraph#16809.

Merge request reports

Loading