executors: Architecture ideas

Created by: eseliger

Our initial implementation of executors is mostly based on what we already had: A repo with terraform files to manage our resources, augmented by configuration changes that need to be applied to the worker and Prometheus deployment. State of today, they are probably among the hardest to set up component in the Sourcegraph architecture. When we started the project, that all made total sense, the first goal was to get something going just on k8s to prove that it could work. Then later, we needed to quickly make it available for customers as well. A quick solution was to publish our internal terraform definitions in a terraform module and add some docs on how we set executors up for us.

I don't think that is what the ideal solution for the future should look like. Ideally, executors are part of pretty much every Sourcegraph deployment, so that all features that are based on them are generally and widely available. Therefore, we need to make some changes to how they get deployed.

This is a collection of ideas I got for this problem space:

They are based on the assumption that

a good chunk of enterprise deployments is not actually interested in KVM isolation (we ourselves at Sourcegraph aren't for internal tools either)
Cloud will always need that sort of isolation

Hence, we need to support both cases well, while keeping the maintenance burden as low as anyhow possible.

Instead of a strictly pull-based workflow (that made sense in the past, given it's just a simple adaptor on top of the dbworker), we could introduce an internal service that monitors the queues and then distributes work as needed. Think of it as an augmented dbworker in the worker instance. This would be part of the standard deployment of a Sourcegraph instance. This service would then be responsible for the following:

Monitor queues for work
Make autoscaling decisions
Potentially: Scrape and forward Prometheus metrics to our internal instance
Provision resources to process workloads
Issue one-time access tokens

Things this architecture should solve:

Configuring the access token in the site config is tedious and subject to never be changed, as it's essentially an internal actor that is risky.
Autoscaling is managed by the cloud providers, which is not super flexible for what we need:
- We need to publish a metric to the different cloud providers, which requires manual setup and is hard to debug for us when it doesn't work because we would need to get insight into the customers cloud provider account
- We are tied to what the cloud providers have, for AWS that means we had to do some workarounds to transform our target number into a proper scaling metric, and that we cannot control how scale-ins should happen. Currently when we scale-in all current workloads are canceled because the cloud provider gives us a very short notice period.
Telling a customer to modify the Prometheus config seems very error prone and more than likely that they won't do it, given it's optional but very helpful for us to help debug problems.
Customers need to run and manage a lot of additional terraform resources
Customers can still have user-provided executors, that can attach to sub-queues (Ie. an org-wide executor on the cloud platform)
- So we will keep the current general executors architecture and only augment it
- We still need to solve the access tokens problem and tackle user-provided executors
- Those executors might be docker-isolation only, so we can give customers a single binary (TBD)
- Those executors don't have autoscaling and might not be part of the internal Prometheus dashboards
- Sub-queues might want to be able to opt-out of global executors (trust)

How this could be implemented:

We write yet another adapter for the dbworker that can monitor queues and "forward" jobs, potentially possible by
- Having a regular dbworker in the "executors backend service" (name TBD) that on dequeue handles spawning a job executor and logs a first step to the execution logs about the provisioning and keeps the heart beating until the job executor takes over and starts reporting for that job over an HTTP API as the current executors do
- And has a PreDequeue hook to make decisions on when to dequeue a job
- And a sidecar that tracks the scaling metric (optional, based on config and adapter type (aws,gcp,...)
We add a site config entry for this service, that tracks:
- Which adapters (aws,gcp,..) to use, and credentials and settings for those
- Which queues to attach to
- Auto-scaling config, and more
- => So that in the end, all a user needs to do to get global executors to work is "give us a credential for your provider with permissions X"
We write adapters for different providers, strong contenders would be
- GCP
- AWS
- ssh/shell (local bare-metal)
- Kubernetes (very likely no KVM support)

Those might be able to be implemented with a relatively simple generic interface that the dbworker handler calls out to depending on its config:

type ExecutorProvider interface {
  // HandleQueueSize is called periodically with the current size of the dbworkers queue. The
  // provider should make decisions for auto-scaling based on this number and its configuration.
  // For AWS, GCP this could mean increasing the desired-count property of an autoscaling group.
  // For kubernetes this might be as simple as returning nil, or spawning some "warm" pods in advance.
  HandleQueueSize(context.Context, int64) error
  // RunJob is called within the worker handler function of the dbworker. This is the biggest part of the whole provider integration.
  // For AWS, GCP this could mean spawning an EC2 instance with the job payload passed to a single-run executor binary mode
  // or something more sophisticated. On kubernetes, this could mean spawning a job with that payload as an argument.
  RunJob(context.Context, ExecutionJobPayload) error
  // This is called periodically and meant to clean up left-behind resources. On kubernetes that could mean pruning pods,
  // on AWS, GCP this could mean finding and terminating running instances.
  PruneUnknownResources(context.Context) error
  // CanDequeue is called in PreDequeue to determine if a provider has availability. This should take into
  // account if any machines are free, or about to be free, or can be provisioned right away.
  CanDequeue(context.Context) (bool, error)
}

TODO: This approach suggests that it's the most simple to have a 1-1 relation between execution job and provider resource, ie. 1 GCP compute instance for 1 job. That might be desirable for some use-cases, but we should keep in mind that this causes additional overhead of spawning and terminating VMs. Also, we need to think about how we can make things like the very important docker registry mirror still fit into this model.

With the kubernetes adapter, we could potentially even have 0-conf executors for those who don't care about isolation heavily. We can still add NetworkPolicies around those instances so we keep internal APIs internal and have a service account as part of the k8s deployment that gives the worker pod the required permissions.

I don't think we can ever support docker-compose setups for an OOTB executor though, given docker-compose is a single machine deployment and executors are usually doing some pretty heavy lifting.