Proposal: Sunset repo-updater

Created by: tsenart

Proposal

Today, repo-updater is serving too many concerns, while being a singleton instance, without failure and resource isolation between those different concerns.

Any nil pointer panic, or memory leak, or any other noisy neighbor issue will cause a cascading failure of unrelated features owned by different teams.

We got to this situation because there was an unmet need of a place to run background jobs, and repo-updater made that easy, so more and more use cases have been integrated into this singleton service over time. The more things we add to repo-updater, the higher the chance of cascading failure.

In contrast, the Code Intelligence team has created completely separate services for the background jobs they have. While it would be a possibility to separate all of these different concerns into separate services, we must consider the overhead which that would force upon us — more things to monitor, build, deploy and provision.

To avoid that overhead and address the issues at hand, I propose that we introduce a new service called worker, and extract the different concerns that currently live in repo-updater. Some of those would end up in worker, others elsewhere.

The worker program can be configured to run specific jobs, or all of them. This allows us to isolate workloads by configuring which jobs are run in a Kubernetes deployment (or equivalent), while having a single binary that is built and provisioned, and keeping it simple in the developer environment and single Docker image.

For instance, we could a have a new Kubernetes deployment for campaigns, called campaigns-workers, which would specify three different containers, one for each job type they run.

containers:
  - image: index.docker.io/sourcegraph/worker:insiders
    name: sync-registry
    resources:
      limits:
        cpu: "1"
        memory: 1Gi
      requests:
        cpu: "1"
        memory: 1Gi
    args:
      - --job=campaigns.sync-registry
  - image: index.docker.io/sourcegraph/worker:insiders
    name: reconciler
    resources:
      limits:
        cpu: "1"
        memory: 500M
      requests:
        cpu: 100m
        memory: 100M
    args:
      - --job=campaigns.sync-registry
  - image: index.docker.io/sourcegraph/worker:insiders
    name: cleanup
    resources:
      limits:
        cpu: "0.5"
        memory: 300M
      requests:
        cpu: "0.5"
        memory: 300M
    args:
      - --job=campaigns.spec-expire-worker

Concerns

Below is an inventory of all current and some future repo-updater concerns, what they're about and how we could remove them from repo-updater.

HTTP API

/repo-update-scheduler-info - Serves git-update schedule information (that is in memory) for a given repo. This schedule information will be migrated to Postgres, so it'll be queryable directly from the frontend without calling out to repo-updater.
/repo-lookup - Lookup a repo in the database and serve it. On Cloud, if we don't find the repo in the database, we look it up in the code host (for github or gitlab.com), and sync that one repo back to the database. There's no reason this logic can't be executed in the frontend.
/repo-external-services - Looks up all the external services a given repo belongs to (in Postgres) and serves them. Can be done from the frontend.
/enqueue-repo-update - Enqueues a high priority git-update in the git-updates queue. Once we move this queue to Postgres, this operation becomes a database write from the frontend.
/exclude-repo - Adds a given repo to the exclude config setting of all the external services that repo belongs to. Used only in a deprecated GraphQL mutation: setRepositoryEnabled. Can be done from the frontend.
/sync-external-service - Triggers a sync of a given external service. Can be done from the frontend, would just enqueue a job that would be picked up by worker-service.
/status-messages - Serves syncing status messages (i.e. errors) that are shown in the admin header panel. With sync jobs being in Postgres, this could be served from the database directly, instead of calling out to repo-updater.
/enqueue-changeset-sync - Enqueues a specific changeset sync. Used by the campaigns UI. Can be done from the frontend.
/schedule-perms-sync - Schedules a permissions sync job for the given users and repos. Can be done from the frontend.
/debug/repo-updater-state - A debug page that displays the in-memory state of the git-update schedule and queue. Can be moved to the frontend, with the state coming from Postgres.
/debug/list-authz-providers - A debug page that serves all configured authz providers across all external services. Can be moved to the frontend.

Repo syncing

External service sync job scheduler.
External service sync job worker. Uses internal/workerutil.
External service sync job resetter. Uses internal/workerutil.
External service sync job cleaner
In memory git updates scheduler: We maintain a priority queue of git update requests that are then sent to gitserver. With git-server now being able to talk to Postgres, I don't see a reason for keeping this concern in repo-updater. gitserver can maintain this update schedule in Postgres, like we do for other things, and consume it directly.
Repo clone state syncer - Periodically calls out to gitserver to get the clone state of every repo, and updates the corresponding repo.clone column in Postgres accordingly. This will be replaced by gitserver writing and maintaining a set of Postgres tables with this same state that can be efficiently joined against, instead of updating the repo table directly.
GitolitePhabricatorMetadataSyncer – A huge hack we implemented to satisfy a specific customer's needs way back in the past.
PhabricatorRepositorySyncWorker - Part of the above huge hack. We have a separate phabricator_repos table that this worker maintains.

Campaigns

SyncRegistry: Manages one syncer per code host to keep our changesets up to date with what is in the code host (i.e. read path)
Reconciler: Reconciles the current and the desired state of changesets on the code host. Uses internal/workerutil. (i.e. write path)
Resetter: Resets stuck reconciler jobs. Uses internal/workerutil.
SpecExpireWorker: Cleans up expired campaign and changeset specs.

Permissions

Permissions sync job scheduler
Permissions sync job worker

Internal rate limit registry

Records our remaining internal rate limit per code host

Code monitoring

Query enqueuer.
Jobs log deleter
Query runner. Uses internal/workerutil.
Query resetter. Uses internal/workerutil.

Code insights

Yet to come