Proposal: Sunset repo-updater
Created by: tsenart
Proposal
Today, repo-updater is serving too many concerns, while being a singleton instance, without failure and resource isolation between those different concerns.
Any nil pointer panic, or memory leak, or any other noisy neighbor issue will cause a cascading failure of unrelated features owned by different teams.
We got to this situation because there was an unmet need of a place to run background jobs, and repo-updater made that easy, so more and more use cases have been integrated into this singleton service over time. The more things we add to repo-updater, the higher the chance of cascading failure.
In contrast, the Code Intelligence team has created completely separate services for the background jobs they have. While it would be a possibility to separate all of these different concerns into separate services, we must consider the overhead which that would force upon us — more things to monitor, build, deploy and provision.
To avoid that overhead and address the issues at hand, I propose that we introduce a new service called worker
, and extract the different concerns that currently live in repo-updater. Some of those would end up in worker
, others elsewhere.
The worker
program can be configured to run specific jobs, or all of them. This allows us to isolate workloads by configuring which jobs are run in a Kubernetes deployment (or equivalent), while having a single binary that is built and provisioned, and keeping it simple in the developer environment and single Docker image.
For instance, we could a have a new Kubernetes deployment for campaigns, called campaigns-workers
, which would specify three different containers, one for each job type they run.
containers:
- image: index.docker.io/sourcegraph/worker:insiders
name: sync-registry
resources:
limits:
cpu: "1"
memory: 1Gi
requests:
cpu: "1"
memory: 1Gi
args:
- --job=campaigns.sync-registry
- image: index.docker.io/sourcegraph/worker:insiders
name: reconciler
resources:
limits:
cpu: "1"
memory: 500M
requests:
cpu: 100m
memory: 100M
args:
- --job=campaigns.sync-registry
- image: index.docker.io/sourcegraph/worker:insiders
name: cleanup
resources:
limits:
cpu: "0.5"
memory: 300M
requests:
cpu: "0.5"
memory: 300M
args:
- --job=campaigns.spec-expire-worker
Concerns
Below is an inventory of all current and some future repo-updater concerns, what they're about and how we could remove them from repo-updater.
HTTP API
-
/repo-update-scheduler-info
- Serves git-update schedule information (that is in memory) for a given repo. This schedule information will be migrated to Postgres, so it'll be queryable directly from the frontend without calling out to repo-updater. -
/repo-lookup
- Lookup a repo in the database and serve it. On Cloud, if we don't find the repo in the database, we look it up in the code host (for github or gitlab.com), and sync that one repo back to the database. There's no reason this logic can't be executed in the frontend. -
/repo-external-services
- Looks up all the external services a given repo belongs to (in Postgres) and serves them. Can be done from the frontend. -
/enqueue-repo-update
- Enqueues a high priority git-update in the git-updates queue. Once we move this queue to Postgres, this operation becomes a database write from the frontend. -
/exclude-repo
- Adds a given repo to theexclude
config setting of all the external services that repo belongs to. Used only in a deprecated GraphQL mutation:setRepositoryEnabled
. Can be done from the frontend. -
/sync-external-service
- Triggers a sync of a given external service. Can be done from the frontend, would just enqueue a job that would be picked up byworker-service
. -
/status-messages
- Serves syncing status messages (i.e. errors) that are shown in the admin header panel. With sync jobs being in Postgres, this could be served from the database directly, instead of calling out to repo-updater. -
/enqueue-changeset-sync
- Enqueues a specific changeset sync. Used by the campaigns UI. Can be done from the frontend. -
/schedule-perms-sync
- Schedules a permissions sync job for the given users and repos. Can be done from the frontend. -
/debug/repo-updater-state
- A debug page that displays the in-memory state of the git-update schedule and queue. Can be moved to the frontend, with the state coming from Postgres. -
/debug/list-authz-providers
- A debug page that serves all configured authz providers across all external services. Can be moved to the frontend.
Repo syncing
- External service sync job scheduler.
- External service sync job worker. Uses
internal/workerutil
. - External service sync job resetter. Uses
internal/workerutil
. - External service sync job cleaner
- In memory git updates scheduler: We maintain a priority queue of git update requests that are then sent to gitserver. With git-server now being able to talk to Postgres, I don't see a reason for keeping this concern in repo-updater. gitserver can maintain this update schedule in Postgres, like we do for other things, and consume it directly.
- Repo clone state syncer - Periodically calls out to gitserver to get the clone state of every repo, and updates the corresponding repo.clone column in Postgres accordingly. This will be replaced by gitserver writing and maintaining a set of Postgres tables with this same state that can be efficiently joined against, instead of updating the repo table directly.
- GitolitePhabricatorMetadataSyncer – A huge hack we implemented to satisfy a specific customer's needs way back in the past.
- PhabricatorRepositorySyncWorker - Part of the above huge hack. We have a separate
phabricator_repos
table that this worker maintains.
Campaigns
- SyncRegistry: Manages one syncer per code host to keep our changesets up to date with what is in the code host (i.e. read path)
- Reconciler: Reconciles the current and the desired state of changesets on the code host. Uses
internal/workerutil
. (i.e. write path) - Resetter: Resets stuck reconciler jobs. Uses
internal/workerutil
. - SpecExpireWorker: Cleans up expired campaign and changeset specs.
Permissions
- Permissions sync job scheduler
- Permissions sync job worker
Internal rate limit registry
- Records our remaining internal rate limit per code host
Code monitoring
- Query enqueuer.
- Jobs log deleter
- Query runner. Uses
internal/workerutil
. - Query resetter. Uses
internal/workerutil
.
Code insights
Yet to come