LSIF: Replicate worker within container
Created by: efritz
From a conversation with @creachadair and @slimsag, we decided to scale lsif-workers on two fronts:
- horizontally via replica counts in k8s deployments (easy and the correct way)
- horizontally via a combination of multiple compose services and intra-container parallelization in docker/docker-compose deployments (harder for us, but easier for companies using that environment - it's a big pain point to have to scale to double-digits manually)
Syntect server does something similar to the second approach (ENV WORKERS=4
) in the dockerfile to ensure that one slow process does not block the remaining resources of the container. This could also be beneficial on bursty workloads with small repositories, as the containers may otherwise be idle or under-provisioned and a handful of processes per container may make use of the excess headroom.
Unfortunately, running multiple (unaltered) workers immediately runs into an issue of port-clashing. Each worker tries to serve its own metrics on port 3187, and more than one worker cannot bind to the same port. Giving these unique and sequentially increasing ports (3187, 3188, 3189, etc) solves the problem, but makes it so our current prometheus configuration will only be able to scrape the first of n workers' metrics. We could scrape all ports, but then the worker count is not a dynamic property of the deployment (and every port must be exposed via compose or
A chat with @uwedeportivo revealed that we could get seamless scaling using Prometheus federation. This will basically let one Prometheus instance pre-aggregate the metrics that can then be scraped by a higher-level Prometheus instance.
This PR changes the lsif-server image to accept the number of workers as an environment variable, and will start up 0-1 servers, 0-n workers, and a Prometheus instance that scrapes the (dynamic number of) running processes. The Prometheus instance exposes itself so that it can itself be scraped by our "main" Prometheus instance within a compose or k8s cluster.
Merge request reports
Activity
Created by: codecov[bot]
Codecov Report
Merging #8951 into master will decrease coverage by
<.01%
. The diff coverage is0%
.@@ Coverage Diff @@ ## master #8951 +/- ## ========================================== - Coverage 41.63% 41.63% -0.01% ========================================== Files 1314 1314 Lines 70675 70676 +1 Branches 6554 6554 ========================================== Hits 29423 29423 - Misses 38575 38576 +1 Partials 2677 2677
Impacted Files Coverage Δ lsif/src/shared/database/postgres.ts 31.11% <ø> (ø)
lsif/src/server/routes/meta.ts 0% <ø> (ø)
lsif/src/worker/server.ts 0% <0%> (ø)
lsif/src/server/startup-migrations/redis.ts 0% <0%> (ø)
Created by: slimsag
A chat with @uwedeportivo revealed that we could get seamless scaling using Prometheus federation. This will basically let one Prometheus instance pre-aggregate the metrics that can then be scraped by a higher-level Prometheus instance.
I am confused by this. Is your proposal here to run a prometheus instance inside the lsif-server container too?
EDIT: Ok, I see, yes it is.
Created by: efritz
This PR is okay to merge as it retains all of the previous capabilities if you don't turn the knobs for the number of servers/workers (and we have even more work before we can increase the replica count, see RFC 127).
There are two open PRs to enable federation on k8s and in docker:
- https://github.com/sourcegraph/deploy-sourcegraph/pull/548
- https://github.com/sourcegraph/deploy-sourcegraph-docker/pull/82
@slimsag OK to merge this first?