LSIF: Cross-repository database to target Postgres
Created by: efritz
The current LSIF backend uses SQLite as its data store for everything: each repo/commit pair has its own SQLite database (written once by one process and read many times by many processes), as well as a cross-repository database (written by the worker and read many times by many processes).
We will soon have a need for many writers (both the api and worker) to access the cross-repository database concurrently, which will lead to (1) contention on a disks-based database, and (2) higher chance of corruption (journaling modes and exclusive locks still do not solve this problem). This work enables https://github.com/sourcegraph/sourcegraph/pull/5691 to be implemented without chance of data corruption due to multiple writers.
To get ahead of this, this PR splits the data that can be written concurrently by multiple processes to point at Postgres instead of SQLite. In dev, this tablespace can be shared with the same postgres instance used by the frontend/repo-updater/mangement-console. In production, it should likely have its own instance. Tagging distribution (@slimsag) and core service (@tsenart) members that commented on this distribution issue in RFC 25.
Merge request reports
Activity
Created by: efritz
Regarding Postgres: the idea is to use the same Postgres instance, but a different database, correct? If so, do we have enough documentation in place so that users know to configure two different PGDATASOURCE strings?
Well, that's what I'd like for @sourcegraph/distribution to weigh in on. Conversation in RFCs made me believe that it would be better/safer for a second instance of Postgres, but if another database within the same container works, then that's possible as well.
I'm also looking for advice on all the places where
PG_
variables are set in different contexts so that I can take care of everything in one effort.Created by: slimsag
Architecture-wise:
-
A separate DB in the same Postgres instance obviously sounds most ideal to me because that means admins don't need to e.g. provision another RDS database. But, it's not clear to me how much load this will place on Postgres so it is hard to know if a separate instance is actually warranted or not. I would suggest rolling this out on a large instance (k8s.sgdev.org) with the separate approach and measuring the change. We could for example take a day to do this and measure how much the Postgres disk grows in size and maybe extract some generic info on avg query time -- which I imagine would probably be enough to gain some confidence (looks like this could maybe be used: https://www.citusdata.com/blog/2019/02/08/the-most-useful-postgres-extension-pg-stat-statements/ )
-
If you do decide going with a separate postgres instance entirely is needed, we would still use just one in the server case I think (no point in running two in the same container).
-
You can just specify the PG_* vars on your lsif-server container. This is:
- https://github.com/sourcegraph/deploy-sourcegraph/blob/master/base/lsif-server/lsif-server.Deployment.yaml
- https://github.com/sourcegraph/deploy-sourcegraph-docker/blob/master/deploy-lsif-server.sh
-
https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/cmd/server/shared/shared.go#L149-150 (here you just need to override
PGDATABASE
using your own new env var named e.g.PGDATABASE_LSIF
which users can configure - the others have preconfigured values)
-
In any case, you should make use of the same tool we use for migrations and create a similar strategy for applying migrations to this DB: see https://github.com/golang-migrate/migrate and https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/migrations/README.md
-
Please please please just make sure the above is documented well :) Every hour you spend making the above seamless, documenting expected increase in resource usage on postgres, etc. saves our admin users several hours -- it's time well spent!
-
Created by: efritz
Updated with migrations and docs. I'm looking to see how I can auto-create the lsif database in the base postgres image.
https://github.com/sourcegraph/deploy-sourcegraph/pull/386 https://github.com/sourcegraph/deploy-sourcegraph-docker/pull/44
Created by: efritz
When is concurrent write access needed in both the server and worker, and why don't journaling modes and exclusive locks solve this problem?
Because we need to update commit data on demand, the API needs also write access to the xrepo db (alternatively, we could drop that request and queue the update, which is bad UX, or we could queue a job and block until it's complete, which would cause long delays if there are any LSIF uploads to process).
SQLite requires exclusive write access (an exclusive write file lock) on the database for any write. For one, this could be slow with multiple API and/or workers trying to update things. But more importantly, SQLite is basically a wrapper around filesystem calls. Locks in SQLite are implemented in terms of POSIX file locking. This seems very brittle when the drive is networked and/or shared. I've seen some cautionary tales about data loss when SQLite was written to concurrently via the network, and loosing this database will nerf all of the LSIF data that had been already uploaded/processed.
That's not a failure mode I want to bring to ops, and that's not a customer issue I'd want to bring to any of us (sorry all your code intel is gone, rerun things in CI).
Created by: slimsag
Yeah, as a frontend-based migration that lsif-server waits on we handle all of the following cases nicely -- they are all automatic and handled just as any other DB migration we've had in the past effectively:
- A new cluster deployment running
sourcegraph/postgres-11.1
- An existing cluster deployment running
sourcegraph/postgres-11.1
and upgrading to this new Sourcegraph version - A new cluster deployment with external postgres DB v9.6+
- An existing cluster deployment using an external Postgres DB v9.6+ upgrading.
- A new
sourcegraph/server
deployment with the built-in postgres - An existing
sourcegraph/server
deployment with the built-in postgres upgrading to this new Sourcegraph version - A new
sourcegraph/server
deployment with external postgres v9.6+ - An existing
sourcegraph/server
deployment with external postgres v9.6+ upgrading - Dev environments
- A new cluster deployment running