Skip to content
Snippets Groups Projects

LSIF: Cross-repository database to target Postgres

Merged Administrator requested to merge lsif-postgres-xrepo into master

Created by: efritz

The current LSIF backend uses SQLite as its data store for everything: each repo/commit pair has its own SQLite database (written once by one process and read many times by many processes), as well as a cross-repository database (written by the worker and read many times by many processes).

We will soon have a need for many writers (both the api and worker) to access the cross-repository database concurrently, which will lead to (1) contention on a disks-based database, and (2) higher chance of corruption (journaling modes and exclusive locks still do not solve this problem). This work enables https://github.com/sourcegraph/sourcegraph/pull/5691 to be implemented without chance of data corruption due to multiple writers.

To get ahead of this, this PR splits the data that can be written concurrently by multiple processes to point at Postgres instead of SQLite. In dev, this tablespace can be shared with the same postgres instance used by the frontend/repo-updater/mangement-console. In production, it should likely have its own instance. Tagging distribution (@slimsag) and core service (@tsenart) members that commented on this distribution issue in RFC 25.

Merge request reports

Merged by avatar (Jul 25, 2025 5:26am UTC)

Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Created by: tsenart

    I won't be able to review this PR this week. @keegancsmith or @mrnugget: Would you have the chance?

  • Created by: felixfbecker

    @efritz let me know when you want me to do another review once the architecture/functionality is approved

  • Created by: efritz

    Regarding Postgres: the idea is to use the same Postgres instance, but a different database, correct? If so, do we have enough documentation in place so that users know to configure two different PGDATASOURCE strings?

    Well, that's what I'd like for @sourcegraph/distribution to weigh in on. Conversation in RFCs made me believe that it would be better/safer for a second instance of Postgres, but if another database within the same container works, then that's possible as well.

    I'm also looking for advice on all the places where PG_ variables are set in different contexts so that I can take care of everything in one effort.

  • Created by: slimsag

    Architecture-wise:

    1. A separate DB in the same Postgres instance obviously sounds most ideal to me because that means admins don't need to e.g. provision another RDS database. But, it's not clear to me how much load this will place on Postgres so it is hard to know if a separate instance is actually warranted or not. I would suggest rolling this out on a large instance (k8s.sgdev.org) with the separate approach and measuring the change. We could for example take a day to do this and measure how much the Postgres disk grows in size and maybe extract some generic info on avg query time -- which I imagine would probably be enough to gain some confidence (looks like this could maybe be used: https://www.citusdata.com/blog/2019/02/08/the-most-useful-postgres-extension-pg-stat-statements/ )

    2. If you do decide going with a separate postgres instance entirely is needed, we would still use just one in the server case I think (no point in running two in the same container).

    3. You can just specify the PG_* vars on your lsif-server container. This is:

    1. In any case, you should make use of the same tool we use for migrations and create a similar strategy for applying migrations to this DB: see https://github.com/golang-migrate/migrate and https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/migrations/README.md

    2. Please please please just make sure the above is documented well :) Every hour you spend making the above seamless, documenting expected increase in resource usage on postgres, etc. saves our admin users several hours -- it's time well spent!

  • Created by: tsenart

    @slimsag's advice is sound architecturally. At the code level, I'm not the most qualified to review TypeScript, so I'll defer.

  • Created by: efritz

    Updated with migrations and docs. I'm looking to see how I can auto-create the lsif database in the base postgres image.

    https://github.com/sourcegraph/deploy-sourcegraph/pull/386 https://github.com/sourcegraph/deploy-sourcegraph-docker/pull/44

  • Created by: efritz

    When is concurrent write access needed in both the server and worker, and why don't journaling modes and exclusive locks solve this problem?

    Because we need to update commit data on demand, the API needs also write access to the xrepo db (alternatively, we could drop that request and queue the update, which is bad UX, or we could queue a job and block until it's complete, which would cause long delays if there are any LSIF uploads to process).

    SQLite requires exclusive write access (an exclusive write file lock) on the database for any write. For one, this could be slow with multiple API and/or workers trying to update things. But more importantly, SQLite is basically a wrapper around filesystem calls. Locks in SQLite are implemented in terms of POSIX file locking. This seems very brittle when the drive is networked and/or shared. I've seen some cautionary tales about data loss when SQLite was written to concurrently via the network, and loosing this database will nerf all of the LSIF data that had been already uploaded/processed.

    That's not a failure mode I want to bring to ops, and that's not a customer issue I'd want to bring to any of us (sorry all your code intel is gone, rerun things in CI).

  • Created by: slimsag

    Yeah, as a frontend-based migration that lsif-server waits on we handle all of the following cases nicely -- they are all automatic and handled just as any other DB migration we've had in the past effectively:

    1. A new cluster deployment running sourcegraph/postgres-11.1
    2. An existing cluster deployment running sourcegraph/postgres-11.1 and upgrading to this new Sourcegraph version
    3. A new cluster deployment with external postgres DB v9.6+
    4. An existing cluster deployment using an external Postgres DB v9.6+ upgrading.
    5. A new sourcegraph/server deployment with the built-in postgres
    6. An existing sourcegraph/server deployment with the built-in postgres upgrading to this new Sourcegraph version
    7. A new sourcegraph/server deployment with external postgres v9.6+
    8. An existing sourcegraph/server deployment with external postgres v9.6+ upgrading
    9. Dev environments
  • Created by: slimsag

    Let me know when this is ready for re-review.

Please register or sign in to reply
Loading