Skip to content

codeintel: Standardize lsifstore out-of-band migrations

Administrator requested to merge ef/standardized-codeintel-oobmigrations into main

Created by: efritz

This is prerequisite effort to https://github.com/sourcegraph/sourcegraph/pull/18549. This PR adds an abstract migrator used specifically for the codeinteldb store's current scale and schema layout. As a proof-of-concept, we rewrite our existing migration to fit the new pattern.

We are making this change specifically because any attempt to fully count the number of migrated rows in either of the definition or references takes multiple minutes per query on today's Cloud scale.

Key insight: Don't track progress by counting number of migrated rows; track progress by counting the number of completely migrated uploads. (Thanks, @shrouxm!)

Instead of doing a full-table scan to count the number of total rows, we instead keep track of the range of versions over every record associated with a particular upload. We can do this efficiently once in an in-band migration with a distinct dump_id (as it's an indexed field). This table should be small enough that even full table scans should be efficient for a background process. However, we can always access it via primary key, so index usage here is no problem.

Rows in this sister table are updated on insert (via trigger) of rows in the data table, as well as deletion (via app-level txn) of rows in the data table. Only migrators update data rows, so updates to the sister table are handled explicitly by the migrator.

When choosing rows to migrate, we can now search by dump_id, which is an indexed column across all of our LSIF data tables. This should speed up the initial search and update. More importantly, we can now efficiently check the progress of the migration by counting the rows with a minimum schema version value below the target migration threshold from the sister table. This should go from a 10m seq scan to an efficient index scan of a much smaller table on Cloud.

Future data migrations that fit the same pattern only need to define and register a new driver, which decodes/re-encodes the data for each row.

Merge request reports

Loading