Skip to content

codeintel: Standardize bulk upload technique

Warren Gifford requested to merge ef/standardize-bulk-upload into main

Created by: efritz

There are several places that we are doing bulk uploads, all inconsistently:

  • Bulk-inserting of lsif_package data
  • Bulk-inserting of lsif_references data
  • Bulk-updating of lsif_nearest_uploads(_links) data
  • Bulk-inserting/updating of lsif_data_* data (in the codeintel-db)

This PR ensures that we have a unified technique for adding large amounts of data into Postgres:

  1. Create a temporary table that does NOT include columns that have the same values for all rows in the operation - this table will be dropped at the end of the transaction
  2. Bulk insert into this temporary table
  3. Upsert from the temporary table into the target table

This last step allows us, in different circumstances, to:

  • Reduce the number of total roundtrip queries (N updates can be performed in N/((2^16-1) / # cols) inserts plus one additional 1 update)
  • Reduce our bandwidth by supplying only one parameterized copy of constant-valued fields instead of supplying them on every row update (common for dump identifiers and schema versions)
  • Minimize the number of tuples we have to touch in a permanent table, reducing the need for frequent vacuum operations (important for nearest upload data, which changes infrequently but is mass-updated frequently)

Reviewers: Please review by commit. The following commits have non-trivial changes:

  • d5a4ce7: Update dbstore.UpdatePackages and dbstore.UpdatePackageReferences. Previously these used a bulk inserter but did not use a temporary table. We now insert rows (minus the dump ID) into a temporary table, then transfer it over to the target once the bulk inserter has flushed.
  • 4be02ff: Update lsifstore.WriteDocuments. This also did not use a temporary table previously. We now can save the dump id and the current schema version for a bulk insert statement.
  • ec097e5: Same thing, but for result chunks.
  • b90c575: Same thing, but with definition and references.
  • ccaf455: Update insertion technique for out of band migrations. We used to issue an update query for every row in the batch, which is incredibly sluggish. We now insert into a temporary table and mass update based on the primary key values. This required a slight refactor in each migration implementation, where we would supply a list of fieldSpecs for each column being read or updated by the migration.
  • 00ac185: Update insertion technique for nearest commit data. We used to create three batch inserters by hand, insert into a temporary table, then insert/update/delete based on the difference between the target and temporary tables. This is all the same, but we now load all of the insertion data into channels so that we can feed all three inserters concurrently.

The remaining commits are fixup efforts and should be self-explanatory.

Merge request reports

Loading