Skip to content

codeintel: Add janitor to clean up commits refs now unresolvable by gitserver

Administrator requested to merge ef/unreachable-commits into main

Created by: efritz

NOTE: 1.7k lines are autogenerated mocks.

This PR adds a new background process that will periodically fetch repository_id/commit pairs that are in the lsif_uploads/lsif_indexes table and ensures they are resolvable. If they are not, the records are soft deleted (and eventually hard-deleted based on existing janitor behavior).

In order to do this in batches and not have to run synchronously over the entire lsif_uploads/lsif_indexes table every 10s or so (whatever our cleanup interval is), we need to store some metadata about the last time we've checked a commit. We can use this metadata to order the data we want to refresh first.

We choose to store this metadata directly on the lsif_uploads/lsif_indexes table as a sibling field to the commit reference itself. The janitor will periodically:

  • union lsif_uploads and lsif_indexes records and find the unique set of repository_id/commit pairs as well as the oldest refresh date for this commit over all tables
  • ask gitserver to resolve a batch of commits
  • for each commit gitserver knows about, update all upload and index records for that repository_id/commit pair (there may be many if there are many independently indexed subprojects) with the current timestamp so it becomes the most recently refreshed (and the furthest away to be re-checked)
  • for each commit gitserver can't resolve, soft delete all upload and index records for that repository_id/commit pair

There are rudimentary metrics attached (number of records deleted) but we should likely add something like "oldest commit record unchecked" so that we can tune the batch sizes for different instance classes.

Merge request reports

Loading