Skip to content

codeintel: Fix race condition in upload expirer

Administrator requested to merge ef/codeintel-qa-retention into main

Created by: efritz

The original problem

See this sample failure, which removes an index during the test.

This is undesired behavior, as we by default have a data retention rule that keeps uploads visible from any commit for 168 hours (1 week). Having them deleted in the time it takes to run the test is very obviously concerning.

Reproduction and root cause

The problem can be reproduced trivially by setting the following environment variables and running the ./upload command in the codeintel-qa test. This simply makes the upload retention scanner happen in a near-constant loop and eventually catches the race condition seen (very rarely) in CI.

 PRECISE_CODE_INTEL_CLEANUP_TASK_INTERVAL=1s
 PRECISE_CODE_INTEL_COMMIT_GRAPH_UPDATE_TASK_INTERVAL=1s
 PRECISE_CODE_INTEL_RETENTION_UPLOAD_PROCESS_DELAY=1s
 PRECISE_CODE_INTEL_RETENTION_REPOSITORY_PROCESS_DELAY=1s

At some point, the test will fail with an upload with a status of '', indicating that the upload was deleted. Adding additional logs locally, we found that the deleted uploads always have an empty set of commits from which they are visible.

There are two ways we can meet this condition:

  1. The commit actually no longer exists in the repo (force-pushed away and subsequently gc'd)
  2. The upload has not yet been inserted into the commit graph

When we run the upload expirer, we first select repositories to process, then select uploads within those repositories to match against retention policies. Those not protected by a policy will be marked for deletion.

When selecting repositories, we only select those with an up-to-date commit graph. This means that all uploads that are present in the database at that time are visible in the commit graph. The code is written assuming this holds over time.

It does not.

It's very possible for an upload to be finished processing between selecting a repository for processing and handling all of that repository's uploads. This race is actually trivially reproducible with three uploads being processed in sequence for the same repository.

The fix

We happen to already write the start time of the transaction that updates the commit graph to the database. We can use this data to ensure that we don't try to expire any uploads whose finished_at time is greater than the last commit graph update for its repository.

Thanks to coming to my PR.

Merge request reports

Loading