codeintel: Fix race condition in upload expirer
Created by: efritz
The original problem
See this sample failure, which removes an index during the test.
This is undesired behavior, as we by default have a data retention rule that keeps uploads visible from any commit for 168 hours (1 week). Having them deleted in the time it takes to run the test is very obviously concerning.
Reproduction and root cause
The problem can be reproduced trivially by setting the following environment variables and running the ./upload
command in the codeintel-qa test. This simply makes the upload retention scanner happen in a near-constant loop and eventually catches the race condition seen (very rarely) in CI.
PRECISE_CODE_INTEL_CLEANUP_TASK_INTERVAL=1s
PRECISE_CODE_INTEL_COMMIT_GRAPH_UPDATE_TASK_INTERVAL=1s
PRECISE_CODE_INTEL_RETENTION_UPLOAD_PROCESS_DELAY=1s
PRECISE_CODE_INTEL_RETENTION_REPOSITORY_PROCESS_DELAY=1s
At some point, the test will fail with an upload with a status of ''
, indicating that the upload was deleted. Adding additional logs locally, we found that the deleted uploads always have an empty set of commits from which they are visible.
There are two ways we can meet this condition:
- The commit actually no longer exists in the repo (force-pushed away and subsequently gc'd)
- The upload has not yet been inserted into the commit graph
When we run the upload expirer, we first select repositories to process, then select uploads within those repositories to match against retention policies. Those not protected by a policy will be marked for deletion.
When selecting repositories, we only select those with an up-to-date commit graph. This means that all uploads that are present in the database at that time are visible in the commit graph. The code is written assuming this holds over time.
It does not.
It's very possible for an upload to be finished processing between selecting a repository for processing and handling all of that repository's uploads. This race is actually trivially reproducible with three uploads being processed in sequence for the same repository.
The fix
We happen to already write the start time of the transaction that updates the commit graph to the database. We can use this data to ensure that we don't try to expire any uploads whose finished_at time is greater than the last commit graph update for its repository.
Thanks to coming to my PR.