codeintel: Consider dependency graph for LSIF data retention (!22930) · Merge requests · Warren Gifford / sourcegraph

Warren Gifford requested to merge ef/improved-eviction-strategy into main Jul 16, 2021

Created by: efritz

This updates the query that periodically soft-deletes stale LSIF data to better support dependency indexing. Previously, we would mark all records older than 720h (configurable) that were not used to resolve code intelligence data from a non-stale tag or the tip of a non-stale branch. This ignores dependencies, meaning that go to definition could stop working for a repository we care about.

Now, the query deletes all irrelevant LSIF data by:

initializing a working set by adding all uploads younger than 720h as well as all uploads used to resolve code intelligence for a non-stale branch or tag
expanding the working set by adding dependencies of the working set to itself
deleting all uploads outside of the working set

This ensures that while an upload is protected, all of its transitive dependencies are as well. Fixes #19120 (closed).

This query may be susceptible in the future to performance issues as we grow the dependency graph. I've tried to make adjustments early on to help with scaling concerns, but was so far unsuccessful:

We thought of storing the set of branch and tags from which each upload is visible. This turns out to be a huge mapping, as some uploads in our production database today are visible from 8k distinct interesting commits. We don't want to materialize what is basically a cross-join of our dependency graph and our upload records in our database.
We thought of selecting a small set of old candidates, then tracing each one up the dependency graph. This should be more efficient when the number of uploads to delete is rather small (as it would amount to only a small number of traversals of the dependency graph). I'm not sure how to do this in SQL without issuing n+1 queries from the application (there's not an obvious way to share traversal work, and I'm not sure how to invoke a recursive CTE on each row of a candidate set).

We should check on this query periodically as we grow. It current runs in 1m10s on a clone of the production database and is nowhere near a hot path.

codeintel: Consider dependency graph for LSIF data retention

Merge request reports