codeintel: Optimize nearest upload operations (!16636) · Merge requests · Administrator / sourcegraph

Administrator requested to merge ef/optimize-lsif-nearest-uploads into main Dec 11, 2020

Created by: efritz

We pre-calculate the set of uploads visible from all commits of repository and store the results in Postgres. This allows us to efficiently answer the question "which uploads can answer queries for this text document at this commit", which is necessary for almost all code intelligence API operations.

The limits of this method have been reached, especially at enterprise customers that have a larger number of distinct upload roots (uploading LSIF indexes for many projects within a monorepo). This is because the graph we write to Postgres is multiplicative in the size of the commit graph and the number of distinct roots.

Previous efforts have reduced the CPU and memory resources required to commit this data. However, we still are writing far too much data to Postgres in these enterprise environments. This PR addresses the amount of data that must be serialized in Postgres, which seems to be the common thread in the code intel issues arising in https://github.com/sourcegraph/customer/issues/156.

We no longer considering uploads defined on descendant commits as visible. Previously, we would use the closest upload visible to the commit looking either at ancestors or descendants to answer code intelligence queries. We've changed this behavior to look only at ancestors. This still covers the main use case where users are browsing in the UI ahead of what gitserver knows about: these commits will still see ancestor uploads while the current version is still being indexed. Looking in one direction reduces the amount of data we need to store by about 50%. We also discovered a bug with the descendant-direction search, which increased memory usage when fixed, so it seen sensible to just stop worrying about the reverse case, which was never necessary.

Split nearest upload records. Insight: that when looking only in one direction, many commits that don't define their own data will have the same visibility as their parent(s). The only commits we need to store data for are commits that: (a) define their own data, (b) have multiple parents, or (c) have children with multiple parents. All other commits with more than one nearest visible upload is encoded as a reference to a unique ancestor and the distance between them (the new lsif_nearest_uploads_links table). This costs less space to store when the number of visible uploads increases (e.g. the number of distinct roots within a repository increase).

Compress multiple nearest upload records. We were also able to greatly compress the total size of data stored in postgres with app-level compression. Instead of inserting a row for each upload visible at every commit, we insert a single row for each commit with a jsonb-encoded set of visible uploads. This is the same data, just encoded a way that drastically reduces the number of tuples needed to be written to a page and an index. This reduced the full loading time (including database writes) from 2m30s to ~30s. This also further reduced the amount of memory needed during this operation.

Closes https://github.com/sourcegraph/sourcegraph/issues/16667.

codeintel: Optimize nearest upload operations

Merge request reports