Skip to content

codeintel: Fix unbounded memory in lsifstore locations method

Administrator requested to merge ef/18246 into main

Created by: efritz

Fixes https://github.com/sourcegraph/sourcegraph/issues/18246. This makes a fairly simple change to the locations method. For background, this method takes a dump id and a set of "result set" ids (definition or reference vertex identifiers in a processed LSIF index), and it returns a map from the result set id to a list of locations (document paths and ranges within documents).

The way we implement this is first map each result set id into a "result chunk" index. Result chunks store data that can be referenced cross-document within a dump. We map each ID onto the index where it's stored, then ask for each of the result chunk payloads from the database. These result chunks contain data of the form result set id -> document id -> [range id] and document id -> document path. Once we extract the appropriate data, we query for each document relevant to the result set and resolve the range identifiers into actual ranges.

Before this PR, we would load all result chunks into memory at once, and load all documents into memory at once. There can be many and they can be large. I suspect this is currently influencing high memory usage at a few customers.

This PR now scans only one result chunk and document at a time and processes everything it needs to before releasing the reference to the object and moving on to the next row. That's basically the only change here.

There's a bit more work to do in ensuring we don't blow out the memory in postgres (even if we page it on our end, it has its own batching mechanisms that may preload a lot of this data and hold it in memory as we process it). This should be a simple addition after this lands.

Merge request reports

Loading