Skip to content

codeintel: Improve `UpdateReferenceCounts`

Administrator requested to merge ef/lat-join into main

Created by: efritz

Observation: In this query plan we materialize 106 million rows by a product of lsif_references and the canonical package providers. What we actually want is a count for each of the 1,210 uploads. The way that the original query achieved this was by materializing the full join, sorting by id, then grouping. This took ~1.4GB to sort on disk each query.

Alternatively, we can use a correlated subquery/lateral join to do a single count per row. This is mechanically the same as the nested loop join that was previously used, except now we skip to the aggregate value instead of blowing up the intermediate result.

Before query:

WITH
packages_defined_by_target_uploads AS (
	SELECT p.scheme, p.name, p.version
	FROM lsif_packages p
	WHERE p.dump_id = ANY ('{1714789}')
),
ranked_uploads_providing_packages AS (
	SELECT
		u.id,
		p.scheme,
		p.name,
		p.version,
		rank() OVER (
			PARTITION BY
				p.scheme, p.name, p.version,
				u.repository_id, u.indexer, u.root
			ORDER BY
				u.committed_at,
				u.id
		) AS rank
	FROM lsif_uploads u
	LEFT JOIN lsif_packages p ON p.dump_id = u.id
	WHERE
		(
			u.id = ANY ('{1714789}') OR
			(p.scheme, p.name, p.version) IN (
				SELECT p.scheme, p.name, p.version
				FROM packages_defined_by_target_uploads p
			)
		) AND
		u.state NOT IN ('deleted', 'deleting')
),
SELECT
    ru.id,
    count(*) AS count
FROM ranked_uploads_providing_packages ru
JOIN lsif_references r
ON
    r.scheme = ru.scheme AND
    r.name = ru.name AND
    r.version = ru.version AND
    r.dump_id != ru.id
WHERE ru.rank = 1
GROUP BY ru.id;

After query:

WITH
packages_defined_by_target_uploads AS (
	SELECT p.scheme, p.name, p.version
	FROM lsif_packages p
	WHERE p.dump_id = ANY ('{1714789}')
),
ranked_uploads_providing_packages AS (
	SELECT
		u.id,
		p.scheme,
		p.name,
		p.version,
		rank() OVER (
			PARTITION BY
				p.scheme, p.name, p.version,
				u.repository_id, u.indexer, u.root
			ORDER BY
				u.committed_at,
				u.id
		) AS rank
	FROM lsif_uploads u
	LEFT JOIN lsif_packages p ON p.dump_id = u.id
	WHERE
		(
			u.id = ANY ('{1714789}') OR
			(p.scheme, p.name, p.version) IN (
				SELECT p.scheme, p.name, p.version
				FROM packages_defined_by_target_uploads p
			)
		) AND
		u.state NOT IN ('deleted', 'deleting')
),
SELECT
    ru.id,
    rc.count,
FROM ranked_uploads_providing_packages ru,
LATERAL (
    SELECT COUNT(*) AS COUNT
    FROM lsif_references r
    WHERE
        r.scheme = ru.scheme AND
        r.name = ru.name AND
        r.version = ru.version AND
        r.dump_id != ru.id
) rc
WHERE ru.rank = 1;

The following high-cost nodes are no longer in the query plan:

Screen Shot 2022-08-22 at 7 17 52 PM

I don't know in practice if this will help us with our current low queue throughput, but it should advance the bad-SQL part of the queries for now. We should still look into reducing lock contention on the lsif_uploads table.

Test plan

Existing unit tests.

Merge request reports

Loading