codeintel: Improve `UpdateReferenceCounts`
Created by: efritz
Observation: In this query plan we materialize 106 million rows by a product of lsif_references
and the canonical package providers. What we actually want is a count for each of the 1,210 uploads. The way that the original query achieved this was by materializing the full join, sorting by id, then grouping. This took ~1.4GB to sort on disk each query.
Alternatively, we can use a correlated subquery/lateral join to do a single count per row. This is mechanically the same as the nested loop join that was previously used, except now we skip to the aggregate value instead of blowing up the intermediate result.
Before query:
WITH
packages_defined_by_target_uploads AS (
SELECT p.scheme, p.name, p.version
FROM lsif_packages p
WHERE p.dump_id = ANY ('{1714789}')
),
ranked_uploads_providing_packages AS (
SELECT
u.id,
p.scheme,
p.name,
p.version,
rank() OVER (
PARTITION BY
p.scheme, p.name, p.version,
u.repository_id, u.indexer, u.root
ORDER BY
u.committed_at,
u.id
) AS rank
FROM lsif_uploads u
LEFT JOIN lsif_packages p ON p.dump_id = u.id
WHERE
(
u.id = ANY ('{1714789}') OR
(p.scheme, p.name, p.version) IN (
SELECT p.scheme, p.name, p.version
FROM packages_defined_by_target_uploads p
)
) AND
u.state NOT IN ('deleted', 'deleting')
),
SELECT
ru.id,
count(*) AS count
FROM ranked_uploads_providing_packages ru
JOIN lsif_references r
ON
r.scheme = ru.scheme AND
r.name = ru.name AND
r.version = ru.version AND
r.dump_id != ru.id
WHERE ru.rank = 1
GROUP BY ru.id;
After query:
WITH
packages_defined_by_target_uploads AS (
SELECT p.scheme, p.name, p.version
FROM lsif_packages p
WHERE p.dump_id = ANY ('{1714789}')
),
ranked_uploads_providing_packages AS (
SELECT
u.id,
p.scheme,
p.name,
p.version,
rank() OVER (
PARTITION BY
p.scheme, p.name, p.version,
u.repository_id, u.indexer, u.root
ORDER BY
u.committed_at,
u.id
) AS rank
FROM lsif_uploads u
LEFT JOIN lsif_packages p ON p.dump_id = u.id
WHERE
(
u.id = ANY ('{1714789}') OR
(p.scheme, p.name, p.version) IN (
SELECT p.scheme, p.name, p.version
FROM packages_defined_by_target_uploads p
)
) AND
u.state NOT IN ('deleted', 'deleting')
),
SELECT
ru.id,
rc.count,
FROM ranked_uploads_providing_packages ru,
LATERAL (
SELECT COUNT(*) AS COUNT
FROM lsif_references r
WHERE
r.scheme = ru.scheme AND
r.name = ru.name AND
r.version = ru.version AND
r.dump_id != ru.id
) rc
WHERE ru.rank = 1;
The following high-cost nodes are no longer in the query plan:
I don't know in practice if this will help us with our current low queue throughput, but it should advance the bad-SQL part of the queries for now. We should still look into reducing lock contention on the lsif_uploads
table.
Test plan
Existing unit tests.