Skip to content

codeintel: remove redundant double `repo` table join

Warren Gifford requested to merge nsc/lsif-repo-joins into main

Created by: Strum355

Low hanging fruit PR that removes using certain Postgres views lsif_uploads_with_repository_name and lsif_indexes_with_repository_name in cases where we were doing join onto repo for authz-conds, as these views were already doing a join onto repo. This shaves between 1-3 seconds off the queries. The biggest offenders are yet to be addressed (huge heap fetches from the index-only scan on repo)

me the SQL wizard

Old Query + Plan
SELECT
	u.id,
	u.commit,
	u.queued_at,
	u.state,
	u.failure_message,
	u.started_at,
	u.finished_at,
	u.process_after,
	u.num_resets,
	u.num_failures,
	u.repository_id,
	u.repository_name,
	u.docker_steps,
	u.root,
	u.indexer,
	u.indexer_args,
	u.outfile,
	u.execution_logs,
	s.rank,
	u.local_steps,
	(
	    SELECT MAX(id) FROM lsif_uploads WHERE associated_index_id = u.id
    ) AS associated_upload_id
FROM lsif_indexes_with_repository_name u
LEFT JOIN (
    SELECT
	r.id,
	ROW_NUMBER() OVER (
	    ORDER BY COALESCE(r.process_after, r.queued_at), r.id
    ) as rank
    FROM lsif_indexes_with_repository_name r
    WHERE r.state = 'queued'
) s
ON u.id = s.id
JOIN repo ON repo.id = u.repository_id
ORDER BY queued_at DESC, u.id LIMIT 50;
Limit  (cost=500273.03..500406.95 rows=50 width=577) (actual time=4877.716..4986.262 rows=50 loops=1)
"  Output: u.id, u.commit, u.queued_at, u.state, u.failure_message, u.started_at, u.finished_at, u.process_after, u.num_resets, u.num_failures, u.repository_id, r.name, u.docker_steps, u.root, u.indexer, u.indexer_args, u.outfile, u.execution_logs, s.rank, u.local_steps, ((SubPlan 1))"
  Buffers: shared hit=10342226
  ->  Result  (cost=500273.03..2317785.05 rows=678591 width=577) (actual time=4877.714..4986.252 rows=50 loops=1)
"        Output: u.id, u.commit, u.queued_at, u.state, u.failure_message, u.started_at, u.finished_at, u.process_after, u.num_resets, u.num_failures, u.repository_id, r.name, u.docker_steps, u.root, u.indexer, u.indexer_args, u.outfile, u.execution_logs, s.rank, u.local_steps, (SubPlan 1)"
        Buffers: shared hit=10342226
        ->  Sort  (cost=500273.03..501969.51 rows=678591 width=573) (actual time=4877.648..4986.059 rows=50 loops=1)
"              Output: u.id, u.commit, u.queued_at, u.state, u.failure_message, u.started_at, u.finished_at, u.process_after, u.num_resets, u.num_failures, u.repository_id, r.name, u.docker_steps, u.root, u.indexer, u.indexer_args, u.outfile, u.execution_logs, s.rank, u.local_steps"
"              Sort Key: u.queued_at DESC, u.id"
              Sort Method: top-N heapsort  Memory: 78kB
              Buffers: shared hit=10342076
              ->  Hash Left Join  (cost=162778.90..477730.73 rows=678591 width=573) (actual time=2300.504..4449.476 rows=1007576 loops=1)
"                    Output: u.id, u.commit, u.queued_at, u.state, u.failure_message, u.started_at, u.finished_at, u.process_after, u.num_resets, u.num_failures, u.repository_id, r.name, u.docker_steps, u.root, u.indexer, u.indexer_args, u.outfile, u.execution_logs, s.rank, u.local_steps"
                    Hash Cond: (u.id = s.id)
                    Buffers: shared hit=10342076
                    ->  Gather  (cost=162611.64..475018.46 rows=678591 width=565) (actual time=2300.077..4013.694 rows=1007576 loops=1)
"                          Output: u.id, u.commit, u.queued_at, u.state, u.failure_message, u.started_at, u.finished_at, u.process_after, u.num_resets, u.num_failures, u.repository_id, u.docker_steps, u.root, u.indexer, u.indexer_args, u.outfile, u.execution_logs, u.local_steps, r.name"
                          Workers Planned: 2
                          Workers Launched: 2
                          Buffers: shared hit=10341757
                          ->  Nested Loop  (cost=161611.64..406159.36 rows=282746 width=565) (actual time=2289.993..4171.644 rows=335859 loops=3)
"                                Output: u.id, u.commit, u.queued_at, u.state, u.failure_message, u.started_at, u.finished_at, u.process_after, u.num_resets, u.num_failures, u.repository_id, u.docker_steps, u.root, u.indexer, u.indexer_args, u.outfile, u.execution_logs, u.local_steps, r.name"
                                Inner Unique: true
                                Buffers: shared hit=10341757
                                Worker 0: actual time=2285.207..4470.810 rows=394736 loops=1
                                  Buffers: shared hit=3719316
                                Worker 1: actual time=2285.218..4461.963 rows=397046 loops=1
                                  Buffers: shared hit=3728602
                                ->  Parallel Hash Join  (cost=161611.21..275512.27 rows=282746 width=569) (actual time=2289.971..2877.886 rows=335859 loops=3)
"                                      Output: u.id, u.commit, u.queued_at, u.state, u.failure_message, u.started_at, u.finished_at, u.process_after, u.num_resets, u.num_failures, u.repository_id, u.docker_steps, u.root, u.indexer, u.indexer_args, u.outfile, u.execution_logs, u.local_steps, r.name, r.id"
                                      Inner Unique: true
                                      Hash Cond: (u.repository_id = r.id)
                                      Buffers: shared hit=5608147
                                      Worker 0: actual time=2285.187..2968.299 rows=394736 loops=1
                                        Buffers: shared hit=1865475
                                      Worker 1: actual time=2285.198..2960.412 rows=397046 loops=1
                                        Buffers: shared hit=1863186
                                      ->  Parallel Seq Scan on public.lsif_indexes u  (cost=0.00..112800.73 rows=419173 width=521) (actual time=0.008..296.627 rows=335859 loops=3)
"                                            Output: u.id, u.commit, u.queued_at, u.state, u.failure_message, u.started_at, u.finished_at, u.process_after, u.num_resets, u.num_failures, u.repository_id, u.docker_steps, u.root, u.indexer, u.indexer_args, u.outfile, u.execution_logs, u.local_steps"
                                            Buffers: shared hit=108609
                                            Worker 0: actual time=0.008..345.063 rows=394736 loops=1
                                              Buffers: shared hit=42332
                                            Worker 1: actual time=0.009..345.255 rows=397046 loops=1
                                              Buffers: shared hit=42561
                                      ->  Parallel Hash  (cost=130930.73..130930.73 rows=2454438 width=48) (actual time=2271.239..2271.240 rows=1971750 loops=3)
"                                            Output: r.name, r.id"
                                            Buckets: 8388608  Batches: 1  Memory Usage: 505760kB
                                            Buffers: shared hit=5499478
                                            Worker 0: actual time=2285.025..2285.026 rows=1963275 loops=1
                                              Buffers: shared hit=1823113
                                            Worker 1: actual time=2285.041..2285.041 rows=1957885 loops=1
                                              Buffers: shared hit=1820595
                                            ->  Parallel Index Only Scan using repo_non_deleted_id_name_idx on public.repo r  (cost=0.43..130930.73 rows=2454438 width=48) (actual time=0.031..1356.792 rows=1971750 loops=3)
"                                                  Output: r.name, r.id"
                                                  Heap Fetches: 2199897
                                                  Buffers: shared hit=5499478
                                                  Worker 0: actual time=0.030..1352.703 rows=1963275 loops=1
                                                    Buffers: shared hit=1823113
                                                  Worker 1: actual time=0.039..1352.041 rows=1957885 loops=1
                                                    Buffers: shared hit=1820595
                                ->  Index Only Scan using repo_pkey on public.repo  (cost=0.43..0.46 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=1007576)
                                      Output: repo.id
                                      Index Cond: (repo.id = r.id)
                                      Heap Fetches: 910427
                                      Buffers: shared hit=4733610
                                      Worker 0: actual time=0.003..0.003 rows=1 loops=394736
                                        Buffers: shared hit=1853841
                                      Worker 1: actual time=0.003..0.003 rows=1 loops=397046
                                        Buffers: shared hit=1865416
                    ->  Hash  (cost=166.69..166.69 rows=45 width=16) (actual time=0.409..0.414 rows=69 loops=1)
"                          Output: s.rank, s.id"
                          Buckets: 1024  Batches: 1  Memory Usage: 12kB
                          Buffers: shared hit=319
                          ->  Subquery Scan on s  (cost=165.34..166.69 rows=45 width=16) (actual time=0.339..0.392 rows=69 loops=1)
"                                Output: s.rank, s.id"
                                Buffers: shared hit=319
                                ->  WindowAgg  (cost=165.34..166.24 rows=45 width=24) (actual time=0.338..0.376 rows=69 loops=1)
"                                      Output: u_1.id, row_number() OVER (?), (COALESCE(u_1.process_after, u_1.queued_at))"
                                      Buffers: shared hit=319
                                      ->  Sort  (cost=165.34..165.45 rows=45 width=16) (actual time=0.328..0.337 rows=69 loops=1)
"                                            Output: u_1.id, (COALESCE(u_1.process_after, u_1.queued_at))"
"                                            Sort Key: (COALESCE(u_1.process_after, u_1.queued_at)), u_1.id"
                                            Sort Method: quicksort  Memory: 28kB
                                            Buffers: shared hit=319
                                            ->  Nested Loop  (cost=0.86..164.10 rows=45 width=16) (actual time=0.068..0.293 rows=69 loops=1)
"                                                  Output: u_1.id, COALESCE(u_1.process_after, u_1.queued_at)"
                                                  Inner Unique: true
                                                  Buffers: shared hit=319
                                                  ->  Index Scan using lsif_indexes_state on public.lsif_indexes u_1  (cost=0.42..59.15 rows=67 width=28) (actual time=0.053..0.102 rows=69 loops=1)
"                                                        Output: u_1.id, u_1.process_after, u_1.queued_at, u_1.repository_id"
                                                        Index Cond: (u_1.state = 'queued'::text)
                                                        Buffers: shared hit=33
                                                  ->  Index Only Scan using repo_non_deleted_id_name_idx on public.repo r_1  (cost=0.43..1.57 rows=1 width=4) (actual time=0.002..0.002 rows=1 loops=69)
"                                                        Output: r_1.id, r_1.name"
                                                        Index Cond: (r_1.id = u_1.repository_id)
                                                        Heap Fetches: 67
                                                        Buffers: shared hit=286
        SubPlan 1
          ->  Aggregate  (cost=2.66..2.67 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=50)
                Output: max(lsif_uploads.id)
                Buffers: shared hit=150
                ->  Index Scan using lsif_uploads_associated_index_id on public.lsif_uploads  (cost=0.42..2.65 rows=1 width=4) (actual time=0.002..0.002 rows=0 loops=50)
                      Output: lsif_uploads.id
                      Index Cond: (lsif_uploads.associated_index_id = u.id)
                      Buffers: shared hit=150
Planning Time: 1.442 ms
Execution Time: 4986.415 ms
New Plan + Query
SELECT
	u.id,
	u.commit,
	u.queued_at,
	u.state,
	u.failure_message,
	u.started_at,
	u.finished_at,
	u.process_after,
	u.num_resets,
	u.num_failures,
	u.repository_id,
	repo.name,
	u.docker_steps,
	u.root,
	u.indexer,
	u.indexer_args,
	u.outfile,
	u.execution_logs,
	s.rank,
	u.local_steps,
	(
	    SELECT MAX(id) FROM lsif_uploads WHERE associated_index_id = u.id
    ) AS associated_upload_id
FROM lsif_indexes u
LEFT JOIN (
    SELECT
	r.id,
	ROW_NUMBER() OVER (
	    ORDER BY COALESCE(r.process_after, r.queued_at), r.id
    ) as rank
    FROM lsif_indexes_with_repository_name r
    WHERE r.state = 'queued'
) s
ON u.id = s.id
JOIN repo ON repo.id = u.repository_id
WHERE repo.deleted_at IS NULL
ORDER BY queued_at DESC, u.id LIMIT 50;
Limit  (cost=369625.94..369814.16 rows=50 width=577) (actual time=3589.991..3673.780 rows=50 loops=1)
"  Output: u.id, u.commit, u.queued_at, u.state, u.failure_message, u.started_at, u.finished_at, u.process_after, u.num_resets, u.num_failures, u.repository_id, repo.name, u.docker_steps, u.root, u.indexer, u.indexer_args, u.outfile, u.execution_logs, s.rank, u.local_steps, ((SubPlan 1))"
  Buffers: shared hit=5608955
  ->  Result  (cost=369625.94..2924013.65 rows=678591 width=577) (actual time=3589.990..3673.770 rows=50 loops=1)
"        Output: u.id, u.commit, u.queued_at, u.state, u.failure_message, u.started_at, u.finished_at, u.process_after, u.num_resets, u.num_failures, u.repository_id, repo.name, u.docker_steps, u.root, u.indexer, u.indexer_args, u.outfile, u.execution_logs, s.rank, u.local_steps, (SubPlan 1)"
        Buffers: shared hit=5608955
        ->  Sort  (cost=369625.94..371322.42 rows=678591 width=573) (actual time=3589.948..3673.475 rows=50 loops=1)
"              Output: u.id, u.commit, u.queued_at, u.state, u.failure_message, u.started_at, u.finished_at, u.process_after, u.num_resets, u.num_failures, u.repository_id, repo.name, u.docker_steps, u.root, u.indexer, u.indexer_args, u.outfile, u.execution_logs, s.rank, u.local_steps"
"              Sort Key: u.queued_at DESC, u.id"
              Sort Method: top-N heapsort  Memory: 81kB
              Buffers: shared hit=5608756
              ->  Hash Left Join  (cost=162778.46..347083.64 rows=678591 width=573) (actual time=2349.443..3185.644 rows=1007727 loops=1)
"                    Output: u.id, u.commit, u.queued_at, u.state, u.failure_message, u.started_at, u.finished_at, u.process_after, u.num_resets, u.num_failures, u.repository_id, repo.name, u.docker_steps, u.root, u.indexer, u.indexer_args, u.outfile, u.execution_logs, s.rank, u.local_steps"
                    Hash Cond: (u.id = s.id)
                    Buffers: shared hit=5608756
                    ->  Gather  (cost=162611.21..344371.37 rows=678591 width=565) (actual time=2349.118..2750.451 rows=1007727 loops=1)
"                          Output: u.id, u.commit, u.queued_at, u.state, u.failure_message, u.started_at, u.finished_at, u.process_after, u.num_resets, u.num_failures, u.repository_id, u.docker_steps, u.root, u.indexer, u.indexer_args, u.outfile, u.execution_logs, u.local_steps, repo.name"
                          Workers Planned: 2
                          Workers Launched: 2
                          Buffers: shared hit=5608630
                          ->  Parallel Hash Join  (cost=161611.21..275512.27 rows=282746 width=565) (actual time=2338.288..2866.322 rows=335909 loops=3)
"                                Output: u.id, u.commit, u.queued_at, u.state, u.failure_message, u.started_at, u.finished_at, u.process_after, u.num_resets, u.num_failures, u.repository_id, u.docker_steps, u.root, u.indexer, u.indexer_args, u.outfile, u.execution_logs, u.local_steps, repo.name"
                                Inner Unique: true
                                Hash Cond: (u.repository_id = repo.id)
                                Buffers: shared hit=5608630
                                Worker 0: actual time=2333.120..3113.884 rows=488314 loops=1
                                  Buffers: shared hit=1852067
                                Worker 1: actual time=2333.128..3103.832 rows=506976 loops=1
                                  Buffers: shared hit=1911134
                                ->  Parallel Seq Scan on public.lsif_indexes u  (cost=0.00..112800.73 rows=419173 width=521) (actual time=0.013..275.659 rows=335909 loops=3)
"                                      Output: u.id, u.commit, u.queued_at, u.state, u.failure_message, u.started_at, u.finished_at, u.process_after, u.num_resets, u.num_failures, u.repository_id, u.docker_steps, u.root, u.indexer, u.indexer_args, u.outfile, u.execution_logs, u.local_steps"
                                      Buffers: shared hit=108609
                                      Worker 0: actual time=0.015..408.241 rows=488314 loops=1
                                        Buffers: shared hit=51959
                                      Worker 1: actual time=0.015..406.191 rows=506976 loops=1
                                        Buffers: shared hit=53738
                                ->  Parallel Hash  (cost=130930.73..130930.73 rows=2454438 width=48) (actual time=2318.496..2318.497 rows=1971791 loops=3)
"                                      Output: repo.name, repo.id"
                                      Buckets: 8388608  Batches: 1  Memory Usage: 505792kB
                                      Buffers: shared hit=5499915
                                      Worker 0: actual time=2332.972..2332.973 rows=1935991 loops=1
                                        Buffers: shared hit=1800055
                                      Worker 1: actual time=2332.939..2332.940 rows=1997412 loops=1
                                        Buffers: shared hit=1857343
                                      ->  Parallel Index Only Scan using repo_non_deleted_id_name_idx on public.repo  (cost=0.43..130930.73 rows=2454438 width=48) (actual time=0.033..1381.237 rows=1971791 loops=3)
"                                            Output: repo.name, repo.id"
                                            Heap Fetches: 2200449
                                            Buffers: shared hit=5499915
                                            Worker 0: actual time=0.047..1377.731 rows=1935991 loops=1
                                              Buffers: shared hit=1800055
                                            Worker 1: actual time=0.029..1378.511 rows=1997412 loops=1
                                              Buffers: shared hit=1857343
                    ->  Hash  (cost=166.69..166.69 rows=45 width=16) (actual time=0.301..0.305 rows=15 loops=1)
"                          Output: s.rank, s.id"
                          Buckets: 1024  Batches: 1  Memory Usage: 9kB
                          Buffers: shared hit=126
                          ->  Subquery Scan on s  (cost=165.34..166.69 rows=45 width=16) (actual time=0.279..0.294 rows=15 loops=1)
"                                Output: s.rank, s.id"
                                Buffers: shared hit=126
                                ->  WindowAgg  (cost=165.34..166.24 rows=45 width=24) (actual time=0.277..0.287 rows=15 loops=1)
"                                      Output: u_1.id, row_number() OVER (?), (COALESCE(u_1.process_after, u_1.queued_at))"
                                      Buffers: shared hit=126
                                      ->  Sort  (cost=165.34..165.45 rows=45 width=16) (actual time=0.266..0.269 rows=15 loops=1)
"                                            Output: u_1.id, (COALESCE(u_1.process_after, u_1.queued_at))"
"                                            Sort Key: (COALESCE(u_1.process_after, u_1.queued_at)), u_1.id"
                                            Sort Method: quicksort  Memory: 25kB
                                            Buffers: shared hit=126
                                            ->  Nested Loop  (cost=0.86..164.10 rows=45 width=16) (actual time=0.090..0.255 rows=15 loops=1)
"                                                  Output: u_1.id, COALESCE(u_1.process_after, u_1.queued_at)"
                                                  Inner Unique: true
                                                  Buffers: shared hit=126
                                                  ->  Index Scan using lsif_indexes_state on public.lsif_indexes u_1  (cost=0.42..59.15 rows=67 width=28) (actual time=0.072..0.181 rows=15 loops=1)
"                                                        Output: u_1.id, u_1.process_after, u_1.queued_at, u_1.repository_id"
                                                        Index Cond: (u_1.state = 'queued'::text)
                                                        Buffers: shared hit=59
                                                  ->  Index Only Scan using repo_non_deleted_id_name_idx on public.repo r  (cost=0.43..1.57 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=15)
"                                                        Output: r.id, r.name"
                                                        Index Cond: (r.id = u_1.repository_id)
                                                        Heap Fetches: 13
                                                        Buffers: shared hit=67
        SubPlan 1
          ->  Aggregate  (cost=3.74..3.75 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=50)
                Output: max(lsif_uploads.id)
                Buffers: shared hit=199
                ->  Index Scan using lsif_uploads_associated_index_id on public.lsif_uploads  (cost=0.42..3.74 rows=2 width=4) (actual time=0.003..0.004 rows=0 loops=50)
                      Output: lsif_uploads.id
                      Index Cond: (lsif_uploads.associated_index_id = u.id)
                      Buffers: shared hit=199
Planning Time: 1.430 ms
Execution Time: 3673.916 ms

Test plan

Covered by existing testing. Ran queries against production database for scale.

Merge request reports

Loading