Skip to content

Speed up ListIndexableRepos query for uncloned

Warren Gifford requested to merge es/faster-indexable-repos into main

Created by: eseliger

Since we introduced an invariant that gitserver_repos MUST have a record for each repo, we no longer need to do a left join here, and hence can optimize. Still not fast, but faster.

Before:

                                                                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Merge  (cost=1385996.09..1428720.77 rows=366186 width=45) (actual time=7499.200..7589.828 rows=4914 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Sort  (cost=1384996.07..1385453.80 rows=183093 width=45) (actual time=7484.861..7485.005 rows=1638 loops=3)
         Sort Key: repo.stars DESC NULLS LAST
         Sort Method: quicksort  Memory: 248kB
         Worker 0:  Sort Method: quicksort  Memory: 257kB
         Worker 1:  Sort Method: quicksort  Memory: 256kB
         ->  Parallel Hash Left Join  (cost=480914.62..1368991.71 rows=183093 width=45) (actual time=2002.405..7483.617 rows=1638 loops=3)
               Hash Cond: (repo.id = gr.repo_id)
               Filter: ((gr.clone_status IS NULL) OR (gr.clone_status = 'not_cloned'::text))
               Rows Removed by Filter: 994128
               ->  Parallel Seq Scan on repo  (cost=104508.90..988268.68 rows=1644687 width=45) (actual time=312.475..4995.084 rows=995766 loops=3)
                     Filter: ((deleted_at IS NULL) AND (blocked IS NULL) AND (((stars >= 5) AND (NOT COALESCE(fork, false)) AND (NOT archived)) OR (lower((name)::text) ~ '^(src\.fedoraproject\.org|maven|npm|jdk)'::text) OR (hashed SubPlan 1)))
                     Rows Removed by Filter: 1258974
                     SubPlan 1
                       ->  Append  (cost=6113.13..103243.67 rows=506091 width=4) (actual time=45.900..249.734 rows=476519 loops=3)
                             ->  Bitmap Heap Scan on external_service_repos  (cost=6113.13..95632.31 rows=504001 width=4) (actual time=45.899..190.160 rows=474429 loops=3)
                                   Recheck Cond: ((user_id IS NOT NULL) OR (org_id IS NOT NULL))
                                   Heap Blocks: exact=39490
                                   ->  BitmapOr  (cost=6113.13..6113.13 rows=504118 width=0) (actual time=38.602..38.603 rows=0 loops=3)
                                         ->  Bitmap Index Scan on external_service_user_repos_idx  (cost=0.00..5836.75 rows=502372 width=0) (actual time=38.509..38.509 rows=472829 loops=3)
                                               Index Cond: (user_id IS NOT NULL)
                                         ->  Bitmap Index Scan on external_service_repos_org_id_idx  (cost=0.00..24.38 rows=1746 width=0) (actual time=0.092..0.092 rows=1600 loops=3)
                                               Index Cond: (org_id IS NOT NULL)
                             ->  Seq Scan on user_public_repos  (cost=0.00..40.90 rows=2090 width=4) (actual time=0.041..0.412 rows=2090 loops=3)
               ->  Parallel Hash  (cost=341028.10..341028.10 rows=2830210 width=11) (actual time=1477.973..1477.974 rows=2254740 loops=3)
                     Buckets: 8388608  Batches: 1  Memory Usage: 382976kB
                     ->  Parallel Seq Scan on gitserver_repos gr  (cost=0.00..341028.10 rows=2830210 width=11) (actual time=0.014..661.638 rows=2254740 loops=3)
 Planning Time: 1.929 ms
 Execution Time: 7592.740 ms
(31 rows)

After

                                                                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather Merge  (cost=1164998.24..1207722.92 rows=366186 width=45) (actual time=5885.152..5900.653 rows=4913 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Sort  (cost=1163998.21..1164455.94 rows=183093 width=45) (actual time=5879.589..5879.741 rows=1638 loops=3)
         Sort Key: repo.stars DESC NULLS LAST
         Sort Method: quicksort  Memory: 260kB
         Worker 0:  Sort Method: quicksort  Memory: 257kB
         Worker 1:  Sort Method: quicksort  Memory: 244kB
         ->  Parallel Hash Join  (cost=259916.77..1147993.85 rows=183093 width=45) (actual time=1061.089..5878.503 rows=1638 loops=3)
               Hash Cond: (repo.id = gr.repo_id)
               ->  Parallel Seq Scan on repo  (cost=104508.90..988268.68 rows=1644687 width=45) (actual time=286.767..4787.528 rows=995766 loops=3)
                     Filter: ((deleted_at IS NULL) AND (blocked IS NULL) AND (((stars >= 5) AND (NOT COALESCE(fork, false)) AND (NOT archived)) OR (lower((name)::text) ~ '^(src\.fedoraproject\.org|maven|npm|jdk)'::text) OR (hashed SubPlan 1)))
                     Rows Removed by Filter: 1258975
                     SubPlan 1
                       ->  Append  (cost=6113.13..103243.67 rows=506091 width=4) (actual time=35.820..253.044 rows=476519 loops=3)
                             ->  Bitmap Heap Scan on external_service_repos  (cost=6113.13..95632.31 rows=504001 width=4) (actual time=35.818..193.572 rows=474429 loops=3)
                                   Recheck Cond: ((user_id IS NOT NULL) OR (org_id IS NOT NULL))
                                   Heap Blocks: exact=39490
                                   ->  BitmapOr  (cost=6113.13..6113.13 rows=504118 width=0) (actual time=28.704..28.705 rows=0 loops=3)
                                         ->  Bitmap Index Scan on external_service_user_repos_idx  (cost=0.00..5836.75 rows=502372 width=0) (actual time=28.623..28.624 rows=472829 loops=3)
                                               Index Cond: (user_id IS NOT NULL)
                                         ->  Bitmap Index Scan on external_service_repos_org_id_idx  (cost=0.00..24.38 rows=1746 width=0) (actual time=0.078..0.079 rows=1600 loops=3)
                                               Index Cond: (org_id IS NOT NULL)
                             ->  Seq Scan on user_public_repos  (cost=0.00..40.90 rows=2090 width=4) (actual time=0.032..0.428 rows=2090 loops=3)
               ->  Parallel Hash  (cost=151488.03..151488.03 rows=313587 width=4) (actual time=601.020..601.025 rows=247423 loops=3)
                     Buckets: 1048576  Batches: 1  Memory Usage: 37248kB
                     ->  Parallel Index Only Scan using gitserver_repos_not_cloned_status_idx on gitserver_repos gr  (cost=0.42..151488.03 rows=313587 width=4) (actual time=0.163..507.651 rows=247423 loops=3)
                           Heap Fetches: 1081295
 Planning Time: 1.356 ms
 Execution Time: 5902.782 ms
(30 rows)

Test plan

Validated with query plans this change is good.

Merge request reports

Loading