batches: DB fell into state with very poor statistics, causing all dequeues to fail

Created by: eseliger

the queue got stuck on k8s when 32 executor machines were online
no visible problem, just not deqeueuing
CPU of pgsql-0 is at 100% constantly
found this in executor machine logs: Jun 20 18:40:57 batches-dogfood-sourcegraph-executor-1ckk executor[3292]: {"SeverityText":"ERROR","Timestamp":1655750457528289914,"InstrumentationScope":"executor_processor","Caller":"workerutil/worker.go:192","Function":"github.com/sourcegraph/sourcegraph/internal/workerutil.(*Worker).Start","Body":"Failed to dequeue and handle record","Resource":{"service.name":"executor","service.version":"155249_2022-06-17_0c2ed651b4ee","service.instance.id":"batches-dogfood-sourcegraph-executor-1ckk"},"Attributes":{"name":"executor_batches_worker","name":"executor_batches_worker","error":"store.Dequeue: unexpected status code 504"}}
started a rollout of new instances - no signs of improvement, but DB CPU usage dropped briefly, now is up to 100 again
closed the browser tab to make sure it's not holding a lock somehow with the constant refetches, it's not changing anything
scaling down instance group to 1 manually to see if it helps
a minute later the CPU is not going down yet, also no sign of improvement, still 0 dequeues
update: cpu is down now, no successful dequeues yet
few minutes later, still no dequeue. systemctl restart ing the executor on the last VM now
GCP seems to have restarted the VM now? is this a health check thing? weird. thinking again, it's probably the script that reboots the vm after a 0 exit code from executor.
Machine is back up, still 0 dequeues.
No logs in frontend, the grafana metric still shows dequeue errors.
Looked at running queries, quite a few very slow ones, many from listRepos, but nothing that ran for minutes by now
Killed the DB now, let's see what happens
Nothing happens
Going through the app, everything seems to work??
I see queries approaching the 1 minute age and then disappearing from the list of queries - maybe that's a timeout in our apiclient?
Still no clue where it hangs
Running the dequeue SQL query by hand now; it's not returning
It returned after a bit more than a minute.. a record that's still 'queued'?? How is that possible?
List changeset sync data is also appearing in long running queries, no idea how that's related
The very slow dequeue query that returns a job that is still queued returns acending IDs
Running with explain analyze now Monstrous plan:

                                                                                                                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=17385.21..17387.48 rows=1 width=17) (actual time=116832.835..116833.067 rows=1 loops=1)
   Join Filter: (j.id = queue_candidates.id)
   Rows Removed by Join Filter: 1204
   CTE candidate
     ->  Limit  (cost=8694.24..8694.25 rows=1 width=62) (actual time=57918.292..57918.301 rows=1 loops=1)
           ->  LockRows  (cost=8694.24..8694.25 rows=1 width=62) (actual time=57918.291..57918.300 rows=1 loops=1)
                 ->  Sort  (cost=8694.24..8694.24 rows=1 width=62) (actual time=57918.254..57918.263 rows=1 loops=1)
                       Sort Key: pc."order"
                       Sort Method: quicksort  Memory: 95kB
                       ->  Nested Loop  (cost=8690.70..8694.23 rows=1 width=62) (actual time=57506.624..57918.045 rows=500 loops=1)
                             Join Filter: (batch_spec_workspace_execution_jobs.id = pc.candidate_id)
                             Rows Removed by Join Filter: 602000
                             ->  Index Scan using batch_spec_workspace_execution_jobs_state on batch_spec_workspace_execution_jobs  (cost=0.42..3.90 rows=1 width=14) (actual time=0.087..0.531 rows=1205 loops=1)
                                   Index Cond: (state = ANY ('{queued,errored}'::text[]))
                             ->  Subquery Scan on pc  (cost=8690.28..8690.31 rows=1 width=56) (actual time=47.725..48.037 rows=500 loops=1205)
                                   ->  Limit  (cost=8690.28..8690.30 rows=1 width=24) (actual time=47.724..47.944 rows=500 loops=1205)
                                         ->  WindowAgg  (cost=8690.28..8690.30 rows=1 width=24) (actual time=47.724..47.912 rows=500 loops=1205)
                                               ->  Sort  (cost=8690.28..8690.29 rows=1 width=16) (actual time=47.723..47.745 rows=501 loops=1205)
                                                     Sort Key: (row_number() OVER (?))
                                                     Sort Method: quicksort  Memory: 105kB
                                                     ->  Nested Loop Left Join  (cost=8688.26..8690.27 rows=1 width=16) (actual time=56832.214..57505.935 rows=1205 loops=1)
                                                           Join Filter: (j_1.id = queue_candidates_1.id)
                                                           Rows Removed by Join Filter: 1450820
                                                           ->  Index Scan using batch_spec_workspace_execution_jobs_state on batch_spec_workspace_execution_jobs j_1  (cost=0.42..2.38 rows=1 width=8) (actual time=0.040..1.163 rows=1205 loops=1)
                                                                 Index Cond: (state = 'queued'::text)
                                                                 Filter: ((process_after IS NULL) OR (process_after <= now()))
                                                           ->  WindowAgg  (cost=8687.84..8687.87 rows=1 width=24) (actual time=47.165..47.656 rows=1205 loops=1205)
                                                                 ->  Subquery Scan on queue_candidates_1  (cost=8687.84..8687.86 rows=1 width=8) (actual time=47.164..47.334 rows=1205 loops=1205)
                                                                       ->  Sort  (cost=8687.84..8687.85 rows=1 width=36) (actual time=47.164..47.224 rows=1205 loops=1205)
                                                                             Sort Key: (rank() OVER (?)), (max(exec_3.started_at)) NULLS FIRST
                                                                             Sort Method: quicksort  Memory: 143kB
                                                                             ->  WindowAgg  (cost=8687.81..8687.83 rows=1 width=36) (actual time=56831.106..56831.824 rows=1205 loops=1)
                                                                                   ->  Sort  (cost=8687.81..8687.81 rows=1 width=28) (actual time=56831.084..56831.138 rows=1205 loops=1)
                                                                                         Sort Key: exec_3.user_id, exec_2.created_at, exec_2.id
                                                                                         Sort Method: quicksort  Memory: 143kB
                                                                                         ->  Nested Loop  (cost=0.84..8687.80 rows=1 width=28) (actual time=48.804..56828.179 rows=1205 loops=1)
                                                                                               Join Filter: (exec_2.user_id = exec_3.user_id)
                                                                                               Rows Removed by Join Filter: 4820
                                                                                               ->  Index Scan using batch_spec_workspace_execution_jobs_state on batch_spec_workspace_execution_jobs exec_2  (cost=0.42..2.37 rows=1 width=20) (actual time=0.034..3.575 rows=1205 loops=1)
                                                                                                     Index Cond: (state = 'queued'::text)
                                                                                               ->  GroupAggregate  (cost=0.42..8685.06 rows=16 width=12) (actual time=0.008..47.154 rows=5 loops=1205)
                                                                                                     Group Key: exec_3.user_id
                                                                                                     ->  Index Only Scan using batch_spec_workspace_execution_jobs_last_dequeue on batch_spec_workspace_execution_jobs exec_3  (cost=0.42..7232.36 rows=290509 width=12) (actual time=0.003..28.038 rows=282163 loops=1205)
                                                                                                           Heap Fetches: 4100615
   CTE updated_record
     ->  Update on batch_spec_workspace_execution_jobs batch_spec_workspace_execution_jobs_1  (cost=0.45..2.67 rows=1 width=290) (actual time=0.288..0.288 rows=0 loops=1)
           ->  Nested Loop  (cost=0.45..2.67 rows=1 width=290) (actual time=0.077..0.077 rows=1 loops=1)
                 ->  HashAggregate  (cost=0.02..0.03 rows=1 width=40) (actual time=0.010..0.010 rows=1 loops=1)
                       Group Key: candidate_1.id
                       ->  CTE Scan on candidate candidate_1  (cost=0.00..0.02 rows=1 width=40) (actual time=0.005..0.006 rows=1 loops=1)
                 ->  Index Scan using batch_spec_workspace_execution_jobs_pkey on batch_spec_workspace_execution_jobs batch_spec_workspace_execution_jobs_1  (cost=0.42..2.64 rows=1 width=222) (actual time=0.063..0.063 rows=1 loops=1)
                       Index Cond: (id = candidate_1.id)
   ->  Nested Loop  (cost=0.45..2.67 rows=1 width=17) (actual time=57918.349..57918.352 rows=1 loops=1)
         ->  HashAggregate  (cost=0.02..0.03 rows=1 width=8) (actual time=57918.300..57918.301 rows=1 loops=1)
               Group Key: candidate.id
               ->  CTE Scan on candidate  (cost=0.00..0.02 rows=1 width=8) (actual time=57918.294..57918.295 rows=1 loops=1)
         ->  Index Scan using batch_spec_workspace_execution_jobs_pkey on batch_spec_workspace_execution_jobs j  (cost=0.42..2.64 rows=1 width=17) (actual time=0.046..0.047 rows=1 loops=1)
               Index Cond: (id = candidate.id)
   ->  Subquery Scan on queue_candidates  (cost=8687.84..8687.86 rows=1 width=24) (actual time=58914.482..58914.633 rows=1205 loops=1)
         ->  Sort  (cost=8687.84..8687.85 rows=1 width=36) (actual time=58914.480..58914.529 rows=1205 loops=1)
               Sort Key: (rank() OVER (?)), (max(exec_1.started_at)) NULLS FIRST
               Sort Method: quicksort  Memory: 143kB
               ->  WindowAgg  (cost=8687.81..8687.83 rows=1 width=36) (actual time=58913.415..58914.151 rows=1205 loops=1)
                     ->  Sort  (cost=8687.81..8687.81 rows=1 width=28) (actual time=58913.390..58913.442 rows=1205 loops=1)
                           Sort Key: exec_1.user_id, exec.created_at, exec.id
                           Sort Method: quicksort  Memory: 143kB
                           ->  Nested Loop  (cost=0.84..8687.80 rows=1 width=28) (actual time=45.145..58910.444 rows=1205 loops=1)
                                 Join Filter: (exec.user_id = exec_1.user_id)
                                 Rows Removed by Join Filter: 4820
                                 ->  Index Scan using batch_spec_workspace_execution_jobs_state on batch_spec_workspace_execution_jobs exec  (cost=0.42..2.37 rows=1 width=20) (actual time=0.042..4.225 rows=1205 loops=1)
                                       Index Cond: (state = 'queued'::text)
                                 ->  GroupAggregate  (cost=0.42..8685.06 rows=16 width=12) (actual time=0.008..48.881 rows=5 loops=1205)
                                       Group Key: exec_1.user_id
                                       ->  Index Only Scan using batch_spec_workspace_execution_jobs_last_dequeue on batch_spec_workspace_execution_jobs exec_1  (cost=0.42..7232.36 rows=290509 width=12) (actual time=0.003..28.453 rows=282163 loops=1205)
                                             Heap Fetches: 4100615
 Planning Time: 2.259 ms
 Execution Time: 116834.020 ms
(77 rows)

Not really giving me a hint what's going on though

Maybe it's time for dinner
The resetter also doesn't work it seems like? There are 5 jobs now that are processing (are those coming from the query that returned wrong data??)
Yes they are, and when reselecting their state they are indeed processing. Now certain that the resetter doesn't work.
Maybe killing worker does something? Wouldn't understand why tho since it lasted over a DB restart
No visible change
Killed all frontends at the same time
No visible change
1h later, took a break. Nothing seems to have changed, resetter didn't do anything and the executor still didn't dequeue anything either
Just learned: The CTE implementation of postgres doesn't support forwarding the state of an update to the next CTE or main query, they all use the same underlying query. So seeing 'queued' is actually expected.
Another find: The resetter only resets things where last_heartbeat_at is not null. That is every record that is officially dequeued, but not my records that I manually dequeued. User error.
Ran update batch_spec_workspace_execution_jobs set last_heartbeat_at = now() where state = 'processing'; and now they're reset correctly.
Back to the dequeue problem: Running the first CTE to select candidates is fast, with the second one that does locking it becomes slower
After a vacuum analyze of the exec jobs table, the run time for those two CTEs went from 60s to 45ms. smh.
AND THE WORKER RETURNS