batches: DB fell into state with very poor statistics, causing all dequeues to fail
Created by: eseliger
- the queue got stuck on k8s when 32 executor machines were online
- no visible problem, just not deqeueuing
- CPU of pgsql-0 is at 100% constantly
- found this in executor machine logs: Jun 20 18:40:57 batches-dogfood-sourcegraph-executor-1ckk executor[3292]: {"SeverityText":"ERROR","Timestamp":1655750457528289914,"InstrumentationScope":"executor_processor","Caller":"workerutil/worker.go:192","Function":"github.com/sourcegraph/sourcegraph/internal/workerutil.(*Worker).Start","Body":"Failed to dequeue and handle record","Resource":{"service.name":"executor","service.version":"155249_2022-06-17_0c2ed651b4ee","service.instance.id":"batches-dogfood-sourcegraph-executor-1ckk"},"Attributes":{"name":"executor_batches_worker","name":"executor_batches_worker","error":"store.Dequeue: unexpected status code 504"}}
- started a rollout of new instances - no signs of improvement, but DB CPU usage dropped briefly, now is up to 100 again
- closed the browser tab to make sure it's not holding a lock somehow with the constant refetches, it's not changing anything
- scaling down instance group to 1 manually to see if it helps
- a minute later the CPU is not going down yet, also no sign of improvement, still 0 dequeues
- update: cpu is down now, no successful dequeues yet
- few minutes later, still no dequeue. systemctl restart ing the executor on the last VM now
- GCP seems to have restarted the VM now? is this a health check thing? weird. thinking again, it's probably the script that reboots the vm after a 0 exit code from executor.
- Machine is back up, still 0 dequeues.
- No logs in frontend, the grafana metric still shows dequeue errors.
- Looked at running queries, quite a few very slow ones, many from listRepos, but nothing that ran for minutes by now
- Killed the DB now, let's see what happens
- Nothing happens
- Going through the app, everything seems to work??
- I see queries approaching the 1 minute age and then disappearing from the list of queries - maybe that's a timeout in our apiclient?
- Still no clue where it hangs
- Running the dequeue SQL query by hand now; it's not returning
- It returned after a bit more than a minute.. a record that's still 'queued'?? How is that possible?
- List changeset sync data is also appearing in long running queries, no idea how that's related
- The very slow dequeue query that returns a job that is still queued returns acending IDs
- Running with explain analyze now Monstrous plan:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=17385.21..17387.48 rows=1 width=17) (actual time=116832.835..116833.067 rows=1 loops=1)
Join Filter: (j.id = queue_candidates.id)
Rows Removed by Join Filter: 1204
CTE candidate
-> Limit (cost=8694.24..8694.25 rows=1 width=62) (actual time=57918.292..57918.301 rows=1 loops=1)
-> LockRows (cost=8694.24..8694.25 rows=1 width=62) (actual time=57918.291..57918.300 rows=1 loops=1)
-> Sort (cost=8694.24..8694.24 rows=1 width=62) (actual time=57918.254..57918.263 rows=1 loops=1)
Sort Key: pc."order"
Sort Method: quicksort Memory: 95kB
-> Nested Loop (cost=8690.70..8694.23 rows=1 width=62) (actual time=57506.624..57918.045 rows=500 loops=1)
Join Filter: (batch_spec_workspace_execution_jobs.id = pc.candidate_id)
Rows Removed by Join Filter: 602000
-> Index Scan using batch_spec_workspace_execution_jobs_state on batch_spec_workspace_execution_jobs (cost=0.42..3.90 rows=1 width=14) (actual time=0.087..0.531 rows=1205 loops=1)
Index Cond: (state = ANY ('{queued,errored}'::text[]))
-> Subquery Scan on pc (cost=8690.28..8690.31 rows=1 width=56) (actual time=47.725..48.037 rows=500 loops=1205)
-> Limit (cost=8690.28..8690.30 rows=1 width=24) (actual time=47.724..47.944 rows=500 loops=1205)
-> WindowAgg (cost=8690.28..8690.30 rows=1 width=24) (actual time=47.724..47.912 rows=500 loops=1205)
-> Sort (cost=8690.28..8690.29 rows=1 width=16) (actual time=47.723..47.745 rows=501 loops=1205)
Sort Key: (row_number() OVER (?))
Sort Method: quicksort Memory: 105kB
-> Nested Loop Left Join (cost=8688.26..8690.27 rows=1 width=16) (actual time=56832.214..57505.935 rows=1205 loops=1)
Join Filter: (j_1.id = queue_candidates_1.id)
Rows Removed by Join Filter: 1450820
-> Index Scan using batch_spec_workspace_execution_jobs_state on batch_spec_workspace_execution_jobs j_1 (cost=0.42..2.38 rows=1 width=8) (actual time=0.040..1.163 rows=1205 loops=1)
Index Cond: (state = 'queued'::text)
Filter: ((process_after IS NULL) OR (process_after <= now()))
-> WindowAgg (cost=8687.84..8687.87 rows=1 width=24) (actual time=47.165..47.656 rows=1205 loops=1205)
-> Subquery Scan on queue_candidates_1 (cost=8687.84..8687.86 rows=1 width=8) (actual time=47.164..47.334 rows=1205 loops=1205)
-> Sort (cost=8687.84..8687.85 rows=1 width=36) (actual time=47.164..47.224 rows=1205 loops=1205)
Sort Key: (rank() OVER (?)), (max(exec_3.started_at)) NULLS FIRST
Sort Method: quicksort Memory: 143kB
-> WindowAgg (cost=8687.81..8687.83 rows=1 width=36) (actual time=56831.106..56831.824 rows=1205 loops=1)
-> Sort (cost=8687.81..8687.81 rows=1 width=28) (actual time=56831.084..56831.138 rows=1205 loops=1)
Sort Key: exec_3.user_id, exec_2.created_at, exec_2.id
Sort Method: quicksort Memory: 143kB
-> Nested Loop (cost=0.84..8687.80 rows=1 width=28) (actual time=48.804..56828.179 rows=1205 loops=1)
Join Filter: (exec_2.user_id = exec_3.user_id)
Rows Removed by Join Filter: 4820
-> Index Scan using batch_spec_workspace_execution_jobs_state on batch_spec_workspace_execution_jobs exec_2 (cost=0.42..2.37 rows=1 width=20) (actual time=0.034..3.575 rows=1205 loops=1)
Index Cond: (state = 'queued'::text)
-> GroupAggregate (cost=0.42..8685.06 rows=16 width=12) (actual time=0.008..47.154 rows=5 loops=1205)
Group Key: exec_3.user_id
-> Index Only Scan using batch_spec_workspace_execution_jobs_last_dequeue on batch_spec_workspace_execution_jobs exec_3 (cost=0.42..7232.36 rows=290509 width=12) (actual time=0.003..28.038 rows=282163 loops=1205)
Heap Fetches: 4100615
CTE updated_record
-> Update on batch_spec_workspace_execution_jobs batch_spec_workspace_execution_jobs_1 (cost=0.45..2.67 rows=1 width=290) (actual time=0.288..0.288 rows=0 loops=1)
-> Nested Loop (cost=0.45..2.67 rows=1 width=290) (actual time=0.077..0.077 rows=1 loops=1)
-> HashAggregate (cost=0.02..0.03 rows=1 width=40) (actual time=0.010..0.010 rows=1 loops=1)
Group Key: candidate_1.id
-> CTE Scan on candidate candidate_1 (cost=0.00..0.02 rows=1 width=40) (actual time=0.005..0.006 rows=1 loops=1)
-> Index Scan using batch_spec_workspace_execution_jobs_pkey on batch_spec_workspace_execution_jobs batch_spec_workspace_execution_jobs_1 (cost=0.42..2.64 rows=1 width=222) (actual time=0.063..0.063 rows=1 loops=1)
Index Cond: (id = candidate_1.id)
-> Nested Loop (cost=0.45..2.67 rows=1 width=17) (actual time=57918.349..57918.352 rows=1 loops=1)
-> HashAggregate (cost=0.02..0.03 rows=1 width=8) (actual time=57918.300..57918.301 rows=1 loops=1)
Group Key: candidate.id
-> CTE Scan on candidate (cost=0.00..0.02 rows=1 width=8) (actual time=57918.294..57918.295 rows=1 loops=1)
-> Index Scan using batch_spec_workspace_execution_jobs_pkey on batch_spec_workspace_execution_jobs j (cost=0.42..2.64 rows=1 width=17) (actual time=0.046..0.047 rows=1 loops=1)
Index Cond: (id = candidate.id)
-> Subquery Scan on queue_candidates (cost=8687.84..8687.86 rows=1 width=24) (actual time=58914.482..58914.633 rows=1205 loops=1)
-> Sort (cost=8687.84..8687.85 rows=1 width=36) (actual time=58914.480..58914.529 rows=1205 loops=1)
Sort Key: (rank() OVER (?)), (max(exec_1.started_at)) NULLS FIRST
Sort Method: quicksort Memory: 143kB
-> WindowAgg (cost=8687.81..8687.83 rows=1 width=36) (actual time=58913.415..58914.151 rows=1205 loops=1)
-> Sort (cost=8687.81..8687.81 rows=1 width=28) (actual time=58913.390..58913.442 rows=1205 loops=1)
Sort Key: exec_1.user_id, exec.created_at, exec.id
Sort Method: quicksort Memory: 143kB
-> Nested Loop (cost=0.84..8687.80 rows=1 width=28) (actual time=45.145..58910.444 rows=1205 loops=1)
Join Filter: (exec.user_id = exec_1.user_id)
Rows Removed by Join Filter: 4820
-> Index Scan using batch_spec_workspace_execution_jobs_state on batch_spec_workspace_execution_jobs exec (cost=0.42..2.37 rows=1 width=20) (actual time=0.042..4.225 rows=1205 loops=1)
Index Cond: (state = 'queued'::text)
-> GroupAggregate (cost=0.42..8685.06 rows=16 width=12) (actual time=0.008..48.881 rows=5 loops=1205)
Group Key: exec_1.user_id
-> Index Only Scan using batch_spec_workspace_execution_jobs_last_dequeue on batch_spec_workspace_execution_jobs exec_1 (cost=0.42..7232.36 rows=290509 width=12) (actual time=0.003..28.453 rows=282163 loops=1205)
Heap Fetches: 4100615
Planning Time: 2.259 ms
Execution Time: 116834.020 ms
(77 rows)
Not really giving me a hint what's going on though
- Maybe it's time for dinner
- The resetter also doesn't work it seems like? There are 5 jobs now that are processing (are those coming from the query that returned wrong data??)
- Yes they are, and when reselecting their state they are indeed processing. Now certain that the resetter doesn't work.
- Maybe killing worker does something? Wouldn't understand why tho since it lasted over a DB restart
- No visible change
- Killed all frontends at the same time
- No visible change
- 1h later, took a break. Nothing seems to have changed, resetter didn't do anything and the executor still didn't dequeue anything either
- Just learned: The CTE implementation of postgres doesn't support forwarding the state of an update to the next CTE or main query, they all use the same underlying query. So seeing 'queued' is actually expected.
- Another find: The resetter only resets things where last_heartbeat_at is not null. That is every record that is officially dequeued, but not my records that I manually dequeued. User error.
- Ran
update batch_spec_workspace_execution_jobs set last_heartbeat_at = now() where state = 'processing';
and now they're reset correctly. - Back to the dequeue problem: Running the first CTE to select candidates is fast, with the second one that does locking it becomes slower
- After a vacuum analyze of the exec jobs table, the run time for those two CTEs went from 60s to 45ms. smh.
- AND THE WORKER RETURNS