workerutil: Remove long-running transactions
Created by: efritz
Overview
This PR changes the workerutil and dbworker to use a heartbeat update of a job record instead of a long-held transaction to signal an active worker. Fixes https://github.com/sourcegraph/sourcegraph/issues/14920.
Technical overview
- A nullable
timestamp with tome zone
columnlast_updated_at
was added to all tables that are used as a worker job record. Associated Postgres views are not updated as these columns do not need to be exposed outside of the dequeue code. - The workerutil
Store
interface changed to replace its transactional store-like API with a simple cancel function being returned alongside a dequeued record. This allows a backing store to be implemented by a transaction (as previously) or by some periodic process that ensures the record is still bing serviced (as now). This allows us to simplify theDequeue
function and also removeTransact
andDone
methods. - Update the dbworker
Store
implementation. Instead of wrapping a record in a transaction via a two-stage optimistic locking mechanism, we use the non-transactional store to lock a record and periodically update a timestamp to signal that the record is still being processed. The returned cancel function will exit out of this goroutine when the record should be left alone (either having moved into a terminal state or able to be re-processed).
Review by commit
The first commit does all the shared changes and doesn't need to be reviewed by everyone as long as you can confirm that your use of the worker interfaces still behave as expected. Each team should check their own use to make sure I didn't accidentally break a guarantee. I've split changes specific to a team's worker setup into separate commits below.
-
Repository updates (cc @asdine):
- https://github.com/sourcegraph/sourcegraph/pull/20936/commits/0f525a8c913b05514204e3333eb76962e25f1009: We updated the handler to create a new transaction as one is no longer created for you by the worker.
-
Code insights (cc @slimsag):
- https://github.com/sourcegraph/sourcegraph/pull/20936/commits/c8d06d046aeb80e428e434a7d17f9b895fe113be: This should require almost no review - you were not using the transaction within your handler body. We just added a heartbeat interval much lower than the max reset timeout.
-
Code monitors (cc @stefanhengl):
- https://github.com/sourcegraph/sourcegraph/pull/20936/commits/325b2b82fa6c08a2974a2173195d5efadd327a11: We updated the handler to create a new transaction as one is no longer created for you by the worker.
-
Batches (cc @eseliger):
- https://github.com/sourcegraph/sourcegraph/pull/20936/commits/b497c839a88a42429e93971eb3f2e747faa0a3a0: We updated the handler to create a new transaction as one is no longer created for you by the worker.
-
Code intelligence (cc @Strum355 and @shrouxm):
- https://github.com/sourcegraph/sourcegraph/pull/20936/commits/f94cad1c2ba57edcfec71e3e93946abe4230a17e: We updated the handler to create a new transaction as one is no longer created for you by the worker.
- https://github.com/sourcegraph/sourcegraph/pull/20936/commits/6b80b8a36044562284f2a8d27ebfc57d90f1b1a9: We had to update the executor-queue API to stash a cancel function instead of a transaction handle.