Skip to content

ssbc: Fix duplicate changeset specs

Administrator requested to merge es/fix-duplicate-changeset-specs into main

Created by: eseliger

Okay, this has been quite a journey to discover. It happened occasionally on k8s when updates were rolling out. Here's what happens:

  • Executor A has job Z
  • Frontend goes down, executor A cannot send heartbeats anymore
  • Worker resets job to queued, since it is considered “lost”, executor A doesn't know about this yet, as it hasn't had a successful heartbeat again so far
  • Executor A continues to work as frontend is down
  • Frontend comes back up slowly
  • Executor B gets assigned job Z, overwriting the worker_hostname column thus now owning the job
  • Executor A is done, and has not yet learned that it doesn’t own the record anymore (no successful heartbeat yet)
  • Executor A calls MarkComplete({workerHostname: A})
  • In the backend, we first do our side-effects and then call dbworkerstore.MarkComplete. This returns false, nil (not updated, but no DB err)
  • We don't look at the ok value, commit the transaction since err is nil, the changeset spec is committed to the DB
  • Executor A gives up
  • Executor B completes work, calls MarkComplete({workerHostname: B}), this time returning true, nil, since the hostname matches
  • The workspace.changeset_specs field is overwritten and includes the new ID, but the old stray changeset spec remains
  • We get the conflicting changeset specs error in the UI, but only 1 result is displayed on the workspace, just as observed

We now properly revert the transactions, test this in code and also do one query less, which was not needed.

Closes https://github.com/sourcegraph/sourcegraph/issues/36918

Test plan

Extended test suite.

Merge request reports

Loading