External service job can become stuck processing forever
Created by: ryanslade
We observed a customer instance that had an external service sync job stuck in the queue for a very long time, many weeks.
It was in the processing
state and had been reset twice. We deleted the row which caused a new job to be queued and repos started to sync as expected.
Our current theory is that repo-updater was restarted while the job was processing which meant that it was never marked as complete. Then, all subsequent attempts to queue the job saw an existing job and didn't attempt to queue a duplicate.
We should:
- Ensure that we can't get into this state again, perhaps by clearing out processing jobs when repo-updater starts
- Surface the state of the sync job somewhere so that it's easier to spot if it does happen again
Related issue: https://github.com/sourcegraph/customer/issues/978