workerutil: experimenting with batch dequeue to reduce queue overhead (!24297) · Merge requests · Administrator / sourcegraph

Administrator requested to merge workerutil/batch-dequeue into main Aug 24, 2021

Created by: coury-clark

This PR is not meant to be merged, just looking for feedback here

So, what is this?

While debugging some slowness in the code insights worker, I noticed a few things:

The dequeue mechanism can be a bottleneck for short-lived, high concurrency handlers (dequeue operations can have non-trivial overhead)
Other queuing systems (kafka, sqs, etc) allow consumers to poll a batch of records with the semantics that records not successfully marked as completed in some time interval may be deqeueued again.

Now, I'm not interested in rebuilding Kafka here but it would be useful for code insights to have some batching. We are currently spending roughly 400ms on each dequeue operation (this could probably be a bit better, but regardless) and only 100-200ms average on the actual handler invocation. We have millions of invocations to process, so if we could batch something like 20-50 records we would nearly eliminate the queue overhead and save a lot of time.

Here are some of the implementation problems I'm looking for feedback on:

The RecordScanFn can just brick the entire set of records currently because it eats the entire rows and gives back just one Record. I could update this to return a []workerutil.Record, or I could add a BatchRecordScanFn with different semantics. Ideally, the operations on a single Record would be in a different store than operations to fetch a Record but I think that would be much more work.
Should the Store interface have both a Dequeue and BatchDequeue operation? I'm leaning towards yes, to reduce the random impact of changing the Dequeue semantics. For example, in this PR there is an executor that uses the same Store interface that would have to be updated.
What to attach to the trace? Currently, it associates the single record id on dequeue. Is this even relevant?
Should batching respect the NumTotalJobs exactly, or execute within a single BatchSize?

So basically this PR is a rough sketch implementation of what this could look like. It doesn't work right because the mapping function eats all of the records, but it should serve to illustrate the idea.

Looking for feedback, thanks!

workerutil: experimenting with batch dequeue to reduce queue overhead

This PR is not meant to be merged, just looking for feedback here

Merge request reports