codeintel: Reduce the size of bundle-manager network payloads (!11004) · Merge requests · Administrator / sourcegraph

Administrator requested to merge split-bundle-manager-payloads into master May 26, 2020

Created by: efritz

This PR splits large payloads that are transferred between the bundle manager and other services (particularly the worker). Particularly when transferring a large upload to the worker from the bundle manager's disk, we get connection reset errors which causes the upload to fail to process.

This PR adds resiliency between this communication path. When the worker requests an upload file, it will keep track of its current offset (how many bytes received) and on a transient network error will retry the request at the correct offset. When the worker sends large uploads (converted database files) to the bundle manager for permanent storage, it will chunk it into smaller files and transfer them serially.

This should remedy #10913, but we need to ensure that it works in the dot-com environment for an extended period before declaring it as solved. I hope that having smaller transfer sizes will cease to trigger the network issues that are present in kubernetes clusters where this behavior is being seen. We may also need to add retry mechanisms when uploading the smaller payloads, but I will hold off until a second PR.

My intuition is that this should be enough as we don't see similar errors when uploading LSIF indexes from CI: these are already split into smaller payloads using the same technique in this PR, so the maximum size of an upload is likely lower than the threshold required to trigger this behavior (but once stitched and transferred pod-to-pod was large enough to trigger it).

codeintel: Reduce the size of bundle-manager network payloads

Merge request reports