codeintel: Reduce correlation memory usage by 50%
Created by: efritz
In order to run workers concurrently (https://github.com/sourcegraph/sourcegraph/issues/11643), we need to be able to estimate the memory requirements of processing an upload given its input size. This took me down a small optimization hole that turned out to be rather fruitful.
Before these changes, the memory required to process an LSIF dump for aws-sdk-go was ~2x the input size (1.3G dump took 2.4G of heap space). After these changes it can be processed with only ~1x the memory (now takes 1.2-1.5G of heap space). Processing is also slightly faster.
Changes:
- Convert LSIF identifiers into numbers at the parsing layer. It is more efficient to store integers than strings in value types and containers as it saves heap space and reduces pointer chasing. Comparisons are also faster.
- Rewrite the integer set to take advantage of the bimodal use of sets during processing: most sets are very small (zero to a handful of elements), but some sets are large (millions of elements). It is advantageous to optimize for this common case. The main difference here is using a small slice instead of a map for small sets. Slices are fast to operate on as it (likely) resides in the same cache line and have much less overhead than a map (both for the struct and per-element). Reducing the size of each set gives us a large benefit because we allocate so many of them.
- Be smarter about parsing markdown. There was a lot of string movement previously.
- Be smarter about supplying size hints of maps and slices during grouping.
Memory trace (before):
Memory trace (after):