Remodel SQLite Schema
Created by: efritz
The Problem
The old SQLite database would store everything in a document blob, which is a gzipped and json-encoded set of the following data:
- ranges in the document
- monikers attached to ranges
- package information attached to monikers
- definition results shared by multiple ranges in the document
- reference results shared by multiple ranges in the document
- hover results shared by multiple ranges in the document
This allows us to have a simple SQL query that would return the blob of data that we needed to answer any query that doesn't involve in looking at two files.
However, after removing the result sets at import time in order to stop the server from having to trace graph edges at query time (which is undesirable -- why do it on every query when you can only do it once) the size of the definition and reference results became apparent. Note that this didn't cause a problem itself, but it did reveal the extent of the problems of one that already existed.
This created much larger document blobs, which will become a problem at some point. In order to answer queries quickly about a range in a large document, it may be necessary to pull multiple megabytes of unrelated information from a SQLite file.
The Solution
Re-model the database so that documents no longer track their own definition and reference results (but they do retain their ranges, monikers, package information, and hovers). Profiling has shown that the OVERWHELMING proportion of the data is in these two fields.
We now put definition and reference results in another table. However, experiments over the Labor day holidy showed that data at this scale will be infeasible to store per-row (the overhead for tuples is too high at insertion, and too large on disk). We need to do some similar gzipped and json-encoded trickery.
So far, we can't: store it all in one giant blog (it would be much larger than a document), store it along with a document or as a sibling of a document (it would not be easy to share the same definition or reference results between documents), or store it in individual rows (due to the required throughput of the converter and the rarity of rare earth materials required to produce enough disk space).
What we can do is shard definition and reference results over several rows, with a size that scales dynamically with the size of the input dump. Then, any identifier for a definition or reference result from a document will be able to determine (with the same hash function and the total number of chunks) the id of the result chunk. This requires loading a second blob for definition and reference results, but these can be cached in memory in the same manner as document blobs are cached in memory. See code for details!
Results
Uploading is now 2-3x faster