insights: store full vectors for each data series sample

Created by: coury-clark

Context

The code insights backend started off as a straightforward implementation that would populate a series by iterating over all repositories for each data point and performing a search query. This process was not ideal - it took a long time and scaled linearly with the repository count of the Sourcegraph installation, regardless of how many repositories are actually under development.

We introduced a compression technique (rfc) to make better choices about which repositories need to be queried, such that only repositories with commits during the sampling window would get queried. This compression is very effective in practice, typically reducing the number of queries necessary by > 90%. We stored the historical data in the same format as a partial vector, where a value at time T for some repository R would have to be derived by finding the most recent value <= T. To generate a full vector for time T, we resolved this relation for each repository at query time. While this significantly reduces the number of recordings stored, in practice we found this query pattern had a few distinct problems:

The performance was quite poor, and scaled with the number of distinct repositories in the data series * the number of time points resolved. The performance was unacceptably bad on sourcegraph.com and trended the same on k8s-dogfood after a short time.
Storing mixed formats of historically compressed partial vectors and non-historical full vectors led to artifacts that would require either more work at query time or some other techniques to set unseen values to zero when full vectors were recorded.

Certainly, there is room for improvement in this query; however, current performance is orders of magnitude away from target performance.

We considered the possibility of computing these full vectors in the background and resolving them from this table at query time. At the moment there are some concerns in the product that this doesn't necessarily satisfy customer need, and comes with enough complexity to implement that we would likely be putting the overall roadmap in question. It's possible that in a future iteration this will be a useful approach (especially for very large installations, eg. sourcegraph.com, but it needs more thought.

Moving forward

We have a few immediate needs to resolve:

Get the performance to an acceptable (and usable) state on clusters at least as big as k8s-dogfood, but ideally as large as sourcegraph.com.
Try to accomplish 1 without too much churn in the product roadmap

To this end, we are going to change the storage format to represent full vectors at each time point, but keep the compression method of reducing queries.

It will be useful to continue the trend of deterministic generation plans by aligning sample times across all series. We have chosen a day-of-month approach, fixed at day 1 and 15 currently. Only historical samples will be able to be fixed at a specific commit, indexed samples will execute at least after a fixed amount of time. Commits should continue to be derived as the HEAD at the sampling time, and should be recorded with the sample time instead of the committed time.
Historical frames that are skipped due to having no changes need to record a value. Given the asynchronous processing of these queries, we will introduce the concept of frame dependencies to some exemplar query. All skipped frames will atomically receive the same value as the exemplar query. That is to say for some exemplar value E, a set of dependent frames D must all succeed in writing the recorded value E. Any failures writing either E or any in D should flag E as an error and retry, and leave the underlying series in the same state prior to E.
This approach allows us to define some bounds on the expected cardinality of an insight by loosely associating the resolved number of repositories that currently match the query.