insights: backfill time frames before the commit index are treated as compressed frames
Created by: coury-clark
I think I just confirmed this is a problem:
seriesId: 242lOzkcVYjfMwS3LXSwintsZ6N
on k8s-dogfood - for repo_id: 45849
which corresponds to github.com/DefinitelyTyped/DefinitelyTyped
Series is configured for 9 month intervals.
I just regenerated the series using streaming search and found this repo only got 2 queries: 2021-04-22
and 2022-01-22
. The oldest commit in the repo was Oct 5 09:39:46 2012 -0700
. Viewing the logs confirms that the compression resulted in this execution plan:
[{ 2013-10-22 00:00:00 +0000 UTC [2014-07-22 00:00:00 +0000 UTC 2015-04-22 00:00:00 +0000 UTC 2016-01-22 00:00:00 +0000 UTC 2016-10-22 00:00:00 +0000 UTC 2017-07-22 00:00:00 +0000 UTC 2018-04-22 00:00:00 +0000 UTC 2019-01-22 00:00:00 +0000 UTC 2019-10-22 00:00:00 +0000 UTC 2020-07-22 00:00:00 +0000 UTC]},{bb9f40328b8349b856937b03b2f4d5c5fec789d1 2021-04-22 00:00:00 +0000 UTC []},{21c9f99f8e0f62f0fef7194e52e8782c8c737e0a 2022-01-22 00:00:00 +0000 UTC []}]
Unfortunately, there is another bug that caused the first search to be pre-empted as well. As an optimization we preemptively remove any queries that would fall before the earliest commit in the repo. We load the oldest commit; however, the code that finds the oldest commit in the repo can return a commit that isn't the oldest commit, but is just a commit that has no parent commit.
In this case there are 5 commits that can be "no parent", and only one of them would have triggered a search on the first data point.
Originally posted by @coury-clark in https://github.com/sourcegraph/sourcegraph/issues/30255#issuecomment-1026190792
If a commit falls before the oldest indexed commit, we should always perform an uncompressed query. We can only treat the lack of commits in a time frame from the index as valid if the current commit falls somewhere in the range min(committed_at where repo_id = %s
<= commiter time < last_indexed_at where repo_id = %s