Make Code Monitors repo-aware
Created by: camdencheek
Summary
Code Monitors currently suffer from an issue where, when searching multiple repositories or revisions, results may be missing or duplicated.
Multiple repo search
Currently, we run the provided search query, then if we get results, we save the timestamp of the latest result as a starting point for the next time the code monitor search query triggers by appending after:
to the query. This is problematic with multiple repos because results may be returned for one repo before valid results from another repo have been pulled from the code host.
Consider the following race:
- 1:05 - repo A is fetched from the code host
- 1:10 - repo B is fetched from the code host
- 1:11 - repo A and repo B are searched, with filter
after:"1:01"
. One result is returned for repo B with timestamp 1:09 - 1:12 - repo A is fetched. One commit is pulled with timestamp 1:06 that would match the monitor query
- 1:13 - repo A and B are searched, with filter
after:"1:09"
. No results are returned because the matching commit in repo A is filtered out
Multiple revision search
In addition to the repo fetch race condition, we have a similar race condition with searching multiple commit refs. A branch may be pushed to after a result is found in a different branch in the same repo.
Consider the following race:
- create commit1 on branch1
- create commit2 on main
- push main to codehost
- gitserver fetches commit2 on main
- run
type:diff repo:myrepo rev:*refs/heads/* test
, get commit2 as a result - push branch1 to codehost
- gitserver fetches commit1 on branch1
- run
type:diff repo:myrepo rev:*refs/heads/* test after:<commit2 timestamp>
- get no results because commit2 is after commit1
Proposed solution
In order to correctly start where we left off, I think we need to make code monitors repo-aware. That is, keep track of where we left off in each repo between searches.
Additionally, we will need to change the way we store the point we left off. We should move from saving a timestamp of the last result commit to saving a list of hashes that we previously searched from (more detail in the linked PR below).
In order to successfully make code monitors repo-aware, we will need to make the code monitors code call each commit/diff search itself so it can expand the refs into commit hashes and store the expanded refs for the next search. For this to be possible, I think we will need to complete some refactors up to the point where we have a "search plan", from which the code monitors backend can mutate the plan and kick off the commit/diff search jobs.
Tasks
-
Figure out a way to start exactly from where we previously left off (#27301) -
Complete refactoring up to the point that code monitors can create a search plan outside of the normal search path -
Rewrite code monitors to 1) generate a search plan, 2) expand revs for each repo, 3) save the expanded revs for each repo, 4) execute the search for each repo