Skip to content

search: fix indexed search match count

Administrator requested to merge rvt/fix-index-countr-regression into master

Created by: rvantonder

This fixes #11236, a regression introduced by https://github.com/sourcegraph/sourcegraph/pull/9541. The MatchCount value should be set on the Zoekt path.

While investigating this I discovered an interesting match count discrepancy. Irrespective of the regression, and prior to the bug I introduced, we had been calculating the number of matches returned by Zoekt by counting the number of lines. This stems from returning lines here on the Zoekt path, and using len(fm.LinesMatches) prior to using the MatchCount member that I introduced.

So three matches of foo on a line like foo foo foo would only count as one match on the indexed search path. To count the three separate foos, we need to calculate the number based off of offsets AKA fragments (which is what this PR does). Now this prompted the question of whether we intended to report the number of line matches, or the number of fragments. I believe we really do mean to report the number of fragments (since that seems reasonable), but also because that's what the unindexed search part did before the change (and this behavior is preserved, with MatchCount) but only because the logic in the unindexed search path always adds only one fragment per line match, as you can see here.

So, there is/was a discrepancy of match count in the indexed/unindexed search paths before the regression, and this PR addresses that bug as well. If we really mean to report the number of lines, and not fragments, then it implies that the unindexed search part contains a bug (reports more matches than number of lines). I do think we mean to report the number of fragments.

But there's another interesting tidbit: Zoekt returns a MatchCount member, for example, we access it here: resp.MatchCount. I thought that I could use that MatchCount value to return the number of matches, because the docstring says:

// Number of non-overlapping matches ref

And from this description I assumed it meant it counts multiple matches on a line, since I'm not sure what "non-overlapping matches" would mean if they are separated by lines, which are already discrete, non-overlapping data. Maybe it means something entirely different, because this MatchCount only counts the number of lines, not the number of matches in a line. It's also possible that's a bug, but I am not familiar enough with what this data member of Zoekt is trying to say, so CC @keegancsmith, take from it what you will.


It took me ~15 minutes to verify a fix, and 1.5+ hours to write the test, and that's really frustrating. I had to keep adding data to the test case so the function wouldn't nil panic all over the place until finally I wound up with a ~60 line struct :/. This would be better as an integration test, but the infra isn't ready yet. CC @unknwon I'm just pointing out this is important and a real pain point, I'm hoping it comes soon 🤞

Merge request reports

Loading