insights: add a new insightsCount compute command to support insights work in aggregating search results
Created by: chwarwick
This adds a new compute command for use by Code Insights to support on going work to provide aggregates of search results.
This initial pr adds the command and the ability to return counts that can be summed by a calling client for Repo, Path, Author, and Language. Additional work will extend this command's return type to optionally include a repo id so it can be used for insights that are persisted as well as add the ability to count capture group matches like the MatchOnly
command currently does.
typical usage:
repo:^github.com/org/myrepo$ lang:Go content:insights.count(my search -> $REPO) count:all patterntype:standard
I choose to add a new command because I wanted to constrain the behaviors of the command to the minimum needed to support the code insights functionality under development.
Known issues:
- Currently search queries passed to compute are initially parsed as
patterntype:regexp
, this can lead to parsing errors if the search pattern contains an invalid regex. For examplei++
. I think this is something that can be resolved in followup work. - Currently search queries passed to compute without a patterntype run as a regexp patterntype and do not follow the same default logic that search does which would run as
patterntype:standard
There is a very similar issue raised in https://github.com/sourcegraph/sourcegraph/issues/40125
Test plan
Unit tests pass
GET - q= content:insights.count(todo -> $REPO) patterntype:literal count:10
event: results
data: [{"value":"github.com/sourcegraph/sourcegraph","count":10}]
event: progress
data: {"done":false,"matchCount":0,"durationMs":33,"skipped":[{"reason":"shard-match-limit","title":"result limit hit","message":"Not all results have been returned due to hitting a match limit. Sourcegraph has limits for the number of results returned from a line, document and repository.","severity":"info","suggested":{"title":"increase limit","queryExpression":"count:1000"}}]}
event: progress
data: {"done":true,"matchCount":0,"durationMs":33,"skipped":[{"reason":"shard-match-limit","title":"result limit hit","message":"Not all results have been returned due to hitting a match limit. Sourcegraph has limits for the number of results returned from a line, document and repository.","severity":"info","suggested":{"title":"increase limit","queryExpression":"count:1000"}}]}
event: done
data: {}
Closes https://github.com/sourcegraph/sourcegraph/issues/39776
Merge request reports
Activity
Created by: rvantonder
Review: Commented
I choose to add a new command because I wanted to constrain the behaviors of the command to the minimum needed to support the code insights functionality under development.
I hesitate that this is a good way to introduce the notion of aggregations and in the
compute
package itself. Aggregations are a general-purpose concept, and should be treated as a first-class data processing layer, and this addition feels like a diversion from those principles. For example, I would expect something more in line with using the current output/match-only command to ingest the output in an actual aggregation layer (i.e., in the code insights client), rather than expanding the core compute language with a side-command that is essentially a hardcoded aggregation. This would keep the compute layer/implementation "clean" and allow focus on a pure aggregation layer. Building aggregation into compute means thinking carefully about exposing such functionality in a lanuage/syntax more amenable to aggregations, likewhere
clauses or aggregation functions in other languages. This design and implementation should be part of the exercise, and a first class concern, not an independent one-off, ancillary command.I am on time off, and would really like to give more guidance/input to these additions in the
compute
package. Please note any external clients or addition to clients that use the current state of thecompute
package (like using the Code Insights streaming client and, as I suggest, doing aggregations on that output) do not need my input. But I have asked to be involved with direct changes to thecompute
package, and that such changes/requests not be bound by time-sensitive constraints, so I would prefer to guide this specific PR further once I'm back.