Skip to content
Snippets Groups Projects

insights: add a new insightsCount compute command to support insights work in aggregating search results

Open Warren Gifford requested to merge cw/insights-compute-cmd into main

Created by: chwarwick

This adds a new compute command for use by Code Insights to support on going work to provide aggregates of search results.

This initial pr adds the command and the ability to return counts that can be summed by a calling client for Repo, Path, Author, and Language. Additional work will extend this command's return type to optionally include a repo id so it can be used for insights that are persisted as well as add the ability to count capture group matches like the MatchOnly command currently does.

typical usage:

repo:^github.com/org/myrepo$ lang:Go content:insights.count(my search -> $REPO) count:all patterntype:standard

I choose to add a new command because I wanted to constrain the behaviors of the command to the minimum needed to support the code insights functionality under development.

Known issues:

  • Currently search queries passed to compute are initially parsed as patterntype:regexp, this can lead to parsing errors if the search pattern contains an invalid regex. For example i++. I think this is something that can be resolved in followup work.
  • Currently search queries passed to compute without a patterntype run as a regexp patterntype and do not follow the same default logic that search does which would run as patterntype:standard

There is a very similar issue raised in https://github.com/sourcegraph/sourcegraph/issues/40125

Test plan

Unit tests pass

GET - q= content:insights.count(todo -> $REPO) patterntype:literal count:10

event: results
data: [{"value":"github.com/sourcegraph/sourcegraph","count":10}]

event: progress
data: {"done":false,"matchCount":0,"durationMs":33,"skipped":[{"reason":"shard-match-limit","title":"result limit hit","message":"Not all results have been returned due to hitting a match limit. Sourcegraph has limits for the number of results returned from a line, document and repository.","severity":"info","suggested":{"title":"increase limit","queryExpression":"count:1000"}}]}

event: progress
data: {"done":true,"matchCount":0,"durationMs":33,"skipped":[{"reason":"shard-match-limit","title":"result limit hit","message":"Not all results have been returned due to hitting a match limit. Sourcegraph has limits for the number of results returned from a line, document and repository.","severity":"info","suggested":{"title":"increase limit","queryExpression":"count:1000"}}]}

event: done
data: {}

Closes https://github.com/sourcegraph/sourcegraph/issues/39776

Merge request reports

The source branch cw/insights-compute-cmd does not exist. Please restore it or use a different source branch.
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Created by: rvantonder

    Review: Commented

    I choose to add a new command because I wanted to constrain the behaviors of the command to the minimum needed to support the code insights functionality under development.

    I hesitate that this is a good way to introduce the notion of aggregations and in the compute package itself. Aggregations are a general-purpose concept, and should be treated as a first-class data processing layer, and this addition feels like a diversion from those principles. For example, I would expect something more in line with using the current output/match-only command to ingest the output in an actual aggregation layer (i.e., in the code insights client), rather than expanding the core compute language with a side-command that is essentially a hardcoded aggregation. This would keep the compute layer/implementation "clean" and allow focus on a pure aggregation layer. Building aggregation into compute means thinking carefully about exposing such functionality in a lanuage/syntax more amenable to aggregations, like where clauses or aggregation functions in other languages. This design and implementation should be part of the exercise, and a first class concern, not an independent one-off, ancillary command.

    I am on time off, and would really like to give more guidance/input to these additions in the compute package. Please note any external clients or addition to clients that use the current state of the compute package (like using the Code Insights streaming client and, as I suggest, doing aggregations on that output) do not need my input. But I have asked to be involved with direct changes to the compute package, and that such changes/requests not be bound by time-sensitive constraints, so I would prefer to guide this specific PR further once I'm back.

Please register or sign in to reply
Loading