insights: backend implementation plan for language insights
Created by: slimsag
Currently the insights backend can run search queries and store the number of results found per repository.
For language insights, we want to do almost the same thing except we want to query statistics about the breakdown of languages returned by the search result (e.g. if searching for a generic term like error
, how many lines were in Go and how many in Python)
- Schema: https://sourcegraph.com/github.com/sourcegraph/sourcegraph@21751566d524d1f82ed046a819d72255217f9c3f/-/blob/cmd/frontend/graphqlbackend/schema.graphql#L1911-1915
- How a query would look roughly: https://sourcegraph.com/github.com/sourcegraph/sourcegraph-code-stats-insights@9da23e6399843d06c0597af1b91e65756c928016/-/blob/src/code-stats-insights.ts#L42-54
Supporting this would be pretty straightforward:
- Define a new type of insight in the settings schema, e.g. instead of
search
maybe we havesearch
AND"languageStats": true
to indicate what should be captured is language stats https://sourcegraph.com/github.com/sourcegraph/sourcegraph@21751566d524d1f82ed046a819d72255217f9c3f/-/blob/schema/settings.schema.json#L377-399 - Ensure these new types of insights have a unique ID. That is, change this hash function so that
search:"foobar"
andsearch:"foobar" languageStats:true
produce two different hash IDs. This ensures they are represented uniquely in the database and their data does not overlap/conflict (which would be REALLY confusing and impossible to undo) https://sourcegraph.com/github.com/sourcegraph/sourcegraph@2175156/-/blob/enterprise/internal/insights/discovery/series_id.go#L10-26 - Similar to how we pass the
search_query
to thequeryrunner
worker, we will need to pass thelanguage_stats
boolean property to the query runner. To do this, we can add a fieldlanguage_stats
to the DB schema like this. - Actually pass the
language_stats
property to the queryrunner worker. Basically just duplicate these lines and these lines. - Now when the queryrunner
work_handler.go
is performing a search query, it will have access tojob.LanguageStats
in the same way it doesjob.SearchQuery
here. - It is now as simple as recording whatever information we want from the GraphQL response here by calling
RecordSeriesPoint
with whateverValue
we want. We can choose to either specify theRepo
fields as we do today (and record per-repo language breakdowns) OR omit those fields and record a single global value. Probably we want repo, so I'd go with that. - For "how to store the actual language names", I would use the
Metadata
field when recording and simply pass an[]string{"Go", "Java", "etc"}
into it. - For querying the data, it will already return with the
Metadata
today (but we may need to expose it to GraphQL), if we want to do more advanced filtering ("give me only Go data points") that is possible with the jsonb indexing which is already set up.