Skip to content

insights: backend implementation plan for language insights

Created by: slimsag

Currently the insights backend can run search queries and store the number of results found per repository.

For language insights, we want to do almost the same thing except we want to query statistics about the breakdown of languages returned by the search result (e.g. if searching for a generic term like error, how many lines were in Go and how many in Python)

Supporting this would be pretty straightforward:

  1. Define a new type of insight in the settings schema, e.g. instead of search maybe we have search AND "languageStats": true to indicate what should be captured is language stats https://sourcegraph.com/github.com/sourcegraph/sourcegraph@21751566d524d1f82ed046a819d72255217f9c3f/-/blob/schema/settings.schema.json#L377-399
  2. Ensure these new types of insights have a unique ID. That is, change this hash function so that search:"foobar" and search:"foobar" languageStats:true produce two different hash IDs. This ensures they are represented uniquely in the database and their data does not overlap/conflict (which would be REALLY confusing and impossible to undo) https://sourcegraph.com/github.com/sourcegraph/sourcegraph@2175156/-/blob/enterprise/internal/insights/discovery/series_id.go#L10-26
  3. Similar to how we pass the search_query to the queryrunner worker, we will need to pass the language_stats boolean property to the query runner. To do this, we can add a field language_stats to the DB schema like this.
  4. Actually pass the language_stats property to the queryrunner worker. Basically just duplicate these lines and these lines.
  5. Now when the queryrunner work_handler.go is performing a search query, it will have access to job.LanguageStats in the same way it does job.SearchQuery here.
  6. It is now as simple as recording whatever information we want from the GraphQL response here by calling RecordSeriesPoint with whatever Value we want. We can choose to either specify the Repo fields as we do today (and record per-repo language breakdowns) OR omit those fields and record a single global value. Probably we want repo, so I'd go with that.
  7. For "how to store the actual language names", I would use the Metadata field when recording and simply pass an []string{"Go", "Java", "etc"} into it.
  8. For querying the data, it will already return with the Metadata today (but we may need to expose it to GraphQL), if we want to do more advanced filtering ("give me only Go data points") that is possible with the jsonb indexing which is already set up.