Expose full repository languages/file types inventory via GraphQL
Created by: sqs
Backend
Each repository has a full "inventory", which is the result of processing all of its files, determinining their languages, and counting bytes of each language/file type. This is not exposed in GraphQL; only the top language is exposed (via Repository.language
). We should expose the full inventory so that API consumers can see all of the languages in use in the repository.
Also, it should include line counts, not just byte counts.
Details: For every Git tree and blob, there should be a GraphQL field inventory
that exposes:
- All languages found in the tree/file (files obviously only have one language right now), in an array of the following for each detected language:
- Language names and other metadata (eg "is this markup or code?") that the
inventory
package already produces -
Number of bytes of this language(have not actually heard this from a customer) - Number of lines of code of this language (can be naive lines, no need to use SLOC or anything)
- Language names and other metadata (eg "is this markup or code?") that the
The most common use case is to get this info for a repository at its HEAD commit, so there should be a Repository.inventory
field that returns this info for the HEAD commit's root Git tree.
The language determination from the existing inventory
package is sufficient. No need to make it more advanced.
The expected usage patterns are mainly:
- Querying across many/all repositories (eg up to 30k repositories)
- This should be possible, but it is OK if it takes minutes or hours and requires repeated calls until the data is computed and cached for every repository. Basically, there should be a way to get this info for that many repositories, but it doesn't need to be instant or even very fast.
UI
- Show this on the directory page
- Seeing this in the UI (eg on a tree page) like the GitHub languages breakdown on the repo page
- So caching this computation is probably a good idea. It doesn't need to be precomputed; it can be computed on-the-fly if not cached.
Related Info
Related: #2586 (closed)
Customers:
- https://app.hubspot.com/contacts/2762526/company/464956351
- https://app.hubspot.com/contacts/2762526/company/732945850
- https://app.hubspot.com/contacts/2762526/company/557476190
- https://app.hubspot.com/contacts/2762526/company/419771425
- https://app.hubspot.com/contacts/2762526/company/407948923
- https://github.com/sourcegraph/issues-uber/issues/188