Skip to content

more accurate file language detection using file content heuristics

Warren Gifford requested to merge enhanced-lang-detection into master

Created by: sqs

  • use src-d's enry (Go linguist port) for lang detection
  • clean up lang detection and inventory code
  • use more accurate heuristics for lang detection

With this PR, we are now using enry (https://github.com/src-d/enry), src-d's Go port of GitHub's linguist, which detects the programming language of a file. Previously we only had it detect the language based on the filename. This leads to incorrect results for many extensions, such as .m (Objective-C, Mathematica, or Matlab?).

This commit adds a feature flag env var USE_ENHANCED_LANGUAGE_DETECTION (default true in dev, default false otherwise) that performs enhanced language detection using file contents. This is significantly slower because it needs to read most files from each repository. The existing (default) behavior is virtually unchanged except that vendor files are now excluded.

We will evaluate the performance of the USE_ENHANCED_LANGUAGE_DETECTION mode and tweak it while the feature flag defaults to off. We intend to support accurate language detection and intend to eventually enable the feature flag by default.

(See commit messages for details.)

This is a precursor to better language detection and stats as described in #5287 #2586 (closed) #2587 (closed) #1235 (closed) and for https://app.hubspot.com/contacts/2762526/company/556068698 https://app.hubspot.com/contacts/2762526/company/464956351 and others.

Merge request reports

Loading