LSIF: Delete indexed files not known by git
Created by: efritz
Indexers such as lsif-tsc will add index entries for files outside of the dump root (e.g. will include ../node_modules/ when indexing web/), add index entries for files outside of the repository root (e.g. will include ../../lsif-tsc/node_modules/typescript/lib/lib.es2015.symbol.d.ts), and add index entries for files in the repository/dump root but possibly untracked by git (e.g. node_modules, other generated files).
For the most part, this is just a waste of space in the dump. However, it also manifests itself in j2d and find references results in a bad way. This PR removes references to any location that we can't jump to in the UI. This does not change any hover data, though. So if we pull a hover definition out of lib.es2015.symbol.d.ts, it will still be in the result dump (we just won't give you the option to jump to a 404).
@sourcegraph/core-services, please review the general algorithm implemented in lsif/src/worker/conversion/visibility.ts
described below, as it adds additional queries to gitserver at conversion time. If there is any concern with scale or we need to address possible issues before this is merged please start this discussion earlier than later.
For each document in the LSIF dump (this includes each source file in the root, and possibly additional files that it imports from vendored dependencies), we need to determine if the document path is a file known to git. We do this with the following steps:
- determine the dirname of the file relative to the repository root
- return early if it's outside of the repo (starts with ..) as git ls-tree will fail here
- run a (cached) git ls-tree on all ancestors of the dirname; if any ancestor returns an empty child set, we can early out
- determine if the path exists in the child set of the parent directory
Each git ls-tree command is non-recursive and we memoize the results of each call. This means that we add at most one git ls-tree call per unique parent directory of the LSIF index. Because we also query from the root of the repo down to the leaf, once we hit an untracked vendor root we don't need to go any further (think lsif-go indexing vendor/github.com/..., or lsif-tsc indexing node_modules in the root or a subdir). This will cut out any git ls-tree call for a non-existent directory within these vendor directories.
If there are ways to more intelligently batch these requests I'd be happy to hear them. See this slack thread for some additional context.