Skip to content

Implement contains(...) filter predicate (RFC 254)

Created by: rvantonder

See RFC 254 for full context. This issue tracks the implementation plan for the contains filter predicate in the search query.

We should implement contains for the repo and file filters. We'll do it for repo first, then expand to file time allowing. Later, we may want to generalize contains to other filters like commits, but it is not in scope of this plan nor RFC 254.

The backend will need quite a bit of work, since we want to essentially perform subqueries based on contains to extract values that then constrain and scope the parent query. Realistically, I think we may only get to repo within the time allocated, and enable the queries and behavior as documented in RFC 254, and file is a stretch goal.

Backend

  • Add scaffolding to recognize filter predicates. This should be a map so we can do a lookup on a filter predicate and then dispatch to a function that does the work. We will only add contains as the first filter predicate in this map.

  • Parse filter predicate values. There are a couple of ways to do this. My thinking is to add a post-processing step after parsing a query into it's canonical form (simple field:value) and then parsing out the value and perform the behavior (i.e., filter predicates are "plugin" logic after parsing a query, and we activate the plugin logic based on recognizing a registered predicate).

  • Implement contains logic for repo. The reference of behavior we want to implement is documented in RFC 254.

    • Syntax: We will start with simple field:value inputs for contains, where field can be file or content and values are interpreted as regular expressions. Am a bit handy wavy on the details here, but we can reuse the field/value parser/scanners for the arguments of value to extract these values.
    • Semantics: repo:contains(v) essentially means "substitute for contains the repository names that satisfy the query v". The idea is that we want to substitute contains(v) for something like repo:foo|bar|baz so that we can simply evaluate the query with our existing search engine. Important for the implementation, is the order in which we evaluate repo:contains(...) predicates. If we have a query like repo:sourcegraph repo:contains(file:.*), we could in theory evaluate contains(...) first (resolving to all repos), and then intersect with the values satisfying repo:sourcegraph. This would be inefficient. More efficient would be if we resolved contains with respect to existing repo:sourcegraph, i.e., we run the search repo:sourcegraph file:.* select:repo and substitute the output of this value for repo:... and run our query. Thus, repo:contains(...) should be implemented as a search query using select, and give precedence to other repo: values in the query before running contains(...).
    • Proper predicate parsing #19075 (closed)
  • Deprecate repoHasFile depends on #20337 (closed)

Additional scope beyond RFC 254

  • Support repo:contains(file:foo content:bar)
  • Support repo:contains(-file:foo -content:bar)
  • Negation on repo, e.g., -repo:cointains(...)
  • Add support for repo:contains(x) repo:contains(y) i.e., repo:contains(x) and repo:contains(y), e.g., make this work repo:^github\.com/sourcegraph/ repo:contains(file:README.md) repo:contains(file:CHANGELOG). At this point, we've fulfilled the scope of RFC 254 and deprecated repoHasFile. The following are stretch goals.

Examples: - repo:foo repo:contains(file:bar) implies repo:(repo:foo repo:contains(file:bar) select:repo). The latter query can be expressed in Zoekt as (type:repo repo: file:bar), but that's not possible for us right now--we can rely on our select implementation. The result of repo:foo repo:contains(file:bar) select:repo would be transformed into a regular expression value that we then replace the repo: value in the original query and run this search. We have some flexibility on figuring out how to evaluate subexpressions, i.e., if the regular expression is very large (many repos satisfy contains) then we can optimize by splitting it into or queries for evaluation, for example. For simplicity, it's easiest to assume we'll build a regular expression for a start. - repogroup:foo -repo:bar repo:contains(content:qux) implies repo:(repogroup:foo -repo:bar content:qux select:repo) - Shelved the below, cf. https://github.com/sourcegraph/sourcegraph/pull/19049#discussion_r592661271 - repo:foo -file:bar file:contains(content:baz) implies repo:foo file:(-file:bar content:baz select:file). I.e., for file:contains(...) we'll do similarly. - ~Complex case (we won't implement this yet, but I want to demonstrate how it pans out): repo:foo repo:contains(file:bar) -file:baz file:contains(content:qux) hey implies ~ repo:(repo:foo file:(file:bar -file:baz content:qux select:file) select:repo) file:(file:bar -file:baz content:qux select:file) hey

Shelving interaction of `repo:contains` and `file:contains` https://github.com/sourcegraph/sourcegraph/pull/19049#discussion_r592661271 ~I.e., For combinations of `repo:contains` and `file:contains`, we'll need to do subquery expansion. We will punt on this for RFC purposes right now. If we implement things nicely, it should be straightforward to evaluate this--it would only mean getting the query in the right shape--we are just substituting the expansion of `file:` inside the `repo:` routine, for example.~

Extra notes: it'd be worthwhile to look at how repoHasFile is currently implemented. I don't think we can reuse much of this code, but it would be interesting if we can learn anything about how it optimizes this use case, and maybe we can borrow parts of the approach.

Frontend

Documentation

Resolved open question:

  • We thought about whether to prefix filter predicates with something like # or +. No satisfying solution popped out of that discussion, which took place over a couple of weeks. We decided to not add a prefix and just recognize registered predicate names.