Implement contains(...) filter predicate (RFC 254)
Created by: rvantonder
See RFC 254 for full context. This issue tracks the implementation plan for the contains
filter predicate in the search query.
We should implement contains
for the repo
and file
filters. We'll do it for repo
first, then expand to file
time allowing. Later, we may want to generalize contains
to other filters like commits, but it is not in scope of this plan nor RFC 254.
The backend will need quite a bit of work, since we want to essentially perform subqueries based on contains
to extract values that then constrain and scope the parent query. Realistically, I think we may only get to repo
within the time allocated, and enable the queries and behavior as documented in RFC 254, and file
is a stretch goal.
Backend
-
Add scaffolding to recognize filter predicates. This should be a map so we can do a lookup on a filter predicate and then dispatch to a function that does the work. We will only add contains
as the first filter predicate in this map. -
Parse filter predicate values. There are a couple of ways to do this. My thinking is to add a post-processing step after parsing a query into it's canonical form (simple field:value
) and then parsing out the value and perform the behavior (i.e., filter predicates are "plugin" logic after parsing a query, and we activate the plugin logic based on recognizing a registered predicate). -
Implement contains
logic for repo. The reference of behavior we want to implement is documented in RFC 254.-
Syntax: We will start with simple field:value
inputs forcontains
, wherefield
can befile
orcontent
and values are interpreted as regular expressions. Am a bit handy wavy on the details here, but we can reuse the field/value parser/scanners for the arguments ofvalue
to extract these values. -
Semantics: repo:contains(v)
essentially means "substitute forcontains
the repository names that satisfy the queryv
". The idea is that we want to substitutecontains(v)
for something likerepo:foo|bar|baz
so that we can simply evaluate the query with our existing search engine. Important for the implementation, is the order in which we evaluaterepo:contains(...)
predicates. If we have a query likerepo:sourcegraph repo:contains(file:.*)
, we could in theory evaluatecontains(...)
first (resolving to all repos), and then intersect with the values satisfyingrepo:sourcegraph
. This would be inefficient. More efficient would be if we resolvedcontains
with respect to existingrepo:sourcegraph
, i.e., we run the searchrepo:sourcegraph file:.* select:repo
and substitute the output of this value forrepo:...
and run our query. Thus,repo:contains(...)
should be implemented as a search query usingselect
, and give precedence to otherrepo:
values in the query before runningcontains(...)
. -
Proper predicate parsing #19075 (closed)
-
-
Deprecate repoHasFile
depends on #20337 (closed)
Additional scope beyond RFC 254
-
Support repo:contains(file:foo content:bar)
-
Support repo:contains(-file:foo -content:bar)
-
Negation on repo
, e.g.,-repo:cointains(...)
-
Add support for repo:contains(x) repo:contains(y)
i.e.,repo:contains(x) and repo:contains(y)
, e.g., make this workrepo:^github\.com/sourcegraph/ repo:contains(file:README.md) repo:contains(file:CHANGELOG)
. At this point, we've fulfilled the scope of RFC 254 and deprecatedrepoHasFile
. The following are stretch goals.
Examples:
- repo:foo repo:contains(file:bar)
implies repo:(repo:foo repo:contains(file:bar) select:repo)
. The latter query can be expressed in Zoekt as (type:repo repo: file:bar)
, but that's not possible for us right now--we can rely on our select
implementation. The result of repo:foo repo:contains(file:bar) select:repo
would be transformed into a regular expression value that we then replace the repo:
value in the original query and run this search. We have some flexibility on figuring out how to evaluate subexpressions, i.e., if the regular expression is very large (many repos satisfy contains
) then we can optimize by splitting it into or
queries for evaluation, for example. For simplicity, it's easiest to assume we'll build a regular expression for a start.
- repogroup:foo -repo:bar repo:contains(content:qux)
implies repo:(repogroup:foo -repo:bar content:qux select:repo)
- Shelved the below, cf. https://github.com/sourcegraph/sourcegraph/pull/19049#discussion_r592661271
-
- ~Complex case (we won't implement this yet, but I want to demonstrate how it pans out): repo:foo -file:bar file:contains(content:baz)
implies repo:foo file:(-file:bar content:baz select:file)
. I.e., for file:contains(...)
we'll do similarly.repo:foo repo:contains(file:bar) -file:baz file:contains(content:qux) hey
implies ~
repo:(repo:foo file:(file:bar -file:baz content:qux select:file) select:repo) file:(file:bar -file:baz content:qux select:file) hey
Shelving interaction of `repo:contains` and `file:contains` https://github.com/sourcegraph/sourcegraph/pull/19049#discussion_r592661271 ~I.e., For combinations of `repo:contains` and `file:contains`, we'll need to do subquery expansion. We will punt on this for RFC purposes right now. If we implement things nicely, it should be straightforward to evaluate this--it would only mean getting the query in the right shape--we are just substituting the expansion of `file:` inside the `repo:` routine, for example.~
Extra notes: it'd be worthwhile to look at how repoHasFile
is currently implemented. I don't think we can reuse much of this code, but it would be interesting if we can learn anything about how it optimizes this use case, and maybe we can borrow parts of the approach.
Frontend
-
Add scaffolding to recognize filter predicates https://github.com/sourcegraph/sourcegraph/pull/19641 -
Syntax: Parsing and validation routines for tokens https://github.com/sourcegraph/sourcegraph/pull/19785 -
Smart autocomplete parentheses for recognized predicates https://github.com/sourcegraph/sourcegraph/pull/20181 -
smart diagnostics for recognizing valid values WIP -
Hovers and highlighting https://github.com/sourcegraph/sourcegraph/pull/19894 https://github.com/sourcegraph/sourcegraph/pull/19900 https://github.com/sourcegraph/sourcegraph/pull/19990 https://github.com/sourcegraph/sourcegraph/pull/20192 -
Pings/Metrics
Documentation
-
select
documentation -
contains
documentation https://github.com/sourcegraph/sourcegraph/pull/20573
Resolved open question:
- We thought about whether to prefix filter predicates with something like
#
or+
. No satisfying solution popped out of that discussion, which took place over a couple of weeks. We decided to not add a prefix and just recognize registered predicate names.