proposal: default to literal searches with keyword, regex, etc. fallbacks on no results
Created by: ijt
We want to support simple, literal queries but also make it easy for regex queries to continue working. Here is a proposal to accomplish that.
Goals
- Literal searches like
NewRouter(
just work. - Searches can be shared by URL and preferably also by query.
- Old search URLs continue to produce the same results as before.
- Old regex-based search queries continue to produce the same results as before.
- Saved searches from before continue to work unchanged.
- New saved searches work smoothly.
- Scripts calling the
src
CLI continue to work unchanged. - Latency is not noticeably increased.
Proposal
We start by assuming nothing since there is plenty of uncertainty about what sort of query will be made when the user lands on the search page. Even if the previous query was meant as a regular expression, we don't know if this one will be. The previous search may have yielded results with case sensitive matching, but this search may need to be insensitive.
The user can tell Sourcegraph what kind of search to run by adding a new flag to the query called exp
, short for "expression" as in "regular expression". This flag can have the following values:
-
literal
orlit
: the query is interpreted literally -
keyword
orkey
: like literal but spaces are converted to.*
regexes -
regexp
orregex
orrx
: the query is interpreted as a regular expression -
keyregexp
orkeyregex
orkrx
: the query is interpreted as a regular expression, except that spaces are converted to.*
. This is how Sourcegraph search currently works, and would be a sensible default for the GQL APISearch
query for backwards compatibility. -
structural
orsx
orcomby
: the query is interpreted as one of Rijnard's patterns with holes like:[_]
or:[foo]
(future work) -
unknown
: the query is tried as each ofliteral
,keyword
,regexp
,keyregexp
,structural
until some results are found, and an additional field is set in the GQL result saying which type. The match type is displayed in the UI. This would make sense as the default for the web app, probably implemented by setting a newExpDefault
field in the GQLSearch
query tounknown
instead of its default value ofkeyregexp
.
Saved queries will need updating to work with this. The saved_queries
table will have a new column called exp_default
whose default value would be keyregexp
to ensure backwards compatibility for existing saved queries. All new saved queries will set exp_default
to unknown
.
Let's have a look at how well this proposal accomplishes the goals:
- Literal searches like
NewRouter(
just work.- They do because the default
unknown
search type first tries literal search.
- They do because the default
- Searches can be shared by URL and preferably also by query.
- The searches are just as shareable as before, by URL and by query.
- Old search URLs continue to produce the same results as before.
- Usually they will. Ones that get unintended literal search results will need to have
exp:keyregexp
to continue working.
- Usually they will. Ones that get unintended literal search results will need to have
- Old regex-based search queries continue to produce the same results as before.
- Same as 3.
- Saved searches from before continue to work unchanged.
- The new
saved_queries
column ofexp_default
defaulting tokeyregexp
handles this.
- The new
- New saved searches work smoothly.
- Saving new queries with
exp_default
set tounknown
handles this.
- Saving new queries with
- Scripts calling the
src
CLI continue to work unchanged.- This is done by having the default for
exp
bekeyregexp
outside of the web app.
- This is done by having the default for
- Latency is not noticeably increased.
- On my local dev instance with 65 repos, empty results are taking 20-30ms, so it's at least plausible that
exp:unknown
won't make regex searches noticeably worse. We may find that it works best to run the different possible interpretations in parallel since most of the time most of them will be ruled out or return nothing.
- On my local dev instance with 65 repos, empty results are taking 20-30ms, so it's at least plausible that
Optimizations
If unknown
is selected, regexp
and keyregexp
can be skipped if the query contains no special regex characters or is not a valid regex. Also structural
can be skipped if the query contains no holes like :[foo]
.
One way to get everything but Making one big regexp won't work because it would mix search results from different interpretations of the query.structural
at once would be to make a big regex with a disjunction (or) of all the possible interpretations. For example, given a query like func ma[ik][ne]
the big regex would be (("func ma[ik][ne]")|("func".*"ma[ik][ne]")|(func ma[ik][ne])|(func.*ma[ik][ne]))
. It might be faster. We'd have to run some benchmarks and find out. The rx
and krx
groups would have to be excluded if they don't compile. It would also be less clear how to communicate to the user what type of match has been found.
Alternatives
Always try everything
The proposal here can be considered a refinement of this earlier proposal, with the addition of the exp
flag and the change that the unknown
value returns results for the first search type that has them instead of results for all possible interpretations. It was too confusing to consider how to mix together search results from different interpretations of the query.
Make regexp opt-in via UI toggle button
This is a practical near-term proposal, however it has some drawbacks compared to the fallback proposal here:
- Query sharing is complicated by the need to click UI widgets when Alice and Bob have different settings.
- As noted earlier, we often want to do a literal search right after having done a regex search and it's nice not to have to fiddle with the UI if we don't have to.
- If the
.*
box is checked and Alice inputsfoo+bar
while forgetting that+
is a regex special character, she'll be surprised at the results. With theexp:unknown
default of this proposal,foo+bar
is more likely to work as intended, probably getting literal results and stopping there.