-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New exact phrase searching feature (for HTML) #12552
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we need some tests?
This is a clever way to implement the feature without having to change the format of the search index ( An alternative implementation I have in mind would involve storing the location of each term in the documents it was found in as part of Also agreed with @picnixz that some test coverage would be good if+when we add this. Edit: rephrase; I shouldn't have suggested that this is an incomplete approach. |
Note: stopwords ( Then the contents of a hypothetical document 15, with contents ...and a query for Perhaps I should try to link to some kind of information retrieval coursebook or online resource, but I wanted to mention some of that to have it in context here while it's on my mind. It's doable and probably quite a challenging and satisfying implementation, but there would be quirks and details. Edit: add exact-match post-filter step |
This may lead to a lot of false positive.
One idea is to use n-gram predictions but I'm not sure if it will be sufficient (and efficient). |
Co-authored-by: Bénédikt Tran <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit on the fence about this one. The implementation isn't that bad, but on the other hand I'm not sure how useful it is and whether it's worth the additional surface area to support (especially if we might want to refactor the search internals in the future, as several people have discussed).
I would definitely want to see tests before it goes in.
Co-authored-by: Will Lachance <[email protected]>
Despite my initial flip-out about an implementation that isn't index-driven, I would note that this is the most-requested search-related feature in the bugtracker. That's bringing me more towards acceptance of it. |
if (data) { | ||
const lowercaseData = data.toLowerCase(); | ||
const mismatch = (s) => !lowercaseData.includes(s); | ||
if (exactSearchPhrases.some(mismatch)) return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps every
would be better than some
here?
If searching for two phrases, it could be frustrating not to find any results at all, despite the fact that some pages do include one of the phrases.
If a user runs a query for |
For me, quotes should be what I want to search, even if it contains the rest. Quotes mean "I want that exact string, I don't want anything else". So maybe we could add a warning saying "remove quotes"? |
Hmm. Would searching for |
Hm. Fortunately not, thanks to both of those being EN-language stopwords. |
That seems simple, and as a user, if I've intentionally used quotes to try to get exact-match results, then I am probably reasonably likely to be able to figure out that all of them must match if I use multiple quoted phrases in my query. I think my largest concern about these changes remains the effiency/time-cost of the client reading through the entire contents of documents for matches. I could draft an ngram-based solution? (this time using inter-term ngrams, as compared to intra-term ngrams in #12596). |
Idea: when indexing tri-grams in the manner proposed in #12596, add the following handling:
The flaw in all of the above reasoning is that it is global across the entire document collection. It may be preferable to have per-document filtering, because otherwise query performance may be inconsistent (some very fast queries where the phrase is known not to exist at all -- but then slow queries where we still have to check every document). |
) using term-ngram index support (sphinx-doc#12596) with additional previous-term-suffix ngrams. Design ref: sphinx-doc#12552 (comment)
// exclude results that don't contain exact phrases if we are searching for them | ||
if (data) { | ||
const lowercaseData = data.toLowerCase(); | ||
const mismatch = (s) => !lowercaseData.includes(s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: perhaps we could/should add word boundaries around the match?
const mismatch = (s) => !lowercaseData.includes(s); | |
const mismatch = (s) => !s.match(`\b${lowercaseData}\b`); |
Reasoning:
- Could make it easier to exact-search for strings that are substrings of other phrases/words.
- Although regex usage can introduce some overhead, there's also optimization opportunity if the matching can skip over non-word boundary match positions.
Re-opened version of #4254. (See #4254 (comment))
I've just rebased the old PR and updated. However I'm not sure that this is the best implementation now, given that we have split "display" logic from "search" logic -- so if best to close this PR and start anew then I won't object.
Closes #3301
A