New exact phrase searching feature (for HTML) #12552

AA-Turner · 2024-07-13T06:14:12Z

Re-opened version of #4254. (See #4254 (comment))

New exact phrase searching feature (for HTML)

I've just rebased the old PR and updated. However I'm not sure that this is the best implementation now, given that we have split "display" logic from "search" logic -- so if best to close this PR and start anew then I won't object.

Closes #3301

A

picnixz

Shouldn't we need some tests?

sphinx/themes/basic/static/searchtools.js

jayaddison · 2024-07-13T09:07:42Z

This is a clever way to implement the feature without having to change the format of the search index (searchindex.js) file! My main concern is that it relies on the search summary functionality (HTTP GET of the complete content of each result), meaning that the retrieval behaviour of some search queries becomes entangled with what is otherwise mainly a result-formatting config setting (html_show_search_summary - on by default, but even so, as a user/project maintainer I would not expect that setting to alter search query capabilities).

An alternative implementation I have in mind would involve storing the location of each term in the documents it was found in as part of searchindex.js -- and then a phrase-query would check that all of the words within the phrase appear adjacently at least once. However.. that's significantly more effort.

Also agreed with @picnixz that some test coverage would be good if+when we add this.

Edit: rephrase; I shouldn't have suggested that this is an incomplete approach.

jayaddison · 2024-07-13T11:36:19Z

An alternative implementation I have in mind would involve storing the location of each term in the documents it was found in as part of searchindex.js -- and then a phrase-query would check that all of the words within the phrase appear adjacently at least once. However.. that's significantly more effort.

Note: stopwords (the, a, it...) are a challenge with this approach, because their positions aren't stored. The trick is to remove them from each phrase in the input query too.

Then the contents of a hypothetical document 15, with contents The example on page A is an useful example!, might tokenize to _ example _ page _ _ _ useful example => document term positions example: {15: [2, 9]}, page: {15: 4}, useful: {15: 8}...

...and a query for example on page could tokenize to example _ page => query term positions {example: 1}, {<ANY>: 2}, {page: 3} -- and now we need to match documents where both example and page appear, and then filter those results to cases where each matched term from the tokenized phrase has a corresponding next-match (allowing the ANY wildcard) with the same offset. And then an exact-match phase to eliminate incorrect wildcard matches (example code page in unrelated document 14).

Perhaps I should try to link to some kind of information retrieval coursebook or online resource, but I wanted to mention some of that to have it in context here while it's on my mind.

It's doable and probably quite a challenging and satisfying implementation, but there would be quirks and details.

Edit: add exact-match post-filter step
Edit: don't imply in the description that we'll definitely implement this

picnixz · 2024-07-13T12:16:56Z

The trick is to remove them from each phrase in the input query too.

This may lead to a lot of false positive.

Perhaps I should try to link to some kind of information retrieval coursebook or online resource, but I wanted to mention some of that to have it in context here while it's on my mind.

One idea is to use n-gram predictions but I'm not sure if it will be sufficient (and efficient).

Co-authored-by: Bénédikt Tran <[email protected]>

wlach

I'm a bit on the fence about this one. The implementation isn't that bad, but on the other hand I'm not sure how useful it is and whether it's worth the additional surface area to support (especially if we might want to refactor the search internals in the future, as several people have discussed).

I would definitely want to see tests before it goes in.

sphinx/themes/basic/static/searchtools.js

Co-authored-by: Will Lachance <[email protected]>

jayaddison · 2024-07-15T14:38:18Z

Despite my initial flip-out about an implementation that isn't index-driven, I would note that this is the most-requested search-related feature in the bugtracker. That's bringing me more towards acceptance of it.

jayaddison · 2024-07-17T11:03:14Z

sphinx/themes/basic/static/searchtools.js

+        if (data) {
+          const lowercaseData = data.toLowerCase();
+          const mismatch = (s) => !lowercaseData.includes(s);
+          if (exactSearchPhrases.some(mismatch)) return;


Perhaps every would be better than some here?

If searching for two phrases, it could be frustrating not to find any results at all, despite the fact that some pages do include one of the phrases.

jayaddison · 2024-07-17T11:33:21Z

If a user runs a query for golang "code example" and no pages contain the phrase 'code example', but pages do contain 'golang', do we have a preferred outcome? (zero results, or include the non-phrase results)

picnixz · 2024-07-17T13:13:37Z

For me, quotes should be what I want to search, even if it contains the rest. Quotes mean "I want that exact string, I don't want anything else". So maybe we could add a warning saying "remove quotes"?

AA-Turner · 2024-07-17T14:09:42Z

Google presents the following ideas:

A

jayaddison · 2024-07-17T14:17:31Z

Hmm. Would searching for "that with" on a large (English-language example, but generalizable to others) documentation set potentially launch many, many HTTP GET requests with this?

jayaddison · 2024-07-17T14:32:34Z

Hmm. Would searching for "that with" on a large (English-language example, but generalizable to others) documentation set potentially launch many, many HTTP GET requests with this?

Hm. Fortunately not, thanks to both of those being EN-language stopwords.

jayaddison · 2024-07-17T14:35:36Z

For me, quotes should be what I want to search, even if it contains the rest. Quotes mean "I want that exact string, I don't want anything else". So maybe we could add a warning saying "remove quotes"?

That seems simple, and as a user, if I've intentionally used quotes to try to get exact-match results, then I am probably reasonably likely to be able to figure out that all of them must match if I use multiple quoted phrases in my query.

I think my largest concern about these changes remains the effiency/time-cost of the client reading through the entire contents of documents for matches. I could draft an ngram-based solution? (this time using inter-term ngrams, as compared to intra-term ngrams in #12596).

jayaddison · 2024-07-19T15:48:40Z

I think my largest concern about these changes remains the effiency/time-cost of the client reading through the entire contents of documents for matches. I could draft an ngram-based solution? (this time using inter-term ngrams, as compared to intra-term ngrams in #12596).

Idea: when indexing tri-grams in the manner proposed in #12596, add the following handling:

During indexing, keep track of the word before/preceding each term, provided that it is part of the same block of text (paragraph/sentence).
If the trigram being created is at the start/prefix of a word, then include the trigram of the suffix of the previous word in the term list (so, when indexing the phrase context matters, term offsets zero and one respectively, the trigram for ext would point to term-offset-zero, the trigram for ers would point to term-offset-one, and the trigram for mat -- seemingly oddly -- would point to both term-offsets zero and term one).
During phrase queries, we would begin by collecting all of the starting-edge trigrams from the query phrase. If we query again for "context matters", this returns [0], [0,1] -- and we can observe that there is a valid, ordered path along the ordered terms. If we queried for "context indeed matters", then we would retrieve [0], [2], [0, 1] or something along those lines -- the second result ([2]) doesn't contain a path back to the first result ([0]), and so this pair of terms does not exist in the document collection.

The flaw in all of the above reasoning is that it is global across the entire document collection. It may be preferable to have per-document filtering, because otherwise query performance may be inconsistent (some very fast queries where the phrase is known not to exist at all -- but then slow queries where we still have to check every document).

) using term-ngram index support (sphinx-doc#12596) with additional previous-term-suffix ngrams. Design ref: sphinx-doc#12552 (comment)

jayaddison · 2024-07-20T13:49:04Z

sphinx/themes/basic/static/searchtools.js

+        // exclude results that don't contain exact phrases if we are searching for them
+        if (data) {
+          const lowercaseData = data.toLowerCase();
+          const mismatch = (s) => !lowercaseData.includes(s);


Suggestion: perhaps we could/should add word boundaries around the match?

Suggested change

const mismatch = (s) => !lowercaseData.includes(s);

const mismatch = (s) => !s.match(`\b${lowercaseData}\b`);

Reasoning:

Could make it easier to exact-search for strings that are substrings of other phrases/words.

Although regex usage can introduce some overhead, there's also optimization opportunity if the matching can skip over non-word boundary match positions.

New exact phrase searching feature (for HTML)

b7220ca

AA-Turner added html search javascript Pull requests that update Javascript code labels Jul 13, 2024

AA-Turner requested a review from wlach July 13, 2024 06:14

AA-Turner mentioned this pull request Jul 13, 2024

Add exact phrase searching in HTML #4254

Closed

AA-Turner requested a review from jayaddison July 13, 2024 06:15

picnixz reviewed Jul 13, 2024

View reviewed changes

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved

Simplify

e97f92f

Co-authored-by: Bénédikt Tran <[email protected]>

wlach reviewed Jul 15, 2024

View reviewed changes

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved

Apply suggestions from code review

c3df68b

Co-authored-by: Will Lachance <[email protected]>

jayaddison reviewed Jul 17, 2024

View reviewed changes

jayaddison reviewed Jul 20, 2024

View reviewed changes

jayaddison mentioned this pull request Jul 24, 2024

HTML search: Introduce ngram-based partial-match searching #12596

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New exact phrase searching feature (for HTML) #12552

New exact phrase searching feature (for HTML) #12552

AA-Turner commented Jul 13, 2024 •

edited

Loading

picnixz left a comment

jayaddison commented Jul 13, 2024 •

edited

Loading

jayaddison commented Jul 13, 2024 •

edited

Loading

picnixz commented Jul 13, 2024

wlach left a comment •

edited

Loading

jayaddison commented Jul 15, 2024

jayaddison Jul 17, 2024

jayaddison commented Jul 17, 2024

picnixz commented Jul 17, 2024

AA-Turner commented Jul 17, 2024

jayaddison commented Jul 17, 2024 •

edited

Loading

jayaddison commented Jul 17, 2024

jayaddison commented Jul 17, 2024

jayaddison commented Jul 19, 2024

jayaddison Jul 20, 2024

	const mismatch = (s) => !lowercaseData.includes(s);
	const mismatch = (s) => !s.match(`\b${lowercaseData}\b`);

New exact phrase searching feature (for HTML) #12552

Are you sure you want to change the base?

New exact phrase searching feature (for HTML) #12552

Conversation

AA-Turner commented Jul 13, 2024 • edited Loading

picnixz left a comment

Choose a reason for hiding this comment

jayaddison commented Jul 13, 2024 • edited Loading

jayaddison commented Jul 13, 2024 • edited Loading

picnixz commented Jul 13, 2024

wlach left a comment • edited Loading

Choose a reason for hiding this comment

jayaddison commented Jul 15, 2024

jayaddison Jul 17, 2024

Choose a reason for hiding this comment

jayaddison commented Jul 17, 2024

picnixz commented Jul 17, 2024

AA-Turner commented Jul 17, 2024

jayaddison commented Jul 17, 2024 • edited Loading

jayaddison commented Jul 17, 2024

jayaddison commented Jul 17, 2024

jayaddison commented Jul 19, 2024

jayaddison Jul 20, 2024

Choose a reason for hiding this comment

AA-Turner commented Jul 13, 2024 •

edited

Loading

jayaddison commented Jul 13, 2024 •

edited

Loading

jayaddison commented Jul 13, 2024 •

edited

Loading

wlach left a comment •

edited

Loading

jayaddison commented Jul 17, 2024 •

edited

Loading