You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using Tantivy (v0.22) to index and query email data, the NOT operator with a phrase query returns an empty result set, even when matching documents exist.
Index documents with fields account, mailbox, and subject.
Run query: account:asdasd AND mailbox:iuhiuhihu AND subject:"苹果手机"
Returns correct documents with the phrase "苹果手机" in subject.
Run query: account:asdasd AND mailbox:iuhiuhihu AND NOT subject:"苹果手机"
Returns empty results, even though documents matching account:asdasd AND mailbox:iuhiuhihu exist without "苹果手机" in subject.
Expected Behavior:
The second query should return documents where account:asdasd and mailbox:iuhiuhihu match, and subject does not contain "苹果手机".
Actual Behavior:
Empty result set.
Additional Observations:
Base query account:asdasd AND mailbox:iuhiuhihu works as expected.
NOT subject:苹果 (single term) also returns an empty set.
Environment:
Tantivy: 0.22
Rust: 1.84.1
How can I query for documents where subject does not contain the phrase "苹果手机"? Willing to provide sample data or logs if needed.
The text was updated successfully, but these errors were encountered:
The problem is not coming from the handling of the NOT operator, but is coming from a bad interaction with the ngram tokenizer.
The ngram tokenizer is used both at indexing time and at query time.
subject:"苹果手机" -> subject:ngram1 OR "subject:ngram2" OR ...
I suspect you created your ngram tokenizer as NgramTokenizer::all_grams(1, 3)?
If you change the tokenizer, it might work as intended. There are several chinese tokenizer available for tantivy.
For instance, you could use lindera (the crate is named lindera-tantivy) with a chinese tokenizer.
Another possible approach would be to write a tokenizer that just emits all kanjis as tokens and use phrase queries.
@fulmicoton I’m currently using NgramTokenizer::new(3, 3, false) for the subject field. My use case involves emails in various mainstream languages worldwide (e.g., English, Chinese, Japanese, etc.), often mixed within a single email (like Chinese-English combos). I can’t specify a tokenizer per email. Is there a tokenizer that can effectively handle such multilingual and mixed-language scenarios?
When using Tantivy (v0.22) to index and query email data, the NOT operator with a phrase query returns an empty result set, even when matching documents exist.
Steps to Reproduce:
Index documents with fields account, mailbox, and subject.
Run query: account:asdasd AND mailbox:iuhiuhihu AND subject:"苹果手机"
Returns correct documents with the phrase "苹果手机" in subject.
Run query: account:asdasd AND mailbox:iuhiuhihu AND NOT subject:"苹果手机"
Returns empty results, even though documents matching account:asdasd AND mailbox:iuhiuhihu exist without "苹果手机" in subject.
Expected Behavior:
The second query should return documents where account:asdasd and mailbox:iuhiuhihu match, and subject does not contain "苹果手机".
Actual Behavior:
Empty result set.
Additional Observations:
Base query account:asdasd AND mailbox:iuhiuhihu works as expected.
NOT subject:苹果 (single term) also returns an empty set.
Environment:
Tantivy: 0.22
Rust: 1.84.1
How can I query for documents where subject does not contain the phrase "苹果手机"? Willing to provide sample data or logs if needed.
The text was updated successfully, but these errors were encountered: