Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOT Operator with Phrase Query Returns Empty Results #2584

Open
inboxsphere opened this issue Feb 21, 2025 · 2 comments
Open

NOT Operator with Phrase Query Returns Empty Results #2584

inboxsphere opened this issue Feb 21, 2025 · 2 comments

Comments

@inboxsphere
Copy link

When using Tantivy (v0.22) to index and query email data, the NOT operator with a phrase query returns an empty result set, even when matching documents exist.

let mut builder = Schema::builder();
let account = builder.add_text_field("account", STRING | STORED | FAST);
let mailbox = builder.add_text_field("mailbox", STRING | STORED | FAST);
let subject = builder.add_text_field("subject", custom()); // ngram3 tokenizer
fn custom() -> TextOptions {
    TextFieldIndexing::default()
        .set_tokenizer("ngram3")
        .set_index_option(IndexRecordOption::WithFreqsAndPositions)
        .into()
        .set_stored()
}

Steps to Reproduce:

Index documents with fields account, mailbox, and subject.
Run query: account:asdasd AND mailbox:iuhiuhihu AND subject:"苹果手机"
Returns correct documents with the phrase "苹果手机" in subject.
Run query: account:asdasd AND mailbox:iuhiuhihu AND NOT subject:"苹果手机"
Returns empty results, even though documents matching account:asdasd AND mailbox:iuhiuhihu exist without "苹果手机" in subject.

Expected Behavior:

The second query should return documents where account:asdasd and mailbox:iuhiuhihu match, and subject does not contain "苹果手机".

Actual Behavior:

Empty result set.

Additional Observations:

Base query account:asdasd AND mailbox:iuhiuhihu works as expected.
NOT subject:苹果 (single term) also returns an empty set.
Environment:

Tantivy: 0.22
Rust: 1.84.1

How can I query for documents where subject does not contain the phrase "苹果手机"? Willing to provide sample data or logs if needed.

@fulmicoton
Copy link
Collaborator

Hello @inboxsphere

The problem is not coming from the handling of the NOT operator, but is coming from a bad interaction with the ngram tokenizer.

The ngram tokenizer is used both at indexing time and at query time.

subject:"苹果手机" -> subject:ngram1 OR "subject:ngram2" OR ...

I suspect you created your ngram tokenizer as NgramTokenizer::all_grams(1, 3)?

If you change the tokenizer, it might work as intended. There are several chinese tokenizer available for tantivy.
For instance, you could use lindera (the crate is named lindera-tantivy) with a chinese tokenizer.

Another possible approach would be to write a tokenizer that just emits all kanjis as tokens and use phrase queries.

@inboxsphere
Copy link
Author

@fulmicoton I’m currently using NgramTokenizer::new(3, 3, false) for the subject field. My use case involves emails in various mainstream languages worldwide (e.g., English, Chinese, Japanese, etc.), often mixed within a single email (like Chinese-English combos). I can’t specify a tokenizer per email. Is there a tokenizer that can effectively handle such multilingual and mixed-language scenarios?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants