Fill `position`s in `NgramTokenizer` #2471

darashi · 2024-08-06T12:04:07Z

I am looking into ways to make searches including non-ASCII characters. I want to search all documents that contain the query string as a substring, regardless of the word delimiter position.

For this purpose, I tried to use NgramTokenizer and PhraseQuery. However, it did not work as I expected. The order of the tokens seems to be ignored. I looked into the cause and found that Position was changed to all zeros in commit e75bb1d.

To be honest, I don't properly understand the reason for this change, but if this change was not necessary for the consistency of the specification or something, why not set the value of position?

So, I made a patch to fill "where the token starts, in terms of characters (not bytes)" in positions.

There may be a more appropriate way to do this and/or there may be something I've missed, but I would appreciate it if you could take a look.

As a supplement, I've used the following code to check the overall behavior:

// main.rs

use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::tokenizer::NgramTokenizer;
use tantivy::{doc, Index, IndexWriter, ReloadPolicy};

fn main() -> tantivy::Result<()> {
    let mut schema_builder = SchemaBuilder::default();

    let text_options = TextOptions::default()
        .set_stored()
        .set_indexing_options(
            TextFieldIndexing::default()
                .set_index_option(IndexRecordOption::WithFreqsAndPositions)
                .set_tokenizer("ngram"),
        )
        .set_stored();

    schema_builder.add_text_field("text", text_options);
    let schema = schema_builder.build();

    let index = Index::create_in_ram(schema.clone());
    let mut index_writer: IndexWriter = index.writer(50_000_000)?;

    let tokenizer = NgramTokenizer::new(1, 1, false)?;
    index.tokenizers().register("ngram", tokenizer.clone());

    let text = schema.get_field("text").unwrap();

    // This is the only document in the index
    index_writer.add_document(doc!(
    text => "本日は晴天なり", // 本日は晴天なり: Today is a fine day
    ))?;
    index_writer.commit()?;

    let reader = index
        .reader_builder()
        .reload_policy(ReloadPolicy::OnCommitWithDelay)
        .try_into()?;
    let searcher = reader.searcher();

    let query_parser = QueryParser::for_index(&index, vec![text]);

    // ✅ Work as expected
    let query = query_parser.parse_query(r#""本日は""#)?; // 本日は: Today is
    println!("Query: {:?}", query);
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
    for (_score, doc_address) in top_docs {
        let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
        println!("Matched (as expected): {}", retrieved_doc.to_json(&schema));
    }

    // ❌ Should not return any results
    println!();
    let query = query_parser.parse_query(r#""日本は""#)?; // 日本は: Japan is
    println!("Query: {:?}", query);
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
    for (_score, doc_address) in top_docs {
        let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
        println!(
            "Matched (but not expeced): {}",
            retrieved_doc.to_json(&schema)
        );
    }

    Ok(())
}

PSeitz · 2024-08-29T02:39:19Z

@darashi I think removing the 0 in position seems good, but I'm not sure what would be best in terms of spec. position is expressed in number of tokens, but the number of tokens in case of NgramTokenizer is ambiguous.

    /// Position, expressed in number of tokens.
    pub position: usize,

It would be good to know how Lucene handles this, the docs here are a little unclear to me. Instead of Position increment I'd be interested in the absolute position emitted.
https://lucene.apache.org/core/9_11_1/analysis/common/org/apache/lucene/analysis/ngram/NGramTokenizer.html

What do you think? Can you check what tokens Lucene emits?

darashi · 2024-08-29T10:01:48Z

Thank you for your comment! The offset of a token in N-Gram varies depending on the N, so I agree that it's not that simple to determine it uniquely especially in cases where multiple Ns are mixed together. Even so, I thought it would be reasonable to express the position in terms of number of Unicode characters.

I'm a novice with Java, but I managed to get NGramTokenizer running;

For hεllo (including non-ascii) with minGram=2 and maxGram=3:

NGramTokenizer@6576fe71 term=hε,bytes=[68 ce b5],startOffset=0,endOffset=2,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=hεl,bytes=[68 ce b5 6c],startOffset=0,endOffset=3,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=εl,bytes=[ce b5 6c],startOffset=1,endOffset=3,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=εll,bytes=[ce b5 6c 6c],startOffset=1,endOffset=4,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=ll,bytes=[6c 6c],startOffset=2,endOffset=4,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=llo,bytes=[6c 6c 6f],startOffset=2,endOffset=5,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=lo,bytes=[6c 6f],startOffset=3,endOffset=5,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=,bytes=[],startOffset=5,endOffset=5,positionIncrement=0,positionLength=1,type=word,termFrequency=1

and for hello with minGram=2 and maxGram=3:

NGramTokenizer@6576fe71 term=he,bytes=[68 65],startOffset=0,endOffset=2,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=hel,bytes=[68 65 6c],startOffset=0,endOffset=3,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=el,bytes=[65 6c],startOffset=1,endOffset=3,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=ell,bytes=[65 6c 6c],startOffset=1,endOffset=4,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=ll,bytes=[6c 6c],startOffset=2,endOffset=4,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=llo,bytes=[6c 6c 6f],startOffset=2,endOffset=5,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=lo,bytes=[6c 6f],startOffset=3,endOffset=5,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=,bytes=[],startOffset=5,endOffset=5,positionIncrement=0,positionLength=1,type=word,termFrequency=1

If I'm not mistaken, it looks like startOffset indicates the absolute position offset in characters. Regardless of the presence of non-ascii characters, the same values are returned in both cases, namely [0, 0, 1, 1, 2, 2, 3, 5].

I used the codes in https://github.com/darashi/hellolucene to obtain the output mentioned above.

PSeitz · 2024-09-06T01:39:41Z

Thanks for checking. Unfortunately they emit positionIncrement, so it's still unclear what they use for position as absolute value.

If I'm not mistaken, it looks like startOffset indicates the absolute position offset in characters. Regardless of the presence of non-ascii characters, the same values are returned in both cases, namely [0, 0, 1, 1, 2, 2, 3, 5].

In tantivy we use byte offsets, maybe they use something else?

Can you add tests for ngram with different settings with phrase queries?
btw. I think the regular PhraseQuery is quite wasteful as it will match all tokens

Fill positions in NgramTokenizer

f90a800

Fix format using cargo +nightly fmt

94c4585

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fill `position`s in `NgramTokenizer` #2471

Fill `position`s in `NgramTokenizer` #2471

darashi commented Aug 6, 2024

PSeitz commented Aug 29, 2024

darashi commented Aug 29, 2024 •

edited

Loading

PSeitz commented Sep 6, 2024 •

edited

Loading

Fill positions in NgramTokenizer #2471

Are you sure you want to change the base?

Fill positions in NgramTokenizer #2471

Conversation

darashi commented Aug 6, 2024

PSeitz commented Aug 29, 2024

darashi commented Aug 29, 2024 • edited Loading

PSeitz commented Sep 6, 2024 • edited Loading

Fill `position`s in `NgramTokenizer` #2471

Fill `position`s in `NgramTokenizer` #2471

darashi commented Aug 29, 2024 •

edited

Loading

PSeitz commented Sep 6, 2024 •

edited

Loading