Updating Index in Pyserini Without Full Reindexing when Document Contents Change #1964

LaplaceXD · 2024-08-20T05:34:49Z

Hi all,

I'm working on a project with Pyserini and would appreciate some guidance on efficiently updating document indexes. My goal is to avoid reindexing the entire document list whenever a document changes. Initially, I planned to delete the specific document from the index and then append the updated version. However, I couldn't find an IndexWriter module that allows for document deletion.

I also tried using the -uniqueDocid flag with the LuceneIndexer set to append mode, but it didn't seem to remove the old document entry from the index.

At this point, I'm uncertain whether this approach is possible in Pyserini or if there's a more suitable method for incremental indexing. Any guidance or references to relevant code or examples would be greatly appreciated.

Thanks in advance for your insights!

The text was updated successfully, but these errors were encountered:

lintool · 2024-08-20T16:30:47Z

See #1451 - does this help?

LaplaceXD · 2024-08-20T17:58:48Z

See #1451 - does this help?

Unfortunately, it doesn't. It works for our other use case, which is when adding new documents; but it doesn't work for our other use case which is when we update the contents of the document, we also want the index to update accordingly.

>>> from pyserini.index.lucene import LuceneIndexer, IndexReader
>>> indexer = LuceneIndexer("index")
2024-08-21 01:51:56,150 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:138) - Using DefaultEnglishAnalyzer
2024-08-21 01:51:56,153 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:139) - Stemmer: porter
2024-08-21 01:51:56,153 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:140) - Keep stopwords? false
2024-08-21 01:51:56,153 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:141) - Stopwords file: null
Aug 21, 2024 1:51:56 AM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
>>> indexer.add_doc_dict({'id': '0', 'contents': 'Hello there!'})
>>> indexer.add_doc_dict({'id': '1', 'contents': 'A completely unique document.'})
>>> indexer.close()
>>> reader = IndexReader("index")
>>> reader.stats()
{'total_terms': 4, 'documents': 2, 'non_empty_documents': 2, 'unique_terms': 4}
>>> indexer = LuceneIndexer("index", append=True)
2024-08-21 01:52:54,745 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:138) - Using DefaultEnglishAnalyzer
2024-08-21 01:52:54,745 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:139) - Stemmer: porter
2024-08-21 01:52:54,746 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:140) - Keep stopwords? false
2024-08-21 01:52:54,746 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:141) - Stopwords file: null
>>> indexer.add_doc_dict({'id': '1', 'contents': 'A new document!'})
>>> indexer.close()
>>> reader = IndexReader("index")
>>> reader.stats()
{'total_terms': 6, 'documents': 3, 'non_empty_documents': 3, 'unique_terms': -1}

Here in the second invocation of reader.stats(), I was expecting the re-addition of document id 1 to overwrite the existing document in the index, instead of treating it as different document.

lintool · 2024-08-20T18:08:27Z

Unfortunately, the document deletion bindings have not been exposed on the Java end (from Lucene), so this is not currently doable. You're certainly welcome to send a PR to implement this functionality... otherwise, this feature request is noted and we might circle back to implement when our team has extra cycles.

LaplaceXD · 2024-08-20T18:11:48Z

I see, thanks for the clarification! I'll see what I can do in the meantime.

LaplaceXD changed the title ~~Updating Indexes in Pyserini Without Full Reindexing~~ Updating Index in Pyserini Without Full Reindexing when Document Contents Change Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating Index in Pyserini Without Full Reindexing when Document Contents Change #1964

Updating Index in Pyserini Without Full Reindexing when Document Contents Change #1964

LaplaceXD commented Aug 20, 2024 •

edited

Loading

lintool commented Aug 20, 2024

LaplaceXD commented Aug 20, 2024 •

edited

Loading

lintool commented Aug 20, 2024

LaplaceXD commented Aug 20, 2024 •

edited

Loading

Updating Index in Pyserini Without Full Reindexing when Document Contents Change #1964

Updating Index in Pyserini Without Full Reindexing when Document Contents Change #1964

Comments

LaplaceXD commented Aug 20, 2024 • edited Loading

lintool commented Aug 20, 2024

LaplaceXD commented Aug 20, 2024 • edited Loading

lintool commented Aug 20, 2024

LaplaceXD commented Aug 20, 2024 • edited Loading

LaplaceXD commented Aug 20, 2024 •

edited

Loading

LaplaceXD commented Aug 20, 2024 •

edited

Loading

LaplaceXD commented Aug 20, 2024 •

edited

Loading