-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating Index in Pyserini Without Full Reindexing when Document Contents Change #1964
Comments
See #1451 - does this help? |
Unfortunately, it doesn't. It works for our other use case, which is when adding new documents; but it doesn't work for our other use case which is when we update the contents of the document, we also want the index to update accordingly. >>> from pyserini.index.lucene import LuceneIndexer, IndexReader
>>> indexer = LuceneIndexer("index")
2024-08-21 01:51:56,150 INFO [main] index.SimpleIndexer (SimpleIndexer.java:138) - Using DefaultEnglishAnalyzer
2024-08-21 01:51:56,153 INFO [main] index.SimpleIndexer (SimpleIndexer.java:139) - Stemmer: porter
2024-08-21 01:51:56,153 INFO [main] index.SimpleIndexer (SimpleIndexer.java:140) - Keep stopwords? false
2024-08-21 01:51:56,153 INFO [main] index.SimpleIndexer (SimpleIndexer.java:141) - Stopwords file: null
Aug 21, 2024 1:51:56 AM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 21; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
>>> indexer.add_doc_dict({'id': '0', 'contents': 'Hello there!'})
>>> indexer.add_doc_dict({'id': '1', 'contents': 'A completely unique document.'})
>>> indexer.close()
>>> reader = IndexReader("index")
>>> reader.stats()
{'total_terms': 4, 'documents': 2, 'non_empty_documents': 2, 'unique_terms': 4}
>>> indexer = LuceneIndexer("index", append=True)
2024-08-21 01:52:54,745 INFO [main] index.SimpleIndexer (SimpleIndexer.java:138) - Using DefaultEnglishAnalyzer
2024-08-21 01:52:54,745 INFO [main] index.SimpleIndexer (SimpleIndexer.java:139) - Stemmer: porter
2024-08-21 01:52:54,746 INFO [main] index.SimpleIndexer (SimpleIndexer.java:140) - Keep stopwords? false
2024-08-21 01:52:54,746 INFO [main] index.SimpleIndexer (SimpleIndexer.java:141) - Stopwords file: null
>>> indexer.add_doc_dict({'id': '1', 'contents': 'A new document!'})
>>> indexer.close()
>>> reader = IndexReader("index")
>>> reader.stats()
{'total_terms': 6, 'documents': 3, 'non_empty_documents': 3, 'unique_terms': -1} Here in the second invocation of |
Unfortunately, the document deletion bindings have not been exposed on the Java end (from Lucene), so this is not currently doable. You're certainly welcome to send a PR to implement this functionality... otherwise, this feature request is noted and we might circle back to implement when our team has extra cycles. |
I see, thanks for the clarification! I'll see what I can do in the meantime. |
Hi all,
I'm working on a project with Pyserini and would appreciate some guidance on efficiently updating document indexes. My goal is to avoid reindexing the entire document list whenever a document changes. Initially, I planned to delete the specific document from the index and then append the updated version. However, I couldn't find an
IndexWriter
module that allows for document deletion.I also tried using the
-uniqueDocid
flag with theLuceneIndexer
set toappend
mode, but it didn't seem to remove the old document entry from the index.At this point, I'm uncertain whether this approach is possible in Pyserini or if there's a more suitable method for incremental indexing. Any guidance or references to relevant code or examples would be greatly appreciated.
Thanks in advance for your insights!
The text was updated successfully, but these errors were encountered: