-
Notifications
You must be signed in to change notification settings - Fork 667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory-Efficient Document Loading with LeanDocs + QdrantVectorStore #783
Memory-Efficient Document Loading with LeanDocs + QdrantVectorStore #783
Conversation
Hey @ThomasRochefortB thanks for the contribution!! I think lazy-load embedding + text data into That being said the I think a more flexible approach here could be to make a retrieval tool for |
Hey @mskarlin ! Thank you for the answer. Maybe we are using PaperQA2 wrong... Our current workflow starts from a cloud bucket containing a large amount of papers and concurrently loop through the bucket to do something like this: # Prepare the Docs object by adding a bunch of documents
docs = Docs()
for doc_path in doc_paths:
docs.add(doc_path) Our thinking was that embedding the papers "on the fly" resulted in a big impact on latency during Q&A usage. |
Hey @ThomasRochefortB -- you're right in that embedding papers can cause extra latency, but we only incur the latency cost once, then use a cache for subsequent uses. In your case, you could start by making a full-text index on the files in bucket, or if you add a Qdrant search tool you could initially use that. Then, after querying, you'll have a subset of candidate documents (like 10 or so). You can run Then you'd use |
@mskarlin Thanks for the insightful discussion! Thank you to you and the rest of the Futurehouse team for your great work! 🔥 |
Thanks for this Thomas, keep it up! |
Memory-Efficient Document Loading with LeanDocs
Context:
At Valence Labs, we are working with large document collections (>100k to >1M papers). We are encountering significant challenges in using PaperQA2 due to the simple implementation of the
Docs
object.Now that
Qdrant
is better supported in the repo, I would like to start a discussion on the possibility of using a "Leaner"Docs()
object and better leverage the external DB.Problem
The current
Docs
implementation loads all text chunks and their embeddings into memory when loading documents. For large document collections, this can lead to significant memory usage since all text content and embeddings are stored in RAM. For >40k papers, we are encountering >35GBs of RAM usage when just loading a saveddocs.pkl
Solution
Introduce
LeanDocs
- a memory-efficient alternative toDocs
that maintains document metadata in memory while keeping text chunks and embeddings in Qdrant. Key changes:LeanDocs
maintains the same interface asDocs
for compatibilityQdrantVectorStore.load_docs()
to create aLeanDocs
instanceImplementation Details
LeanDocs
follows the same pattern asDocs
withdocs
,docnames
, and other core attributesload_docs
method only loads document metadata on initializationTesting
Benefits
Quick local test on my end:
Docs
object takes about 70.15 MB, while the newLeanDocs
takes about 0.0574 MBI would like guidance on testing the implementation on LitQA2 to make sure no "silent" bugs are affecting the quality of the results.