Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory-Efficient Document Loading with LeanDocs + QdrantVectorStore #783

Closed

Conversation

ThomasRochefortB
Copy link
Contributor

Memory-Efficient Document Loading with LeanDocs

Context:

At Valence Labs, we are working with large document collections (>100k to >1M papers). We are encountering significant challenges in using PaperQA2 due to the simple implementation of the Docs object.

Now that Qdrant is better supported in the repo, I would like to start a discussion on the possibility of using a "Leaner" Docs() object and better leverage the external DB.

Problem

The current Docs implementation loads all text chunks and their embeddings into memory when loading documents. For large document collections, this can lead to significant memory usage since all text content and embeddings are stored in RAM. For >40k papers, we are encountering >35GBs of RAM usage when just loading a saved docs.pkl

Solution

Introduce LeanDocs - a memory-efficient alternative to Docs that maintains document metadata in memory while keeping text chunks and embeddings in Qdrant. Key changes:

  • LeanDocs maintains the same interface as Docs for compatibility
  • Modified QdrantVectorStore.load_docs() to create a LeanDocs instance
  • Text chunks and embeddings remain in Qdrant and are retrieved only when needed
  • Only document metadata (docname, citation, dockey) is kept in memory

Implementation Details

  • LeanDocs follows the same pattern as Docs with docs, docnames, and other core attributes
  • The load_docs method only loads document metadata on initialization
  • Text chunks are accessed through the Qdrant vector store when needed for queries
  • Maintains full compatibility with existing PaperQA functionality

Testing

  • I have tested the implementation with the following:
import asyncio
from qdrant_client import AsyncQdrantClient
from paperqa.llms import QdrantVectorStore  # adjust import path as needed
import nest_asyncio
nest_asyncio.apply()

async def test_load_docs_from_qdrant():
    client = AsyncQdrantClient(url="http://localhost:6333")
    docs = await QdrantVectorStore.load_lean_docs(
        client=client,
        collection_name="test-collection",
        vector_name=None,
        batch_size=100,
        max_concurrent_requests=5
    )
    print(docs)
    return docs

if __name__ == "__main__":    
    # Run the async test
    docs = asyncio.run(test_load_docs_from_qdrant())

Benefits

  • Quick local test on my end:

    • For a collection of 43 papers, the original Docs object takes about 70.15 MB, while the new LeanDocs takes about 0.0574 MB
  • I would like guidance on testing the implementation on LitQA2 to make sure no "silent" bugs are affecting the quality of the results.

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels Jan 2, 2025
@mskarlin
Copy link
Collaborator

mskarlin commented Jan 2, 2025

Hey @ThomasRochefortB thanks for the contribution!! I think lazy-load embedding + text data into LeanDocs is a solid approach to your issue.

That being said the paperqa.Docs design philosophy diverges from this thinking a bit. At FutureHouse, we also have 10M+ document stores, and ideally a Docs object holds the current working state for a single question. We rely on retrieval steps (like what you'd see here: https://github.com/Future-House/paper-qa/blob/main/paperqa/agents/tools.py#L107) to populate a Docs object on the fly in the context of particular question. That way Docs objects only store ~100-1000 relevant chunks, which makes them much more lightweight. That's why we hadn't needed stores like Qdrant either.

I think a more flexible approach here could be to make a retrieval tool for Qdrant which pulls down the candidate papers and inserts them into a Docs in the context of a user question. You could rely on the agent to build the Qdrant query to add some intelligence and flexibility, we find it actually increases QA performance. Let me know your thoughts!

@ThomasRochefortB
Copy link
Contributor Author

Hey @mskarlin ! Thank you for the answer.

Maybe we are using PaperQA2 wrong... Our current workflow starts from a cloud bucket containing a large amount of papers and concurrently loop through the bucket to do something like this:

# Prepare the Docs object by adding a bunch of documents
docs = Docs()
for doc_path in doc_paths:
    docs.add(doc_path)

Our thinking was that embedding the papers "on the fly" resulted in a big impact on latency during Q&A usage.
Are we doing this wrong? Do you not consider the text embedding step a bottleneck in your process? Should we instead just pre-build an index of the bucket, and just use agent_query at runtime?

@mskarlin
Copy link
Collaborator

mskarlin commented Jan 2, 2025

Hey @ThomasRochefortB -- you're right in that embedding papers can cause extra latency, but we only incur the latency cost once, then use a cache for subsequent uses. In your case, you could start by making a full-text index on the files in bucket, or if you add a Qdrant search tool you could initially use that. Then, after querying, you'll have a subset of candidate documents (like 10 or so). You can run docs.aadd(doc_path) on those documents if they are new, then save the resulting embeddings and texts in a database. The next time those documents come back from your full text search or qdrant search, you can then pull the texts/embeddings from your DB, and add them to the docs object with docs.aadd_texts, which should only take a few milliseconds for typical document / embedding sizes.

Then you'd use agent_query at runtime, just like you suggested.

@ThomasRochefortB
Copy link
Contributor Author

@mskarlin Thanks for the insightful discussion!
I suggest to close this PR then since I misunderstood the goal of the Docs() object.
We'll keep the implementation in our forked branch since this is what we will be using internally moving forward.
Feel free to reach out again if you would be interested in it in the future.

Thank you to you and the rest of the Futurehouse team for your great work! 🔥

@jamesbraza
Copy link
Collaborator

Thanks for this Thomas, keep it up!

@jamesbraza jamesbraza closed this Jan 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants