Memory-Efficient Document Loading with LeanDocs + QdrantVectorStore #783

ThomasRochefortB · 2025-01-02T18:50:44Z

Memory-Efficient Document Loading with LeanDocs

Context:

At Valence Labs, we are working with large document collections (>100k to >1M papers). We are encountering significant challenges in using PaperQA2 due to the simple implementation of the Docs object.

Now that Qdrant is better supported in the repo, I would like to start a discussion on the possibility of using a "Leaner" Docs() object and better leverage the external DB.

Problem

The current Docs implementation loads all text chunks and their embeddings into memory when loading documents. For large document collections, this can lead to significant memory usage since all text content and embeddings are stored in RAM. For >40k papers, we are encountering >35GBs of RAM usage when just loading a saved docs.pkl

Solution

Introduce LeanDocs - a memory-efficient alternative to Docs that maintains document metadata in memory while keeping text chunks and embeddings in Qdrant. Key changes:

LeanDocs maintains the same interface as Docs for compatibility
Modified QdrantVectorStore.load_docs() to create a LeanDocs instance
Text chunks and embeddings remain in Qdrant and are retrieved only when needed
Only document metadata (docname, citation, dockey) is kept in memory

Implementation Details

LeanDocs follows the same pattern as Docs with docs, docnames, and other core attributes
The load_docs method only loads document metadata on initialization
Text chunks are accessed through the Qdrant vector store when needed for queries
Maintains full compatibility with existing PaperQA functionality

Testing

I have tested the implementation with the following:

import asyncio
from qdrant_client import AsyncQdrantClient
from paperqa.llms import QdrantVectorStore  # adjust import path as needed
import nest_asyncio
nest_asyncio.apply()

async def test_load_docs_from_qdrant():
    client = AsyncQdrantClient(url="http://localhost:6333")
    docs = await QdrantVectorStore.load_lean_docs(
        client=client,
        collection_name="test-collection",
        vector_name=None,
        batch_size=100,
        max_concurrent_requests=5
    )
    print(docs)
    return docs

if __name__ == "__main__":    
    # Run the async test
    docs = asyncio.run(test_load_docs_from_qdrant())

Benefits

Quick local test on my end:
- For a collection of 43 papers, the original Docs object takes about 70.15 MB, while the new LeanDocs takes about 0.0574 MB
I would like guidance on testing the implementation on LitQA2 to make sure no "silent" bugs are affecting the quality of the results.

mskarlin · 2025-01-02T19:15:43Z

Hey @ThomasRochefortB thanks for the contribution!! I think lazy-load embedding + text data into LeanDocs is a solid approach to your issue.

That being said the paperqa.Docs design philosophy diverges from this thinking a bit. At FutureHouse, we also have 10M+ document stores, and ideally a Docs object holds the current working state for a single question. We rely on retrieval steps (like what you'd see here: https://github.com/Future-House/paper-qa/blob/main/paperqa/agents/tools.py#L107) to populate a Docs object on the fly in the context of particular question. That way Docs objects only store ~100-1000 relevant chunks, which makes them much more lightweight. That's why we hadn't needed stores like Qdrant either.

I think a more flexible approach here could be to make a retrieval tool for Qdrant which pulls down the candidate papers and inserts them into a Docs in the context of a user question. You could rely on the agent to build the Qdrant query to add some intelligence and flexibility, we find it actually increases QA performance. Let me know your thoughts!

ThomasRochefortB · 2025-01-02T19:47:37Z

Hey @mskarlin ! Thank you for the answer.

Maybe we are using PaperQA2 wrong... Our current workflow starts from a cloud bucket containing a large amount of papers and concurrently loop through the bucket to do something like this:

# Prepare the Docs object by adding a bunch of documents
docs = Docs()
for doc_path in doc_paths:
    docs.add(doc_path)

Our thinking was that embedding the papers "on the fly" resulted in a big impact on latency during Q&A usage.
Are we doing this wrong? Do you not consider the text embedding step a bottleneck in your process? Should we instead just pre-build an index of the bucket, and just use agent_query at runtime?

mskarlin · 2025-01-02T21:02:55Z

Hey @ThomasRochefortB -- you're right in that embedding papers can cause extra latency, but we only incur the latency cost once, then use a cache for subsequent uses. In your case, you could start by making a full-text index on the files in bucket, or if you add a Qdrant search tool you could initially use that. Then, after querying, you'll have a subset of candidate documents (like 10 or so). You can run docs.aadd(doc_path) on those documents if they are new, then save the resulting embeddings and texts in a database. The next time those documents come back from your full text search or qdrant search, you can then pull the texts/embeddings from your DB, and add them to the docs object with docs.aadd_texts, which should only take a few milliseconds for typical document / embedding sizes.

Then you'd use agent_query at runtime, just like you suggested.

…paper-qa into feat/lean_docs

ThomasRochefortB · 2025-01-03T14:19:28Z

@mskarlin Thanks for the insightful discussion!
I suggest to close this PR then since I misunderstood the goal of the Docs() object.
We'll keep the implementation in our forked branch since this is what we will be using internally moving forward.
Feel free to reach out again if you would be interested in it in the future.

Thank you to you and the rest of the Futurehouse team for your great work! 🔥

jamesbraza · 2025-01-03T21:34:24Z

Thanks for this Thomas, keep it up!

ThomasRochefortB and others added 10 commits December 20, 2024 14:56

First commit

850afc5

linting+ format

f062936

Fixed most of the implementation comments

f7ce8f4

[pre-commit.ci lite] apply automatic fixes

745a1ce

Merge branch 'main' into feat/Qdrant_docs_reconstruct

28a26af

Revert changes to docs.py

7cc34df

Fix lint

0a6f43b

[pre-commit.ci lite] apply automatic fixes

99c2566

Never mind

52acbec

Initial commit

096f4e9

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels Jan 2, 2025

[pre-commit.ci lite] apply automatic fixes

ecbc17b

ThomasRochefortB and others added 4 commits January 2, 2025 17:45

fixed the load lean_docs

4ff96bf

Merge branch 'feat/lean_docs' of https://github.com/ThomasRochefortB/…

17b098e

…paper-qa into feat/lean_docs

[pre-commit.ci lite] apply automatic fixes

432c853

Merge branch 'Future-House:main' into feat/lean_docs

95e8828

jamesbraza closed this Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory-Efficient Document Loading with LeanDocs + QdrantVectorStore #783

Memory-Efficient Document Loading with LeanDocs + QdrantVectorStore #783

ThomasRochefortB commented Jan 2, 2025

mskarlin commented Jan 2, 2025

ThomasRochefortB commented Jan 2, 2025

mskarlin commented Jan 2, 2025

ThomasRochefortB commented Jan 3, 2025

jamesbraza commented Jan 3, 2025

Memory-Efficient Document Loading with LeanDocs + QdrantVectorStore #783

Memory-Efficient Document Loading with LeanDocs + QdrantVectorStore #783

Conversation

ThomasRochefortB commented Jan 2, 2025

Memory-Efficient Document Loading with LeanDocs

Context:

Problem

Solution

Implementation Details

Testing

Benefits

mskarlin commented Jan 2, 2025

ThomasRochefortB commented Jan 2, 2025

mskarlin commented Jan 2, 2025

ThomasRochefortB commented Jan 3, 2025

jamesbraza commented Jan 3, 2025