-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add recursively merge of documents by their parents if they meet the threshold #162
Conversation
Pull Request Test Coverage Report for Build 12654383306Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one change request about the try_merge_level
. Other than that the PR looks good to me.
There are cases where we check multiple times whether we can merge the same child_docs
over and over in try_merge_level
.
Imagine we start try_merge_level
with 4 documents.
2 documents from a parent A and two from a parent B. Let's say when we calculate merge scores, documents from parent A get merged. documents from parent B don't get merged.
Now, there were some merges made, so we're not done yet. We go into the recursion and check the documents for B again. Let's prevent this from happening. I suggest we change two lines for this but haven't tried it. Please check or maybe you have another idea.
haystack_experimental/components/retrievers/auto_merging_retriever.py
Outdated
Show resolved
Hide resolved
haystack_experimental/components/retrievers/auto_merging_retriever.py
Outdated
Show resolved
Hide resolved
haystack_experimental/components/retrievers/auto_merging_retriever.py
Outdated
Show resolved
Hide resolved
The class docstring also needs to be updated. from haystack import Document
from haystack_experimental.components.splitters import HierarchicalDocumentSplitter
from haystack_experimental.components.retrievers.auto_merging_retriever import AutoMergingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
# create a hierarchical document structure with 2 levels, where the parent document has 3 children
text = "The sun rose early in the morning. It cast a warm glow over the trees. Birds began to sing."
original_document = Document(content=text)
builder = HierarchicalDocumentSplitter(block_sizes=[10, 3], split_overlap=0, split_by="word")
docs = builder.run([original_document])["documents"]
>>> len([d for d in docs if d.meta["__level"]==0])
1
>>> len([d for d in docs if d.meta["__level"]==1])
2
>>> len([d for d in docs if d.meta["__level"]==2])
7 which contradicts the comment |
I wrote it not counting the tree's root, the original document, which is always present - but we should count it as a level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found one more line that needs to be changed in my opinion. As all tests pass already, I would suggest adding a test to cover that change too.
haystack_experimental/components/retrievers/auto_merging_retriever.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick call! Looks good to me now!
Related Issues
Proposed Changes:
Updated the
run()
method to have a recursive merging process. It recursively groups documents by their parents and merges them if they meet the threshold, continuing up the hierarchy until no more merges are possible. Previously, this was hard-coded to a 2-level hierarchyThe function that checks for the input was not being called
How did you test it?
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.