Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add recursively merge of documents by their parents if they meet the threshold #162

Merged
merged 21 commits into from
Jan 7, 2025

Conversation

davidsbatista
Copy link
Contributor

@davidsbatista davidsbatista commented Jan 3, 2025

Related Issues

Proposed Changes:

  • Updated the run() method to have a recursive merging process. It recursively groups documents by their parents and merges them if they meet the threshold, continuing up the hierarchy until no more merges are possible. Previously, this was hard-coded to a 2-level hierarchy

  • The function that checks for the input was not being called

How did you test it?

  • ran unit tests locally and did manual verification
  • CI tests
  • added more tests, and bumped test coverage to 93%, previously it was below 80%

Checklist

@coveralls
Copy link

coveralls commented Jan 3, 2025

Pull Request Test Coverage Report for Build 12654383306

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.4%) to 83.545%

Totals Coverage Status
Change from base Build 12624157864: 0.4%
Covered Lines: 2041
Relevant Lines: 2443

💛 - Coveralls

@davidsbatista davidsbatista changed the title Auto merging retriever update fix: add recursively merge of documents by their parents if they meet the threshold Jan 3, 2025
@davidsbatista davidsbatista marked this pull request as ready for review January 3, 2025 16:18
@davidsbatista davidsbatista requested a review from a team as a code owner January 3, 2025 16:18
@davidsbatista davidsbatista requested review from julian-risch and removed request for a team January 3, 2025 16:18
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one change request about the try_merge_level. Other than that the PR looks good to me.
There are cases where we check multiple times whether we can merge the same child_docs over and over in try_merge_level.
Imagine we start try_merge_level with 4 documents.
2 documents from a parent A and two from a parent B. Let's say when we calculate merge scores, documents from parent A get merged. documents from parent B don't get merged.
Now, there were some merges made, so we're not done yet. We go into the recursion and check the documents for B again. Let's prevent this from happening. I suggest we change two lines for this but haven't tried it. Please check or maybe you have another idea.

@julian-risch
Copy link
Member

The class docstring also needs to be updated.
When I run the example code, i get three levels of documents now.

from haystack import Document
from haystack_experimental.components.splitters import HierarchicalDocumentSplitter
from haystack_experimental.components.retrievers.auto_merging_retriever import AutoMergingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

# create a hierarchical document structure with 2 levels, where the parent document has 3 children
text = "The sun rose early in the morning. It cast a warm glow over the trees. Birds began to sing."
original_document = Document(content=text)
builder = HierarchicalDocumentSplitter(block_sizes=[10, 3], split_overlap=0, split_by="word")
docs = builder.run([original_document])["documents"]


>>> len([d for d in docs if d.meta["__level"]==0])
1
>>> len([d for d in docs if d.meta["__level"]==1])
2
>>> len([d for d in docs if d.meta["__level"]==2])
7

which contradicts the comment # create a hierarchical document structure with 2 levels, where the parent document has 3 children

@davidsbatista
Copy link
Contributor Author

I wrote it not counting the tree's root, the original document, which is always present - but we should count it as a level.

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one more line that needs to be changed in my opinion. As all tests pass already, I would suggest adding a test to cover that change too.

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick call! Looks good to me now!

@davidsbatista davidsbatista merged commit 66cfc02 into main Jan 7, 2025
8 checks passed
@davidsbatista davidsbatista deleted the auto-merging-retriever-update branch January 7, 2025 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AutoMergingRetriever support multiple levels of hierarchy
3 participants