Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Metadata extractor update #147

Merged
merged 18 commits into from
Dec 10, 2024
Merged

feat: Metadata extractor update #147

merged 18 commits into from
Dec 10, 2024

Conversation

sjrl
Copy link
Collaborator

@sjrl sjrl commented Dec 5, 2024

Related Issues

Updates to the LLM Metadata Extractor.

Proposed Changes:

Output Changes:

  • Returns documents which contains all documents that succeeded in metadata extraction and failed_documents which contains all documents that failed (either because the LLM could not be executed or a JSON parsing error). By having this separation it's easy to use the failed_documents directly in another LLM metadata extractor to try and fix the issue especially if the error is related to JSON parsing.

Prompt Changes

  • Removes the prompt_variable
  • Instead we require the prompt variable to have to be document. This way we can utilize all the info contained within a document as well. E.g. I can use document.content to print out the contents and then also use document.meta.XX to print out any relevant meta information as well (e.g. file name) into the prompt instructions.

Se/de

  • Identified some bugs in the from_dict method. Basically, we don't have a way where we can automatically call the from_dict method of all the different LLM Providers. So instead we need to create a custom from_dict method that can optionally handle all the different edge cases.

Other

  • Made expected_keys optional since it's not always alarming if not all expected keys are filled. Sometimes documents won't contain relevant info to extract so instead it can be passed optionally. If passed now it will only print a warning message if there are missing (or additionally unexpected keys).
  • Used Thread Pool Executor to add "parallelism" to the LLM calls. This should greatly speed up the component since it will run the LLM calls in parallel.

How did you test it?

Added a lot of new tests

Notes for the reviewer

Checklist

@coveralls
Copy link

coveralls commented Dec 9, 2024

Pull Request Test Coverage Report for Build 12256963475

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 46 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.08%) to 83.211%

Files with Coverage Reduction New Missed Lines %
components/extractors/llm_metadata_extractor.py 46 70.7%
Totals Coverage Status
Change from base Build 12246737953: 0.08%
Covered Lines: 2037
Relevant Lines: 2448

💛 - Coveralls

@sjrl sjrl marked this pull request as ready for review December 9, 2024 15:02
@sjrl sjrl requested a review from a team as a code owner December 9, 2024 15:02
@sjrl sjrl requested review from anakin87 and davidsbatista and removed request for a team and anakin87 December 9, 2024 15:02
Copy link
Contributor

@davidsbatista davidsbatista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good - just left a few minor comments/checks

Copy link
Contributor

@davidsbatista davidsbatista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sjrl sjrl merged commit c50a835 into main Dec 10, 2024
8 checks passed
@sjrl sjrl deleted the metadata-extractor-update branch December 10, 2024 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants