feat: Metadata extractor update #147

sjrl · 2024-12-05T12:58:14Z

Related Issues

Updates to the LLM Metadata Extractor.

Proposed Changes:

Output Changes:

Returns documents which contains all documents that succeeded in metadata extraction and failed_documents which contains all documents that failed (either because the LLM could not be executed or a JSON parsing error). By having this separation it's easy to use the failed_documents directly in another LLM metadata extractor to try and fix the issue especially if the error is related to JSON parsing.

Prompt Changes

Removes the prompt_variable
Instead we require the prompt variable to have to be document. This way we can utilize all the info contained within a document as well. E.g. I can use document.content to print out the contents and then also use document.meta.XX to print out any relevant meta information as well (e.g. file name) into the prompt instructions.

Se/de

Identified some bugs in the from_dict method. Basically, we don't have a way where we can automatically call the from_dict method of all the different LLM Providers. So instead we need to create a custom from_dict method that can optionally handle all the different edge cases.

Other

Made expected_keys optional since it's not always alarming if not all expected keys are filled. Sometimes documents won't contain relevant info to extract so instead it can be passed optionally. If passed now it will only print a warning message if there are missing (or additionally unexpected keys).
Used Thread Pool Executor to add "parallelism" to the LLM calls. This should greatly speed up the component since it will run the LLM calls in parallel.

How did you test it?

Added a lot of new tests

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

coveralls · 2024-12-09T09:34:48Z

Pull Request Test Coverage Report for Build 12256963475

Details

0 of 0 changed or added relevant lines in 0 files are covered.
46 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.08%) to 83.211%

Files with Coverage Reduction	New Missed Lines	%
components/extractors/llm_metadata_extractor.py	46	70.7%

Totals
Change from base Build 12246737953:	0.08%
Covered Lines:	2037
Relevant Lines:	2448

💛 - Coveralls

davidsbatista

Looks very good - just left a few minor comments/checks

haystack_experimental/components/extractors/llm_metadata_extractor.py

…ctor.py Co-authored-by: David S. Batista <[email protected]>

davidsbatista

LGTM

sjrl added 8 commits December 5, 2024 13:57

Updates

0f72c08

Fix doc string

7ae8cb5

Remove commented out code

c08ccfb

Fix linting

fa995a6

Fix another lint

0ee221d

Fix typing

bfad2bf

Fix tests

a5bf68d

Update integration test

1e38bab

sjrl added 5 commits December 9, 2024 10:44

More tests

7cad295

Update tests

e819e7d

Try adding mocks

189f9f0

Adding more tests

37892e5

More tests

43d8086

sjrl marked this pull request as ready for review December 9, 2024 15:02

sjrl requested a review from a team as a code owner December 9, 2024 15:02

sjrl requested review from anakin87 and davidsbatista and removed request for a team and anakin87 December 9, 2024 15:02

Merge branch 'main' into metadata-extractor-update

559a436

davidsbatista reviewed Dec 10, 2024

View reviewed changes

sjrl and others added 4 commits December 10, 2024 14:12

Update haystack_experimental/components/extractors/llm_metadata_extra…

6a7eea4

…ctor.py Co-authored-by: David S. Batista <[email protected]>

Update haystack_experimental/components/extractors/llm_metadata_extra…

bc43c6a

…ctor.py Co-authored-by: David S. Batista <[email protected]>

Update haystack_experimental/components/extractors/llm_metadata_extra…

c34af8f

…ctor.py Co-authored-by: David S. Batista <[email protected]>

Fix linting

723899f

davidsbatista approved these changes Dec 10, 2024

View reviewed changes

sjrl merged commit c50a835 into main Dec 10, 2024
8 checks passed

sjrl deleted the metadata-extractor-update branch December 10, 2024 14:19

julian-risch mentioned this pull request Dec 10, 2024

Add notebook with example code for LLMMetadataExtractor #154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Metadata extractor update #147

feat: Metadata extractor update #147

sjrl commented Dec 5, 2024 •

edited

Loading

coveralls commented Dec 9, 2024 •

edited

Loading

davidsbatista left a comment

davidsbatista left a comment

feat: Metadata extractor update #147

feat: Metadata extractor update #147

Conversation

sjrl commented Dec 5, 2024 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

coveralls commented Dec 9, 2024 • edited Loading

Pull Request Test Coverage Report for Build 12256963475

Details

💛 - Coveralls

davidsbatista left a comment

Choose a reason for hiding this comment

davidsbatista left a comment

Choose a reason for hiding this comment

sjrl commented Dec 5, 2024 •

edited

Loading

coveralls commented Dec 9, 2024 •

edited

Loading