Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to report multiple documents on extract #391

Open
mih opened this issue Jul 20, 2023 · 2 comments
Open

How to report multiple documents on extract #391

mih opened this issue Jul 20, 2023 · 2 comments

Comments

@mih
Copy link
Member

mih commented Jul 20, 2023

I implement a dataset-level metadata extractor. I think I need to be able to report multiple, individual metadata records. In principle, one be able to build these records in a way that they can be reported in a nested fashion (thereby reporting just a single object). However, in my case I have no control over the nature of these documents, and they might be linked (or not) in different ways.

What is a desirable approach here?

  • an arbitrary top-level key that maps onto an array?
  • a JSON-LD style @graph top-level key (as a realization of the above)?
  • something else?

Related: We might be talking about a lot of stuff to return. If I see things correctly, I need to load multiple standalone records into memory (many), report them via immediate_data as a single dict, such that they can be written out as JSON (again). I am yet to understand why meta-extract turns a single return value of type ExtractorResult into a result record, rather than dealing with result records directly. This would make the standard machinery of seemlessly switching between return values and generator yields applicable to metadata extractors too

@christian-monch
Copy link
Collaborator

With the realization that the principle approach requires fixing, I would opt for the {"@graph": [ <objects>]}-approach as a quick-fix.

@christian-monch
Copy link
Collaborator

christian-monch commented Jul 21, 2023

@mih I didn't think about that yesterday afternoon, but another option would be to return a list, which contains the individual results, in immediate_data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants