Provide a default implementation for locating metadata source files for extraction #379

jsheunis · 2023-06-19T08:39:45Z

I'm thinking about streamlining and deduplicating code for metadata extractors in the context of psychoinformatics-de/datalad-tabby#2.

The goal is to provide a method with which agents can supply arguments (or not) to meta-extract that allow flexibility for locating files with metadata sources. Currently, there are multiple ways that this is done:

Most extractors have a hard-coded location for the metadata_source file, e.g. the bids_dataset extractor in datalad-neuroimaging will always look for the ./dataset_description.json file relative to the root of the dataset.
The updated genericjson_dataset extractor in datalad-metalad uses a combination of a default location (.metadata/dataset.json) and location(s) provided as extraction arguments during the meta-extract call.
Another option would be to specify the default location via a dataset configuration, and allows users to specify this themselves.

I think it makes sense to provide a standard implementation for this process within metalad, so that there doesn't have to be any code duplication in extractor code. My suggestion is:

use a dataset configuration to provide the default location,
tell users to overwrite the configuration if, they prefer, before running extraction, OR
tell users to provide the meta-extract call with the metadata_source extraction argument (multiple = serialized list)

Any extractor will then take the extractor arguments as priority, and will default to the dataset configuration.

The text was updated successfully, but these errors were encountered:

christian-monch · 2023-07-03T07:50:04Z

metalad supports an unlimited number of "runtime"-arguments that can be provided via CLI or API to the extractors. For example:

> datalad meta-extract my_awesome_extractor extractor_arg_1 extractor_arg_2 ... extractor_arg_n

The number and nature of the additional arguments is not defined by meta-extract, the arguments (in the example above. extractrator_arg_1, ..., extractor_arg_n, are just forwarded to the extractors. It makes sense to handle identical or similar extractor argument definitions in a single place. One possible solution would be a common base class, that reads and validates the arguments, another would be a set of functions that read and validate the arguments and provide a structured representation, e.g. a dataclass with the argument content.

(The topic is also mentioned in psychoinformatics-de/datalad-tabby#2)

yarikoptic · 2023-09-14T16:16:42Z

I think this relates to my "desire" to just be able to extract all metadata an extractor can extract across the files in the dataset. E.g. for datalad-catalog we need metalad_core extractor to extract metadata for all the files in the dataset. I would expect some easy way to do it besides getting a full list of files in the dataset and feeding it to the extractor (IIRC needs to be done in the loop with some smart ways to pre-do something or otherwise would be slow).

But overall it is likely the "property" of an extractor to know which files it could extract metadata about, instead of me feeding each extractor with all the paths even if they are not appropriate for it.

jsheunis mentioned this issue Jun 19, 2023

Implement metadata extractor psychoinformatics-de/datalad-tabby#2

Closed

yarikoptic mentioned this issue Sep 14, 2023

Add file level metadata extraction datalad/datalad-registry#244

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a default implementation for locating metadata source files for extraction #379

Provide a default implementation for locating metadata source files for extraction #379

jsheunis commented Jun 19, 2023

christian-monch commented Jul 3, 2023 •

edited

Loading

yarikoptic commented Sep 14, 2023

Provide a default implementation for locating metadata source files for extraction #379

Provide a default implementation for locating metadata source files for extraction #379

Comments

jsheunis commented Jun 19, 2023

christian-monch commented Jul 3, 2023 • edited Loading

yarikoptic commented Sep 14, 2023

christian-monch commented Jul 3, 2023 •

edited

Loading