Information issue: example for using `meta_conduct` from python-code #395

christian-monch · 2023-09-20T12:06:09Z

This is an example of how to use meta_conduct from python-code to extract metadata from all files of a datalad dataset.

In this example, we assume that we want to execute three extractors on every file. Also, the results of the extraction should be written to stdout.

The example uses meta-conduct to perform the extraction. It defines a pipeline that traverses over all files of a datalad dataset and feeds each file into a processing pipeline of three extractors. Each "extractor" is an instance of the generic class datalad_metalad.pipeline.processor.MetadataExtractor. The instances are configured during instantiation through the parameters extractor_name and extractor_type. The values for those parameters are provided as arguments to meta_conduct.

This is the example code. The dataset directory is given in the global variable dataset_dir and the names of the three extractors are given in the global variables X, Y, and Z (the code executes all three extractors as file-level extractors and assumes that all provided extractors are either file-level extractors or legacy extractors):

from datalad.api import meta_conduct

# The dataset on which the extractors should run
dataset_dir = "<path to your datalad-dataset>"

# The extractors that should be executed
X = "metalad_core"
Y = "datalad_core"
Z = "metalad_example_file"


pipeline = {
    'provider': {
        'module': 'datalad_metalad.pipeline.provider.datasettraverse',
        'class': 'DatasetTraverser',
        'name': 'trav',
        'arguments': {}
    },
    'processors': [
        {
            'module': 'datalad_metalad.pipeline.processor.extract',
            'class': 'MetadataExtractor',
            'name': 'ext1',
            'arguments': {}
        },
        {
            'module': 'datalad_metalad.pipeline.processor.extract',
            'class': 'MetadataExtractor',
            'name': 'ext2',
            'arguments': {}
        },
        {
            'module': 'datalad_metalad.pipeline.processor.extract',
            'class': 'MetadataExtractor',
            'name': 'ext3',
            'arguments': {}
        }
    ]
}

adding_pipeline = {
    **pipeline,
    'consumer': {
        'module': 'datalad_metalad.pipeline.consumer.add_stdin',
        'class': 'StdinAdder',
        'name': 'adder',
        'arguments': {}
    }
}


arguments = [
    f'trav.top_level_dir={dataset_dir}', 'trav.item_type=file',
    f'ext1.extractor_name={X}', 'ext1.extractor_type=file',
    f'ext2.extractor_name={Y}', 'ext2.extractor_type=file',
    f'ext3.extractor_name={Z}', 'ext3.extractor_type=file',
]

for result in meta_conduct(configuration=pipeline,
                           arguments=arguments,
                           result_renderer='disabled'):
    print(result)

The code above will print the resulting metadata records to stdout. They can then be processed further, and, for example, be added to the git-repo of the dataset via meta-add. To this end, it would be sufficient to pipe the results into a call like this, which instructs meta-add to read content from stdin and to expect the content to be in JSON-lines format, i.e. to contain one metadata-record per input line.

datalad meta-add --json-lines -

Adding to the git-repo can also be achieved by adding a consumer-component to the pipeline definition. In the code above the variable adding-pipeline contains such a consumer. The result of running the pipeline adding-pipeline would be equivalent to running datalad meta-conduct and piping the result into datalad meta-add as described above. (If you want to try it, you might have to specify the parameter adder.dataset=<path to your datalad-dataset> to tell the consumer in which git-repo the metadata records should be stored.)

The text was updated successfully, but these errors were encountered:

christian-monch · 2023-09-20T12:08:36Z

@yarikoptic @candleindark This is a writeup of matrix chat messages regarding extraction of metadata from dataset files

yarikoptic · 2023-09-20T12:54:53Z

Thank you @christian-monch . Another earlier dump I did is available in a now closed #394

yarikoptic · 2023-09-20T13:08:22Z

should we keep file level extraction separate or could/should we add dataset level extractions here too?

christian-monch · 2023-09-21T08:31:42Z

We can combine file and dataset level extraction. To do this, the traverser has to be instructed to emit file- and dataset-items. This is done with 'trav.item_type=both.

In addition the extractor-processors should be configured with dataset-level extractors (or legacy extractors) and with extractor-type "dataset", e.g. ext3.extractor_type='dataset', ext3.extractor_name='metalad_example_dataset'. The respective extractor will only be executed on dataset-items

yarikoptic mentioned this issue Sep 20, 2023

Add file level metadata extraction datalad/datalad-registry#244

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Information issue: example for using `meta_conduct` from python-code #395

Information issue: example for using `meta_conduct` from python-code #395

christian-monch commented Sep 20, 2023

christian-monch commented Sep 20, 2023

yarikoptic commented Sep 20, 2023

yarikoptic commented Sep 20, 2023

christian-monch commented Sep 21, 2023

Information issue: example for using meta_conduct from python-code #395

Information issue: example for using meta_conduct from python-code #395

Comments

christian-monch commented Sep 20, 2023

christian-monch commented Sep 20, 2023

yarikoptic commented Sep 20, 2023

yarikoptic commented Sep 20, 2023

christian-monch commented Sep 21, 2023

Information issue: example for using `meta_conduct` from python-code #395

Information issue: example for using `meta_conduct` from python-code #395