Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Information issue: example for using meta_conduct from python-code #395

Open
christian-monch opened this issue Sep 20, 2023 · 4 comments
Open

Comments

@christian-monch
Copy link
Collaborator

This is an example of how to use meta_conduct from python-code to extract metadata from all files of a datalad dataset.

In this example, we assume that we want to execute three extractors on every file. Also, the results of the extraction should be written to stdout.

The example uses meta-conduct to perform the extraction. It defines a pipeline that traverses over all files of a datalad dataset and feeds each file into a processing pipeline of three extractors. Each "extractor" is an instance of the generic class datalad_metalad.pipeline.processor.MetadataExtractor. The instances are configured during instantiation through the parameters extractor_name and extractor_type. The values for those parameters are provided as arguments to meta_conduct.

This is the example code. The dataset directory is given in the global variable dataset_dir and the names of the three extractors are given in the global variables X, Y, and Z (the code executes all three extractors as file-level extractors and assumes that all provided extractors are either file-level extractors or legacy extractors):

from datalad.api import meta_conduct

# The dataset on which the extractors should run
dataset_dir = "<path to your datalad-dataset>"

# The extractors that should be executed
X = "metalad_core"
Y = "datalad_core"
Z = "metalad_example_file"


pipeline = {
    'provider': {
        'module': 'datalad_metalad.pipeline.provider.datasettraverse',
        'class': 'DatasetTraverser',
        'name': 'trav',
        'arguments': {}
    },
    'processors': [
        {
            'module': 'datalad_metalad.pipeline.processor.extract',
            'class': 'MetadataExtractor',
            'name': 'ext1',
            'arguments': {}
        },
        {
            'module': 'datalad_metalad.pipeline.processor.extract',
            'class': 'MetadataExtractor',
            'name': 'ext2',
            'arguments': {}
        },
        {
            'module': 'datalad_metalad.pipeline.processor.extract',
            'class': 'MetadataExtractor',
            'name': 'ext3',
            'arguments': {}
        }
    ]
}

adding_pipeline = {
    **pipeline,
    'consumer': {
        'module': 'datalad_metalad.pipeline.consumer.add_stdin',
        'class': 'StdinAdder',
        'name': 'adder',
        'arguments': {}
    }
}


arguments = [
    f'trav.top_level_dir={dataset_dir}', 'trav.item_type=file',
    f'ext1.extractor_name={X}', 'ext1.extractor_type=file',
    f'ext2.extractor_name={Y}', 'ext2.extractor_type=file',
    f'ext3.extractor_name={Z}', 'ext3.extractor_type=file',
]

for result in meta_conduct(configuration=pipeline,
                           arguments=arguments,
                           result_renderer='disabled'):
    print(result)

The code above will print the resulting metadata records to stdout. They can then be processed further, and, for example, be added to the git-repo of the dataset via meta-add. To this end, it would be sufficient to pipe the results into a call like this, which instructs meta-add to read content from stdin and to expect the content to be in JSON-lines format, i.e. to contain one metadata-record per input line.

datalad meta-add --json-lines -

Adding to the git-repo can also be achieved by adding a consumer-component to the pipeline definition. In the code above the variable adding-pipeline contains such a consumer. The result of running the pipeline adding-pipeline would be equivalent to running datalad meta-conduct and piping the result into datalad meta-add as described above. (If you want to try it, you might have to specify the parameter adder.dataset=<path to your datalad-dataset> to tell the consumer in which git-repo the metadata records should be stored.)

@christian-monch
Copy link
Collaborator Author

@yarikoptic @candleindark This is a writeup of matrix chat messages regarding extraction of metadata from dataset files

@yarikoptic
Copy link
Member

Thank you @christian-monch . Another earlier dump I did is available in a now closed #394

@yarikoptic
Copy link
Member

should we keep file level extraction separate or could/should we add dataset level extractions here too?

@christian-monch
Copy link
Collaborator Author

We can combine file and dataset level extraction. To do this, the traverser has to be instructed to emit file- and dataset-items. This is done with 'trav.item_type=both.

In addition the extractor-processors should be configured with dataset-level extractors (or legacy extractors) and with extractor-type "dataset", e.g. ext3.extractor_type='dataset', ext3.extractor_name='metalad_example_dataset'. The respective extractor will only be executed on dataset-items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants