-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Information issue: example for using meta_conduct
from python-code
#395
Comments
@yarikoptic @candleindark This is a writeup of matrix chat messages regarding extraction of metadata from dataset files |
Thank you @christian-monch . Another earlier dump I did is available in a now closed #394 |
should we keep |
We can combine In addition the extractor-processors should be configured with dataset-level extractors (or legacy extractors) and with extractor-type "dataset", e.g. |
This is an example of how to use
meta_conduct
from python-code to extract metadata from all files of a datalad dataset.In this example, we assume that we want to execute three extractors on every file. Also, the results of the extraction should be written to
stdout
.The example uses
meta-conduct
to perform the extraction. It defines a pipeline that traverses over all files of a datalad dataset and feeds each file into a processing pipeline of three extractors. Each "extractor" is an instance of the generic classdatalad_metalad.pipeline.processor.MetadataExtractor
. The instances are configured during instantiation through the parametersextractor_name
andextractor_type
. The values for those parameters are provided as arguments tometa_conduct
.This is the example code. The dataset directory is given in the global variable
dataset_dir
and the names of the three extractors are given in the global variablesX
,Y
, andZ
(the code executes all three extractors as file-level extractors and assumes that all provided extractors are either file-level extractors or legacy extractors):The code above will print the resulting metadata records to
stdout
. They can then be processed further, and, for example, be added to the git-repo of the dataset viameta-add
. To this end, it would be sufficient to pipe the results into a call like this, which instructsmeta-add
to read content fromstdin
and to expect the content to be in JSON-lines format, i.e. to contain one metadata-record per input line.Adding to the git-repo can also be achieved by adding a consumer-component to the pipeline definition. In the code above the variable
adding-pipeline
contains such a consumer. The result of running the pipelineadding-pipeline
would be equivalent to runningdatalad meta-conduct
and piping the result intodatalad meta-add
as described above. (If you want to try it, you might have to specify the parameteradder.dataset=<path to your datalad-dataset>
to tell the consumer in which git-repo the metadata records should be stored.)The text was updated successfully, but these errors were encountered: