Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Info: performance numbers on smaug #371

Open
christian-monch opened this issue Mar 28, 2023 · 2 comments
Open

Info: performance numbers on smaug #371

christian-monch opened this issue Mar 28, 2023 · 2 comments

Comments

@christian-monch
Copy link
Collaborator

christian-monch commented Mar 28, 2023

This is an informative issue:

Using meta-conduct and a simple extraction pipeline on smaug, we extracted about 1,300,000 file-level metadata records with the following command

openneuro> date; time datalad -f json meta-conduct ~/tmp/extract_test.json \
traverser.top_level_dir=$(pwd) \
traverser.item_type=both \
traverser.traverse_sub_datasets=True \
extractor.extractor_type=file \
extractor.extractor_name=metalad_core \
>/.../extract_test_1.jsonl 2>/.../extract_test_1.err
Tue 28 Mar 2023 01:08:38 AM EDT

real    464m43.061s
user    3171m24.090s
sys     2597m34.377s

The extraction pipeline in file ~/tmp/extract_test.json has the following content:

{
  "provider": {
    "module": "datalad_metalad.pipeline.provider.datasettraverse",
    "class": "DatasetTraverser",
    "name": "traverser",
    "arguments": {}
  },
  "processors": [
    {
      "module": "datalad_metalad.pipeline.processor.extract",
      "class": "MetadataExtractor",
      "name": "extractor",
      "arguments": {}
    }
  ]
}
@christian-monch
Copy link
Collaborator Author

Info on bids_dataset extraction performance on openneuro:

openneuro> time datalad -f json meta-conduct ~/tmp/extract_test.json \
traverser.top_level_dir=$(pwd) \
traverser.item_type=dataset \
traverser.traverse_sub_datasets=True \
extractor.extractor_type=dataset \
extractor.extractor_name=bids_dataset \
>/.../extract_test_bids_dataset_1.jsonl 2>/.../extract_test_bids_dataset_1.err



real    25m23.549s
user    146m47.854s
sys     4m54.585s

@christian-monch
Copy link
Collaborator Author

christian-monch commented Mar 29, 2023

Info extraction on haxby:

File-level extraction with metalad_core:

haxby> time datalad -f json meta-conduct ~/tmp/extract_test.json traverser.top_level_dir=$(pwd) traverser.item_type=file traverser.traverse_sub_datasets=True extractor.extractor_type=file extractor.extractor_name=metalad_core  2>/.../test_haxby_1.err >/.../test_haxby_1.jsonl

real    10m49.770s
user    28m33.421s
sys     30m13.129s

Dataset-level extraction with metalad_core:

haxby> time datalad -f json meta-conduct ~/tmp/extract_test.json traverser.top_level_dir=$(pwd) traverser.item_type=dataset traverser.traverse_sub_datasets=True extractor.extractor_type=dataset extractor.extractor_name=metalad_core  2>/.../test_haxby_dataset_1.err >/.../test_haxby_dataset_1.jsonl

real    0m4.564s
user    0m7.279s
sys     0m3.550s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant