Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change dataset download scripts to use Cloudflare buckets directly #712

Merged
merged 1 commit into from
Mar 12, 2024

Conversation

morphine00
Copy link
Contributor

No description provided.

@morphine00 morphine00 requested a review from a team as a code owner February 17, 2024 20:47
Copy link

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@morphine00
Copy link
Contributor Author

Additional comment: we noticed that the two scripts that download a sequence of TAR files also download a checksum file and run its verification; but said verification will bring up a lot of errors because it tries to check checksums on many associated files that weren't downloaded (ex: JSON metadata). Was this the intended behavior?

@ahmadki
Copy link
Contributor

ahmadki commented Feb 21, 2024

Additional comment: we noticed that the two scripts that download a sequence of TAR files also download a checksum file and run its verification; but said verification will bring up a lot of errors because it tries to check checksums on many associated files that weren't downloaded (ex: JSON metadata). Was this the intended behavior?

This is expected, although could've been handled better.

There are two types of meta files. parquet files which have the original image links, and json files generated by img2dataset and include information about the download process like the error code in case failing to download a file.

The meta files were uploaded along with the dataset (you can view them here), but I didn't include them in the download scripts to save on bandwidth. Changing the validation command from:

sha512sum --quiet -c sha512sums.txt

to

cat sha512sums.txt | grep .tar | sha512sum --quiet -c

should get rid of the warnings

@morphine00
Copy link
Contributor Author

It's trivial to change that line. Although we checked and quite frankly, the dataset itself is many gigabytes, while the support files are a few MB each. Perhaps it's simply better to have the download scripts grab the entire directories?

@nathanw-mlc
Copy link
Member

bump

@morphine00 morphine00 merged commit 68f8f38 into master Mar 12, 2024
1 check passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants