-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dataset tokenization script improvements #106
dataset tokenization script improvements #106
Conversation
you should be pointing this PR to 93-tokenize-... instead of main |
84f8260
to
cd19305
Compare
For the reviewers. Please let this command run once, and verify it uploaded your dataset.
|
Weird, on my machine it used just ~1 GB of memory |
But it's failing with
I think one of these calls is at fault
|
I added |
b63fdd3
to
f80a511
Compare
6be2593
to
c3f39a7
Compare
I reduced memory usage, but broke tests (should be easy to fix) this https://huggingface.co/datasets/delphi-suite/stories-tokenized is the result of |
one of the unit tests fails because I replaced delphi-suite/stories-tokenizer with a different tokenizer, that needs updating too |
Fixes #105
Fixing the tokenize dataset script, where currently only
delphi-suite/stories
dataset is supported with its (unique) structure of parquet files.The script should be able to download all suitable HF datasets even if they have a slightly different structure.
Note: Needs to be rebased on #94 once that branch is rebased on main again