Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset tokenization script improvements #106

Merged
merged 12 commits into from
Apr 24, 2024

Conversation

joshuawe
Copy link
Collaborator

@joshuawe joshuawe commented Apr 2, 2024

Fixes #105

Fixing the tokenize dataset script, where currently only delphi-suite/stories dataset is supported with its (unique) structure of parquet files.
The script should be able to download all suitable HF datasets even if they have a slightly different structure.

Note: Needs to be rebased on #94 once that branch is rebased on main again

@joshuawe joshuawe linked an issue Apr 2, 2024 that may be closed by this pull request
3 tasks
@jettjaniak
Copy link
Contributor

you should be pointing this PR to 93-tokenize-... instead of main
then when you merge 93-tokenize-... this one would automatically update to point on main again

@jettjaniak
Copy link
Contributor

looks like the merges caused the displayed diff to be wrong
image

@joshuawe joshuawe force-pushed the 105-fix-dataset-download-for-its-tokenization2 branch from 84f8260 to cd19305 Compare April 5, 2024 16:14
@joshuawe joshuawe marked this pull request as ready for review April 10, 2024 14:40
@joshuawe
Copy link
Collaborator Author

For the reviewers. Please let this command run once, and verify it uploaded your dataset.
It worked for me for a subset of the dataset. But my RAM memory was not sufficient for tokenizing the entire dataset in one go. :(

python ./scripts/tokenize_dataset.py --token HF_TOKEN --input-dataset-name delphi-suite/stories --tokenizer-name delphi-suite/stories-tokenizer --output-dataset-name NEW_HF_DATASET_NAME --column-name=story

@jettjaniak @siwei-li

@jettjaniak
Copy link
Contributor

Weird, on my machine it used just ~1 GB of memory

@jettjaniak
Copy link
Contributor

But it's failing with

[1]    87023 killed     ./scripts/tokenize_dataset.py --hf-token hf_cHQmKbyWcgrUxZQAgUWuphVtJvheAGFSB
/opt/homebrew/Cellar/[email protected]/3.10.13_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

I think one of these calls is at fault

        # Store the tokenized data in a new dataset for this split
        tokenized_datasets[split] = Dataset.from_dict({"tokens": tokenized_dataset})

    # Create a new dataset with the same structure (splits) as the original dataset, but with tokenized data
    output_dataset = DatasetDict(tokenized_datasets)

@jettjaniak
Copy link
Contributor

I added scripts/demo_upload_in_chunks.py as an example how to upload the dataset in chunks, we should adapt the tokenization script accordingly

@jettjaniak jettjaniak changed the title 105 fix dataset download for its tokenization2 dataset tokenization script improvements Apr 17, 2024
@jettjaniak jettjaniak force-pushed the 105-fix-dataset-download-for-its-tokenization2 branch from b63fdd3 to f80a511 Compare April 20, 2024 20:59
@jettjaniak jettjaniak force-pushed the 105-fix-dataset-download-for-its-tokenization2 branch from 6be2593 to c3f39a7 Compare April 21, 2024 02:02
@jettjaniak
Copy link
Contributor

I reduced memory usage, but broke tests (should be easy to fix)

this https://huggingface.co/datasets/delphi-suite/stories-tokenized is the result of
scripts/tokenize_dataset.py -i delphi-suite/stories -f story -s SPLIT -o delphi-suite/stories-tokenized -r delphi-suite/stories-tokenizer -l 512 -t hf_...
where SPLIT={train, validation} (two separate commands)

@jettjaniak
Copy link
Contributor

one of the unit tests fails because I replaced delphi-suite/stories-tokenizer with a different tokenizer, that needs updating too

@jettjaniak jettjaniak merged commit ad2936f into main Apr 24, 2024
1 check passed
@jettjaniak jettjaniak deleted the 105-fix-dataset-download-for-its-tokenization2 branch April 24, 2024 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix dataset download for its tokenization
2 participants