dataset tokenization script improvements #106

joshuawe · 2024-04-02T21:31:13Z

Fixes #105

Fixing the tokenize dataset script, where currently only delphi-suite/stories dataset is supported with its (unique) structure of parquet files.
The script should be able to download all suitable HF datasets even if they have a slightly different structure.

Note: Needs to be rebased on #94 once that branch is rebased on main again

jettjaniak · 2024-04-03T18:54:09Z

you should be pointing this PR to 93-tokenize-... instead of main
then when you merge 93-tokenize-... this one would automatically update to point on main again

jettjaniak · 2024-04-03T18:56:10Z

looks like the merges caused the displayed diff to be wrong

joshuawe · 2024-04-10T15:04:23Z

For the reviewers. Please let this command run once, and verify it uploaded your dataset.
It worked for me for a subset of the dataset. But my RAM memory was not sufficient for tokenizing the entire dataset in one go. :(

python ./scripts/tokenize_dataset.py --token HF_TOKEN --input-dataset-name delphi-suite/stories --tokenizer-name delphi-suite/stories-tokenizer --output-dataset-name NEW_HF_DATASET_NAME --column-name=story

@jettjaniak @siwei-li

jettjaniak · 2024-04-11T21:48:09Z

Weird, on my machine it used just ~1 GB of memory

jettjaniak · 2024-04-11T22:42:40Z

But it's failing with

[1]    87023 killed     ./scripts/tokenize_dataset.py --hf-token hf_cHQmKbyWcgrUxZQAgUWuphVtJvheAGFSB
/opt/homebrew/Cellar/[email protected]/3.10.13_2/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

I think one of these calls is at fault

        # Store the tokenized data in a new dataset for this split
        tokenized_datasets[split] = Dataset.from_dict({"tokens": tokenized_dataset})

    # Create a new dataset with the same structure (splits) as the original dataset, but with tokenized data
    output_dataset = DatasetDict(tokenized_datasets)

jettjaniak · 2024-04-13T03:52:59Z

I added scripts/demo_upload_in_chunks.py as an example how to upload the dataset in chunks, we should adapt the tokenization script accordingly

…tion_dataset

jettjaniak · 2024-04-21T02:10:14Z

I reduced memory usage, but broke tests (should be easy to fix)

this https://huggingface.co/datasets/delphi-suite/stories-tokenized is the result of
scripts/tokenize_dataset.py -i delphi-suite/stories -f story -s SPLIT -o delphi-suite/stories-tokenized -r delphi-suite/stories-tokenizer -l 512 -t hf_...
where SPLIT={train, validation} (two separate commands)

jettjaniak · 2024-04-21T02:12:21Z

one of the unit tests fails because I replaced delphi-suite/stories-tokenizer with a different tokenizer, that needs updating too

joshuawe linked an issue Apr 2, 2024 that may be closed by this pull request

fix dataset download for its tokenization #105

Closed

3 tasks

joshuawe force-pushed the 105-fix-dataset-download-for-its-tokenization2 branch from 84f8260 to cd19305 Compare April 5, 2024 16:14

joshuawe marked this pull request as ready for review April 10, 2024 14:40

jettjaniak changed the title ~~105 fix dataset download for its tokenization2~~ dataset tokenization script improvements Apr 17, 2024

joshuawe and others added 9 commits April 20, 2024 09:15

Update tokenize_dataset.py to use load_dataset instead of load_valida…

9c4eea0

…tion_dataset

include tqdm

a6a0911

add split handling functionality

2311c21

beauty fix

9aeb531

arg names & cosmetics

198e1e8

added scripts/demo_upload_in_chunks.py

b76f4f4

Update tokenization script to upload to HF in chunks

11b4376

Uncomment the repo creation

9970bee

refactor, cosmetics

f80a511

jettjaniak force-pushed the 105-fix-dataset-download-for-its-tokenization2 branch from b63fdd3 to f80a511 Compare April 20, 2024 20:59

jettjaniak added 2 commits April 20, 2024 14:11

memory usage FIXMEs

b5550d9

fix memory usage issues

c3f39a7

jettjaniak force-pushed the 105-fix-dataset-download-for-its-tokenization2 branch from 6be2593 to c3f39a7 Compare April 21, 2024 02:02

jettjaniak mentioned this pull request Apr 21, 2024

replaced sentencepiece with byte-level BPE #118

Merged

fix test_tokenization

8fa194b

jettjaniak approved these changes Apr 24, 2024

View reviewed changes

jettjaniak merged commit ad2936f into main Apr 24, 2024
1 check passed

jettjaniak deleted the 105-fix-dataset-download-for-its-tokenization2 branch April 24, 2024 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset tokenization script improvements #106

dataset tokenization script improvements #106

joshuawe commented Apr 2, 2024 •

edited

Loading

jettjaniak commented Apr 3, 2024

jettjaniak commented Apr 3, 2024

joshuawe commented Apr 10, 2024

jettjaniak commented Apr 11, 2024

jettjaniak commented Apr 11, 2024

jettjaniak commented Apr 13, 2024

jettjaniak commented Apr 21, 2024

jettjaniak commented Apr 21, 2024

dataset tokenization script improvements #106

dataset tokenization script improvements #106

Conversation

joshuawe commented Apr 2, 2024 • edited Loading

jettjaniak commented Apr 3, 2024

jettjaniak commented Apr 3, 2024

joshuawe commented Apr 10, 2024

jettjaniak commented Apr 11, 2024

jettjaniak commented Apr 11, 2024

jettjaniak commented Apr 13, 2024

jettjaniak commented Apr 21, 2024

jettjaniak commented Apr 21, 2024

joshuawe commented Apr 2, 2024 •

edited

Loading