tokenization memory usage #88

brian-ham · 2024-10-13T16:29:21Z

Hi! I am currently trying to tokenize the processed 400m-1x data, but I'm running into object store memory issues where the tokenize_shuffle.py script seems to be attempting to tokenize the entire processed dataset instead of periodically writing into disk memory. For context, I don't have S3 access so I modified the script slightly to save to a local disk. I tried enabling --no_shuffle, in case that was preventing periodic memory writes, and also played around with force_num_cores, num_writer_per_node, and allow_imbalanced_write to little effect.

Are there any other tips to manage memory usage with the tokenize_shuffle script on the 400m-1x data, or is it by design that the memory write happens only at the end? Thanks!

afang-story · 2024-10-31T01:15:36Z

It is by design because we need to shuffle the data, which should be done if you plan on training on the data. If you don't care about shuffling, you could try changing https://github.com/mlfoundations/open_lm/blob/main/open_lm/datapreprocess/ray/tokenize_shuffle.py#L672 to ds = ds, but this might slow down writing. It would probably be better to write new tokenization only code in this case.

Assuming you want a tokenized and shuffled data, I would recommend you get a machine with more space, or use the single node rust code here: https://github.com/revbucket/tokshuf-rust

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenization memory usage #88

tokenization memory usage #88

brian-ham commented Oct 13, 2024

afang-story commented Oct 31, 2024

tokenization memory usage #88

tokenization memory usage #88

Comments

brian-ham commented Oct 13, 2024

afang-story commented Oct 31, 2024