Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenization memory usage #88

Open
brian-ham opened this issue Oct 13, 2024 · 1 comment
Open

tokenization memory usage #88

brian-ham opened this issue Oct 13, 2024 · 1 comment

Comments

@brian-ham
Copy link

Hi! I am currently trying to tokenize the processed 400m-1x data, but I'm running into object store memory issues where the tokenize_shuffle.py script seems to be attempting to tokenize the entire processed dataset instead of periodically writing into disk memory. For context, I don't have S3 access so I modified the script slightly to save to a local disk. I tried enabling --no_shuffle, in case that was preventing periodic memory writes, and also played around with force_num_cores, num_writer_per_node, and allow_imbalanced_write to little effect.

Are there any other tips to manage memory usage with the tokenize_shuffle script on the 400m-1x data, or is it by design that the memory write happens only at the end? Thanks!

@afang-story
Copy link
Contributor

It is by design because we need to shuffle the data, which should be done if you plan on training on the data. If you don't care about shuffling, you could try changing https://github.com/mlfoundations/open_lm/blob/main/open_lm/datapreprocess/ray/tokenize_shuffle.py#L672 to ds = ds, but this might slow down writing. It would probably be better to write new tokenization only code in this case.

Assuming you want a tokenized and shuffled data, I would recommend you get a machine with more space, or use the single node rust code here: https://github.com/revbucket/tokshuf-rust

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants