You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi! I am currently trying to tokenize the processed 400m-1x data, but I'm running into object store memory issues where the tokenize_shuffle.py script seems to be attempting to tokenize the entire processed dataset instead of periodically writing into disk memory. For context, I don't have S3 access so I modified the script slightly to save to a local disk. I tried enabling --no_shuffle, in case that was preventing periodic memory writes, and also played around with force_num_cores, num_writer_per_node, and allow_imbalanced_write to little effect.
Are there any other tips to manage memory usage with the tokenize_shuffle script on the 400m-1x data, or is it by design that the memory write happens only at the end? Thanks!
The text was updated successfully, but these errors were encountered:
Assuming you want a tokenized and shuffled data, I would recommend you get a machine with more space, or use the single node rust code here: https://github.com/revbucket/tokshuf-rust
Hi! I am currently trying to tokenize the processed 400m-1x data, but I'm running into object store memory issues where the tokenize_shuffle.py script seems to be attempting to tokenize the entire processed dataset instead of periodically writing into disk memory. For context, I don't have S3 access so I modified the script slightly to save to a local disk. I tried enabling --no_shuffle, in case that was preventing periodic memory writes, and also played around with force_num_cores, num_writer_per_node, and allow_imbalanced_write to little effect.
Are there any other tips to manage memory usage with the tokenize_shuffle script on the 400m-1x data, or is it by design that the memory write happens only at the end? Thanks!
The text was updated successfully, but these errors were encountered: