You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I got basic setup of training tokenizer(on a subset of corpus), TokenDataset of a big corpus and training
I run out of RAM while creating TokenDataset(I got 30gb ram on kaggle). This seems strange, given original corpus size is far less than 30GB
Is there a way to encode corpus into other file(with smaller batches) and then load them lazily for training?
The text was updated successfully, but these errors were encountered:
I have run into an issue where specific files, for whatever reason, can crash the tokenizer. I've had a 15kb XML file swallow 30GBs of RAM. I'm not really sure why some files cause this, but perhaps that's the issue you're running into?
I got basic setup of training tokenizer(on a subset of corpus), TokenDataset of a big corpus and training I run out of RAM while creating TokenDataset(I got 30gb ram on kaggle). This seems strange, given original corpus size is far less than 30GB
Is there a way to encode corpus into other file(with smaller batches) and then load them lazily for training?
kaggle gives you 16gigs of ram and you would need like 100 or more gigs of ram to encode it (edit this is super old and it should use cpu and not ram)
I got basic setup of training tokenizer(on a subset of corpus), TokenDataset of a big corpus and training
I run out of RAM while creating TokenDataset(I got 30gb ram on kaggle). This seems strange, given original corpus size is far less than 30GB
Is there a way to encode corpus into other file(with smaller batches) and then load them lazily for training?
The text was updated successfully, but these errors were encountered: