-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to train on 4-5 gtx 1070s #17
Comments
I went to sigkill it. Nothing was happening that I could tell. No iops or anything in grafana |
Hi, could you give more details, such as the training command used? It is not recommended to fit a model that is too big as it will cause excessive offloading of the model to cpu memory (assuming you are using fsdp), which is very slow |
I was using the example command in the readme with the use_fsdp option:
|
I see, it could be that the model is too large, causing slow cpu offload, or that fsdp is not working properly on your system. Does the same command work if you change to a smaller model eg |
base just says bus error
|
This also results in an idling process
|
And here's logs from the normal model:
|
And then giving it the ram it needs via swap, I get the same bus error:
|
I've got 5 1070s here I'm trying to train on. The memory goes away quick initially. 40G of VRAM but only 64G of system memory. I added swap but I imagine this will take forever to load in. Are there other flags I can use to reduce this?
The text was updated successfully, but these errors were encountered: