-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of memory when fine-tuning #43
Comments
I'm not sure this is an OOM error. The training should succeed on a 16GB V100. Can you provide more details about the file you're fine-tuning, TF versions etc.? Did the fine-tuning steps for Moby Dick succeed for you or did those fail as well? |
I am using Python 3.7.4 (fresh Anaconda distribution) on an EC2 linux machine. Running now with Moby Dick. Same situation. Pretty quickly training seems to hang after printing this warning: 2019-10-15 18:09:05.363842: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. The gpu utilization: +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+ A while later (maybe an hour) I get the error I mentioned in my previous post and the program exits. |
Yeah, I was able to replicate this. I was testing the fine-tuning on a 32GB V100 and it worked with higher batch sizes. Let me look into fine-tuning with lower memory. Now that we added CTRL to https://github.com/huggingface/transformers, I wonder if it is also worth trying that angle. I'll update once I have a solution. |
@keskarnitish How do I run training.py on GPU? When I ran Oh, my CUDA10.1 is not compatible with tensorflow-gpu 1.14.0. After fixing this issue, I get the following:
My system is Ubuntu 18.04 with Tesla V100 32GB (about 25GB is free) and tensorflow-gpu 1.14.0. I tried batch size of 4, 2, and 1. |
While I explore this, I noticed a PR that seems to circumvent this issue (#51). I haven't tested this out but it might be a temporary solution. |
Yeah, I can confirm I also can't get V100 16gb 8CPU, 30gb Ram, 100gb SSD to work with tensorflow-gpu==1.14 on the moby dick training example with batch_size = 1 and iterations 1. 256 model _v0 Can you recommend another GPU that could be good for training? Happy to try another. To my understanding NickWaltons fix manages multi-gpus but doesn't describe which ones? |
Fine-tuning does work on the 32 GB GV100. |
About this (for general info): What tricks are usually applied to make a lower-memory branch like you did? I looked at the |
|
Thank you for this important contribution!
I am trying to fine-tune your full model on a V100 with 16GB memory. Even when setting batch size to 1 in the patch, I seem to be running out of memory (see error below). Is there any way to fine-tune your model on a 16GB machine?
Thanks,
Oren.
2019-10-14 20:27:40.672735: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 15753943296 memory_limit_: 15753943450 available bytes: 154 curr_region_allocation_bytes_: 31507887104
2019-10-14 20:27:40.672751: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats:
Limit: 15753943450
InUse: 15753943296
MaxInUse: 15753943296
NumAllocs: 3949
MaxAllocSize: 1262254080
2019-10-14 20:27:40.672835: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ****************************************************************************************************
ERROR:tensorflow:Error recorded from training_loop: Dst tensor is not initialized.
[[node save/RestoreV2 (defined at training.py:164) ]]
The text was updated successfully, but these errors were encountered: