Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory when fine-tuning #43

Open
orenmelamud opened this issue Oct 14, 2019 · 9 comments
Open

Out of memory when fine-tuning #43

orenmelamud opened this issue Oct 14, 2019 · 9 comments

Comments

@orenmelamud
Copy link

Thank you for this important contribution!

I am trying to fine-tune your full model on a V100 with 16GB memory. Even when setting batch size to 1 in the patch, I seem to be running out of memory (see error below). Is there any way to fine-tune your model on a 16GB machine?

Thanks,
Oren.

2019-10-14 20:27:40.672735: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 15753943296 memory_limit_: 15753943450 available bytes: 154 curr_region_allocation_bytes_: 31507887104
2019-10-14 20:27:40.672751: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats:
Limit: 15753943450
InUse: 15753943296
MaxInUse: 15753943296
NumAllocs: 3949
MaxAllocSize: 1262254080

2019-10-14 20:27:40.672835: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ****************************************************************************************************
ERROR:tensorflow:Error recorded from training_loop: Dst tensor is not initialized.
[[node save/RestoreV2 (defined at training.py:164) ]]

@keskarnitish
Copy link
Contributor

I'm not sure this is an OOM error. The training should succeed on a 16GB V100. Can you provide more details about the file you're fine-tuning, TF versions etc.?

Did the fine-tuning steps for Moby Dick succeed for you or did those fail as well?

@orenmelamud
Copy link
Author

I am using Python 3.7.4 (fresh Anaconda distribution) on an EC2 linux machine.
tensorflow-gpu==1.14 with your Keras patch set to batch size 1

Running now with Moby Dick. Same situation. Pretty quickly training seems to hang after printing this warning:

2019-10-15 18:09:05.363842: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.

The gpu utilization:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 41C P0 40W / 300W | 15469MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 4700 C python 15443MiB |
+-----------------------------------------------------------------------------+

A while later (maybe an hour) I get the error I mentioned in my previous post and the program exits.

@keskarnitish
Copy link
Contributor

Yeah, I was able to replicate this. I was testing the fine-tuning on a 32GB V100 and it worked with higher batch sizes. Let me look into fine-tuning with lower memory. Now that we added CTRL to https://github.com/huggingface/transformers, I wonder if it is also worth trying that angle. I'll update once I have a solution.

@zhongpeixiang
Copy link

zhongpeixiang commented Oct 30, 2019

@keskarnitish How do I run training.py on GPU? When I ran python training.py --model_dir ../seqlen256_v1.ckpt --iterations 250, the model is on CPU by default.

Oh, my CUDA10.1 is not compatible with tensorflow-gpu 1.14.0.

After fixing this issue, I get the following:

2019-10-30 18:26:06.376093: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ****************************************************************************************************
2019-10-30 18:26:06.376141: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at reduction_ops_common.h:180 : Resource exhausted: OOM when allocating tensor with shape[512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
ERROR:tensorflow:Error recorded from training_loop: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node encoder/encoder_layer_12/layer_normalization_24/moments/mean (defined at ../transformer.py:90) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[training/clip_by_global_norm/mul_1/_12367]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node encoder/encoder_layer_12/layer_normalization_24/moments/mean (defined at ../transformer.py:90) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node encoder/encoder_layer_12/layer_normalization_24/moments/mean:
 encoder/encoder_layer_11/add_1 (defined at ../transformer.py:98)

Input Source operations connected to node encoder/encoder_layer_12/layer_normalization_24/moments/mean:
 encoder/encoder_layer_11/add_1 (defined at ../transformer.py:98)

My system is Ubuntu 18.04 with Tesla V100 32GB (about 25GB is free) and tensorflow-gpu 1.14.0. I tried batch size of 4, 2, and 1.

@keskarnitish
Copy link
Contributor

While I explore this, I noticed a PR that seems to circumvent this issue (#51). I haven't tested this out but it might be a temporary solution.

@hypnoai
Copy link

hypnoai commented Nov 4, 2019

Yeah, I can confirm I also can't get V100 16gb 8CPU, 30gb Ram, 100gb SSD to work with tensorflow-gpu==1.14 on the moby dick training example with batch_size = 1 and iterations 1. 256 model _v0

Can you recommend another GPU that could be good for training? Happy to try another. To my understanding NickWaltons fix manages multi-gpus but doesn't describe which ones?

@keskarnitish
Copy link
Contributor

Yeah, I can confirm I also can't get V100 16gb 8CPU, 30gb Ram, 100gb SSD to work with tensorflow-gpu==1.14 on the moby dick training example with batch_size = 1 and iterations 1. 256 model _v0

Can you recommend another GPU that could be good for training? Happy to try another. To my understanding NickWaltons fix manages multi-gpus but doesn't describe which ones?

Fine-tuning does work on the 32 GB GV100.

@pgrandinetti
Copy link

Yeah, I was able to replicate this. I was testing the fine-tuning on a 32GB V100 and it worked with higher batch sizes. Let me look into fine-tuning with lower memory. Now that we added CTRL to https://github.com/huggingface/transformers, I wonder if it is also worth trying that angle. I'll update once I have a solution.

About this (for general info): What tricks are usually applied to make a lower-memory branch like you did? I looked at the diff with master, and seems you reduced many tensors from float32 to flat16. What else would you try?

@Heiheiyo
Copy link

Yeah, I was able to replicate this. I was testing the fine-tuning on a 32GB V100 and it worked with higher batch sizes. Let me look into fine-tuning with lower memory. Now that we added CTRL to https://github.com/huggingface/transformers, I wonder if it is also worth trying that angle. I'll update once I have a solution.

I get an OOM error on a 32GB V100
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants