Out of memory when fine-tuning #43

orenmelamud · 2019-10-14T20:36:22Z

Thank you for this important contribution!

I am trying to fine-tune your full model on a V100 with 16GB memory. Even when setting batch size to 1 in the patch, I seem to be running out of memory (see error below). Is there any way to fine-tune your model on a 16GB machine?

Thanks,
Oren.

2019-10-14 20:27:40.672735: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 15753943296 memory_limit_: 15753943450 available bytes: 154 curr_region_allocation_bytes_: 31507887104
2019-10-14 20:27:40.672751: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats:
Limit: 15753943450
InUse: 15753943296
MaxInUse: 15753943296
NumAllocs: 3949
MaxAllocSize: 1262254080

2019-10-14 20:27:40.672835: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ****************************************************************************************************
ERROR:tensorflow:Error recorded from training_loop: Dst tensor is not initialized.
[[node save/RestoreV2 (defined at training.py:164) ]]

keskarnitish · 2019-10-15T17:09:23Z

I'm not sure this is an OOM error. The training should succeed on a 16GB V100. Can you provide more details about the file you're fine-tuning, TF versions etc.?

Did the fine-tuning steps for Moby Dick succeed for you or did those fail as well?

orenmelamud · 2019-10-15T19:02:20Z

I am using Python 3.7.4 (fresh Anaconda distribution) on an EC2 linux machine.
tensorflow-gpu==1.14 with your Keras patch set to batch size 1

Running now with Moby Dick. Same situation. Pretty quickly training seems to hang after printing this warning:

2019-10-15 18:09:05.363842: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.

The gpu utilization:

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 4700 C python 15443MiB |
+-----------------------------------------------------------------------------+

A while later (maybe an hour) I get the error I mentioned in my previous post and the program exits.

keskarnitish · 2019-10-16T23:09:26Z

Yeah, I was able to replicate this. I was testing the fine-tuning on a 32GB V100 and it worked with higher batch sizes. Let me look into fine-tuning with lower memory. Now that we added CTRL to https://github.com/huggingface/transformers, I wonder if it is also worth trying that angle. I'll update once I have a solution.

zhongpeixiang · 2019-10-30T07:25:50Z

@keskarnitish How do I run training.py on GPU? When I ran python training.py --model_dir ../seqlen256_v1.ckpt --iterations 250, the model is on CPU by default.

Oh, my CUDA10.1 is not compatible with tensorflow-gpu 1.14.0.

After fixing this issue, I get the following:

2019-10-30 18:26:06.376093: W tensorflow/core/common_runtime/bfc_allocator.cc:319] ****************************************************************************************************
2019-10-30 18:26:06.376141: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at reduction_ops_common.h:180 : Resource exhausted: OOM when allocating tensor with shape[512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
ERROR:tensorflow:Error recorded from training_loop: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node encoder/encoder_layer_12/layer_normalization_24/moments/mean (defined at ../transformer.py:90) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[training/clip_by_global_norm/mul_1/_12367]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[512] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node encoder/encoder_layer_12/layer_normalization_24/moments/mean (defined at ../transformer.py:90) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node encoder/encoder_layer_12/layer_normalization_24/moments/mean:
 encoder/encoder_layer_11/add_1 (defined at ../transformer.py:98)

Input Source operations connected to node encoder/encoder_layer_12/layer_normalization_24/moments/mean:
 encoder/encoder_layer_11/add_1 (defined at ../transformer.py:98)

My system is Ubuntu 18.04 with Tesla V100 32GB (about 25GB is free) and tensorflow-gpu 1.14.0. I tried batch size of 4, 2, and 1.

keskarnitish · 2019-10-30T17:06:13Z

While I explore this, I noticed a PR that seems to circumvent this issue (#51). I haven't tested this out but it might be a temporary solution.

hypnoai · 2019-11-04T20:53:20Z

Yeah, I can confirm I also can't get V100 16gb 8CPU, 30gb Ram, 100gb SSD to work with tensorflow-gpu==1.14 on the moby dick training example with batch_size = 1 and iterations 1. 256 model _v0

Can you recommend another GPU that could be good for training? Happy to try another. To my understanding NickWaltons fix manages multi-gpus but doesn't describe which ones?

keskarnitish · 2019-11-04T23:16:29Z

Yeah, I can confirm I also can't get V100 16gb 8CPU, 30gb Ram, 100gb SSD to work with tensorflow-gpu==1.14 on the moby dick training example with batch_size = 1 and iterations 1. 256 model _v0

Can you recommend another GPU that could be good for training? Happy to try another. To my understanding NickWaltons fix manages multi-gpus but doesn't describe which ones?

Fine-tuning does work on the 32 GB GV100.

pgrandinetti · 2019-11-24T16:45:13Z

Yeah, I was able to replicate this. I was testing the fine-tuning on a 32GB V100 and it worked with higher batch sizes. Let me look into fine-tuning with lower memory. Now that we added CTRL to https://github.com/huggingface/transformers, I wonder if it is also worth trying that angle. I'll update once I have a solution.

About this (for general info): What tricks are usually applied to make a lower-memory branch like you did? I looked at the diff with master, and seems you reduced many tensors from float32 to flat16. What else would you try?

Heiheiyo · 2020-08-13T07:12:11Z

Yeah, I was able to replicate this. I was testing the fine-tuning on a 32GB V100 and it worked with higher batch sizes. Let me look into fine-tuning with lower memory. Now that we added CTRL to https://github.com/huggingface/transformers, I wonder if it is also worth trying that angle. I'll update once I have a solution.

I get an OOM error on a 32GB V100

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory when fine-tuning #43

Out of memory when fine-tuning #43

orenmelamud commented Oct 14, 2019

keskarnitish commented Oct 15, 2019

orenmelamud commented Oct 15, 2019

keskarnitish commented Oct 16, 2019

zhongpeixiang commented Oct 30, 2019 •

edited

Loading

keskarnitish commented Oct 30, 2019

hypnoai commented Nov 4, 2019 •

edited

Loading

keskarnitish commented Nov 4, 2019

pgrandinetti commented Nov 24, 2019

Heiheiyo commented Aug 13, 2020

Out of memory when fine-tuning #43

Out of memory when fine-tuning #43

Comments

orenmelamud commented Oct 14, 2019

keskarnitish commented Oct 15, 2019

orenmelamud commented Oct 15, 2019

keskarnitish commented Oct 16, 2019

zhongpeixiang commented Oct 30, 2019 • edited Loading

keskarnitish commented Oct 30, 2019

hypnoai commented Nov 4, 2019 • edited Loading

keskarnitish commented Nov 4, 2019

pgrandinetti commented Nov 24, 2019

Heiheiyo commented Aug 13, 2020

zhongpeixiang commented Oct 30, 2019 •

edited

Loading

hypnoai commented Nov 4, 2019 •

edited

Loading