a bug of "OOM, about gpu" when I run it on more than one spark worker node #12

younfor · 2017-01-04T06:32:59Z

for example : I had change the files to load my own pic like [None,32,32,3] . Everything is OK, but when I set the partition=2 or 4 , 8 ... and my computer information is gtx1070, ubuntu14.04, 8G. I also change the model init code with:
config = tf.ConfigProto() config.gpu_options.allow_growth = True config.gpu_options.allocator_type = 'BFC' #config.gpu_options.per_process_gpu_memory_fraction = 0.2 session = tf.Session(config=config)
upon will enable several process in one gpu.
the bug is when the programer run some epoches , I find "nvidia-smi" 's gpu memory grows without stop.
from 800MB to 2G , 4G, 8G... finally show some errors like cuda OOM.
my way to solve it:
after my check and try to fix it, I find a function leads to the GPU Memory Leak
def reset_gradients(self): #with self.session.as_default(): #self.gradients = [tf.zeros(g[1].get_shape()).eval() for g in self.compute_gradients] self.gradients = [0.0]*len(self.compute_gradients) # my modify self.num_gradients = 0
though I don't the details why this change can works ,but it did.
email :[email protected]

The text was updated successfully, but these errors were encountered:

illuzen · 2017-01-20T17:06:50Z

Interesting. I remember seeing this error sometimes, IIRC we installed newer drivers to fix it, but this seems like a pretty benign fix. I can kinda see why this might allocate more memory. In any case we shouldn't be running an eval just to get a bunch of zeros. Good catch, wanna roll this into the other PR?

Did you see a performance change when you set allow_growth = True? And do you have a cluster or is it a bunch of cores on a local?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a bug of "OOM, about gpu" when I run it on more than one spark worker node #12

a bug of "OOM, about gpu" when I run it on more than one spark worker node #12

younfor commented Jan 4, 2017

illuzen commented Jan 20, 2017

a bug of "OOM, about gpu" when I run it on more than one spark worker node #12

a bug of "OOM, about gpu" when I run it on more than one spark worker node #12

Comments

younfor commented Jan 4, 2017

illuzen commented Jan 20, 2017