GPU memory consumption #58

vevenom · 2021-11-26T11:42:19Z

Hi,

when running PatchmatchNet on ETH3D dataset through eval.py, I end up using 15GB of GPU memory, and the paper reports 5529 MB. Could it be the case that all images for all scenes are loaded at the same time into the memory through Dataloader? Or Is there something else in the code that might be causing such large memory consumption?

I appreciate your answer, thanks.

Best,
Sinisa

The text was updated successfully, but these errors were encountered:

FangjinhuaWang · 2021-11-26T12:23:00Z

Hi,

I think you are using our default image size (2688x1792)? If so, what's your cuda version and pytorch version? I am not an expert about this, but I observe that using higher version of pytorch, e.g. 1.7.1, results in higher memory shown in nvidia-smi. I think this may relate to the internal memory management.

vevenom · 2021-11-26T13:07:54Z

Yes, I am using the default image size. The CUDA version is 11.3, and pytorch version is 1.10. I see, that is an interesting observation. I will do some analysis to get deeper understanding of this and provide an update on this if I find anything interesting.

Thanks for the quick response.

anmatako · 2021-12-07T20:26:26Z

@vevenom I noticed in the past that due to the structure of the eval script the initialization was invoked 3 times instead of just once (which is strange since I do not know where this was coming from). However, after the latest merge #57 it should only be happening once so you should be seeing the advertised 5GB consumption instead of 15GB. At least that's what I saw on my end when running the older and newer version.

atztao · 2021-12-21T08:07:13Z

Hey, this is a very good job, but can you provide your point cloud results on the DTU verification set?

anmatako · 2021-12-21T18:23:36Z

@atztao I am sharing the new point cloud retuls along with a text file with the metrics. For comparison here are the legacy point cloud results with a metrics text file as well.

atztao · 2021-12-23T03:22:48Z

@atztao I am sharing the new point cloud retuls along with a text file with the metrics. For comparison here are the legacy point cloud results with a metrics text file as well.

Thanks, but we no account in the my.sharepoint.com.

anmatako · 2021-12-29T19:46:30Z

@vevenom have you tried again using the code after merging this PR #57? Curious if you still see increased GPU memory consumption, because that's not something I see on my end now. If the issue still persists, please share some more info on the setup and data you're using, otherwise if the issue is resolved feel free to close it.

vevenom · 2022-01-04T14:51:09Z

Sorry, I was busy with other experiments. I am going to check this in the following week and provide an update if the issue persists.

vevenom · 2022-01-05T17:47:00Z

Even with #57 and the latest merge, I still observe large memory usage when using nvidia-smi. However, the allocated memory never seems to reach this number. I tried to track in the code when the large spike occurs in the code for comparison.

Interestingly, for example for living_room scene in the eth dataset, there is a spike in reserved memory but the allocated memory does not change as much for this line:

PatchmatchNet/models/net.py

Line 116 in 82206d8

res = self.res(self.conv3(torch.cat((deconv, conv0), dim=1)))

torch.cuda.memory_allocated: from 1.961059GB to 2.266113GB
torch.cuda.memory_reserved: from 7.986328GB to 13.800781GB
torch.cuda.max_memory_reserved: from 18.468750GB to 18.468750GB

and after separating the functions calls in this line, it seems the spike is coming from self.res . Still, when printing max memory allocated, I get torch.cuda.max_memory_allocated: 11.5GB. Earlier, there is a notable spike in reserved memory when extracting image features but the allocated memory did not change as well in this line:

PatchmatchNet/models/net.py

Line 51 in 82206d8

conv1 = self.conv1(self.conv0(x))


torch.cuda.memory_allocated: 0.876828GB to 1.020382GB
torch.cuda.memory_reserved: 0.878906GB to 18.466797GB
torch.cuda.max_memory_reserved: 0.878906GB to 18.466797GB

There are some reports of such behaviors for CUDA 11, so I would be interested to see if other users with similar setup are affected by this issue as well. Again, The CUDA version I am using is 11.3, and pytorch version is 1.10

hx804722948 · 2022-01-17T02:47:09Z

The CUDA version I am using is 11.0, and pytorch version is 1.7.1

anmatako · 2022-01-18T01:03:41Z

@hx804722948 Can you please provide more info regarding the dataset and inputs you're using? I think Torch 1.7.1 had an issue with batch size of 2 and @FangjinhuaWang had to do a workaround to make the train loss work correctly. Running with Torch 1.9.1 there is no issue with the batch size. If what you're seeing is unrelated to the version of Torch and/or the use of batch size, I can try to reproduce your results and see if there's a bug in the code.

hx804722948 · 2022-01-18T02:17:29Z

@anmatako thank you, I will try with Torch 1.9.1. I use convert_dtu_dataset and train.py with parameters --batch_size
2
--epochs
8
--input_folder
Y:\converted_dtu
--train_list
lists/dtu/train.txt
--test_list
lists/dtu/val.txt
--output_folder
D:/checkpoints/PatchmatchNet/dtu/Cuda110+torch1.7.1
--num_light_idx
7
It works right with Cuda101+torch1.3.0, get loss Nan with torch 1.7.1, but eval right with torch 1.7.1

hx804722948 · 2022-01-18T05:48:22Z

@anmatako thank you very much! Cuda111+torch1.9.1 work right

anmatako · 2022-02-02T17:24:12Z

@vevenom I will try to repro your issue on my end once I get some time hopefully soon. I have not seen such discrepancy running on Windows, but maybe I'm missing something in the way I'm monitoring the memory usage. It could be that I'm monitoring allocated vs reserved memory, and I'll also keep in mind the potential issue with CUDA 11. I'll let you know once I have more on that one.

ly27253 · 2022-03-23T12:00:48Z

Hello, when I run the BaseEvalMain_web.m file in Matlab, is there any GPU-accelerated .m file?

FangjinhuaWang · 2022-03-23T13:09:12Z

No, we use original evaluation codes from DTU dataset.

ly27253 · 2022-03-23T13:21:43Z

No, we use original evaluation codes from DTU dataset.

Thank you very much for your reply, thank you again for your selfless dedication, and sincerely wish you a happy life and everything goes well.

FangjinhuaWang mentioned this issue Mar 15, 2022

a problem of memory consumption FangjinhuaWang/IterMVS#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory consumption #58

GPU memory consumption #58

vevenom commented Nov 26, 2021

FangjinhuaWang commented Nov 26, 2021

vevenom commented Nov 26, 2021

anmatako commented Dec 7, 2021

atztao commented Dec 21, 2021 •

edited

Loading

anmatako commented Dec 21, 2021

atztao commented Dec 23, 2021 •

edited

Loading

anmatako commented Dec 29, 2021

vevenom commented Jan 4, 2022

vevenom commented Jan 5, 2022 •

edited

Loading

hx804722948 commented Jan 17, 2022

anmatako commented Jan 18, 2022

hx804722948 commented Jan 18, 2022

hx804722948 commented Jan 18, 2022

anmatako commented Feb 2, 2022

ly27253 commented Mar 23, 2022

FangjinhuaWang commented Mar 23, 2022

ly27253 commented Mar 23, 2022

GPU memory consumption #58

GPU memory consumption #58

Comments

vevenom commented Nov 26, 2021

FangjinhuaWang commented Nov 26, 2021

vevenom commented Nov 26, 2021

anmatako commented Dec 7, 2021

atztao commented Dec 21, 2021 • edited Loading

anmatako commented Dec 21, 2021

atztao commented Dec 23, 2021 • edited Loading

anmatako commented Dec 29, 2021

vevenom commented Jan 4, 2022

vevenom commented Jan 5, 2022 • edited Loading

hx804722948 commented Jan 17, 2022

anmatako commented Jan 18, 2022

hx804722948 commented Jan 18, 2022

hx804722948 commented Jan 18, 2022

anmatako commented Feb 2, 2022

ly27253 commented Mar 23, 2022

FangjinhuaWang commented Mar 23, 2022

ly27253 commented Mar 23, 2022

atztao commented Dec 21, 2021 •

edited

Loading

atztao commented Dec 23, 2021 •

edited

Loading

vevenom commented Jan 5, 2022 •

edited

Loading