Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory consumption #58

Open
vevenom opened this issue Nov 26, 2021 · 17 comments
Open

GPU memory consumption #58

vevenom opened this issue Nov 26, 2021 · 17 comments

Comments

@vevenom
Copy link

vevenom commented Nov 26, 2021

Hi,

when running PatchmatchNet on ETH3D dataset through eval.py, I end up using 15GB of GPU memory, and the paper reports 5529 MB. Could it be the case that all images for all scenes are loaded at the same time into the memory through Dataloader? Or Is there something else in the code that might be causing such large memory consumption?

I appreciate your answer, thanks.

Best,
Sinisa

@FangjinhuaWang
Copy link
Owner

Hi,

I think you are using our default image size (2688x1792)? If so, what's your cuda version and pytorch version? I am not an expert about this, but I observe that using higher version of pytorch, e.g. 1.7.1, results in higher memory shown in nvidia-smi. I think this may relate to the internal memory management.

@vevenom
Copy link
Author

vevenom commented Nov 26, 2021

Yes, I am using the default image size. The CUDA version is 11.3, and pytorch version is 1.10. I see, that is an interesting observation. I will do some analysis to get deeper understanding of this and provide an update on this if I find anything interesting.

Thanks for the quick response.

@anmatako
Copy link

anmatako commented Dec 7, 2021

@vevenom I noticed in the past that due to the structure of the eval script the initialization was invoked 3 times instead of just once (which is strange since I do not know where this was coming from). However, after the latest merge #57 it should only be happening once so you should be seeing the advertised 5GB consumption instead of 15GB. At least that's what I saw on my end when running the older and newer version.

@atztao
Copy link

atztao commented Dec 21, 2021

Hey, this is a very good job, but can you provide your point cloud results on the DTU verification set?

@anmatako
Copy link

@atztao I am sharing the new point cloud retuls along with a text file with the metrics. For comparison here are the legacy point cloud results with a metrics text file as well.

@atztao
Copy link

atztao commented Dec 23, 2021

@atztao I am sharing the new point cloud retuls along with a text file with the metrics. For comparison here are the legacy point cloud results with a metrics text file as well.

Thanks, but we no account in the my.sharepoint.com.

@anmatako
Copy link

@vevenom have you tried again using the code after merging this PR #57? Curious if you still see increased GPU memory consumption, because that's not something I see on my end now. If the issue still persists, please share some more info on the setup and data you're using, otherwise if the issue is resolved feel free to close it.

@vevenom
Copy link
Author

vevenom commented Jan 4, 2022

Sorry, I was busy with other experiments. I am going to check this in the following week and provide an update if the issue persists.

@vevenom
Copy link
Author

vevenom commented Jan 5, 2022

Even with #57 and the latest merge, I still observe large memory usage when using nvidia-smi. However, the allocated memory never seems to reach this number. I tried to track in the code when the large spike occurs in the code for comparison.

Interestingly, for example for living_room scene in the eth dataset, there is a spike in reserved memory but the allocated memory does not change as much for this line:

res = self.res(self.conv3(torch.cat((deconv, conv0), dim=1)))

torch.cuda.memory_allocated: from 1.961059GB to 2.266113GB
torch.cuda.memory_reserved: from 7.986328GB to 13.800781GB
torch.cuda.max_memory_reserved: from 18.468750GB to 18.468750GB

and after separating the functions calls in this line, it seems the spike is coming from self.res . Still, when printing max memory allocated, I get torch.cuda.max_memory_allocated: 11.5GB. Earlier, there is a notable spike in reserved memory when extracting image features but the allocated memory did not change as well in this line:

conv1 = self.conv1(self.conv0(x))


torch.cuda.memory_allocated: 0.876828GB to 1.020382GB
torch.cuda.memory_reserved: 0.878906GB to 18.466797GB
torch.cuda.max_memory_reserved: 0.878906GB to 18.466797GB

There are some reports of such behaviors for CUDA 11, so I would be interested to see if other users with similar setup are affected by this issue as well. Again, The CUDA version I am using is 11.3, and pytorch version is 1.10

@hx804722948
Copy link

The CUDA version I am using is 11.0, and pytorch version is 1.7.1
image

@anmatako
Copy link

@hx804722948 Can you please provide more info regarding the dataset and inputs you're using? I think Torch 1.7.1 had an issue with batch size of 2 and @FangjinhuaWang had to do a workaround to make the train loss work correctly. Running with Torch 1.9.1 there is no issue with the batch size. If what you're seeing is unrelated to the version of Torch and/or the use of batch size, I can try to reproduce your results and see if there's a bug in the code.

@hx804722948
Copy link

@anmatako thank you, I will try with Torch 1.9.1. I use convert_dtu_dataset and train.py with parameters --batch_size
2
--epochs
8
--input_folder
Y:\converted_dtu
--train_list
lists/dtu/train.txt
--test_list
lists/dtu/val.txt
--output_folder
D:/checkpoints/PatchmatchNet/dtu/Cuda110+torch1.7.1
--num_light_idx
7
It works right with Cuda101+torch1.3.0, get loss Nan with torch 1.7.1, but eval right with torch 1.7.1

@hx804722948
Copy link

@anmatako thank you very much! Cuda111+torch1.9.1 work right

@anmatako
Copy link

anmatako commented Feb 2, 2022

@vevenom I will try to repro your issue on my end once I get some time hopefully soon. I have not seen such discrepancy running on Windows, but maybe I'm missing something in the way I'm monitoring the memory usage. It could be that I'm monitoring allocated vs reserved memory, and I'll also keep in mind the potential issue with CUDA 11. I'll let you know once I have more on that one.

@ly27253
Copy link

ly27253 commented Mar 23, 2022

Hello, when I run the BaseEvalMain_web.m file in Matlab, is there any GPU-accelerated .m file?

@FangjinhuaWang
Copy link
Owner

No, we use original evaluation codes from DTU dataset.

@ly27253
Copy link

ly27253 commented Mar 23, 2022

No, we use original evaluation codes from DTU dataset.

Thank you very much for your reply, thank you again for your selfless dedication, and sincerely wish you a happy life and everything goes well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants