-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU out of memory and hdf5 file not read for the crashed simulations #5459
Comments
Hi @Tissot11,
Perhaps @atmyers could have a look at this memory report then, once you have it. Your problem seems to be periodic. For more test runs you could reduce the size in this direction and use fewer resources. What is the error when you are trying to read your data?
If, however, you are trying to access an unfinished file itself then that data might be corrupted. I only know of ADIOS2 being able to produce readable files even if the writing process crashes (and possible when some certain options are activated). |
I attach an out file of a finished simulation. In the AMReX report I somehow see lower memory usage and in the bottom of the file, I see about 600 GB while the device memory is about 29 GiB. Is the device memory meant for a single GPU? I am using This is the message I get while reading file from the crashed simulation
In my experience with other code, I can read the hdf5 files of a crashed simulation as well. I have no |
Could you also attach the |
Sorry, the previous run files, I had deleted them. But I have just encountered another crashed run due to memory error. I attach the err, out and Backtrace (converted to txt) files. |
I see. So the code is running out of memory in AddPlasmaFlux. One thing is that, while the total memory across all GPUs may be enough to store your particles, if any one GPU runs out of memory the simulation will crash. Do you have a sense of how many particles are on each box prior to running out of memory? Also, the reason why |
Ok, this would be great to avoid these memory errors. It was also my observation and I asked in #5131 about using these two mechanisms for particle injections and the memory usages in each case. To be honest, I probably can not calculate the number of particles in each box during the simulation runtime. |
Any idea, when this would be implemented in WarpX? |
@Tissot11, just out of curiosity: if feasible and you have the resources, what happens when you run your setup in WarpX' CPU-only mode and compare with your previous results? |
Maybe you would not want to run the full simulation with just CPUs and the speed advantage of GPUs is why you chose WarpX. I am just generally curious if the WarpX CPU mode could produce first results for you that you could compare to what you already have, or if there are general road blocks. |
I can try WarpX on CPU for testing but you're right that GPU acceleration is the reason to switch to WarpX. I guess the problem is probably with the boundary conditions and the injector itself. I have tried continuous injection and I would like to try the NFlux method. Probably I try it tomorrow with CPU runs and see if I can reproduce the results. |
I tried running the job on 384 GPUs to avoid the Backtrace.226.txt It seems that I can not run my jobs since there seem to be several issues involved with |
Using the same input deck as in #5131, I could not finish a simulation due to GPU of memory. Setting
amrex.abort_on_out_of_gpu_memory = 0
did not help. However, the stdout file generated by WarpX reports significant lower memory usage compared to the memory available. I attach err and std files. You can see in out file, WarpX reports only 650 GB memory usage which is far lower than the total memory of 32 GPUs with 40 GB each.errWarpX-2735006.txt
outWarpX-2735006.txt
When this simulation does not finish, I tried reading the data using OpenPMD time series, but it can not read the files. Is this expected? In my experiences with other codes, I should be able to read whatever data was written for the crashed simulation. Do I need to compile HDF5 with some other flags or so?
The text was updated successfully, but these errors were encountered: