Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU out of memory and hdf5 file not read for the crashed simulations #5459

Open
Tissot11 opened this issue Nov 14, 2024 · 11 comments
Open

GPU out of memory and hdf5 file not read for the crashed simulations #5459

Tissot11 opened this issue Nov 14, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@Tissot11
Copy link

Using the same input deck as in #5131, I could not finish a simulation due to GPU of memory. Setting amrex.abort_on_out_of_gpu_memory = 0 did not help. However, the stdout file generated by WarpX reports significant lower memory usage compared to the memory available. I attach err and std files. You can see in out file, WarpX reports only 650 GB memory usage which is far lower than the total memory of 32 GPUs with 40 GB each.

errWarpX-2735006.txt
outWarpX-2735006.txt

When this simulation does not finish, I tried reading the data using OpenPMD time series, but it can not read the files. Is this expected? In my experiences with other codes, I should be able to read whatever data was written for the crashed simulation. Do I need to compile HDF5 with some other flags or so?

@Tissot11 Tissot11 added the bug Something isn't working label Nov 14, 2024
@n01r
Copy link
Member

n01r commented Nov 14, 2024

Hi @Tissot11,

  • Could you run a test simulation only until step 6000 (assuming that it would always crash shortly after that)? Then you will get an AMReX report on memory usage.

Perhaps @atmyers could have a look at this memory report then, once you have it.

Your problem seems to be periodic. For more test runs you could reduce the size in this direction and use fewer resources.

What is the error when you are trying to read your data?
The OpenPMDTimeSeries may fail to initialize if there is a file that is unfinished (when the crash happened while the file was still being written). Although, if I am not mistaken, that was fixed and should only give a warning. Right, @RemiLehe?

  • @Tissot11, please post which version of openpmd_viewer you are using.

If, however, you are trying to access an unfinished file itself then that data might be corrupted. I only know of ADIOS2 being able to produce readable files even if the writing process crashes (and possible when some certain options are activated).

@Tissot11
Copy link
Author

I attach an out file of a finished simulation. In the AMReX report I somehow see lower memory usage and in the bottom of the file, I see about 600 GB while the device memory is about 29 GiB. Is the device memory meant for a single GPU?

outWarpX-2757117.txt

I am using openpmd_viewer version 1.10.0 installed from conda-forge

This is the message I get while reading file from the crashed simulation

Error: Read Error in backend HDF5
Object type:	File
Error type:	Inaccessible
Further description:	Failed to open HDF5 file /home/hk-project-obliques/hd_ff296/WarpXSimulations/2D-MA30_MS33_th75_rT216/d25_mi100-IP-NFlux/DIAGS/openpmd.h5


HDF5-DIAG: Error detected in HDF5 (1.12.2) thread 0:
  #000: H5F.c line 620 in H5Fopen(): unable to open file
    major: File accessibility
    minor: Unable to open file
  #001: H5VLcallback.c line 3501 in H5VL_file_open(): failed to iterate over available VOL connector plugins
    major: Virtual Object Layer
    minor: Iteration failed
  #002: H5PLpath.c line 578 in H5PL__path_table_iterate(): can't iterate over plugins in plugin path '(null)'
    major: Plugin for dynamically loaded library
    minor: Iteration failed
  #003: H5PLpath.c line 620 in H5PL__path_table_iterate_process_path(): can't open directory: /usr/local/hdf5/lib/plugin
    major: Plugin for dynamically loaded library
    minor: Can't open directory or file
  #004: H5VLcallback.c line 3351 in H5VL__file_open(): open failed
    major: Virtual Object Layer
    minor: Can't open object
  #005: H5VLnative_file.c line 97 in H5VL__native_file_open(): unable to open file
    major: File accessibility
    minor: Unable to open file
  #006: H5Fint.c line 1990 in H5F_open(): unable to read superblock
    major: File accessibility
    minor: Read failed
  #007: H5Fsuper.c line 405 in H5F__super_read(): file signature not found
    major: File accessibility
    minor: Not an HDF5 file
[AbstractIOHandlerImpl] IO Task OPEN_FILE failed with exception. Clearing IO queue and passing on the exception.

In my experience with other code, I can read the hdf5 files of a crashed simulation as well. I have no ADIOS2 only HDF5 configured with WarpX

@atmyers
Copy link
Member

atmyers commented Nov 15, 2024

Could you also attach the Backtrace.68 file, referred to in errWarpX-2735006.txt?

@Tissot11
Copy link
Author

Sorry, the previous run files, I had deleted them. But I have just encountered another crashed run due to memory error. I attach the err, out and Backtrace (converted to txt) files.

errWarpX-2759955.txt
outWarpX-2759955.txt
Backtrace.42.txt

@atmyers
Copy link
Member

atmyers commented Nov 15, 2024

I see. So the code is running out of memory in AddPlasmaFlux.

One thing is that, while the total memory across all GPUs may be enough to store your particles, if any one GPU runs out of memory the simulation will crash. Do you have a sense of how many particles are on each box prior to running out of memory?

Also, the reason why AddPlasmaFlux takes more memory than AddPlasma is that we add the particles to a temporary container in AddPlasmaFlux, thus briefly doubling the newly-added ones in memory. Perhaps we could add an option to clear them from the tmp container on the fly.

@Tissot11
Copy link
Author

Tissot11 commented Nov 15, 2024

Ok, this would be great to avoid these memory errors. It was also my observation and I asked in #5131 about using these two mechanisms for particle injections and the memory usages in each case. To be honest, I probably can not calculate the number of particles in each box during the simulation runtime.

@Tissot11
Copy link
Author

Any idea, when this would be implemented in WarpX?

@n01r
Copy link
Member

n01r commented Nov 25, 2024

@Tissot11, just out of curiosity: if feasible and you have the resources, what happens when you run your setup in WarpX' CPU-only mode and compare with your previous results?

@n01r
Copy link
Member

n01r commented Nov 25, 2024

Maybe you would not want to run the full simulation with just CPUs and the speed advantage of GPUs is why you chose WarpX. I am just generally curious if the WarpX CPU mode could produce first results for you that you could compare to what you already have, or if there are general road blocks.

@Tissot11
Copy link
Author

I can try WarpX on CPU for testing but you're right that GPU acceleration is the reason to switch to WarpX. I guess the problem is probably with the boundary conditions and the injector itself. I have tried continuous injection and I would like to try the NFlux method. Probably I try it tomorrow with CPU runs and see if I can reproduce the results.

@Tissot11
Copy link
Author

I tried running the job on 384 GPUs to avoid the out of memory errors. Yet, the job crashed, this time not because of the out of memory errors but something else. I attach out, err and Backtrace files.

Backtrace.226.txt
errWarpX-2796001.txt
outWarpX-2796001.txt

It seems that I can not run my jobs since there seem to be several issues involved with NFluxPerCell, openpmd_viewer and also boundary conditions. Please let me know if these issues are of priorities to you and fixable in near term future? Then I could be interested in providing feedback and also seriously consider using WarpX for my research work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants