-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with non mpich mpi #71
Comments
This issue seems not directly related to IGG, but rather an issue from not being able to hook into system MPI. Did you try a plain MPI.jl example with not using jll-shipped MPICH? |
Thanks @luraess, it looks like you're right: when using
So it looks like the whole chain of dependencies can work fine with the locally installed OpenMPI version. However, in our real use-case |
PS: Not sure whether it's related, but we also have to add the (It's not really an issue for us, but I'm not sure whether that warrants an update of the example in the README) |
What is the error message?
Are there any differences in how you call |
That should not be: we will sort this out rather than adjusting the README 👍 Thanks for letting us know! |
Each process writes an error looking like:
Some
The line above works when run in the context of the 50-lines multi-xPU example in the exact same environment. |
Are you may be calling |
In all cases, I'm running the computation using $ mpirun -n 8 julia --project -t 1 my_script.jl That would mean there is only one thread per process, right? Also, both in the (working) ParallelStencil example and the (non-working) real code, the parallel stencil initialization is the same: @init_parallel_stencil(Threads, Float64, 3) |
Actually, it might be that you are running more processes than you have GPUs and the error occurs rightfully. If your GPUs our distributed on multiple nodes, then you can solve this for example with a hostfile. |
Can you call |
Also don't forget:
|
Yes, if I do this the same error occurs in
Sorry, I had forgotten about it. But no: no |
Ah, looking at it more closely, it seems that In any case, this definitely sounds like (1) this issue doesn't have much (if anything) to do with Many many thanks! We'll report back if/when we find what happens, in case this could help other |
Seems the If you manage to make a MWE or reproducer, then we could also look further into it if needed. And if it turns out being an issue not related to IGG, maybe posting it on Discourse could help other users to know about it. |
OK, I think I found the issue. It is a "long distance" bug that has nothing to do with either using MPI
# using HDF5 # <-- Uncomment this and `MPI.Init()` breaks when using system-wide MPI
MPI.Init()
comm = MPI.COMM_WORLD
print("Hello world, I am rank $(MPI.Comm_rank(comm)) of $(MPI.Comm_size(comm))\n")
MPI.Barrier(comm) It seems we're hitting JuliaIO/HDF5.jl#1079, which is due to recent In any case, thank you for your time, helping us find the origin of the problem! |
Yes, it might very well be that this error occurs rightfully, and we simply do not use IGG in the correct way. In this specific instance, for example, our objective was to test that everything works first in a multi-CPU context. So we're launching a few MPI processes on one compute node (that happens to have a GPU, but we don't really want to use it, only the CPUs). To be clear, IGG is right to say that there are more MPI processes than GPUs, but this is also precisely the situation we want to be in. What would be the correct way to do this? Should we use a hostfile rather than using the |
Okay, then this means that the error occurs rightfully and it is correct usage to deactivate device selection by setting the keyword
No, a hostfile would only be something to use, if you are trying to run multiple processes on different nodes, but due to the way you're launching your commands they are not placed on the nodes as you wish for. |
Thanks for your answer!
Yes, that's a very good idea. I have to admit I was at first a bit puzzled that there wasn't any option available for As I was writing above, I tend to think that a simple comment in the 50-lines multi-xPU example might help future users find this option more easily. Maybe simply something along the lines: # Numerics
nx, ny, nz = 256, 256, 256; # Number of gridpoints dimensions x, y and z.
nt = 100; # Number of time steps
# Add the select_device=false / device_type="none" kwarg below
# to ignore GPUs in a multi-CPU configuration
init_global_grid(nx, ny, nz);
dx = lx/(nx_g()-1); # Space step in x-dimension
dy = ly/(ny_g()-1); # Space step in y-dimension
dz = lz/(nz_g()-1); # Space step in z-dimension |
Hello,
I use ImplicitGlobalGrid and ParallelStencil for my simulation, and everything is fine when I use the default MPI (MPICH).
This does not work properly with MPI with selected openMPI via MPI preferences (MPI.jl itself seems to run OK but
implicitGlobalGrid complains about mismatching MPI version and run N times the same mono-process pb).
Is there some particular settings I should use for ImplicitGlobalGrid.jl for using the proper MPI library?
julia 1.9.3
The text was updated successfully, but these errors were encountered: