-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some test fail on s390x #59
Comments
I've never tested on s390 but looking at the log I suspect this is an MPI issue. The tests are passing when they don't use MPI or run only on a single node, as soon as the test tries 2 or more notes it fails. Is MPI configured correctly in the test environment? |
The test environment is just the Fedora package, @keszybz would know the details. |
The "mpi environment" is just what the test sets up. The build is done in a dedicated VM, the The test does this:
i.e.
I can try to answer some general questions, but I know nothing about this package and about as much about s390x ;)
|
All MPI tests are failing, the non-MPI tests are passing. The log does not contain details, e.g., the output of Also, for some reason, the tests are running in parallel which further messes up the ctest output. CMake tells ctest to run everything in series, otherwise we can get really nasty over-subscription of resources. There are multiple tests that take 12 MPI ranks, and each rank may or may not be using multiple threads. How can we get the details from at least one failing test, e.g., the |
|
#67 should help with the over subscribing issue. |
@junghans Let me know if the PR helped or if you need to use I would like to close this issue before tagging the release. |
I did another build, https://koji.fedoraproject.org/koji/taskinfo?taskID=125159590:
|
Something is wrong with MPI,
Hard to figure this out without hands onto the hardware. |
The sequential one, https://koji.fedoraproject.org/koji/taskinfo?taskID=125161367, fails as well:
But I think @mkstoyanov is right, that looks very much like a more fundamental issue in the |
In my experience, the Red Hat family is rather paranoid. There are a bunch of flags about "hardened" and "secure" that I have not used and I don't know what those mean. I won't put it past them to have something in the environment that blocks processes from communicating with MPI. The error message says that I don't think I can help here. |
Let me add a MPI hello world to build and see if that fails, too! |
Hmm, https://koji.fedoraproject.org/koji/taskinfo?taskID=125191181,
|
That'd be me. But I only picked up mpich because nobody else wanted it. I'm not qualified to fix real issues. |
@opoplawski any ideas? Otherwise, I will ping the developer mailing list. |
I can't find the source code within the build logs. In the hello-world example, do you have any code other than MPI init and the print statement? You should add at least an
That will call the method which failed in the heffte logs and the order of the ranks should be sequential, i.e., 0, 1, 2, 3, ... |
Ok, I made it print the source and added the suggested loop as well in https://koji.fedoraproject.org/koji/taskinfo?taskID=125198447 |
Hmm
|
The commands to read the process info, rank, comm-size, etc. Those do not require actual communication but rather on-node work. The log shows a crash on the second call to the You can play around with send/recv to see how those act and if they work properly, but there's something wrong with MPI in this environment. |
Yeah, that will need some deeper investigations. I would just go ahead with v2.4.1 and not wait for this issue. |
From https://koji.fedoraproject.org/koji/taskinfo?taskID=125093717:
Full build log: build_s390x.log.txt.zip
It says
v2.4.0
, but is actually c7c8f69.Aarch64, ppc64le and x86_64 work.
The text was updated successfully, but these errors were encountered: