Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some test fail on s390x #59

Open
junghans opened this issue Oct 22, 2024 · 21 comments
Open

Some test fail on s390x #59

junghans opened this issue Oct 22, 2024 · 21 comments

Comments

@junghans
Copy link
Contributor

junghans commented Oct 22, 2024

From https://koji.fedoraproject.org/koji/taskinfo?taskID=125093717:

25/25 Test #25: heffte_longlong_np4 ..............***Failed  375.02 sec
--------------------------------------------------------------------------------
                               heffte::fft class
--------------------------------------------------------------------------------
     float                  -np 4  test int/long long<stock>              pass
24% tests passed, 19 tests failed out of 25
Total Test time (real) = 1354.20 sec
The following tests FAILED:
	  4 - heffte_reshape3d_np4 (Failed)
	  5 - heffte_reshape3d_np7 (Failed)
	  6 - heffte_reshape3d_np12 (Failed)
	  8 - heffte_fft3d_np2 (Failed)
	  9 - heffte_fft3d_np4 (Failed)
	 10 - heffte_fft3d_np6 (Failed)
	 11 - heffte_fft3d_np8 (Failed)
	 12 - heffte_fft3d_np12 (Failed)
	 13 - heffte_streams_np6 (Failed)
	 14 - test_subcomm_np8 (Failed)
	 15 - test_subcomm_np12 (Failed)
	 17 - heffte_fft3d_r2c_np2 (Failed)
	 18 - heffte_fft2d_r2c_np4 (Failed)
	 19 - heffte_fft3d_r2c_np6 (Failed)
	 20 - heffte_fft3d_r2c_np8 (Failed)
	 21 - heffte_fft3d_r2c_np12 (Failed)
	 23 - test_cos_np2 (Failed)
	 24 - test_cos_np4 (Failed)
	 25 - heffte_longlong_np4 (Failed)
Errors while running CTest
error: Bad exit status from /var/tmp/rpm-tmp.kyrg4o (%check)
RPM build errors:
    Bad exit status from /var/tmp/rpm-tmp.kyrg4o (%check)
Child return code was: 1

Full build log: build_s390x.log.txt.zip

It says v2.4.0, but is actually c7c8f69.

Aarch64, ppc64le and x86_64 work.

@mkstoyanov
Copy link
Collaborator

mkstoyanov commented Oct 22, 2024

I've never tested on s390 but looking at the log I suspect this is an MPI issue. The tests are passing when they don't use MPI or run only on a single node, as soon as the test tries 2 or more notes it fails.

Is MPI configured correctly in the test environment?

@junghans
Copy link
Contributor Author

The test environment is just the Fedora package, @keszybz would know the details.

@mkstoyanov mkstoyanov mentioned this issue Oct 23, 2024
5 tasks
@keszybz
Copy link

keszybz commented Oct 23, 2024

The "mpi environment" is just what the test sets up. The build is done in a dedicated VM, the hw_info.log file linked from the build describes the machine.

The test does this:

%check
# allow openmpi to oversubscribe, i.e. runs test with more
# cores than the builder has
export PRTE_MCA_rmaps_default_mapping_policy=:oversubscribe

for mpi in mpich openmpi; do
  test -n "${mpi}" && module load mpi/${mpi}-%{_arch}
  %ctest
  test -n "${mpi}" && module unload mpi/${mpi}-%{_arch}
done

i.e.

export PRTE_MCA_rmaps_default_mapping_policy=:oversubscribe

for mpi in mpich openmpi; do
  test -n "${mpi}" && module load mpi/${mpi}-x86_64  
  /usr/bin/ctest --test-dir "redhat-linux-build" \
           --output-on-failure \
           --force-new-ctest-process \
            -j${RPM_BUILD_NCPUS} 
  test -n "${mpi}" && module unload mpi/${mpi}-x86_64
done

I can try to answer some general questions, but I know nothing about this package and about as much about s390x ;)

CPU info:
Architecture:                         s390x
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Big Endian
CPU(s):                               3
On-line CPU(s) list:                  0-2
Vendor ID:                            IBM/S390
Model name:                           -
Machine type:                         3931
Thread(s) per core:                   1
Core(s) per socket:                   1
Socket(s) per book:                   1
Book(s) per drawer:                   1
Drawer(s):                            3
CPU dynamic MHz:                      5200
CPU static MHz:                       5200
BogoMIPS:                             3331.00
Dispatching mode:                     horizontal
Flags:                                esan3 zarch stfle msa ldisp eimm dfp edat etf3eh highgprs te vx vxd vxe gs vxe2 vxp sort dflt vxp2 nnpa sie
Hypervisor:                           KVM/Linux
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            384 KiB (3 instances)
L1i cache:                            384 KiB (3 instances)
L2 cache:                             96 MiB (3 instances)
L3 cache:                             256 MiB
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-2

@mkstoyanov
Copy link
Collaborator

All MPI tests are failing, the non-MPI tests are passing. The log does not contain details, e.g., the output of ctest -V

Also, for some reason, the tests are running in parallel which further messes up the ctest output. CMake tells ctest to run everything in series, otherwise we can get really nasty over-subscription of resources. There are multiple tests that take 12 MPI ranks, and each rank may or may not be using multiple threads.

How can we get the details from at least one failing test, e.g., the test_reshape3d running with 4 ranks.

@junghans
Copy link
Contributor Author

—output-on-failure should be the same as -V on error, but let me add a -j1 to the ctest.

@junghans
Copy link
Contributor Author

#67 should help with the over subscribing issue.

@mkstoyanov
Copy link
Collaborator

@junghans Let me know if the PR helped or if you need to use -D Heffte_SEQUENTIAL_TESTING

I would like to close this issue before tagging the release.

@junghans
Copy link
Contributor Author

I did another build, https://koji.fedoraproject.org/koji/taskinfo?taskID=125159590:

Test project /builddir/build/BUILD/heffte-2.4.0-build/heffte-master/s390x-redhat-linux-gnu-mpich
      Start  1: heffte_fortran_fftw
      Start  2: unit_tests_nompi
 1/25 Test  #1: heffte_fortran_fftw ..............   Passed    0.08 sec
      Start  3: unit_tests_stock
      Start  7: heffte_fft3d_np1
 2/25 Test  #3: unit_tests_stock .................   Passed    0.00 sec
      Start 16: heffte_fft3d_r2c_np1
 3/25 Test #16: heffte_fft3d_r2c_np1 .............   Passed    0.08 sec
      Start 22: test_cos_np1
 4/25 Test  #7: heffte_fft3d_np1 .................   Passed    0.13 sec
 5/25 Test #22: test_cos_np1 .....................   Passed    0.08 sec
      Start  8: heffte_fft3d_np2
 6/25 Test  #8: heffte_fft3d_np2 .................***Failed    0.08 sec
--------------------------------------------------------------------------------
                               heffte::fft class
--------------------------------------------------------------------------------
                            constructor heffte::fft3d<stock>              pass
Abort(676410127) on node 1 (rank 1 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(7454).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................: 
MPIDI_Barrier_allcomm_composition_json(132): 
MPIDI_POSIX_mpi_bcast(238).................: 
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not match across processes in the collective routine: Received 0 but expected 1
      Start 17: heffte_fft3d_r2c_np2
 7/25 Test  #2: unit_tests_nompi .................   Passed    0.38 sec
 8/25 Test #17: heffte_fft3d_r2c_np2 .............***Failed    0.14 sec
--------------------------------------------------------------------------------
                             heffte::fft_r2c class
--------------------------------------------------------------------------------
                        constructor heffte::fft3d_r2c<stock>              pass
Abort(676410127) on node 1 (rank 1 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(7454).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................: 
MPIDI_Barrier_allcomm_composition_json(132): 
MPIDI_POSIX_mpi_bcast(238).................: 
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not match across processes in the collective routine: Received 0 but expected 1
      Start  4: heffte_reshape3d_np4
 9/25 Test  #4: heffte_reshape3d_np4 .............***Failed    0.75 sec
--------------------------------------------------------------------------------
                             heffte reshape methods
--------------------------------------------------------------------------------
                                   heffte::mpi::gather_boxes              pass
      Start  5: heffte_reshape3d_np7
10/25 Test  #5: heffte_reshape3d_np7 .............***Failed    1.70 sec
--------------------------------------------------------------------------------
                             heffte reshape methods
--------------------------------------------------------------------------------
                                   heffte::mpi::gather_boxes              pass
     float         -np 7  heffte::reshape3d_alltoall all-2-1              pass
      Start  6: heffte_reshape3d_np12
11/25 Test  #6: heffte_reshape3d_np12 ............***Failed    3.13 sec
--------------------------------------------------------------------------------
                             heffte reshape methods
--------------------------------------------------------------------------------
                                   heffte::mpi::gather_boxes              pass
      Start  9: heffte_fft3d_np4
12/25 Test  #9: heffte_fft3d_np4 .................***Failed    8.16 sec
--------------------------------------------------------------------------------
                               heffte::fft class
--------------------------------------------------------------------------------
  ccomplex                  -np 4  test heffte::fft2d<stock>              pass
      Start 10: heffte_fft3d_np6
13/25 Test #10: heffte_fft3d_np6 .................***Failed   25.74 sec
--------------------------------------------------------------------------------
                               heffte::fft class
--------------------------------------------------------------------------------
     float                  -np 6  test heffte::fft3d<stock>              pass
      Start 11: heffte_fft3d_np8
14/25 Test #11: heffte_fft3d_np8 .................***Failed   45.90 sec
--------------------------------------------------------------------------------
                               heffte::fft class
--------------------------------------------------------------------------------
     float                  -np 8  test heffte::fft3d<stock>              pass
      Start 12: heffte_fft3d_np12
15/25 Test #12: heffte_fft3d_np12 ................***Failed   58.30 sec
--------------------------------------------------------------------------------
                               heffte::fft class
--------------------------------------------------------------------------------
     float                 -np 12  test heffte::fft3d<stock>              pass
      Start 13: heffte_streams_np6
16/25 Test #13: heffte_streams_np6 ...............***Failed   17.86 sec
--------------------------------------------------------------------------------
                              heffte::fft streams
--------------------------------------------------------------------------------
  ccomplex         -np 6  test heffte::fft3d (stream)<stock>              pass
Abort(676410127) on node 3 (rank 3 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(7454).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................: 
MPIDI_Barrier_allcomm_composition_json(132): 
MPIDI_POSIX_mpi_bcast(238).................: 
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not match across processes in the collective routine: Received 0 but expected 1
      Start 14: test_subcomm_np8
17/25 Test #14: test_subcomm_np8 .................***Failed   76.87 sec
--------------------------------------------------------------------------------
                          heffte::fft subcommunicators
--------------------------------------------------------------------------------
    double                -np 8  test subcommunicator<stock>              pass
      Start 15: test_subcomm_np12
18/25 Test #15: test_subcomm_np12 ................***Failed  116.44 sec
--------------------------------------------------------------------------------
                          heffte::fft subcommunicators
--------------------------------------------------------------------------------
    double               -np 12  test subcommunicator<stock>              pass
      Start 18: heffte_fft2d_r2c_np4
19/25 Test #18: heffte_fft2d_r2c_np4 .............***Failed   31.25 sec
--------------------------------------------------------------------------------
                             heffte::fft_r2c class
--------------------------------------------------------------------------------
     float              -np 4  test heffte::fft2d_r2c<stock>              pass
      Start 19: heffte_fft3d_r2c_np6
20/25 Test #19: heffte_fft3d_r2c_np6 .............***Failed   63.12 sec
--------------------------------------------------------------------------------
                             heffte::fft_r2c class
--------------------------------------------------------------------------------
     float              -np 6  test heffte::fft3d_r2c<stock>              pass
      Start 20: heffte_fft3d_r2c_np8
21/25 Test #20: heffte_fft3d_r2c_np8 .............***Failed  104.04 sec
--------------------------------------------------------------------------------
                             heffte::fft_r2c class
--------------------------------------------------------------------------------
     float              -np 8  test heffte::fft3d_r2c<stock>              pass
      Start 21: heffte_fft3d_r2c_np12
22/25 Test #21: heffte_fft3d_r2c_np12 ............***Failed  155.48 sec
--------------------------------------------------------------------------------
                             heffte::fft_r2c class
--------------------------------------------------------------------------------
     float             -np 12  test heffte::fft3d_r2c<stock>              pass
      Start 23: test_cos_np2
23/25 Test #23: test_cos_np2 .....................***Failed    1.13 sec
      Start 24: test_cos_np4
24/25 Test #24: test_cos_np4 .....................***Failed    6.16 sec
--------------------------------------------------------------------------------
                               cosine transforms
--------------------------------------------------------------------------------
     float             -np 4  test cosine<stock-cos-type-II>              pass
      Start 25: heffte_longlong_np4
25/25 Test #25: heffte_longlong_np4 ..............***Failed  105.14 sec
--------------------------------------------------------------------------------
                               heffte::fft class
--------------------------------------------------------------------------------
     float                  -np 4  test int/long long<stock>              pass
24% tests passed, 19 tests failed out of 25
Total Test time (real) = 821.67 sec
The following tests FAILED:
	  4 - heffte_reshape3d_np4 (Failed)
	  5 - heffte_reshape3d_np7 (Failed)
	  6 - heffte_reshape3d_np12 (Failed)
	  8 - heffte_fft3d_np2 (Failed)
	  9 - heffte_fft3d_np4 (Failed)
	 10 - heffte_fft3d_np6 (Failed)
	 11 - heffte_fft3d_np8 (Failed)
	 12 - heffte_fft3d_np12 (Failed)
	 13 - heffte_streams_np6 (Failed)
	 14 - test_subcomm_np8 (Failed)
	 15 - test_subcomm_np12 (Failed)
	 17 - heffte_fft3d_r2c_np2 (Failed)
	 18 - heffte_fft2d_r2c_np4 (Failed)
	 19 - heffte_fft3d_r2c_np6 (Failed)
	 20 - heffte_fft3d_r2c_np8 (Failed)
	 21 - heffte_fft3d_r2c_np12 (Failed)
	 23 - test_cos_np2 (Failed)
	 24 - test_cos_np4 (Failed)
	 25 - heffte_longlong_np4 (Failed)
Errors while running CTest

@mkstoyanov
Copy link
Collaborator

mkstoyanov commented Oct 24, 2024

Something is wrong with MPI, MPI_Barrier(MPI_COMM_WORLD) should always work. This is the first MPI method called after MPI_Init().

  • You can try -D Heffte_SEQUENTIAL_TESTING=ON to make sure different MPI processes don't try to synch across ranks of different tests.
  • You can try a minimal MPI example, e.g., no heffte just a small program that does send recv and check if that works in the rpm environment. Maybe MPI does not work at all.

Hard to figure this out without hands onto the hardware.

@junghans
Copy link
Contributor Author

The sequential one, https://koji.fedoraproject.org/koji/taskinfo?taskID=125161367, fails as well:

4/25 Test  #4: heffte_reshape3d_np4 .............***Failed    0.51 sec
--------------------------------------------------------------------------------
                             heffte reshape methods
--------------------------------------------------------------------------------
                                   heffte::mpi::gather_boxes              pass
Abort(676410127) on node 1 (rank 1 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(7454).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................: 
MPIDI_Barrier_allcomm_composition_json(132): 
MPIDI_POSIX_mpi_bcast(238).................: 
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not match across processes in the collective routine: Received 0 but expected 1
Abort(676410127) on node 3 (rank 3 in comm 0): Fatal error in internal_Barrier: Other MPI error, error stack:
internal_Barrier(7454).....................: MPI_Barrier(MPI_COMM_WORLD) failed
MPID_Barrier(167)..........................: 
MPIDI_Barrier_allcomm_composition_json(132): 
MPIDI_POSIX_mpi_bcast(238).................: 
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not match across processes in the collective routine: Received 0 but expected 1

But I think @mkstoyanov is right, that looks very much like a more fundamental issue in the mpich package of Fedora.
@keszybz, who is Fedora's mpich package maintainer?

@mkstoyanov
Copy link
Collaborator

In my experience, the Red Hat family is rather paranoid. There are a bunch of flags about "hardened" and "secure" that I have not used and I don't know what those mean. I won't put it past them to have something in the environment that blocks processes from communicating with MPI. The error message says that MPI_Barrier failed to send/recv even a single byte.

I don't think I can help here.

@junghans
Copy link
Contributor Author

Let me add a MPI hello world to build and see if that fails, too!

@junghans
Copy link
Contributor Author

junghans commented Oct 25, 2024

Hmm, https://koji.fedoraproject.org/koji/taskinfo?taskID=125191181, hello world worked:

+ mpicc /builddir/build/SOURCES/mpi_hello_world.c -o mpi_hello_world
+ mpiexec -np 12 ./mpi_hello_world
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 0 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 2 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 1 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 3 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 8 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 4 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 6 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 10 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 9 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 7 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 11 out of 12 processors
Hello world from processor db7a3174e80d4305b9dd4b1a288fb59b, rank 5 out of 12 processors

@keszybz
Copy link

keszybz commented Oct 25, 2024

@keszybz, who is Fedora's mpich package maintainer?

That'd be me. But I only picked up mpich because nobody else wanted it. I'm not qualified to fix real issues.

@junghans
Copy link
Contributor Author

@opoplawski any ideas? Otherwise, I will ping the developer mailing list.

@mkstoyanov
Copy link
Collaborator

Hmm, https://koji.fedoraproject.org/koji/taskinfo?taskID=125191181, hello world worked:

I can't find the source code within the build logs. In the hello-world example, do you have any code other than MPI init and the print statement? You should add at least an MPI_Barrier, i.e.,

  int me, nranks;
  MPI_Comm_rank(MPI_COMM_WORLD, &me);
  MPI_Comm_size(MPI_COMM_WORLD, &nranks);

  for (int i=0; i<nranks; i++) {
    if (me == i)
      std::cout << "hello from rank: " << me << std::endl;
    MPI_Barrier(MPI_COMM_WORLD);
  }

That will call the method which failed in the heffte logs and the order of the ranks should be sequential, i.e., 0, 1, 2, 3, ...

@junghans
Copy link
Contributor Author

@junghans
Copy link
Contributor Author

Ok, I made it print the source and added the suggested loop as well in https://koji.fedoraproject.org/koji/taskinfo?taskID=125198447

@junghans
Copy link
Contributor Author

Hmm

Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 0 out of 12 processors
hello from rank: 0
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 2 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 4 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 1 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 5 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 8 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 10 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 6 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 11 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 3 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 9 out of 12 processors
Hello world from processor 9db9bf1ab4534e4e9507c05c42b72e39, rank 7 out of 12 processors
hello from rank: 1
hello from rank: 2
error: Bad exit status from /var/tmp/rpm-tmp.ttBmu8 (%build)
    Bad exit status from /var/tmp/rpm-tmp.ttBmu8 (%build)
RPM build errors:
Child return code was: 1

@mkstoyanov
Copy link
Collaborator

The commands to read the process info, rank, comm-size, etc. Those do not require actual communication but rather on-node work. The log shows a crash on the second call to the MPI_Barrier() so the hello-world is failing, while working on the other systems.

You can play around with send/recv to see how those act and if they work properly, but there's something wrong with MPI in this environment.

@junghans
Copy link
Contributor Author

Yeah, that will need some deeper investigations.

I would just go ahead with v2.4.1 and not wait for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants