MPI processes (Starting PEs : 1) does not match the expected number 24 #454

YueZhang720 · 2024-10-23T12:51:59Z

Your name

Yue Zhang

Your affiliation

HKUST(GZ)

What happened? What did you expect to happen?

After submitting slurm job, there are errors in GCHP log:

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 Starting Threads :           56
 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T

 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 Starting PEs :            1
 Starting Threads :           56
 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist


FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

 For k_split (remapping)=           1
n_split is set to 02 for resolution-dt=0025x0025x6-  600.000
Using n_zfilter : 000
Using n_sponge : 001
Using non_ortho :       T
 Starting PEs :            1
 Starting Threads :           56

FATAL: mpp_domains_define.inc: not all the pe_end are in the pelist

What are the steps to reproduce the bug?

I have tried gchp13.3.4 and gchp14.4.3, and both simulations report the same errors. I also used different versions of [email protected] and [email protected]; they didn't work either. What do you think caused this issue and what do you think I should do to solve this problem?

Please attach any relevant configuration and log files.

ExtData.txt
GCHP_log.txt
run_sh.txt
setCommonRunSettings.txt

What GCHP version were you using?

14.4.3

What environment were you running GCHP on?

Local cluster

What compiler and version were you using?

gcc 10.2.0

What MPI library and version were you using?

openmpi 5.0.5

Will you be addressing this bug yourself?

Yes, but I will need some help

Additional information

No response

The text was updated successfully, but these errors were encountered:

lizziel · 2024-10-23T14:53:56Z

Hi @YueZhang720, this looks like an issue where ESMF is running with a single thread, across all threads. This would explain why you are seeing multiple prints of the same code. Check your ESMF build. Was ESMF_COMM set to mpiuni? It needs to specify your MPI, in this case openmpi. See GCHP ReadTheDocs instructions for environment settings which includes ESMF settings needed for build: https://gchp.readthedocs.io/en/stable/getting-started/requirements.html.

lizziel · 2024-11-12T17:55:15Z

@YueZhang720, were you able to resolve this issue?

YueZhang720 · 2024-11-21T17:07:10Z

@YueZhang720, were you able to resolve this issue?

It doesn't work. So I try GCHPv14.5.0 with [email protected]. When I use mpirun -np 6 ./gchp, here is the error message:

[node09:1761803] [[57744,0],0] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at line 320
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_pmix_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

When I use srun -n 48 -N 2 -m plane=24 --mpi=pmi2 ./gchp, the error is as follows:

--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM support. This usually happens
when OMPI was not configured --with-slurm and we weren't able
to discover a SLURM installation in the usual places.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node09:1762615] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM support. This usually happens
when OMPI was not configured --with-slurm and we weren't able
to discover a SLURM installation in the usual places.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node09:1762622] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM support. This usually happens
when OMPI was not configured --with-slurm and we weren't able
to discover a SLURM installation in the usual places.

Please configure as appropriate and try again.

Is there anything wrong between my slurm and esmf? I have tried many times but it didn't work out.

lizziel · 2024-11-21T18:18:57Z

Hi @YueZhang720, this still looks like an MPI issue. Do you have a system administrator on your cluster who can help look into the MPI configuration?

YueZhang720 added the category: Bug Something isn't working label Oct 23, 2024

lizziel self-assigned this Oct 23, 2024

lizziel added category: Debug Help Request for help debugging GCHP topic: Runtime Related to runtime issues (e.g. simulation stops with error) and removed category: Bug Something isn't working labels Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI processes (Starting PEs : 1) does not match the expected number 24 #454

MPI processes (Starting PEs : 1) does not match the expected number 24 #454

YueZhang720 commented Oct 23, 2024

lizziel commented Oct 23, 2024 •

edited

Loading

lizziel commented Nov 12, 2024

YueZhang720 commented Nov 21, 2024

lizziel commented Nov 21, 2024

MPI processes (Starting PEs : 1) does not match the expected number 24 #454

MPI processes (Starting PEs : 1) does not match the expected number 24 #454

Comments

YueZhang720 commented Oct 23, 2024

Your name

Your affiliation

What happened? What did you expect to happen?

What are the steps to reproduce the bug?

Please attach any relevant configuration and log files.

What GCHP version were you using?

What environment were you running GCHP on?

What compiler and version were you using?

What MPI library and version were you using?

Will you be addressing this bug yourself?

Additional information

lizziel commented Oct 23, 2024 • edited Loading

lizziel commented Nov 12, 2024

YueZhang720 commented Nov 21, 2024

lizziel commented Nov 21, 2024

lizziel commented Oct 23, 2024 •

edited

Loading