prterun hangs, Open MPI v5.0.6 #12939

axiom-ctrl · 2024-11-27T14:27:54Z

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Open MPI v5.0.6

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from a source/distribution tarball

Please describe the system on which you are running

Operating system/version: Red Hat Enterprise Linux 8.8
Computer hardware: HPE Cray EX
Network type: HPE Slingshot

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

I have noticed that sometimes, without any specific regularity, after initiating an mpirun command, a prterun hang occurs, meaning the main executable application does not start. The likelihood of this happening increases when launching a larger number of MPI processes across different computational nodes.

Open MPI was compiled with PBS support, as outlined below:

./configure \
    CC="gcc" \
    CXX="g++" \
    FC="gfortran" \
    --prefix=${install_dir} \
    --enable-shared \
    --enable-static \
    --with-pbs \
    --with-libfabric="/opt/cray" \
    --with-libfabric-libdir="/opt/cray/lib64" \
    --with-tm="/opt/pbs" \
    --with-tm-libdir="/opt/pbs/lib"

I run the application as follows:

mpirun -np 32 pw.x -i scf.input

I have observed the same behavior with earlier versions of Open MPI 5 as well. This issue is not exclusively related to the application I used (i.e. Quantum ESPRESSO) and occurs independently of any specific computational node. In other words, simply restarting the job often resolves the problem.

The text was updated successfully, but these errors were encountered:

rhc54 · 2024-11-27T14:42:29Z

Usually traced back to a race condition in the TM support integration. The TM library isn't the most robust interface, I'm afraid, as it is rather atomistic - launch daemon, poll until reap, etc. We've seen plenty of "lost" interactions over the years.

You can try using the ssh launcher instead - it is more robust. Just add --prtemca plm ssh to your mpirun cmd line. Long-running debate over which (tm or ssh) is faster at scale, but no question that ssh tends to be more reliable.

axiom-ctrl · 2024-11-27T15:12:02Z

Thank you for the prompt response. I attempted to run the command as:

mpirun --prtemca plm ssh -np 32 --hostfile $PBS_NODEFILE pw.x -i scf.input

... but it outputs:

/apps/utils/openmpi/5.0.6/gnu/11.2.1/bin/prted: error while loading shared libraries: libpbs.so.0: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-x8000c1s0b0n1-2021969@0,0] on node x8000c1s0b0n1
  Remote daemon: [prterun-x8000c1s0b0n1-2021969@0,1] on node x8000c1s0b1n0

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

It appears that prted is correctly linked:

ldd /apps/utils/openmpi/5.0.6/gnu/11.2.1/bin/prted

	linux-vdso.so.1 (0x00007ffffefd2000)
	libprrte.so.3 => /apps/utils/openmpi/5.0.6/gnu/11.2.1/lib/libprrte.so.3 (0x000014f103e8e000)
	libpbs.so.0 => /opt/pbs/lib/libpbs.so.0 (0x000014f103afa000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x000014f1038da000)

... and the library is present:

find /opt/pbs/lib/ -iname 'libpbs.so.0'

	/opt/pbs/lib/libpbs.so.0

Additionally, I set PRTE_MCA_prte_fwd_environment to true. The aforementioned error disappears if I remove the flag you suggested.

Thanks again.

rhc54 · 2024-11-27T15:51:46Z

You're missing a library on the backend, but that is pretty typical - they don't put the pbs libraries on the compute nodes. Reconfigure without the --with-tm lines and see if that works. If not, then try removing the --with-pbs as well. You may or may not need to forward the environment, but it won't hurt if you do.

axiom-ctrl · 2024-11-27T18:55:14Z

I attempted to reconfigure the installation with the following settings:

./configure \
    CC="gcc" \
    CXX="g++" \
    FC="gfortran" \
    --prefix=${install_dir} \
    --enable-shared \
    --enable-static \
    --with-libfabric="/opt/cray" \
    --with-libfabric-libdir="/opt/cray/lib64" \
    --without-pbs \
    --without-tm

This allowed me to move past the previous error.

Then I ran three identical jobs, each with 32 MPI processes distributed across 32 separate nodes. (PBS_NODEFILE contains 32 distinct lines, corresponding to 32 different compute nodes.)

mpirun --prtemca plm ssh -np 32 --hostfile $PBS_NODEFILE pw.x -i scf.input

What happens, is that the application reports:

     Parallel version (MPI & OpenMP), running on      32 processor cores
     Number of MPI processes:                32
     Threads/MPI process:                     1

     MPI processes distributed on     2 nodes

Indeed, when I SSH into one of the nodes where processes should be running (checking PBS_NODEFILE), I can observe that the prted processes are hanging and have failed to launch the pw.x executable.

If I attempt to omit the -np flag (while keeping --hostfile):

--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: x8000c0s7b0n1
  Location: mtl_ofi_component.c:1007
  Error: Invalid argument (22)
--------------------------------------------------------------------------

I am not sure if this is related to the way the mpirun is being invoked?

hppritcha · 2024-11-27T19:12:03Z

To help diagnose the last error message problem could you rerun with the following open mpi mca params set?

--mca pml ob1 --mca btl self,sm,tcp

Does this system have the HPE PALS package installed? Is there a aprun command in your path?

rhc54 · 2024-11-27T20:21:47Z

Indeed, when I SSH into one of the nodes where processes should be running (checking PBS_NODEFILE), I can observe that the prted processes are hanging and have failed to launch the pw.x executable.

I suspect they aren't hanging, but rather don't think they have anything to do. You could also try adding --prtemca plm_base_verbose 5 to get some more debugging output - might want to add --enable-debug to your configure line first to activate some of the debugging messages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prterun hangs, Open MPI v5.0.6 #12939

prterun hangs, Open MPI v5.0.6 #12939

axiom-ctrl commented Nov 27, 2024 •

edited

Loading

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

rhc54 commented Nov 27, 2024

axiom-ctrl commented Nov 27, 2024 •

edited

Loading

rhc54 commented Nov 27, 2024

axiom-ctrl commented Nov 27, 2024

hppritcha commented Nov 27, 2024

rhc54 commented Nov 27, 2024

prterun hangs, Open MPI v5.0.6 #12939

prterun hangs, Open MPI v5.0.6 #12939

Comments

axiom-ctrl commented Nov 27, 2024 • edited Loading

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

rhc54 commented Nov 27, 2024

axiom-ctrl commented Nov 27, 2024 • edited Loading

rhc54 commented Nov 27, 2024

axiom-ctrl commented Nov 27, 2024

hppritcha commented Nov 27, 2024

rhc54 commented Nov 27, 2024

axiom-ctrl commented Nov 27, 2024 •

edited

Loading

axiom-ctrl commented Nov 27, 2024 •

edited

Loading