Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prterun hangs, Open MPI v5.0.6 #12939

Open
axiom-ctrl opened this issue Nov 27, 2024 · 6 comments
Open

prterun hangs, Open MPI v5.0.6 #12939

axiom-ctrl opened this issue Nov 27, 2024 · 6 comments

Comments

@axiom-ctrl
Copy link

axiom-ctrl commented Nov 27, 2024

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

  • Open MPI v5.0.6

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

  • from a source/distribution tarball

Please describe the system on which you are running

  • Operating system/version: Red Hat Enterprise Linux 8.8
  • Computer hardware: HPE Cray EX
  • Network type: HPE Slingshot

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

I have noticed that sometimes, without any specific regularity, after initiating an mpirun command, a prterun hang occurs, meaning the main executable application does not start. The likelihood of this happening increases when launching a larger number of MPI processes across different computational nodes.

Open MPI was compiled with PBS support, as outlined below:

./configure \
    CC="gcc" \
    CXX="g++" \
    FC="gfortran" \
    --prefix=${install_dir} \
    --enable-shared \
    --enable-static \
    --with-pbs \
    --with-libfabric="/opt/cray" \
    --with-libfabric-libdir="/opt/cray/lib64" \
    --with-tm="/opt/pbs" \
    --with-tm-libdir="/opt/pbs/lib"

I run the application as follows:

mpirun -np 32 pw.x -i scf.input

I have observed the same behavior with earlier versions of Open MPI 5 as well. This issue is not exclusively related to the application I used (i.e. Quantum ESPRESSO) and occurs independently of any specific computational node. In other words, simply restarting the job often resolves the problem.

@rhc54
Copy link
Contributor

rhc54 commented Nov 27, 2024

Usually traced back to a race condition in the TM support integration. The TM library isn't the most robust interface, I'm afraid, as it is rather atomistic - launch daemon, poll until reap, etc. We've seen plenty of "lost" interactions over the years.

You can try using the ssh launcher instead - it is more robust. Just add --prtemca plm ssh to your mpirun cmd line. Long-running debate over which (tm or ssh) is faster at scale, but no question that ssh tends to be more reliable.

@axiom-ctrl
Copy link
Author

axiom-ctrl commented Nov 27, 2024

Thank you for the prompt response. I attempted to run the command as:

mpirun --prtemca plm ssh -np 32 --hostfile $PBS_NODEFILE pw.x -i scf.input

... but it outputs:

/apps/utils/openmpi/5.0.6/gnu/11.2.1/bin/prted: error while loading shared libraries: libpbs.so.0: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
PRTE has lost communication with a remote daemon.

  HNP daemon   : [prterun-x8000c1s0b0n1-2021969@0,0] on node x8000c1s0b0n1
  Remote daemon: [prterun-x8000c1s0b0n1-2021969@0,1] on node x8000c1s0b1n0

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------

It appears that prted is correctly linked:

ldd /apps/utils/openmpi/5.0.6/gnu/11.2.1/bin/prted

	linux-vdso.so.1 (0x00007ffffefd2000)
	libprrte.so.3 => /apps/utils/openmpi/5.0.6/gnu/11.2.1/lib/libprrte.so.3 (0x000014f103e8e000)
	libpbs.so.0 => /opt/pbs/lib/libpbs.so.0 (0x000014f103afa000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x000014f1038da000)

... and the library is present:

find /opt/pbs/lib/ -iname 'libpbs.so.0'

	/opt/pbs/lib/libpbs.so.0

Additionally, I set PRTE_MCA_prte_fwd_environment to true. The aforementioned error disappears if I remove the flag you suggested.

Thanks again.

@rhc54
Copy link
Contributor

rhc54 commented Nov 27, 2024

You're missing a library on the backend, but that is pretty typical - they don't put the pbs libraries on the compute nodes. Reconfigure without the --with-tm lines and see if that works. If not, then try removing the --with-pbs as well. You may or may not need to forward the environment, but it won't hurt if you do.

@axiom-ctrl
Copy link
Author

I attempted to reconfigure the installation with the following settings:

./configure \
    CC="gcc" \
    CXX="g++" \
    FC="gfortran" \
    --prefix=${install_dir} \
    --enable-shared \
    --enable-static \
    --with-libfabric="/opt/cray" \
    --with-libfabric-libdir="/opt/cray/lib64" \
    --without-pbs \
    --without-tm

This allowed me to move past the previous error.


Then I ran three identical jobs, each with 32 MPI processes distributed across 32 separate nodes. (PBS_NODEFILE contains 32 distinct lines, corresponding to 32 different compute nodes.)

mpirun --prtemca plm ssh -np 32 --hostfile $PBS_NODEFILE pw.x -i scf.input

What happens, is that the application reports:

     Parallel version (MPI & OpenMP), running on      32 processor cores
     Number of MPI processes:                32
     Threads/MPI process:                     1

     MPI processes distributed on     2 nodes

Indeed, when I SSH into one of the nodes where processes should be running (checking PBS_NODEFILE), I can observe that the prted processes are hanging and have failed to launch the pw.x executable.


If I attempt to omit the -np flag (while keeping --hostfile):

--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: x8000c0s7b0n1
  Location: mtl_ofi_component.c:1007
  Error: Invalid argument (22)
--------------------------------------------------------------------------

I am not sure if this is related to the way the mpirun is being invoked?

@hppritcha
Copy link
Member

To help diagnose the last error message problem could you rerun with the following open mpi mca params set?

--mca pml ob1 --mca btl self,sm,tcp

Does this system have the HPE PALS package installed? Is there a aprun command in your path?

@rhc54
Copy link
Contributor

rhc54 commented Nov 27, 2024

Indeed, when I SSH into one of the nodes where processes should be running (checking PBS_NODEFILE), I can observe that the prted processes are hanging and have failed to launch the pw.x executable.

I suspect they aren't hanging, but rather don't think they have anything to do. You could also try adding --prtemca plm_base_verbose 5 to get some more debugging output - might want to add --enable-debug to your configure line first to activate some of the debugging messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants