-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prterun hangs, Open MPI v5.0.6 #12939
Comments
Usually traced back to a race condition in the TM support integration. The TM library isn't the most robust interface, I'm afraid, as it is rather atomistic - launch daemon, poll until reap, etc. We've seen plenty of "lost" interactions over the years. You can try using the |
Thank you for the prompt response. I attempted to run the command as:
... but it outputs:
It appears that
... and the library is present:
Additionally, I set Thanks again. |
You're missing a library on the backend, but that is pretty typical - they don't put the |
I attempted to reconfigure the installation with the following settings:
This allowed me to move past the previous error. Then I ran three identical jobs, each with 32 MPI processes distributed across 32 separate nodes. (
What happens, is that the application reports:
Indeed, when I SSH into one of the nodes where processes should be running (checking If I attempt to omit the
I am not sure if this is related to the way the |
To help diagnose the last error message problem could you rerun with the following open mpi mca params set?
Does this system have the HPE PALS package installed? Is there a |
I suspect they aren't hanging, but rather don't think they have anything to do. You could also try adding |
Background information
Details of the problem
I have noticed that sometimes, without any specific regularity, after initiating an
mpirun
command, aprterun
hang occurs, meaning the main executable application does not start. The likelihood of this happening increases when launching a larger number of MPI processes across different computational nodes.Open MPI was compiled with PBS support, as outlined below:
I run the application as follows:
mpirun -np 32 pw.x -i scf.input
I have observed the same behavior with earlier versions of Open MPI 5 as well. This issue is not exclusively related to the application I used (i.e. Quantum ESPRESSO) and occurs independently of any specific computational node. In other words, simply restarting the job often resolves the problem.
The text was updated successfully, but these errors were encountered: