-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs started with srun using cm/PSM2 fail #12886
Comments
I'm not sure |
Using OpenMPI 4.0 PSM2 and srun are working, however using PMI2 instead of PMIX. As far as I understand there seems to be also some successfull communication on the PMIX level before the crash.
|
PMIx has nothing to do with the fabric, so the test doesn't mean anything per the reported issue. Likewise, the Slurm PMI2 module is independent of the PMIx one, and so the question remains regarding provision of the OmniPath token. You could just |
OMPI_MCA_orte_precondition_transports is indeed not present. The strange thing is, that it is also not present with Open MPI 4.0, where the PSM2 MTL seems to be working correctly. Is setting this variable a new requirement for Open MPI 5.0 or PMIX? I couldn't find anything in the OpenMPI documentation. |
I managed to get Open MPI 4.1 running with PMIX and PML cm / MTL PSM2, even though OMPI_MCA_orte_precondition_transport is not set. So the problem is most likely on the OpenMPI side and not related to the environment variable. |
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.5
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Open MPI was build from the released sources (https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.5.tar.bz2) using gcc 14.2 and the configured options
./configure --prefix=%{_prefix}
--libdir=%{_libdir}
--enable-shared
--disable-heterogeneous
--enable-prte-prefix-by-default
--enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda,coll-cuda,io-romio341
--with-show-load-errors=no
--with-slurm
--with-psm2
--with-pmix=internal
--with-ucx=%{builddir}%{_prefix}
--with-ofi=%{builddir}%{_prefix}
Further we use the manually built libfabric v1.21.0 and ucx v1.16.0.
Please describe the system on which you are running
Details of the problem
On our cluster with slurm 23.11.5, Open MPI 5.0 jobs (here the Open MPI ring example) fail inside a SLURM allocation, when started with srun. The same job runs successfully when started using mpirun. The PML is set to "cm" and MTL to "PSM2" using environment variables. Jobs with PML ucx or ob1 run successfully, but with higher latency.
The text was updated successfully, but these errors were encountered: