Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi4py test failures with ompi@main #12929

Open
dalcinl opened this issue Nov 20, 2024 · 3 comments
Open

mpi4py test failures with ompi@main #12929

dalcinl opened this issue Nov 20, 2024 · 3 comments

Comments

@dalcinl
Copy link
Contributor

dalcinl commented Nov 20, 2024

Nightly mpi4py tests with ompi@main have been failing from time to time. Test pass after a re-run, so the things is no easily reproducible. The latest failure produces the following output.

testCreateFromGroup (test_comm.TestCommSelfDup.testCreateFromGroup) ... [fv-az654-539:143367] PMIX ERROR: PMIX_ERR_UNPACK_READ_PAST_END_OF_BUFFER in file client/pmix_client_group.c at line 1376

Full logs here

@rhc54
Copy link
Contributor

rhc54 commented Nov 20, 2024

Have noticed some semi-random errors on PRs, but haven't seen that particular error message before. I suspect that specific error may be indicative of the growing disconnect between the PMIx master branch and the OMPI fork of PRRTE. I've tried to start some discussion over here about it, but due to Supercomputing and holidays it will take some time to address the problem.

The line number indicates that the PMIx submodule isn't current - indeed, a quick glance shows it is far behind the head of the master branch. I can post a PR to update it, just to see if it impacts anything.

However, the overall problem could have nothing to do with PMIx or PRRTE. 🤷‍♂️ Difficult to say.

@rhc54
Copy link
Contributor

rhc54 commented Nov 20, 2024

Sigh - can't update PMIx as the OMPI PRRTE fork is simply too out-of-sync. 🤷‍♂️ Not much I can help with, I'm afraid.

@rhc54
Copy link
Contributor

rhc54 commented Nov 20, 2024

FWIW: looking at the OMPI nightly regression tests (their own test suite), it appears that the one-sided tests are uniformly failing in both the main and v5.0 branches. Seeing the same failure signatures that are being reported elsewhere by Debian.

Interestingly enough, I'm not seeing the failure you are reporting here - but given it is intermittent, that may be simply luck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants