-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpi4py CI failures on main, v5.0.x #12940
Comments
It is worth noting that the test program is spawning on The error message is actually coming from the TCP BTL: https://github.com/open-mpi/ompi/blob/main/opal/mca/btl/tcp/btl_tcp_endpoint.c#L588-L590 This suggests that there might be some kind of race condition happening in the BTL disconnect teardown. |
I would first suggest reverting #12920 |
Look at the failure - it had nothing to do with that PR. You'll also have seen other PRs failing mpi4py at that time, with these same random failures - one was just fixing a typo in the docs. Still, you are welcome to revert and try again. |
i don't know, just look at https://github.com/open-mpi/ompi/pulls?q=is%3Apr+is%3Aclosed and the on PR that jumps out as being merged in with mpi4py failures on main is #12920 |
Correct - we didn't merge the others. They are still sitting there with errors, going nowhere. |
FWIW: I cannot get that test to run at all with head of PMIx/PRRTE master branches - it immediately segfaults with bad returned data from the "lookup" operation. If I change Can't comment on the correctness of the test - but I suspect OMPI's dpm code really cannot handle the scenario of all parent procs calling with "comm_self". All our tests to-date have had one parent proc (rank=0) doing the spawn. 🤷♂️ Just noting the difference. Probably not something I'll further pursue as the plan is to remove the pub/lookup operation anyway. |
Interesting - decided to add a check for NULL return of the lookup data so this test wouldn't segfault and found that it runs perfectly fine with head of PMIx/PRRTE master branches and with I then found that I can make the entire thing work without oversubscription if I simply add a 1 second delay in the loop over Someone who cares could investigate the root cause of the NULL return. Could be in OPAL's pmix_base_exchange function or in the PRRTE data server. There is a timeout in there, so it could be that stress causes the timeout to fire - or maybe we don't clean up a timeout event fast enough and it erroneously fires. 🤷♂️ |
As the title says, we've been seeing some mpi4py CI failures on
main
andv5.0.x
recently.C reproducer
I've managed to reproduce the spawn test failures locally on my mac. The problem is that they're non-deterministic. 🙁
I've written a short C reproducer. It only seems to trip the error — sometimes! — when we run a bunch of Comm spawns in a single process.
Compile and run it with:
If I run this a few times, it will definitely fail at least once.
Supplemental detail
Sometimes the mpi4py tests all succeed (!). Sometimes one the spawn tests randomly fails.
If you want to see the failure in the original mpi4py test suite, the good news is that there is a pytest command to rapidly re-run just the spawn tests. I find that this command fails once every several iterations:
The
-k CommSpawn
is the selector — it runs any test that includes CommSpawn in the name (I think it's case sensitive...?). This ends up only being 16 tests (out of the entire mpi4py test suite) and when it succeeds, it only takes 2-3 seconds.Here's a sample output from an mpi4py test that fails (it's not always this test):
The text was updated successfully, but these errors were encountered: