mpi4py CI failures on main, v5.0.x #12940

jsquyres · 2024-11-27T21:12:55Z

As the title says, we've been seeing some mpi4py CI failures on main and v5.0.x recently.

C reproducer

I've managed to reproduce the spawn test failures locally on my mac. The problem is that they're non-deterministic. 🙁

I've written a short C reproducer. It only seems to trip the error — sometimes! — when we run a bunch of Comm spawns in a single process.

// C version of an mpi4py test, blatently stolen and converted to C
// from
// https://github.com/mpi4py/mpi4py/blob/master/test/test_spawn.py#L205-L217

#include <stdio.h>
#include <mpi.h>

void do_child(MPI_Comm parent)
{
    MPI_Barrier(parent);
    MPI_Comm_disconnect(&parent);
}

void do_parent(char *argv[])
{
    const int count = 3;
    char *commands[count] = { argv[0], argv[0], argv[0] };
    int maxprocs[3] = { 1, 1, 1 };
    MPI_Comm child;
    int errcodes[3];
    MPI_Info infos[] = { MPI_INFO_NULL, MPI_INFO_NULL, MPI_INFO_NULL };
    MPI_Comm_spawn_multiple(count, commands, MPI_ARGVS_NULL,
                            maxprocs, infos, 0,
                            MPI_COMM_SELF, &child,
                            errcodes);
    
    int local_size, remote_size;
    MPI_Comm_size(child, &local_size);
    MPI_Comm_remote_size(child, &remote_size);
    MPI_Barrier(child);
    MPI_Comm_disconnect(&child);
    MPI_Barrier(MPI_COMM_SELF);

    if (local_size != 1) {
        printf("WARNING: local_size == %d, expected 1\n", local_size);
    }
    if (remote_size != count) {
        printf("WARNING: remote_size == %d, expected %d\n",
               remote_size, count);
    }
}

int main(int argc, char* argv[])
{
    MPI_Init(NULL, NULL);
    MPI_Barrier(MPI_COMM_SELF);

    MPI_Comm parent;
    MPI_Comm_get_parent(&parent);
    if (parent == MPI_COMM_NULL) {
        for (int i = 0; i < 32; ++i) {
            do_parent(argv);
        }
    } else {
        do_child(parent);
    }

    MPI_Barrier(MPI_COMM_SELF);
    MPI_Finalize();
    return 0;
}

Compile and run it with:

mpicc -g mpi4py-comm-spawn-defaults1.c -o mcsd
mpirun --mca rmaps_default_mapping_policy :oversubscribe -n 2 mcsd

If I run this a few times, it will definitely fail at least once.

Supplemental detail

Sometimes the mpi4py tests all succeed (!). Sometimes one the spawn tests randomly fails.

If you want to see the failure in the original mpi4py test suite, the good news is that there is a pytest command to rapidly re-run just the spawn tests. I find that this command fails once every several iterations:

mpirun --mca rmaps_default_mapping_policy :oversubscribe -n 2 python3 test/main.py -v -f -k CommSpawn

The -k CommSpawn is the selector — it runs any test that includes CommSpawn in the name (I think it's case sensitive...?). This ends up only being 16 tests (out of the entire mpi4py test suite) and when it succeeds, it only takes 2-3 seconds.

Here's a sample output from an mpi4py test that fails (it's not always this test):

testCommSpawnDefaults1 (test_spawn.TestSpawnMultipleSelfMany.testCommSpawnDefaults1) ... [JSQUYRES-M-4LRP:00000] *** An error occurred in Socket closed
[JSQUYRES-M-4LRP:00000] *** reported by process [1182269441,0]
[JSQUYRES-M-4LRP:00000] *** on a NULL communicator
[JSQUYRES-M-4LRP:00000] *** Unknown error
[JSQUYRES-M-4LRP:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[JSQUYRES-M-4LRP:00000] ***    and MPI will try to terminate your MPI job as well)

The text was updated successfully, but these errors were encountered:

jsquyres · 2024-11-27T21:40:58Z

It is worth noting that the test program is spawning on MPI_COMM_SELF -- meaning that each of the 2 processes in MPI_COMM_WORLD are calling spawn_multiple (i.e., each launching 3 MPI processes). I have 12 cores, so running 8 processes is still undersubscribing my laptop, but it does indicate a lot of control messages flying around to setup and teardown two sets of 3 processes. Indeed, If I change the test to spawn on MPI_COMM_WORLD, the test fails far less frequently (i.e., less control messages flowing around). I don't know if this means anything, but it is an interesting datapoint.

The error message is actually coming from the TCP BTL: https://github.com/open-mpi/ompi/blob/main/opal/mca/btl/tcp/btl_tcp_endpoint.c#L588-L590

This suggests that there might be some kind of race condition happening in the BTL disconnect teardown.

hppritcha · 2024-11-27T22:22:02Z

I would first suggest reverting #12920
and seeing if that helps. Looking at our merged PRs that one was merged in with an mpi4py fauilure.

rhc54 · 2024-11-27T22:31:45Z

Look at the failure - it had nothing to do with that PR. You'll also have seen other PRs failing mpi4py at that time, with these same random failures - one was just fixing a typo in the docs. Still, you are welcome to revert and try again.

hppritcha · 2024-11-27T22:37:23Z

i don't know, just look at

https://github.com/open-mpi/ompi/pulls?q=is%3Apr+is%3Aclosed

and the on PR that jumps out as being merged in with mpi4py failures on main is #12920

rhc54 · 2024-11-27T22:38:10Z

Correct - we didn't merge the others. They are still sitting there with errors, going nowhere.

rhc54 · 2024-11-27T22:57:02Z

FWIW: I cannot get that test to run at all with head of PMIx/PRRTE master branches - it immediately segfaults with bad returned data from the "lookup" operation. If I change MPI_COMM_SELF to MPI_COMM_WORLD, then everything runs fine - repeatedly.

Can't comment on the correctness of the test - but I suspect OMPI's dpm code really cannot handle the scenario of all parent procs calling with "comm_self". All our tests to-date have had one parent proc (rank=0) doing the spawn.

🤷‍♂️ Just noting the difference. Probably not something I'll further pursue as the plan is to remove the pub/lookup operation anyway.

rhc54 · 2024-11-28T08:12:59Z

Interesting - decided to add a check for NULL return of the lookup data so this test wouldn't segfault and found that it runs perfectly fine with head of PMIx/PRRTE master branches and with MPI_COMM_SELF...so long as I allow oversubscription! If I don't explicitly permit oversubscription, then I hit the "NULL return" error that would otherwise cause the segfault.

I then found that I can make the entire thing work without oversubscription if I simply add a 1 second delay in the loop over do_parent. So this appears to just be a case of overwhelming the pub/lookup code path such that it winds up returning a NULL, and not having adequate error detection to deal with that case.

Someone who cares could investigate the root cause of the NULL return. Could be in OPAL's pmix_base_exchange function or in the PRRTE data server. There is a timeout in there, so it could be that stress causes the timeout to fire - or maybe we don't clean up a timeout event fast enough and it erroneously fires. 🤷‍♂️

jsquyres added Target: main Target: v5.0.x labels Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpi4py CI failures on main, v5.0.x #12940

mpi4py CI failures on main, v5.0.x #12940

jsquyres commented Nov 27, 2024

jsquyres commented Nov 27, 2024 •

edited

Loading

hppritcha commented Nov 27, 2024

rhc54 commented Nov 27, 2024

hppritcha commented Nov 27, 2024

rhc54 commented Nov 27, 2024

rhc54 commented Nov 27, 2024

rhc54 commented Nov 28, 2024

mpi4py CI failures on main, v5.0.x #12940

mpi4py CI failures on main, v5.0.x #12940

Comments

jsquyres commented Nov 27, 2024

C reproducer

Supplemental detail

jsquyres commented Nov 27, 2024 • edited Loading

hppritcha commented Nov 27, 2024

rhc54 commented Nov 27, 2024

hppritcha commented Nov 27, 2024

rhc54 commented Nov 27, 2024

rhc54 commented Nov 27, 2024

rhc54 commented Nov 28, 2024

jsquyres commented Nov 27, 2024 •

edited

Loading