-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collective message failure with PSM2 #60
Comments
Patrick, you never replied to my last email. Are you using CUDA? Also - please post a sample command line showing how you are exactly running this test case, showing successful and unsuccessful outputs and exactly what version of PSM2 you are using. Intel never directly supported Debian and neither does Cornelis Networks but I will take another look when I get a chance. |
Hi Michael,
sorry to not answer this question, I ran so many test these lasts days,
some of them with the admins of this cluster, that I have to reply with
these details. They also ask me to open this issue.
But no, I do not use cuda, these node do not have any GPU nor GPGPU.
I've spent also a few time to downgrade the CFD code and use point to
point communications instead of mpi_alltoallw global communications for
this purpose. It's less efficient but the code is back usable on this
super-computer.
Le 05/02/2021 à 16:03, Michael Heinz a écrit :
…
Patrick, you never replied to my last email. Are you using CUDA?
Also - please post a sample command line showing how you are exactly
running this test case, showing successful and unsuccessful outputs
and *exactly* what version of PSM2 you are using.
Intel never directly supported Debian and neither does Cornelis
Networks but I will take another look when I get a chance.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#60 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACZWQHHR7GX5JLQKNODECP3S5QCCPANCNFSM4XEKM3OA>.
|
About this test,
No GPU on the node, no multithread implementation (just MPI). About PSM2 versions: I've installed from github. It seams to be the same as the OS version as there no recent commits. The COMMIT file contains: 30c52a0fd155774e18cc06328a1ba83c2a6a8104 For the OS provided libraries (also tested):
Notice that I have also the bug with the OS deployed openmpi-gcc-hfi. Patrick |
Okay. I'll try to look at this today. You're running 4 ranks on only one host? |
Yes it is enought to show the problem. And the problem size is also very small: in the main program it is set to 5x7 points for easily tracking the problem with a debugger. Larger dimensions also show the problem. |
I can reproduce your issue on our production platform with all of the OpenMPIs I tried (2.1.2, 3.1.2 and 4.0.5). Our system is CentOS7 based with libpsm2-11.2.78-1.el7. |
Hi Peter |
Note that I said I COULD reproduce it. In fact I could not make it run successfully with OpenMPI and PSM2 in any way. It did run ok without PSM2 or with IntelMPI on PSM2. This however does not guarantee that the code is correct (I've not had time to analyze it myself). |
I agree, even such a small code may have a bug inside... even with my deep checks using gdb. |
Well I'm just a systems expert that read your thread on openmpi-users and thought I'd help you out by contributing my testing results. Also, it's not until now I realized that you actually meant alltoallw (not typo for alltoallv). I can imagine that being bugged without people noticing. In fact, this is the first time I've heard of an application that uses it (I'm sure there are examples that I've missed though). Any way, alltoallw is not very common and probably sees very limited testing... edit: yes my IntelMPI test was using the same PSM2. One can do "export PSM2_IDENTIFY=1" before mpirun to get runtime info on what is used. |
Up until our spin off from Intel, yes, they did. I'm not sure what their plans are now.
|
Patrick, I just tried your DEBUG package on my machines and I did get your error when I used PSM2 - but I got the same error when I used verbs, so I still don't know that this is a PSM2 issue. Here's what I did:
|
I get the same failure when using OFI sockets: ...
... |
I haven't used Fortran in ~20 years so I'm having trouble reading your sample app. What is the largest chunk of data that you send at one time? |
Well, that ruins that idea. Many transports have a maximum message size but don't enforce it, leading to data corruption - but you'd have to be sending 2 gigabytes or more in a single message for this to become a factor for PSM2. |
Patrick, I'm going to continue to look at this when I can - but since I get the same error with verbs and with sockets, I really think you should move this to the OMPI repo. |
Thanks Michael for your help. I'll open an issue on OMPI soon, this week I have a lot of teaching hours to do, so may be at the end of the week. I will point also to this discusison. |
Hi,
for several weeks I try to understand a problem (wrong behavior) with fortran MPI_ALLTOALLW calls. The problem only occur on a Debian supercomputer using this opa-psm2 library for it's omni-path architecture. I, and 2 OpenMPI developpers, have tested many other achitectures (intel or amd cpu, with ethernet, omni-path or infiniband network and running RedHat or Suse OS. The problem do not occur in any of these tests. More over, if on the debian computer I build OpenMPI using --without-psm2 flag the problem do not occur but omni-path performances are not reached.
I'm building OpenMPI 4.0.5 with gcc 6.3 or gcc 10.2 (same behavior)
Please find in attachement a really small test case showing the problem. If all runs fine it prints " Test pass!" else it shows the wrong values and calls mpi_abort().
To run this test:
Patrick
DEBUG.tar.gz
The text was updated successfully, but these errors were encountered: