-
Notifications
You must be signed in to change notification settings - Fork 864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mip4py: Regression in GitHub Actions #12857
Comments
@hppritcha Maybe this issue comes from #12853? |
Highly likely that's the case. |
Unsure exactly what was included in that update, but I was immediately suspicious when that PR was filed because it did not include the corresponding update to the PMIx submodule. As previously noted, the two are tightly coupled - so you should never update one without updating the other. Might not be the cause here, but it is the first thing I would try. |
@hppritcha Please let me know once you open a PR. I would like to suggest to update the configuration of of oversubscription in this repo, such that it is in sync with what I do in mpi4py. I assume that OMPI wants all configuration in $HOME/.openmpi/mca-params.conf honored, including that of PRRTE. Am I right? |
Hmm. well actually this is not a bug. See https://docs.open-mpi.org/en/v5.0.x/mca.html#mca-parameter-changes-between-open-mpi-4-x-and-5-x . You'll need to add The ompi mpi4py github action is kind of misleading here so i'll fix it to not use that deprecated mca param in the $HOME/.openmpi/mca-params.conf file. |
@hppritcha I suppose this opens the question I noted on your PR. We could register synonyms for the OMPI v4 form of the PRRTE params. Do we want to do so? Be a little bit of grunge work, but nothing all that bad. May not be possible for all params as they may not have a corresponding OMPI v5 version - or the values may conflict so that PRRTE doesn't understand them. But we could pick up a number of them, and certainly the "oversubscribe" one should be doable. I've added a check to ensure we are actually processing the OMPI default param files - pretty sure we are, but I'll check. Just need to finish something else first. |
So the OMPI documentation @hppritcha pointed to warns use to use THREE different locations to configure Open MPI operation. Are you guys sure you want to put that complexity on users? I must add a clarification: the purpose of the line setting Until a few days ago, oversubscription seemed to work with all branches, and as of today it is still working in v5.0.x. From what @rhc54 said, I understand OMPI from v5.0.x and main simply ignore the old v4.x setting (is that correct?). |
Guys, I have a somewhat explicit question. Are PRRTE and PMIx (when used via/with Open MPI) supposed to read configuration from OMPI's config file |
I have said several times that I will investigate it - just not a high priority at the moment when compared with other things. |
I see @dalcinl . i'll take a further look at what may have caused this change in behavior on our main branch. |
So here is the problem. The change came about because of an OMPI reported issue regarding passing of MCA params. We got that fixed. However, what it exposed was that we were overlaying processing of OMPI, PRRTE, and PMIx params on top of each other, thereby causing confusion. Long story short: the issue here is that This is why we have three param files - to ensure that the settings are applied before anything happens in their respective projects. I personally don't see a simple way around that one - I suppose you could add param file parsing in OMPI's |
From a user perspective this is mind blowingly complicated and cumbersome. And I'm trying to be polite. |
I actually agree - just waiting for someone to propose a realistic solution. Frankly, the whole MCA param thing has been a considerable source of confusion to users over the years - way too many params, far too detailed knowledge of the code base required to really use them. Adding knowledge of which ones to set where does increase that burden. To recap the problem a bit, in case someone does want to come up with a proposal, imagine the case where someone puts a PRRTE-related value in the OMPI param file, and still has that same param in the PRRTE param file. What precedence should be applied to determining the final value? Which one "wins"? We could collapse to a single param file, passing its name in the environment to override the project default (e.g., modify OMPI's "mpirun" to pass the name/path to the OMPI param file) - but we'd have to use that name to override looking for any of the other files. Not sure how confusing that will get. If I put something in a PMIx param file, then suddenly that setting disappears because I ran OMPI's "mpirun"? So I'm open to thoughts - just unsure we really have a "right" answer here. |
Afraid that won't work @hppritcha - PRRTE uses the PMIx MCA base. You would have to convert PRRTE to use the OPAL MCA base, which means pulling PRRTE completely inside OMPI ala ORTE...which forces a near-complete rewrite of PRRTE (e.g., every framework would need redo, you'd have to convert to the OPAL object system, etc.). At that point, you might as well go back to ORTE and start from there. |
git bisect and code inspection shows that the behavior was changed when we merged the commit from this PR openpmix/prrte#1999 into our fork of prrte. Not sure whether to consider this a bug or a feature at this point. I guess at a minimum we should document the need to put prrte mca params in the prte/mca-params.conf when not using env. variables or |
Yeah, the problem lies in the required conversion - need to identify that a param actually belongs to PRRTE so you can correctly prefix it with However, PRRTE doesn't know we are using the OMPI personality, and therefore doesn't know we should read the OMPI param file(s), until after we have populated the MCA system. I can investigate perhaps doing another parse at time of selecting the ompi/schizo component - might be in time for some things, but not early enough for some params to have already taken effect. Have to see how the MCA system would deal with duplicate parsing. |
Hmm... well our prte could be made to know its doing ompi personality. |
I think you'll find things are more complicated than you realize if/when you try! However, you may find it a valuable learning experience. I have fixed this issue in the referenced PR. As noted in the commit comment, it isn't a perfect solution - however, we do identify the OMPI personality fairly early in the init process, and so most (maybe the vast majority?) of params of interest to users will be caught by it. Per the commit message:
This really needs to be documented as someone is bound to spend oodles of hours trying to understand why the value they specified somewhere isn't working. You should also note that we still have lost all info as to the source of a param's value - still no idea if/how one might recover that info. |
we ignore pmix mca params in the file now as that's being treated as a separate piece of software that we use as clients. related to open-mpi/ompi#12857 Signed-off-by: Howard Pritchard <[email protected]>
It seems to me that all these issues arise from the fact that we persist at separating the MCA for the different layers. This also forces us to prepend all MCA with a repeated text pattern (in @rhc54 PR that's I propose we do the opposite, MCA files are read once (by the different projects in the context of mpirun or batch plugin) and appended, or prepended, to a single environment variable. No wasted space for tagging the project, compact and clean, easy to print and all MCA-based frameworks have the same. The downside is a replication of unnecessary params in processes and projects that don't need them, but as long as there is no name collision (and we know that's not the case) the wasted space will be ridiculously small compared with our current memory footprint for all the MCA variables. |
Didn't realize I had the power to close an issue over here with a commit in upstream PRRTE, so I'll reopen it. I have no immediate objection to your proposal. My only initial concerns would be:
|
mpi4py 's own nightly CI testing on GHA has been failing with ompi@main in the last three days.
However, I understand that OMPI's own CI had not failed, otherwise you would not have merged in the regression.
These are the full logs from the last failed run:
https://github.com/mpi4py/mpi4py-testing/actions/runs/11310080041/job/31454714552
This is the specific error, not the message
prte-rmaps-base:all-available-resources-used
in the output:On closer inspection, I noticed a difference between mpi4py and open-mpi configuration:
mpi4py configures oversubscription via
$HOME/.openmpi/mca-params.conf
https://github.com/mpi4py/mpi4py-testing/blob/master/.github/workflows/openmpi.yml#L101
Open MPI configures oversubscription in both
$HOME/.openmpi/mca-params.conf
and$HOME/.prte/mca-params.conf
https://github.com/open-mpi/ompi/blob/main/.github/workflows/ompi_mpi4py.yaml#L80
Looks like something has changed recently, and oversubscription settings in
$HOME/.openmpi/mca-params.conf
are being ignored. Was this change intentional or is it a regression?The text was updated successfully, but these errors were encountered: