-
Notifications
You must be signed in to change notification settings - Fork 864
WeeklyTelcon_20200225
Geoffrey Paulsen edited this page Feb 25, 2020
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoffrey Paulsen (IBM)
- Jeff Squyres (Cisco)
- Akshay Venkatesh (NVIDIA)
- Austen Lauria (IBM)
- Brendan Cunningham (Intel)
- Edgar Gabriel (UH)
- Harumi Kuno (HPE)
- Howard Pritchard (LANL)
- Joseph Schuchart
- Josh Hursey (IBM)
- Joshua Ladd (Mellanox)
- Ralph Castain (Intel)
- Thomas Naughton (ORNL)
- Michael Heinz (Intel)
- Noah Evans (Sandia)
- Scott Breyer (Sandia?
- William Zhang (AWS)
- Artem Polyakov (Mellanox)
- Nathan Hjelm (Google)
- Todd Kordenbrock (Sandia)
- Charles Shereda (LLNL)
- David Bernhold (ORNL)
- George Bosilca (UTK)
- Matthew Dosanjh (Sandia)
- Brandon Yates (Intel)
- Erik Zeiske
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Xin Zhao (Mellanox)
- mohan (AWS)
-
MTT -
- If you change your MTT to startup PRRTE at begining of session, and just use prun.
- Can see times cut in half or more.
- This is good, but also need to test mpirun wrapper.
- Cisco is converting half of MPI installs to use prrte/prun
-
AWS where can scale out horizontally, will continue to do both.
-
PRRTE Transition:
- ORTE is gone, PRRTE is in it's place. Expect some hickups
- A bunch of MTT failures, because people need to update command line changes for
-
vs--
in command line prompts. - A number of Fortran failures, that don't make much sense.
-
IBM MTT is hitting IOF issue, where file descriptor shuts down, and libevent spins hard
-
PRRTE - Josh turned on CI.
- Auto labeller is not yet there. Experimenting
- Like to get OMPI side running prte option
- Whenever we move PMIX or PRTE submodule pointer, it'll label the PR.
-
Anyone can click the override-merge button.
- Hasn't been an issue, but remember this won't trigger PR based hooks.
- Still 1+ month of effort before Open MPI v5.0 could be ready with this.
- see: https://github.com/openpmix/prrte/issues/298 for additional mpirun launch items
-
OMPI master submodule pointers setup to track PMIx and PRRTE master.
- Hopefully long term, master can track release branches.
- But still ensure there's some regression tracking of master/master/master.
- But once things settle down, might not want everyone's masters tracking each other.
- But if we DONT have master/master/master then new features that span across repos will be challenging
- Ompi v5.0 might want to trigger a major revision of other dependencies (PMIx and PRRTE)?
- Minimum PMIx version for v5.0 will be PMIx v3.0
- Scheduled v5.0 for end of June
- Geoff Paulsen (with Austen Lauria) and Joshua Ladd will be release managers.
- prefixing of "mca" with "opal_mca"
- Do we dlopen mca component in global or private scope.
- pretty sure there are tech reasons why this is the case, and if someone wanted to make this private, would need to document and make many
- Note on wiki for pull requests on master.
- Did we agree last week to REQUIRE Pull Requests on master?
- Think yes for "common" code, but maybe not for type-os in components we own.
- Can't partially enable it per directory.
- Jeff will do this.
- Think we ended up saying yes for all to increase code visibility.
- CI working on parallelizing better, and coiridnate
- hwloc, libevent, and pmix to go up to top level directory (like prrte) and out of mca.
- Big downside of being in mca, is that get slurped into opal.
- mca was a convenient location before.
- Might still be convenient because we still want to support external.
- Deleted C++ bindings, but haven't deleted MPI1 bindings yet.
- Scalapack hasn't released updated usage of MPI3 replacements.
- Configure changes to fatally error if user tries something no longer supported.
- removing sync
- SPI - parent organization
- Jeff is now primary, and Brian is secondary.
- Nathan is going to revive his aliasing PR.
- really remove SM
- somehow rename vader? I missed this item.
- Lock-bot / Stale-bot
- don't want to upset people. Want an open community.
- Good conversation. Pros didn't outweight the cons.
- Not going to do either of these.
- Are encouraging the use of a "Draft" PR.
- A stronger version of WIP
- Common Conventions
- Command line options are really going to change.
- Deprecated notices for OMPI v5.0
- confusion about parameters on command line versus AMCA, will generate some errors.
- Josh is looking into some of this work as well
- SM Cuda BTL - Jeff talked to Akshay about this component.
- Want to primarily support CUDA via UCX.
- Still a few use-cases where want to support CUDA in BTL
- Akshay will discuss possibly integrating this feature to Vader Post v5.0
Blockers All Open Blockers
Review v3.0.x Milestones v3.0.6
Review v3.1.x Milestones v3.1.6
- Need to look at MTT to see if we have new regressions or not.
- Brian is out a few days.
Review v4.0.x Milestones v4.0.3
- v4.0.3 in the works.
- Config file needs to be committed to
- Merging in PRs
- Seeing some instability on github right now.
- Will merge 7462, 7463 after github.com comes back, leave others for v4.0.4
- Schedule: End of June
- May need doc updates for command line changes, both prrte, and mca changes.
- Failures in IBM suite for Inter-comm Timeouts.
- Since Prrte change over.
- Comm Spawn calls connect/accept, but that's working.
- MTT build failures on master
- Passing the -with-pmi option
- Jeff will email William
- Maybe mid-late summer.
- scale-testing, PRs have to opt-into it.
Review Master Master Pull Requests
- PMIx v3.1.5 rc2 posted this week. Release should be Friday.
- CI testing only tests build and did it run, but doesn't test HOW it ran.
- Environment setup can be a bit different.
- For example no-permissions in
/tmp
. Might pass on one machine, and fail on another without/tmp
permissions.