-
Notifications
You must be signed in to change notification settings - Fork 864
WeeklyTelcon_20220607
Geoffrey Paulsen edited this page Jun 7, 2022
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Austen Lauria (IBM)
- Brendan Cunningham (Cornelis Networks)
- Brian Barrett (AWS)
- Christoph Niethammer (HLRS)
- Edgar Gabriel (UoH)
- Geoffrey Paulsen (IBM)
- Jeff Squyres (Cisco)
- Joseph Schuchart
- Josh Fisher (Cornelis Networks)
- Josh Hursey (IBM)
- Matthew Dosanjh (Sandia)
- Thomas Naughton (ORNL)
- Todd Kordenbrock (Sandia)
- Tommy Janjusic (nVidia)
- William Zhang (AWS)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (nVidia)
- Aurelien Bouteiller (UTK)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- David Bernhold (ORNL)
- Erik Zeiske
- Geoffroy Vallee (ARM)
- George Bosilca (UTK)
- Harumi Kuno (HPE)
- Hessam Mirsadeghi (UCX/nVidia)
- Howard Pritchard (LANL)
- Joshua Ladd (nVidia)
- Marisa Roman (Cornelius)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Michael Heinz (Cornelis Networks)
- Nathan Hjelm (Google)
- Noah Evans (Sandia)
- Raghu Raja (AWS)
- Ralph Castain (Intel)
- Sam Gutierrez (LLNL)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- Xin Zhao (nVidia)
- v4.1.4 Released!
- A dozen bugfixes
- UCC backported
- v4.1.5
- Schedule: targeting ~6 mon (Nov 1)
- No driver on schedule yet.
-
A couple of critical new issues.
- Issue 10437 - blocker for even next RC.
- Issue 10435 - a Regression from v4.1
-
Progress being made on missing Sessions symbols.
-
Looking for coll_han Tuning runs
- Joseph is planning to do runs, tho might not be next weeks
- Tommy is working on
- Also on Brendan and
- Thomas Naughton also
-
main
andv5.0.x
should be the same, use either
-
Call to Prte / PMIx
- Longest Pole in the tent right now.
- If you want OMPI v5.0 released in near-ish future, please scare up some resources
- Use PRRTE
critical
andTarget v2.1
labels for issues.
-
Thomas Did testing on latest PRRTE (not submodule pointers)
- Ralph pulled in a larger PR that seemed to fix things.
-
Schedule:
- Blockers are still the same.
- PRRTE blocker -
- Right now looking like late summer (Us not having a PRRTE release for Packager to package)
- Call for help - If anyone has resources to help, we can move this release date much sooner.
- Requires investment from us.
- Blockers are listed Some are in the PRRTE project
- Any Alternatives?
- The problem for Open MPI is not that PRRTE isn't ready to release. The parts we use, works great, but other parts still have issues (namely DVM)
- Because we install PMIx and PRRTE as if they came from their own tarballs.
- This leaves Packagers no good way to distribute Open MPI.
- How do we install PMIx and PRRTE in open-mpi/lib instead and get all of the
rpaths
correct? - This might be the best bet (aside from fixing PRRTE ources of course)
-
Several Backported PRs
-
New issue opened on Performance when oversubscribed.
-
New issue topology issues when mapping by topology cache L3.
- Please HELP!
- Performance test default selection of Tuned vs HAN
- Brian hasn't (and might not for a while) have time to send out instructions on how to test.
- Can anyone send out these instructions?
- Call for folks to performance test at 16 nodes, and at whatever "makes sense" for them.
- Accelerator stuff that William is working on, should be able to get out of draft.
- Edgar has been working on ROCME component of Framework
- Post v5.0.0? Originally was shouldn't since release was close, but if it slips to end of summer, we'll see ...
- Edgar finished ROCM component... appears to be working.
- William or Brian can comment on how close to merge to
main
. - William working on btl sm_cuda and rcache code. Could maybe merge at the end of this week.
- Tommy, was going to get some nVidia people to review / test.
- Discussion on
btl sm_cuda
- used to be a cloned copy ofsm
, but it's the oldersm
component, notvader
which was renamed tosm
.- Might be time to drop
btl sm_cuda
? - vader component does not have hooks to the new framework.
- Uses where
btl sm_cuda
might get used today would be:- TCP path would use this for on-node
- Node without UCX
- even one-sided would not end up using
btl sm_cuda
.
- Might be time to drop
- v5.0.0 would be a good time to remove this.
- Based on old
sm
is a big detractor. - Can we ALSO remove
rcache
? Unclear.
- Based on old
- William or Brian can comment on how close to merge to
- What's the status of accellerator branch on v5.0.x branch?
- PR is just to
main
. - We said we could do a backport, but that would be after it gets merged to
main
- If v5.0.0 is still a month out, is that enough time?
- v5.0.0 is lurking closer.
- This is a BIG chunk of code...
- But if v5.0.0 delays longer... this would be good to get in.
- Answer is largely dependent on pmix and prte.
- Also has implications on OMPI-next?
- PR is just to
- Can anyone who understands packaging review: https://github.com/open-mpi/ompi/pull/10386 ?
- Automate 3rd Party minimum version checks into a txt file that both
- configure and docs could read from a common file.
- config.py runs at beginning of Sphynx and could read in files, etc.
- Still iterating on.
-
https://github.com/open-mpi/ompi/pull/8941 -
- Like to get this in, or close it
- Geoff will sent him an email to George to ask him to reiview.
- What are companies thinking about travel?
- Wiki for face to face: https://github.com/open-mpi/ompi/wiki/Meeting-2022
- Should think about schedule, location, and topics.
- Some new topics added this week. Please consider adding more topics.
- MPI Forum was virtual
- Next one Euro MPI will be hybrid.
- Plan to continue being hybrid with 1-2 meetings / year.