Skip to content

WeeklyTelcon_20220524

Jeff Squyres edited this page May 24, 2022 · 3 revisions

Open MPI Weekly Telecon

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Jeff Squyres (Cisco)
  • Austen Lauria (IBM)
  • Brian Barrett (AWS)
  • David Bernholdt (ORNL)
  • Josh Fisher (Cornelis Networks)
  • Thomas Naughton (ORNL)
  • Todd Kordenbrock (Sandia)
  • William Zhang (AWS)
  • Hessam Mirsadeghi (UCX/NVIDIA)
  • Tommy Janjusic (NVIDIA)
  • George Bosilca (UTK)

not there today (I keep this for easy cut-n-paste for future notes)

  • Edgar Gabriel (AMD)
  • Howard Pritchard (LANL)
  • Joseph Schuchart (UTK)
  • Josh Hursey (IBM)
  • Matthew Dosanjh (Sandia)
  • Geoffrey Paulsen (IBM)
  • Brendan Cunningham (Cornelis Networks)
  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (NVIDIA)
  • Aurelien Bouteiller (UTK)
  • Brandon Yates (Intel)
  • Christoph Niethammer (HLRS)
  • Harumi Kuno (HPE)
  • Joshua Ladd (NVIDIA)
  • Marisa Roman (Cornelius Networks)
  • Mark Allen (IBM)
  • Michael Heinz (Cornelis Networks)
  • Nathan Hjelm (Google)
  • Sam Gutierrez (LLNL)
  • Xin Zhao (NVIDIA)

v4.1.x

  • 4.1.4
    • Will do today!
      • Actually, after the call, we realized we forgot to merge a PR we intended to be included in v4.1.x. Will merge that and let MTT run on it tonight.
    • We'll do 4.1.5 whenever is relevant
  • Open question about COMM_SPLIT_TYPE from user
    • George is investigating. Could be a PMIx issue...?
    • Need to investigate main/v5.0 after that.

v5.0.x

  • Will roll RC next week (after long weekend)
    • have had build fixes/bug fixes come in
    • 2 OMPI blockers:
      • ADAPT and HAN priorities
        • Setting the priorities is easy
        • The Bosilica paper shows really good results
        • EFA/ARM shows slight improvement on short messages and a slight regression on large messages. This could be an EFA issue. But this could also be a main/5.0 issue since the paper was written. Can someone -- who isn't EFA -- re-run the tests and ensure we don't have a regression. Meaning: we have one data point that doesn't look good, but it's also not an entirely trustworthy data point. We need more data.
        • George: we were busy last week, sorry. :-(
        • Would even be good for others to run, too.
        • Point: tuned is pretty good when processes are mapped well. Adapt runs well all the time. Maybe try to run EFA with "poorly mapped" processes...?
        • IBM said that they would re-run with UCX. I'll run on a "handful" of notes.
        • AWS ran on 16 or 32 nodes.
        • Joseph also volunteer to re-run.
        • Nvidia: we'll run the tests, too.
        • The ask is to run OSU or IMB collective benchmarks.
        • David B. volunteered Tom Naughton.
        • Josh F. from Cornelius will run as well.
        • https://github.com/open-mpi/ompi/issues/10347 is the issue.
      • mpirun external dependencies
        • progress is being made, slowly.
    • Main blockers are PRTE and PMIx issues.
      • Need to fix PRTE 2.1.x blockers before there will be a PRTE release. These are not OMPI blockers, but they need to be fixed before a PRTE 2.1 release.
      • To be clear: there are PRTE 2.0.x releases.
        • We previously thought we would be able to use these for OMPI v5.0.x.
        • This has unfortunately turned out to not be the case -- there are new PRTE v2.1.x features that we really need for OMPI v5.0.x.
        • We really need community/people to help fix the PRTE 2.1.x issues so that we can get a PRTE 2.1.x release so that we can release OMPI v5.0.x.

Bottom line: we need resources to help with PRTE 2.1 release.

Per last week's discussion, we have now decided what the minimum versions are for OMPI v5.0.x:

  • PMIx 4.1
    • We initially said 4.0, but later in the call (after most people had dropped) this was updated to 4.1.
  • PRTE 2.1

There is a pending PR to change our configury.

NOTE: We should not set any public release to a PMIX / PRTE version that does not exist.

v4.0.x

  • Howard and Geoff not here -- nothing new to discuss.

Main branch

  • Did not get to discuss this. See notes from last meeting.

MTT

  • Did not get to discuss this. See notes from last meeting.

Face-to-face

  • Did not get to discuss this. See notes from last meeting.
Clone this wiki locally