Skip to content

WeeklyTelcon_20200225

Geoffrey Paulsen edited this page Feb 25, 2020 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoffrey Paulsen (IBM)
  • Jeff Squyres (Cisco)
  • Akshay Venkatesh (NVIDIA)
  • Austen Lauria (IBM)
  • Brendan Cunningham (Intel)
  • Edgar Gabriel (UH)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Joseph Schuchart
  • Josh Hursey (IBM)
  • Joshua Ladd (Mellanox)
  • Ralph Castain (Intel)
  • Thomas Naughton (ORNL)
  • Michael Heinz (Intel)
  • Noah Evans (Sandia)
  • Scott Breyer (Sandia?

not there today (I keep this for easy cut-n-paste for future notes)

  • William Zhang (AWS)
  • Artem Polyakov (Mellanox)
  • Nathan Hjelm (Google)
  • Todd Kordenbrock (Sandia)
  • Charles Shereda (LLNL)
  • David Bernhold (ORNL)
  • George Bosilca (UTK)
  • Matthew Dosanjh (Sandia)
  • Brandon Yates (Intel)
  • Erik Zeiske
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Xin Zhao (Mellanox)
  • mohan (AWS)

Old Business

  • MTT -

    • If you change your MTT to startup PRRTE at begining of session, and just use prun.
    • Can see times cut in half or more.
    • This is good, but also need to test mpirun wrapper.
    • Cisco is converting half of MPI installs to use prrte/prun
  • AWS where can scale out horizontally, will continue to do both.

  • PRRTE Transition:

    • ORTE is gone, PRRTE is in it's place. Expect some hickups
    • A bunch of MTT failures, because people need to update command line changes for - vs -- in command line prompts.
    • A number of Fortran failures, that don't make much sense.
  • IBM MTT is hitting IOF issue, where file descriptor shuts down, and libevent spins hard

  • PRRTE - Josh turned on CI.

    • Auto labeller is not yet there. Experimenting
    • Like to get OMPI side running prte option
    • Whenever we move PMIX or PRTE submodule pointer, it'll label the PR.
  • Anyone can click the override-merge button.

    • Hasn't been an issue, but remember this won't trigger PR based hooks.
    • Still 1+ month of effort before Open MPI v5.0 could be ready with this.
    • see: https://github.com/openpmix/prrte/issues/298 for additional mpirun launch items
  • OMPI master submodule pointers setup to track PMIx and PRRTE master.

    • Hopefully long term, master can track release branches.
    • But still ensure there's some regression tracking of master/master/master.
    • But once things settle down, might not want everyone's masters tracking each other.
    • But if we DONT have master/master/master then new features that span across repos will be challenging
    • Ompi v5.0 might want to trigger a major revision of other dependencies (PMIx and PRRTE)?

Open MPI Face-to-face Highlights.

  • Minimum PMIx version for v5.0 will be PMIx v3.0
  • Scheduled v5.0 for end of June
    • Geoff Paulsen (with Austen Lauria) and Joshua Ladd will be release managers.
  • prefixing of "mca" with "opal_mca"
  • Do we dlopen mca component in global or private scope.
    • pretty sure there are tech reasons why this is the case, and if someone wanted to make this private, would need to document and make many
  • Note on wiki for pull requests on master.
    • Did we agree last week to REQUIRE Pull Requests on master?
    • Think yes for "common" code, but maybe not for type-os in components we own.
      • Can't partially enable it per directory.
      • Jeff will do this.
    • Think we ended up saying yes for all to increase code visibility.
  • CI working on parallelizing better, and coiridnate
  • hwloc, libevent, and pmix to go up to top level directory (like prrte) and out of mca.
    • Big downside of being in mca, is that get slurped into opal.
    • mca was a convenient location before.
    • Might still be convenient because we still want to support external.
  • Deleted C++ bindings, but haven't deleted MPI1 bindings yet.
    • Scalapack hasn't released updated usage of MPI3 replacements.
  • Configure changes to fatally error if user tries something no longer supported.
  • removing sync
  • SPI - parent organization
    • Jeff is now primary, and Brian is secondary.
  • Nathan is going to revive his aliasing PR.
    • really remove SM
    • somehow rename vader? I missed this item.
  • Lock-bot / Stale-bot
    • don't want to upset people. Want an open community.
    • Good conversation. Pros didn't outweight the cons.
    • Not going to do either of these.
  • Are encouraging the use of a "Draft" PR.
    • A stronger version of WIP
  • Common Conventions
  • Command line options are really going to change.
    • Deprecated notices for OMPI v5.0
    • confusion about parameters on command line versus AMCA, will generate some errors.
    • Josh is looking into some of this work as well
  • SM Cuda BTL - Jeff talked to Akshay about this component.
    • Want to primarily support CUDA via UCX.
    • Still a few use-cases where want to support CUDA in BTL
    • Akshay will discuss possibly integrating this feature to Vader Post v5.0

Release Branches

Review v3.0.x Milestones v3.0.6

Review v3.1.x Milestones v3.1.6

  • Need to look at MTT to see if we have new regressions or not.
  • Brian is out a few days.

Review v4.0.x Milestones v4.0.3

  • v4.0.3 in the works.
    • Config file needs to be committed to
    • Merging in PRs
    • Seeing some instability on github right now.
    • Will merge 7462, 7463 after github.com comes back, leave others for v4.0.4

v5.0.0

  • Schedule: End of June
    • May need doc updates for command line changes, both prrte, and mca changes.

master

  • Failures in IBM suite for Inter-comm Timeouts.
    • Since Prrte change over.
    • Comm Spawn calls connect/accept, but that's working.
  • MTT build failures on master
    • Passing the -with-pmi option
    • Jeff will email William

Face to face

  • Maybe mid-late summer.

Infrastrastructure

  • scale-testing, PRs have to opt-into it.

Review Master Master Pull Requests

CI status


Depdendancies

PMIx Update

  • PMIx v3.1.5 rc2 posted this week. Release should be Friday.
  • CI testing only tests build and did it run, but doesn't test HOW it ran.
    • Environment setup can be a bit different.
    • For example no-permissions in /tmp. Might pass on one machine, and fail on another without /tmp permissions.

ORTE/PRRTE

MTT


Back to 2019 WeeklyTelcon-2019

Clone this wiki locally