Skip to content

WeeklyTelcon_20160216

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Brad Benton
  • Edgar Gabriel
  • Howard
  • Josh Hursey
  • Ryan Grant
  • Todd Kordenbrock
  • Joshua Ladd
  • Ralph
  • Sylvain Jeaugey

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.3
  • Targeting beginning April for 1.10.3 - no new drivers
  • Nathan - 0 byte send issue?
  • Howard - verbs usNIC build default issue? - PR 938 waiting for Howard to review.
  • Jeff - Fortran08? - ralph just committed.
  • Issue 1136 - SLES12 - Longrunning jobs mpirun SIGCHLD at end of Job?
  • nVidia now showing MTT failures, were silently failing before.
    • hello_alloc_memusempi - 1sided. Slyvian should open an issue against 1.10.x.
    • Some race condition. so possibly not fixed on master and 2.x, might just not hit it.

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
    • Issue 1348 -disable addprocs 32bit & bigendian. Resolved.
    • Issue 1346 - grpcomm fixes. Resolved.
    • Issue 1252 - openib causes horrible same node perf.
      • Can't test on 2.0 due to unresolved symbol / bad init.
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
  • PR 953: --host discussion Issue 1344
    • Have a pow-wow in face to face. - Lets make sure we settle on what we want.
  • ompi-release PR 962 - in master.
    • Nathan would like this change everywhere. Nathan will put this on master everywhere PR.
  • 966 - Fortran 08 - good with this, push everywhere.
  • PR 967 - work occurred over weekend.
  • Is there a more generic jenkins test that could be configured to catch this?
    • Yes, but can't anticipate that this will happen commonly
    • Jeff could add another configure option to Jeff's nightly MTTs.
    • Edgar has a general problem that many components require OMPIO component.
  • PMI-x 1.1.2? - fixes some bugs, but also adds more "stuff".
    • Ralph: this is a 1.1.2+ for OMPI 2.0 release.
    • PMI-x 1.2 for OMPI 2.0.1
    • Will this be okay for our versioning backwards compatibility statements?
      • Yes, none of this gets built into the users' application.

Review Master?

MTT status:

  • From Last week:
    • lot of issues are usNIC related. Jeff will STILLlook at.
      • non-one-sided failures with usNIC cluster. Perhaps cluster network setup.
    • nVidia look like dynamics related. Slyvian fixing something about way it launches.
      • Turned of nVidida MTT tests right now. Just started getting different errors.
        • BOTH Master and 2.x - some CUDA related things are broken. IS collective related.
        • Some new errors for 1.10 - because jeff committed some fixes on the test, that is now SHOWING the error.
      • Hope to get testing back online today or tomorrow.
    • Nathan will look at all one-sided failures.
    • tcp btl might have an issue, getting tried to lock resource but already locked warning.

Status Updates:

  • LANL - Release stuff, Some investigations for meeting next week.
    • Now that we have KNL boxes, been working some with Open MPI and MPICH KNL, vast improvement over KNC.
    • Binaries will work on KNL or Haswells.
    • Want to get back to OMPI_PLACES setting. Not sure where to put it. Discuss at face2face.
      • will need to use NESTED OMP parallelism. Want to make that easy.
    • Want to make sure everything is clean for 1-sided for 2.0
    • Trying to find last error with MPOOL re-write. Asking for feedback, and asking how people like the new organization.
      • Really want George's comment here.
      • will give us ability to use MEMKIND, and will take some work of getting everything to use same allocators.
      • Can expose performance variables to tweak settings.
  • Houston - Mostly using release branch, done a little more code development for glass
  • IBM -
    • Getting MTT and builds setup internally.
    • Defining support matrix for new open MPI product.
    • Will be using RFC process for some bigger features.
    • Problem with MTT reporter. Josh put patch for it. Still running off svn repo, but we'll need to do a swap.
    • During the swap MTT will be down.

Status Update Rotation

  1. LANL, Houston, IBM
  2. Cisco, ORNL, UTK, NVIDIA
  3. Mellanox, Sandia, Intel

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally