Skip to content

WeeklyTelcon_20160112

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Brad Benton
  • Edgar Gabriel
  • Geoffroy Vallee
  • George
  • Howard
  • Josh Hursey
  • Nathan Hjelm
  • Ralph
  • Ryan Grant
  • Sylvain Jeaugey
  • Todd Kordenbrock

Minutes

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.2
    • mpirun hangs on ONLY SLES 12. Minimum 40 procs/node. at very end of mpirun. Only seeing it in certain cases. Not sure what's going on.
    • Is mpirun not exiting because ORTED not exiting? Nathan saw this on 2.0
    • wait for Paul Hardgrove.
    • No objections for Ralph shipping 1.10.2

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
    • Group Comms weren't working for Comms of powers of 2. Nathan found massive memory issue.
    • https://github.com/open-mpi/ompi/issues/1252 - Nathan working on a decay function for progress functions to "fix" this.
      • Nathan's been delayed until later this week. Could get done by middle of next week.
      • George commented that openib btl specificly could be made to only progress if there is a send/recv message posted.
        • ugeniee progress - could only check for data grams every (only 200ns hit).
      • Prefer to stick with nathan's original decay function without modifying openib.
    • https://github.com/open-mpi/ompi/issues/1225 - Totalview debugger problem + PMPI-x.
      • SLURM users use srun, doesn't have this issue.
      • DDT does NOT have this issue either. Don't know why it's different. Attach FIFO.
        • mpirun waits on a pipe for debugger to write a 1 on that pipe.
        • Don't see how that CAN work.
        • Nathan's been using attach, rather than mpirun --debug. Attach happens after launch, so then it's not going through this step. Nathan thinks not so critical since attach works.
      • Anything will work, as long as you're ATTACHING to a running job, rather than launching through debugger.
      • Barring a breakthrough with PMI-x notify in next week. We'll do an RC2 and just carfully document what works/doesn't as far as debuggers.
      • Will disable "mpirun --debug" and print an error on 2.0 branch that says it's broken.
      • No longer a blocker for 2.0.0 due to schedule. Still want to fix this for next release.
    • No new features (except for
      • Howard will review
      • review group comm
      • don't know if we'll bother with pls filesystem.
    • UXC using Modex stuff.
    • OMPI-IO + Luster slow on 2.0.0 (and master) branches. Discussed making ROMIO default for OMPI on Luster (only).

Review Master?

  • Bunch of failures on Master branch. No chance to look at yet.

  • Cisco and Ivy cluster.

  • Nathan's seeing a resource deadlock avoided on OMPI Waitall. Some TCP BTL issue. Looks like something going on down there. Should be fairly easy to test this. Cisco TCP one-sided stuff.

    • Nathan will see if he can figure this out. Haven't changed one-sided pt2pt receintly. Surprised. Maybe proclocks on by default? Need to work this out. Just changed locks from being conditional to being unconditional.
  • Edgar found some luster issues. OMPI master, has bad MPI-IO performance on luster. Looked reasonable on master, but now performance is poor. Not completely sure when get performance

    • Luster itself, could switch back to ROMIO for default.
    • GPFS, and others will look good, but Luster is bad. Can't have OMPI-IO as default on Luster.
    • Problem for 2.0.0 AND Master Branch.
  • https://github.com/open-mpi/ompi/issues/398 ready for Pull request

    • Nathan - Should go to 2.1 (since mpull changes pushed to 2.1).
  • https://github.com/open-mpi/ompi/pull/1118 - mpull rewrite should be ready to go, but want George to look at make comments. Probably one of first 2.1 requests after into master.

  • https://github.com/open-mpi/ompi/pull/1296 - PMI-x - Spreading changes from PMI-x across non-PMI-x infrastructure. Is that OKay?

    • This is just making changes in GLUE that is OMPI specific.
    • Should go into 2.0.0. plugs leaks, but minor.. still good.
  • https://github.com/open-mpi/ompi/pull/1290 - OPAL HOTEL problem. Do we need to get this into 2.0 as well?

    • Definately needs to go into 2.0! Jeff is using it in 1.10.
  • https://github.com/open-mpi/ompi/pull/1278 - Nathan might want to look at. Giles fixing derived datatypes in one-sided.

    • Nathan says it looks okay. Perfectly reasonable to use two different sets of tags.
    • Absolutely a 2.0.0 bug as well.
    • Nathan will merge it, and open the PR.

Status Updates:

  • Mellanox - (via email update after the meeting)

We are just now preparing the patch to open a PR. We’ve just finished testing this morning and got the ‘OK’ from UCX folks to open a PR. Sorry for the delay, we just wanted to be sure all the ‘t’s were crossed and ‘I’s dotted before submission.

  • https://github.com/open-mpi/ompi-release/pull/891
  • Sandia - Ryan, working on getting some bug fixes for 2.0. No major issues
  • Intel - Working on MTT re-write. Trying to track down error notification thing. not much cycles.
    • re-writing client in python, and make it more pluggable. and extending feature set, to handle broader range of stages.
    • Josh has been working on reporter side (last 6 months) with some students. Thinking about more flexible architecture.
      • rest interface around database, to support Python, and more flexible javascript reporter. Hopefully get that to a stage where people can play with.

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM
  3. Cisco, ORNL, UTK, NVIDIA

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally