WeeklyTelcon_20201110

Open MPI Weekly Telecon ---

Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

NOT-YET-UPDATED

4.0.x

v4.0.6 toward the end of the month.
- 7199 - ras based orted on headnode.
- No other blockers currently on
- Geoff will create milestone issue for v4.0.6
PR 8187 - deterministic build.
- rhc will review.
Bug in ORTE that Ralph fixed last night for AWS. This hasn't gone back to v4.1. (PR8176)
- It's technically more correct, so probably doesn't matter.
- Weren't communicating the cpuset correctly. OFI MTL needed it in v4.1.
  - Jeff will see which BTL.
- Prob not critical.
- Ralph will pull it out and PR it for v4.0.x

v4.1

Did make another RC last night, Jeff will send out email this morning.
- First to include HAN and Adapt.
  - Intent is that we'd activate both together.
- Jeff will email with instructions on how to enable.
Warnings have gotten out of hand on v4.1 on MAC.
- Some warnings would be fixed if we just fixed some macros. A bunch of unused variables.
- In general not critical.
- /bin/sh command subst error (unexpected EOF), so that's probably a critical error.
Coverity only runs on master - Coverity then

Open-MPI v5.0

Ralph is working on updating PMIx / PRRTE submodule pointers.
- Jeff is helping with configury issues.
Hoping these are one-time issues, and not every time we update submodule pointers.
- All new configury.
- If this isn't a one-time thing, we should consider stop embedding this.
- Submodules are good for a number of things, but you ahve to take a change in configury, Go through CI, commit it to PMIx master, then go through OMPI CI, and find out that there's a bug. Then you have to go back to PMIx master.
Cisco has some of this.
- Should figure out a keyword on PMIx side, that effectively does
Hit a problem in AWS Amazon-Liunux1 trying to build a tarball.
And were hitting Mellanox CI due to Python older version.
IBM doesn't do dist check - bug is in there (real issue)
Some issues are caused by Embedding. (What do you mean by SRCDIR) (Which SRCDIR?)
But other issues are NOT caused by embedding pmix.
- So getting rid of embedding would not solve these issues
Not moment it's looking like 2Q next year.
- IBM's been pushing on PRRTE as well. We're testing the map/bind options.
- Trying to push tickets up (some are clarification of expected behavior)
- Community help with these issues would help move forward the prrte deadline.
What's going to be the state of the SM Cuda BTL and CUDA support in v5.0?
- What's the general state? Any known issues?
- AWS would like to get.
- Josh Ladd - Will take internally to see what they have to say.
- From nVidia/Mellanox, Cuda Support is through UCX, SM Cuda isn't tested that much.
- Hessam Mirsadeg - All Cuda awareness through UCX
- May ask George Bosilica about this.
- Don't want to remove a BTL if someone is interested in it.
- UCX also supports TCP via CUDA
- PRRTE CLI on v5.0 will have some GPU functionality that Ralph is working on
PR 8191 - converted all OMPI READMEs to markdown.
- Looks nicer when browsing on Github (can do formatting)
- Do things in one markdown language (easy to edit)
- Master only.
Should we consolidate Top level README, website FAQ (sorta googlable), All the manpages.
- If 2 of these are going to Markdown, maybe we should do FAQ in markdown, and put these all in one place like readthedocs.io
- Unless someone has an allergic reaction, Jeff's interested in working on this.
- They have a decent versioning scheme to version docs based on release(s).
- When would the docs get pushed to readthedocs.io?
  - Master would be on a github hook after PRs are merged.
  - Stable Release branches would go out at release time.
- LICENSE question - what license would the docs be available under? Open-MPI BSD license, or
- readthedocs.io encourages "restructured text" format over markdown.
  - They also support a hybrid for projects that have both.
- Thomas Naughton has done the restructured text, and it allows
Ralph tried the Instant on at scale:
- 10,000 nodes x 32PPN
- Ralph verified Open-MPI could do all of that in < 5 seconds, Instant-On.
- Through MPI_Init() (if using Instant-On)
- TCP and Slingshot (OFI provider private now)
- PRRTE with PMIx v4.0 support
- SLURM has some of the integration, but hasn't taken this patch yet.
Discussion on:
- Draft Request Make default static https://github.com/open-mpi/ompi/pull/8132
- One con is that many providers hard link against libraries, which would then make libmpi dependent on this.
- Talking about amending to request MCAs to know if it should be slurped in.
  - (if the component hard links or dlopens their libraries)
- Roadrunner experiments... The Bottleneck in launching was I/O in loading all the .sos
  - spindle, and burst buffer reduce this, but still
- Still going through function pointers, no additional inlining.
  - can do this today.
- Still different than STATIC (sharing this image across process), just not calling dlopen that many times.
- New proposal is to have a 3rd option where component decides it's default is to be slurped into libmpi
  - It's nice to have fabric provider's not bring their dependencies into libmpi so that the main libmpi can be run on nodes that may not have the provider's dependencies installed.
- Low priority thing anyway, if we get it in for v5.0 it'd be nice, but not critical.

Video Presentation

George and Jeff are leading
No new updates this week (see last week)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WeeklyTelcon_20201110

Open MPI Weekly Telecon ---

Attendees (on Web-ex)

4.0.x

v4.1

Open-MPI v5.0

Video Presentation

Clone this wiki locally