-
Notifications
You must be signed in to change notification settings - Fork 865
WeeklyTelcon_20201103
Geoffrey Paulsen edited this page Jan 19, 2021
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- NOT-YET-UPDATED
- If Openpmix 3.2.1 fixes spawn, may drive.
- We have PMIx 3.2.1rc1 - should have.
- Issue about singleton comm-spawn - Used to work, not sure when regression introduced.
- Nice to get to work
- Josh will PR PMIx 3.2.1rc today.
- Howard - get these merged today for MTT tonight.
- In a new position. HAN and Coverity fixes are in.
- Raghu is taking over for Brian's
- VERSION,NEWS, and README update is in.
- RC this week?
-
Ralph is working on updating PMIx / PRRTE submodule pointers.
- Jeff is helping with configury issues.
-
Hoping these are one-time issues, and not every time we update submodule pointers.
- All new configury.
- If this isn't a one-time thing, we should consider stop embedding this.
- Submodules are good for a number of things, but you ahve to take a change in configury, Go through CI, commit it to PMIx master, then go through OMPI CI, and find out that there's a bug. Then you have to go back to PMIx master.
-
Cisco has some of this.
- Should figure out a keyword on PMIx side, that effectively does
-
Hit a problem in AWS Amazon-Liunux1 trying to build a tarball.
-
And were hitting Mellanox CI due to Python older version.
-
IBM doesn't do dist check - bug is in there (real issue)
-
Some issues are caused by Embedding. (What do you mean by SRCDIR) (Which SRCDIR?)
-
But other issues are NOT caused by embedding pmix.
- So getting rid of embedding would not solve these issues
-
Not moment it's looking like 2Q next year.
- IBM's been pushing on PRRTE as well. We're testing the map/bind options.
- Trying to push tickets up (some are clarification of expected behavior)
- Community help with these issues would help move forward the prrte deadline.
-
What's going to be the state of the SM Cuda BTL and CUDA support in v5.0?
- What's the general state? Any known issues?
- AWS would like to get.
-
Ralph tried the Instant on at scale:
- 10,000 nodes x 32PPN
- Ralph verified Open-MPI could do all of that in < 5 seconds, Instant-On.
- Through MPI_Init() (if using Instant-On)
- TCP and Slingshot (OFI provider private now)
- PRRTE with PMIx v4.0 support
- SLURM has some of the integration, but hasn't taken this patch yet.
-
Discussion on:
- Draft Request Make default static https://github.com/open-mpi/ompi/pull/8132
- One con is that many providers hard link against libraries, which would then make libmpi dependent on this.
- Talking about amending to request MCAs to know if it should be slurped in.
- (if the component hard links or dlopens their libraries)
- Roadrunner experiments... The Bottleneck in launching was I/O in loading all the .sos
- spindle, and burst buffer reduce this, but still
- Still going through function pointers, no additional inlining.
- can do this today.
- Still different than STATIC (sharing this image across process), just not calling dlopen that many times.
- George and Jeff are leading
- No new updates this week (see last week)