-
Notifications
You must be signed in to change notification settings - Fork 864
WeeklyTelcon_20230509
Geoffrey Paulsen edited this page Jul 25, 2023
·
2 revisions
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoff Paulsen (IBM)
- Howard Pritchard (LANL)
- Edgar Gabriel (AMD)
- Luke Robison (Amazon)
- Joseph Schuchart
- Thomas Huber
- Thomas Naughton (ORNL)
- Todd Kordenbrock (Sandia)
- Tommy Janjusic (nVidia)
-
https://github.com/open-mpi/ompi/pull/11649 - an OFI callback scoped incorrectly.
- This probably affects main, v5.0.x, and v4.x
- could craft a testcase. Is this only seen with MT app? No, not only MT.
- reentrant for single thread.
- Got more than one completion because overwrote an array.
- Luke will search for test case
- Blocker for v5.0.0
- No new updates
- MAC Params issues are biggest issues now
- https://github.com/openmpi/ompi/issues/11532
- https://github.com/openpmix/prrte/issues/1731
- Just want 2 of the 3 fixes for v5.0.0, 3rd issue can wait for 5.0.x
- Quincy was going to take this over, but busy with other things.
- Might be in PMIx base / framework
- https://github.com/open-mpi/ompi/wiki/WeeklyTelcon_20230425#pmix-mca-parameter-issues
- Ralph volunteered to help, but might take a month
- Luke will check with Ralph
- Need to cherry-pick NIC selection to v5.0.x
- commit that went into main broke some AWS configurations
- Caused some coverity issues, but fixed already PRed against main
- 2 MTT issues
- UCX and DSO - may be a fix needed to be cherry-picked back to v5.0.x
- Issue 11632 - Fix provided in re-review
- UCX and DSO - may be a fix needed to be cherry-picked back to v5.0.x
- Good to retest ABI with v4.1.x before v5.0.0
- Geoff will do this or next week
- SMCuda to disqualify itself if no Cuda HW available.
- Want this for v5.0
- one rank or singleton closes itself early.
- Edge case in SMCuda and attempts to clean up and tie into framework.
- When it gets unloaded, there are dangling pointers.
- Fix - doesn't setup callback functions unless Cuda_Init succeeds.
- Edgar's PR is still trying to compile Cuda collective always (PR 11617)
- Waiting for review
- Summary, we want both
- Doc work still remaining, will enumerate next week any remaining issues
- A fix 20 minutes ago, other than there's some pmix cross version 11658
- These same doc fixes will trickle through pmix/prrte and
- New Issue, nVidia's internal MTT found an async-modex
- global dstore has an issue.
- If you set async-modex, or set dstore-hash.
- Issue of scale... minimal required 4nodes x 4ppn.
- UCX and ob1 both affected.
- just Init+Finalize can trigger
- v5.0 blocker.
- Behvaior of MPI_Comm_disconnect - a lot of discussion with George
- MPI_Finalize - what happens to persistent communication handles that the user didn't explicitly free?
- Option number 3 is under
- C Const in headerfiles, but open mpi and mpich are both doing what's acceptable for ABI definitions, but not discussed this forum