-
Notifications
You must be signed in to change notification settings - Fork 864
WeeklyTelcon_20211130
Geoffrey Paulsen edited this page Nov 30, 2021
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoffrey Paulsen (IBM)
- Todd Kordenbrock (Sandia)
- Sam Gutierrez (LLNL)
- Harumi Kuno (HPE)
- Joseph Schuchart
- Naughton III, Thomas (ORNL)
- Howard Pritchard (LANL)
- Jeff Squyres (Cisco)
- Brendan Cunningham (Cornelis Networks)
- Hessam Mirsadeghi (UCX/nVidia)
- Matthew Dosanjh (Sandia)
- William Zhang (AWS)
- Austen Lauria (IBM)
- Tomislav Janjusic
- Josh Hursey (IBM)
- Brian Barrett (AWS)
- Michael Heinz (Cornelis Networks)
- Nathan Hjelm (Google)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (nVidia/Mellanox)
- Aurelien Bouteiller (UTK)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- Christoph Niethammer (HLRS)
- David Bernhold (ORNL)
- Edgar Gabriel (UH)
- Erik Zeiske
- Geoffroy Vallee (ARM)
- George Bosilca (UTK)
- Joshua Ladd (nVidia/Mellanox)
- Marisa Roman (Cornelius)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Noah Evans (Sandia)
- Raghu Raja (AWS)
- Ralph Castain (Intel)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- Xin Zhao (nVidia/Mellanox)
- Tommy is taking over for Josh Ladd. Please send Mellanox items to him.
- He will also help with v5 RM work.
- We're still waiting on Datatype issues now reported in v4.1.1
- Issue 8856
- Howard took the DT fix and created a PR
- Need an explanation for PR 8810
- Hessem contacted Artem, and that it's a work in progress.
- Follow up 8818 on datatypes
- Is this also blocker?
- No.
- Raghu has left AWS.
- Brian is stepping up for v4.1.x RM work
- v4.1.1
- Released over the weekend. Got George's datatype fix.
- Brian and Jeff did a bunch of testing, and was happy with.
- Unfortunately two different folks reported partial roundoff error #8856
- George spent a lot of time trying
- Holding off on merging v4.1.x PRs until we get a better understanding of #8856
- Still haven't done the alpha, but haven't done that until we get Cherry-picks from master.
- Austen, Tommy, and Geoff will Cherry-pick "easier"
- Issue #8652 RDMA performance problem.
- This is more of an enchancement than a severity: blocker
- Not a blocker, just an issue with the way the user ran.
- If there's a mode that we know has bad performance, useful to call out in UCX section of docs.
- Pushing back the alpha build for v5.0.0 from this Friday to NEXT friday.
- Issue 8776 - libevent confusion if running with external 3rd party tools
- PR 8792 - Need to move this over to v5.0.x
- Need to check with Brian if this is relevant on v4.0 or v4.1
- compile with --disable-dlopen, or slurp in all of the plugins.
- 3 line change, should be small work.
- Not a linker error, job just hangs and fails, really might want on v4.0 and v4.1
- PR 8799 - should probably be PRed to v5.0
- Howard's concerned that these package specific for config lookups, into the way that mpicc is linked, (for example cray)
- mpicc --show - shows some long dependencies.
- Just let him know on the ticket.
- Howard will update the ticket.
- Howard's concerned that these package specific for config lookups, into the way that mpicc is linked, (for example cray)
- Docs - Man pages will be included in this effort.
- Likely include nroff and http in the tarball (so users don't need sphynx, and don't need internet)
- If this doesn't make v5.0.0, it can go into later.
- Packagers need some advice, and need a README, few more weeks at minimum.
- 8808 - same memory backing file.
- what is the failure profile for this?
- Rare, but what happens is if two users are sharing a node, and we leave backing files because a job fails, another user tries to create the backing file, it can conflict. So we add user-id to give a little more safety for conflicting.
- Does mean that there's a cleanup issue for shared memory files.
- Only reason is because moved the backing file out of dev/shmem.
- PR 8816
- One issue is LARGE macro formats.
- More readable, or less readable with formatting is subjective.
- there is some tolkens to surrouned
- Always put a trailing comma on struct or array initializers (little things could be improved)
- I wonder if we should optimize for time here?
- Anything that's contriversial, just surround it with these tokens
- Once this is formatted, then we can run clang-tidy, and this can
- Jeff has a CI script to enforce this.
- You can turn this on before.
- Don't want to
- Need to do ompi and oshmem too.
- Not touching 3rd party.
- Large macros in tests (might need that same
- clang-tidy is smaller - pretty small.
- clang-format has to be completely included.
- Some 300 forks of Open-MPI on github.
- Anyone else have long standing branches.
- Worried that we're not making a lot of friends here, that don't really help.
- Code read-ability is important. We have coding standards, but haven't been
- We should have had this convo before we merged into opal.
- Right now we're in this horrible half-state.
- If we jump through this pain, then we can automate it.
- Not going to do clang-format on 3rdParty
- Including ROMIO
- No update
- No update
- Also some changes with libcurl, especially since this breaks OMPI built.
- PMIx can interface with REST interfaces (used by libcurl)
- JSON
- Build system issue in PMIx when we changed to static DSOs.
- Think this has been resolved
- Ralph was looking at this (private messaged Geoff)
- Jeff and Ralph and Yosi had a good conversation *
- rhc has no strong issues either way.
- We prepend LD_LIBRARY_PATH pointing to the PRRTE installation.
- At the moment in OMPI, we overlay this with OMPI library location.
- Seems like the best fix would be to make these two independent.
- PREFIX - enable prefix by default.
- In Open MPI happens to be the same as the OMPI prefix.
- But PRRTE does this by default, because we want the daemons to match the commands.
- OMPI doesn't want to do that. And that's okay
- Instead of --enable prefix-by-default we need --enable mpi-prefix-by-default.
- Looking at it from OMPI perspective
- user asked for prefixing, user wants prefixing... dont care if same or not, just want it to work.
- If user DOESNT want prefixing, then dont want EITHER prefixing.
- But if have a global PRRTE that might want prefixed.
- PRRTE will prefix by default
- What happens when I want MPI libs redirected?
- Problem is if you build PRRTE INTERNAL, then you can't redirect MPI libraries.
- Gotta set PATH and LD_LIBRARY_PATH correctly
- One of those things, --enable-prefix is NOT default in < v4.0
- There are times when want to redirect OMPIs to a different set of libraries.
- right now it's a configure / compile time, which is problematic. have to redo all of the subcomponents.
- What would be nice is if this was at runtime, so that ompi's mpirun can find all of the subcomponents at runtime.
- Setting LD_LIBRARY_PATH is the way to point to another set of libraries.
- This breaks because mpirun will overwrite LD_LIBRARY_PATH.
- Personally Doesn't want this as a default.
- Joseph doesn't want us setting LD_LIBRARY_PATH
- Need to look at the public tests repo for merging in both ULFM and Sessions tests.
- Howard and Geoff will look at this week.
-
OMPI docs and manpages, but persistant problem that mpirun is really prrterun
- PMIx and PRRTE now use pandoc. It'd be bad to require both pandoc and sphynx
- Josh Hursey is wrote this up https://github.com/openpmix/prrte/issues/931, as a means to draw how to man mpirun for Open MPI
-
PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.
- Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
- Intent this is for v5.0
- mpirun / prrterun - we had quite a bit of details in orte, but are updating as much as possible.
- Ralph has asked about this for PMIx/PRRTE since this is turning out to work
-
No update - 3/16
- Could be independent of PMIx and PRRTE.
- PMIx and PRRTE want to follow suite, and not require both pandoc and sphynx.