-
Notifications
You must be signed in to change notification settings - Fork 865
WeeklyTelcon_20180109
Geoffrey Paulsen edited this page Jan 15, 2019
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
--- Will fill out as meeting starts
- Geoff Paulsen
- Brian
- Howard Pritchard
- Artem Polyakov
- Josh Hursey
- David Bernholdt
- Edgar Gabriel
- Geoffroy Vallee
- Josh Ladd
- Mathew
- Ralph
- Todd Kordenbrock
- Nathan Hjelm
- Thomas Naughton
- Jan 9th:
- Decided last week to push date to late feb or march.
- Discuss abandoning openib btl.
- Want Chelcio and nvidia to be part of discussion.
- Abandoning might leave some people.
- Nathan has a UCX BTL
- Nathan would like for general stanity.
- What are the situations that openib gives you functionality that you don't get from libfabrics or UCX.
- iWarp.
- GPU support - UCX is working on this, but it's not ready.
- Issue - no one wants to support the code.
- Send email to OpenIB is looking at end of life, and request someone to step up and support.
- We care until UCX support GPGPU (soon).
- People are still using it, probably because they haven't
- Idea towards a deprecation path.
- turn off for default in non-supported paths,e tc.
- Summary - leave it in there for now, with some bandaids for now, and
- Should next release be v3.2 or v4.0
- Discussion in PR4401
- History:
- Right after v3.0 branched, an ABI break came into master.
- But then we merged ABI break to v3.0, but never downgraded master back to v3.x
- Need to go back an audit to see if there are any ABI breaks in master.
- Pretty sure no ABI breaks from v3.1 and master.
- PROCESS question - Historically when we rev MAJOR numbers, we've also reved ALL of the shared libraries
- Decided we'll keep it at v4.0, but not break .so versioning unless audit determine it's needed on a library by library basis.
- Fortran change - Function prototypes - These couldn't have possibly worked, so tread it like NO ABI change.
- fix and doc, but not
- OSHMEM - pull requests against v3.1, and v3.0 as well.
- Okay to go into v4.0, don't PR back to v3.x releases.
- Process going forward -
- ADD / REMOVE / CHANGED - should make it easier for release managers.
- VERY painful for Release Managers to go through logs to determine if shared library changed.
- Test infrastructure
- Mellanox Jenkins is failing on every PR. What are we going to do about this? is it real/not?
- can reproduce with a multithreaded test.
- Looks like it's hanging with openIB, TCP, (not yalla), jenkins says Vader hangs.
- sometime infrastructure uses timeout command, othertimes ompi_timeout happens.
- One segfault due to vader, rest due to timeout.
- Forget that segfault is possibly due to timeout command.
- Multithreaded test-11bw ??
- looks like an atomic issue (both multi-threaded TCP and Vader).
- Mellanox Jenkins is failing on every PR. What are we going to do about this? is it real/not?
Review All Open Blockers
Review v2.x Milestones v2.1.2
- Shooting for Release on Jan 19th, and RC later this week.
- Issue4682 - just better error checking enhancement, so shouldn't go into release branch
- Issue 4336 and 4453 - Edgar to Backport a few PRs:
- PR4454 - backport fixes in v3.x branch to v2.x branch.
- PR4351 - memory consumption
Review v3.0.x Milestones v3.0
- Schedule: RC2
- On 3.x series trying to cut RCs on nightly tarballs.
- Will
- Duped issue: Mpool init hang AND Current blocker: Hang on ARM in v3.0.x
- Only hangs in debug. Bad, but not ship-stopper.
- Doesn't happen in optimized mode
-
Issue 4563 - not seeing on little arm boxes here, Jenkins uses --disable-builtin-atomics.
- Because when we disable atomics on powerpc, compiler thinks we have cmp-set128.
- On arm uses old-school lock-based lifo and fifo.
- Fix being worked in PR3988 - bug in PGI compiler
-
Issue 4509 madvise hook
- Jeff and Howard will discuss.
- Now that we hook madvise, we need to be more careful.
- Nathan hopes his PR 4576 on master would reduce the occurances to 0, but need user to verify.
- may have to invalidate a LARGE region, even though it's mostly valide just because glibc invalideded a small part of it.
- Tested PR 4576 in master last week,
- Still need to merge into v2.x, v3.0.x and v3.1.x
- Do we need to Pull PR 4628 into v3.0.x?
- broken in v3.0.0 and later, but it's just launch performance not hang.
- decided NOT to block v3.0.1 for this, and fix this in v3.0.2
Review v3.1.x Milestones v3.1
- SCHEDULE: Like to get out in late January
-
BLOCKER:
- 4605 - update PMIx to v2.1.0 - Just a refresh of that directory.
- 4523 - OSC monitoring component when portals is configured.
- Issue 2168 - hasn't that been resolved? 2168. - Brian will link and close.
- Brian will Issue 4303
Review Master Master Pull Requests
- When your PR has been accepted into a release branch, please go to the issue, and remove the target of the release branch that it was just merged into. Attempting to automate this in the future.
Review Master MTT testing
- pushed date to late feb or march.
- Mellanox, Sandia, Intel
- LANL, Houston, IBM, Fujitsu
- Amazon,
- Cisco, ORNL, UTK, NVIDIA