TreePiece replication #164

trquinn · 2024-01-26T01:01:47Z

This is @harshithamenon 's code to replicate treepieces to improve performance by distributing cache requests over several processors.

…plica group. The requests for a key will be sent to the corresponding TreePieceReplica instead of a tree piece. This is done to prevent the case when one tree piece is getting many requests and that becomes the bottleneck

…quired information from the group instead of the tree piece

This brings tp_replication up to date with master after over a decade of changes. N.B. it does not compile because of charm interface changes over that decade.

This is not production code. But it might work for testing the concept.

robertwissing · 2024-01-27T22:55:01Z

I ran some tests on the Lamb 80 million and works for some number of tree pieces(128,1024,4096) but breaks for others (960, 16384, 2**16).

Error:

Reason: Ok, before it handled this, but why do we have a null pointer in the tree?!?
[52] Stack Traceback:
[52:0] ChaNGa.mpi.smp.icc.ompi.karolina 0x9c3f96 CmiAbortHelper(char const*, char const*, char const*, int, int)
[52:1] ChaNGa.mpi.smp.icc.ompi.karolina 0x9c3f37 CmiAbort
[52:2] ChaNGa.mpi.smp.icc.ompi.karolina 0x724abf TreePieceReplica::fillRequestNodeFromReplica(CkCacheRequestMsg)
[52:3] ChaNGa.mpi.smp.icc.ompi.karolina 0x81dc93 CkDeliverMessageFree
[52:4] ChaNGa.mpi.smp.icc.ompi.karolina 0x81052c
[52:5] ChaNGa.mpi.smp.icc.ompi.karolina 0x810dbc _processHandler(void, CkCoreState*)
[52:6] ChaNGa.mpi.smp.icc.ompi.karolina 0x96d115 CsdScheduleForever
[52:7] ChaNGa.mpi.smp.icc.ompi.karolina 0x96d07e CsdScheduler
[52:8] ChaNGa.mpi.smp.icc.ompi.karolina 0x9bd4f6
[52:9] libpthread.so.0 0x2b7036ad2ea5
[52:10] libc.so.6 0x2b7038dd7b0d clone

handle NULL node.

trquinn · 2024-02-02T20:36:59Z

@robertwissing see if the recent commit fixes this problem.

robertwissing · 2024-02-11T22:08:03Z

That fixed the problem, but seems like it is slightly slower with the tree replication than without. I have not run it on the merger case, but I ran it on a refined dwarf (the one from your benchmark 8X, so 400M particles). Ran it with 1024 and 8192 cores and a bit slower on both core numbers. I saw that Idle time seem to go up in the tree replication run. Below is stats for the tenth step:
Tree replication UCX 8N:
Orb3dLB_notopo stats: maxObjLoad 0.657383
Orb3dLB_notopo stats: minWall 32.175646 maxWall 32.445406 avgWall 32.277810 maxWall/avgWall 1.005192
Orb3dLB_notopo stats: minIdle 2.700702 maxIdle 4.229554 avgIdle 3.183818 minIdle/avgIdle 0.848259
Orb3dLB_notopo stats: minPred 27.573949 maxPred 29.088928 avgPred 28.712396 maxPred/avgPred 1.013114
Orb3dLB_notopo stats: minPiece 72.000000 maxPiece 299.000000 avgPiece 104.166667 maxPiece/avgPiece 2.870400
Orb3dLB_notopo stats: minBg 0.154998 maxBg 0.407093 avgBg 0.212711 maxBg/avgBg 1.913825
Orb3dLB_notopo stats: orb migrated 78619 refine migrated 0 objects
took 0.610235 seconds.
Elapsed time: 391.025
Building trees ... took 0.184952 seconds.
Elapsed time: 393.017
Calculating gravity (tree bucket, theta = 0.700000) ... Calculating gravity and SPH took 28.6747 seconds.

Regular UCX 8N:
Orb3dLB_notopo stats: maxObjLoad 0.633993
Orb3dLB_notopo stats: minWall 30.508254 maxWall 30.743554 avgWall 30.563651 maxWall/avgWall 1.005886
Orb3dLB_notopo stats: minIdle 1.446566 maxIdle 2.361893 avgIdle 1.852712 minIdle/avgIdle 0.780783
Orb3dLB_notopo stats: minPred 27.852560 maxPred 28.718140 avgPred 28.442074 maxPred/avgPred 1.009706
Orb3dLB_notopo stats: minPiece 70.000000 maxPiece 299.000000 avgPiece 104.166667 maxPiece/avgPiece 2.870400
Orb3dLB_notopo stats: minBg 0.045028 maxBg 0.256918 avgBg 0.093987 maxBg/avgBg 2.733552
Orb3dLB_notopo stats: orb migrated 78574 refine migrated 0 objects
took 0.534952 seconds.
Elapsed time: 355.906
Building trees ... took 0.181137 seconds.
Elapsed time: 356.088
Calculating gravity (tree bucket, theta = 0.700000) ... Calculating gravity and SPH took 28.4716 seconds.

trquinn · 2024-02-14T18:50:41Z

Looking at the load balancing data, this simulation does not seem to have a difficult time load balancing, so it's not clear that tree replication is needed. Key numbers are: maxPred/avgPred is very close to 1, indicating that the balancer thinks it's about to do a very good job; final "Calulating gravity" number is slightly less than maxPred, indicating that load balancing was even better than predicted.
I would test on a more clustered simulation where the load balancer is obviously struggling.

robertwissing · 2024-02-17T22:07:25Z

I tried to commit, but got permission denied, the tree replication need to be added to the tree build in starform.cpp aswell:
// Need to build tree since we just did a drift.
buildTree(PHASE_FEEDBACK);

tpReplicaProxy.clearTable(CkCallbackResumeThread());
treeProxy.replicateTreePieces(CkCallbackResumeThread())

I ran the merger case which is more clustered, and here I do get quite the improvement. As can be seen below(for 4096 CPU).

I also ran this simulation with more tree pieces(42000 -> 160000), in an attempt to increase the minPiece number. but instead got minPiece: 0 in these runs. not sure why that is happening exactly.....

WITH TREE REPLICATION:

[Orb3dLB_notopo] sorting

Orb3dLB_notopo stats: maxObjLoad 0.749472
Orb3dLB_notopo stats: minWall 2.118554 maxWall 2.219700 avgWall 2.170132 maxWall/avgWall 1.022841
Orb3dLB_notopo stats: minIdle 1.149432 maxIdle 2.167409 avgIdle 1.427119 minIdle/avgIdle 0.805421
Orb3dLB_notopo stats: minPred 0.637064 maxPred 1.917029 avgPred 1.280334 maxPred/avgPred 1.497288
Orb3dLB_notopo stats: minPiece 2.000000 maxPiece 47.000000 avgPiece 10.937500 maxPiece/avgPiece 4.297143
Orb3dLB_notopo stats: minBg 0.047661 maxBg 0.308163 avgBg 0.197008 maxBg/avgBg 1.564221
Orb3dLB_notopo stats: orb migrated 32556 refine migrated 0 objects
took 0.138386 seconds.
Elapsed time: 61.7747
Building trees ... took 0.164258 seconds.
Elapsed time: 62.1046
Calculating gravity (tree bucket, theta = 0.700000) ... Calculating densities/divv ... took 1.099997 seconds.
Calculating pressure gradients ... took 0.302843 seconds.
Kick Close:
Rung 0: 3.35382e-06
uDot update: Rung 0 ... took 0.049003 seconds.
Calculating gravity and SPH took 2.03107 seconds.

REGULAR:

[Orb3dLB_notopo] sorting

Orb3dLB_notopo stats: maxObjLoad 0.762852
Orb3dLB_notopo stats: minWall 2.049248 maxWall 2.137798 avgWall 2.087568 maxWall/avgWall 1.024061
Orb3dLB_notopo stats: minIdle 1.138510 maxIdle 2.127362 avgIdle 1.377826 minIdle/avgIdle 0.826309
Orb3dLB_notopo stats: minPred 0.856364 maxPred 1.810501 avgPred 1.315091 maxPred/avgPred 1.376712
Orb3dLB_notopo stats: minPiece 2.000000 maxPiece 33.000000 avgPiece 10.937500 maxPiece/avgPiece 3.017143
Orb3dLB_notopo stats: minBg 0.006768 maxBg 0.219217 avgBg 0.116041 maxBg/avgBg 1.889133
Orb3dLB_notopo stats: orb migrated 34842 refine migrated 0 objects
took 0.127405 seconds.
Elapsed time: 69.5427
Building trees ... took 0.218447 seconds.
Elapsed time: 69.7612
Calculating gravity (tree bucket, theta = 0.700000) ... Calculating densities/divv ... took 2.152796 seconds.
Calculating pressure gradients ... took 0.310139 seconds.
Kick Close:
Rung 0: 3.35382e-06
uDot update: Rung 0 ... took 0.0361415 seconds.
Calculating gravity and SPH took 2.97707 seconds.

trquinn · 2024-02-20T16:53:58Z

Note: you can always do a pull request on a pull request.
If you can point me to a branch on your fork, I can incorporate your changes.

trquinn · 2024-02-23T18:14:25Z

ParallelGravity.cpp

@@ -2059,6 +2063,9 @@ void Main::advanceBigStep(int iStep) {
    /******** Tree Build *******/
    buildTree(activeRung);

+		tpReplicaProxy.clearTable(CkCallbackResumeThread());


A refactor needs to happen to avoid code duplication (see also @robertwissing 's latest commit in starform.cpp). I suggest we move these lines into Main::buildTree().

trquinn · 2024-08-28T18:33:07Z

Robert reports another issue:
I had an issue with the tree replication code though. When running
multi-timestepping I get this error sometimes:
------------- Processor 2664 Exiting: Called CmiAbort ------------
Reason: Why did we ask for this bucket with no particles?

It seems to happen more frequently when using more treepieces.

spencerw · 2024-10-14T19:52:10Z

I've been having an issue getting the GPU gravity to scale to multiple nodes, and I think this PR might fix it. The GPU likes fewer TreePieces, but doing so puts too heavy of a cache load on a few select cores when scaling to more than one physical node. I tried out this PR and was able to use far fewer TreePieces without running into any load balancing issues.

Unfortunately, the '--with-cuda' flag appears to break the TreePiece replication code when running on multiple nodes. It looks like this PR is still based off of the old WorkRequest GPU code (my changes weren't merged into main until after this PR was opened), so it might be worth trying an upstream merge first.

Init. Accel. ... took 0.031706 seconds.
malloc(): corrupted top size
------------- Processor 85 Exiting: Caught Signal ------------
Reason: Aborted
Calculating gravity (tree bucket, theta = 0.700000) ... [85] Stack Traceback:
[85:0] libc.so.6 0x400018c14650
[85:1] libc.so.6 0x400018bcf86c raise
[85:2] libc.so.6 0x400018bb7030 abort
[85:3] libc.so.6 0x400018c08520
[85:4] libc.so.6 0x400018c1eb48
[85:5] libc.so.6 0x400018c21e80
[85:6] libc.so.6 0x400018c22ac0 malloc
[85:7] ChaNGa.smp.cuda 0x9dae14 CmiAlloc
[85:8] ChaNGa.smp.cuda 0x9bb904 CkAllocMsg
malloc(): corrupted top size

xterm is not installed on Vista, so I can't use gdb to get a more detailed stack trace at the moment. I'll have to try and reproduce this on another machine.

Separately, '--enable-bigkeys' causes the same error if using more than one node.

trquinn · 2024-10-16T19:40:41Z

Upstream merged cleanly. Try the multinode CUDA run again.

spencerw · 2024-10-16T20:09:47Z

The upstream merge appears to have fixed the segfaults I was getting before, both with the CUDA and bigkeys flags.

Running the dwf1.6144 benchmark on two Grace Hopper nodes and 1024 TreePieces, I'm seeing a 2x speedup for gravity with this PR, relative to the master branch. Regardless, we definitely need to rethink how work is sent to the GPU. Splitting up the kernel launches between TreePieces seems to be causing a pretty significant performance penalty.

trquinn · 2024-11-15T21:01:56Z

Robert reports:

I said a few meetings ago that I would try to reproduce the error that I got with the
MHD+tp_replication code on the newest tp_replication update. And I still get the same error when
running multi-timestep runs. Seem to happen mainly after feedback or star form.

 

This error:

Reason: Why did we ask for this bucket with no particles?

robertwissing · 2024-11-20T17:31:40Z

I moved the tree replication to Main::buildTree()
tpReplicaProxy.clearTable(CkCallbackResumeThread());
treeProxy.replicateTreePieces(CkCallbackResumeThread())

And this seem to have fixed the issue with
"Reason: Why did we ask for this bucket with no particles?"
Was perhaps caused by not doing tree replicate after the tree bulld in feedback routine.

trquinn · 2024-11-25T01:28:08Z

I moved the tree replication to Main::buildTree() tpReplicaProxy.clearTable(CkCallbackResumeThread()); treeProxy.replicateTreePieces(CkCallbackResumeThread())

And this seem to have fixed the issue with "Reason: Why did we ask for this bucket with no particles?" Was perhaps caused by not doing tree replicate after the tree bulld in feedback routine.

I've pushed this change.

harshithamenon and others added 5 commits April 2, 2013 19:13

Add the code to replicate the tree piece information and fetch the re…

2f52755

…quired information from the group instead of the tree piece

Merge branch 'master' into tp_replication

e65aa26

This brings tp_replication up to date with master after over a decade of changes. N.B. it does not compile because of charm interface changes over that decade.

TreePieceReplica code now compiles and runs.

6b2b606

This is not production code. But it might work for testing the concept.

Merge branch 'master' into tp_replication

6f2fd2d

trquinn requested a review from robertwissing January 26, 2024 21:56

TreePieceReplica::fillRequestNodeFromReplica(): resort to TreePiece to

dd456bb

handle NULL node.

Added tree replication to tree build in starform.cpp aswell

54586cc

trquinn commented Feb 23, 2024

View reviewed changes

Merge branch 'master' into tp_replication

4a06673

Move replicateTreePieces() calls into Main:buildTree().

f788642

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TreePiece replication #164

TreePiece replication #164

trquinn commented Jan 26, 2024

robertwissing commented Jan 27, 2024

trquinn commented Feb 2, 2024

robertwissing commented Feb 11, 2024 •

edited

Loading

trquinn commented Feb 14, 2024

robertwissing commented Feb 17, 2024

trquinn commented Feb 20, 2024

trquinn Feb 23, 2024

trquinn commented Aug 28, 2024

spencerw commented Oct 14, 2024 •

edited

Loading

trquinn commented Oct 16, 2024

spencerw commented Oct 16, 2024

trquinn commented Nov 15, 2024

robertwissing commented Nov 20, 2024

trquinn commented Nov 25, 2024

TreePiece replication #164

Are you sure you want to change the base?

TreePiece replication #164

Conversation

trquinn commented Jan 26, 2024

robertwissing commented Jan 27, 2024

trquinn commented Feb 2, 2024

robertwissing commented Feb 11, 2024 • edited Loading

trquinn commented Feb 14, 2024

robertwissing commented Feb 17, 2024

trquinn commented Feb 20, 2024

trquinn Feb 23, 2024

Choose a reason for hiding this comment

trquinn commented Aug 28, 2024

spencerw commented Oct 14, 2024 • edited Loading

trquinn commented Oct 16, 2024

spencerw commented Oct 16, 2024

trquinn commented Nov 15, 2024

robertwissing commented Nov 20, 2024

trquinn commented Nov 25, 2024

robertwissing commented Feb 11, 2024 •

edited

Loading

spencerw commented Oct 14, 2024 •

edited

Loading