Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example code for AMG-DD solver? #1115

Open
BenWibking opened this issue Aug 15, 2024 · 18 comments
Open

example code for AMG-DD solver? #1115

BenWibking opened this issue Aug 15, 2024 · 18 comments

Comments

@BenWibking
Copy link

Are there any examples of AMG-DD usage?

I don't see it used anywhere in https://github.com/hypre-space/hypre/blob/master/src/test/test_ij.c.

I would like to replicate the tests in https://arxiv.org/abs/1906.10575, but there does not appear to be enough information in the paper to choose values for all of the parameters that are exposed by the API.

@waynemitchell
Copy link
Contributor

@BenWibking, you can use the ij driver to test AMG-DD by passing -solver 90 or -solver 91:
https://github.com/hypre-space/hypre/blob/master/src/test/ij.c#L2392
Which parameters are you unsure about? The paper should discuss the most important parameters (I hope).

@BenWibking
Copy link
Author

Thanks for the quick response!

I see I was looking at the wrong source file.

I am unsure about the parameter setting the number of ghost layers:

         HYPRE_BoomerAMGDDSetNumGhostLayers(amgdd_solver, amgdd_num_ghost_layers);

Perhaps I missed something in the paper. How many ghost layers were used for the tests shown?

@waynemitchell
Copy link
Contributor

waynemitchell commented Aug 15, 2024

If I'm remembering this correctly, everything should work correctly with a single ghost layer. The main algorithmic development in the paper was driven by trying to minimize the number of ghost layers required. A single ghost layer is the default set in the ij driver, so you shouldn't need to set anything:
https://github.com/hypre-space/hypre/blob/master/src/test/ij.c#L499
I'm not sure why the number of ghost layers is still exposed as a parameter that the user can set... maybe there are cases that I'm just not thinking of right now when you need to set it higher. But I think it's likely a relic from a previous version of the algorithm.

@BenWibking
Copy link
Author

Ok, I've looked at that code and tried to modify the AMG2023 code to use AMGDD based on ij.c.

My code is here: BenWibking/AMG2023@c847960#diff-ee753cc8c3a9fe01da6eeade8f8b9aee1d4c7485f3f52f2ae2add0a12222111d

However, I get an MPI_ABORT with no other error message that would enable me to debug it, even when running in a debugger:

Running with these driver parameters:
  Problem ID    = 1

=============================================
Hypre init times:
=============================================
Hypre init:
  wall clock time = 0.000000 seconds
  Laplacian_27pt:
    (Nx, Ny, Nz) = (10, 10, 10)
    (Px, Py, Pz) = (1, 1, 1)

=============================================
Generate Matrix:
=============================================
Spatial Operator:
  wall clock time = 0.000295 seconds
  RHS vector has unit components
  Initial guess is 0
=============================================
IJ Vector Setup:
=============================================
RHS and Initial Guess:
  wall clock time = 0.000012 seconds
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
  Proc: [[6075,0],0]
  Errorcode: -1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Is there some way to see why Hypre called MPI_ABORT?

@BenWibking
Copy link
Author

I recompiled without MPI and was able to see this. It looks like a bug:

(lldb) r
Process 30117 launched: '/Users/benwibking/AMG2023/amg' (arm64)
Running with these driver parameters:
  Problem ID    = 1

=============================================
Hypre init times:
=============================================
Hypre init:
  wall clock time = 0.000000 seconds
  Laplacian_27pt:
    (Nx, Ny, Nz) = (10, 10, 10)
    (Px, Py, Pz) = (1, 1, 1)

=============================================
Generate Matrix:
=============================================
Spatial Operator:
  wall clock time = 0.000000 seconds
  RHS vector has unit components
  Initial guess is 0
=============================================
IJ Vector Setup:
=============================================
RHS and Initial Guess:
  wall clock time = 0.000000 seconds
[memory.c, 66] hypre_assert failed: 0
Assertion failed: (0), function hypre_OutOfMemory, file memory.c, line 66.
Process 30117 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #4: 0x00000001001f3ae4 amg`hypre_OutOfMemory(size=18446744064213649472) at memory.c:66:4
   63
   64  	   hypre_sprintf(msg, "Out of memory trying to allocate %zu bytes\n", size);
   65  	   hypre_error_w_msg(HYPRE_ERROR_MEMORY, msg);
-> 66  	   hypre_assert(0);
   67  	   fflush(stdout);
   68  	}
   69
Target 0: (amg) stopped.

@BenWibking
Copy link
Author

The full backtrace is:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #0: 0x00000001856d15f0 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x0000000185709c20 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x0000000185616a30 libsystem_c.dylib`abort + 180
    frame #3: 0x0000000185615d20 libsystem_c.dylib`__assert_rtn + 284
  * frame #4: 0x00000001001f3ae4 amg`hypre_OutOfMemory(size=18446744064213649472) at memory.c:66:4
    frame #5: 0x00000001001f3000 amg`hypre_MAlloc_core(size=18446744064213649472, zeroinit=1, location=hypre_MEMORY_HOST) at memory.c:437:7
    frame #6: 0x00000001001f34f8 amg`hypre_CAlloc(count=18446744072522563848, elt_size=8, location=HYPRE_MEMORY_HOST) at memory.c:948:11
    frame #7: 0x00000001000ab0cc amg`hypre_BoomerAMGDDSetup(amgdd_vdata=0x000000011f815200, A=0x0000600001b20000, b=0x0000600003528080, x=0x0000600003528100) at par_amgdd_setup.c:96:15
    frame #8: 0x000000010007b004 amg`HYPRE_BoomerAMGDDSetup(solver=0x000000011f815200, A=0x0000600001b20000, b=0x0000600003528080, x=0x0000600003528100) at HYPRE_parcsr_amgdd.c:47:13
    frame #9: 0x000000010007121c amg`hypre_GMRESSetup(gmres_vdata=0x0000600001128000, A=0x0000600001b20000, b=0x0000600003528080, x=0x0000600003528100) at gmres.c:241:4
    frame #10: 0x000000010006ff6c amg`HYPRE_GMRESSetup(solver=0x0000600001128000, A=0x0000600001b20000, b=0x0000600003528080, x=0x0000600003528100) at HYPRE_gmres.c:37:13
    frame #11: 0x0000000100005344 amg`main(argc=1, argv=0x000000016fdfeaa8) at amg.c:756:7
    frame #12: 0x000000018537f154 dyld`start + 2476

@waynemitchell
Copy link
Contributor

waynemitchell commented Aug 16, 2024

Hm... after just a few quick ij driver runs, I'm not able to reproduce this on my side. From your backtrace, it looks like the size passed to the memory allocation is bad (looks like some uninitialized garbage or something). The code should just be allocating a small amount of memory here: basically just a data structure for each level of the AMG hierarchy (the size is num_levels here):
https://github.com/hypre-space/hypre/blob/master/src/parcsr_ls/par_amgdd_setup.c#L96
Maybe the regular AMG setup isn't happening as it should? The AMG-DD setup should perform an underlying AMG setup automatically here:
https://github.com/hypre-space/hypre/blob/master/src/parcsr_ls/par_amgdd_setup.c#L76
But maybe the check in this if statement is not robust for some reason? Can you check whether num_levels at par_amgdd_setup.c:96 has a reasonable value, and if not, check whether the AMG setup call at line 76 is happening?

@BenWibking
Copy link
Author

num_levels is bad:

Process 46550 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x00000001000ab0bc amg`hypre_BoomerAMGDDSetup(amgdd_vdata=0x0000000131008200, A=0x00006000024a0000, b=0x0000600000aa4080, x=0x0000600000aa4100) at par_amgdd_setup.c:96:15
   93  	   }
   94
   95  	   // Allocate pointer for the composite grids
-> 96  	   compGrid = hypre_CTAlloc(hypre_AMGDDCompGrid *, num_levels, HYPRE_MEMORY_HOST);
   97  	   hypre_ParAMGDDDataCompGrid(amgdd_data) = compGrid;
   98
   99  	   // In the 1 processor case, just need to initialize the comp grids
Target 0: (amg) stopped.
(lldb) p num_levels
(HYPRE_Int) -1186987768

@BenWibking
Copy link
Author

If I set a breakpoint on line 76, it doesn't trigger, so I assume that means it's not getting executed:

(lldb) breakpoint set --file par_amgdd_setup.c --line 76
Breakpoint 1: where = amg`hypre_BoomerAMGDDSetup + 184 at par_amgdd_setup.c:76:36, address = 0x00000001000ab018
(lldb) r
Process 46833 launched: '/Users/benwibking/AMG2023/amg' (arm64)
Running with these driver parameters:
  Problem ID    = 1

=============================================
Hypre init times:
=============================================
Hypre init:
  wall clock time = 0.000000 seconds
  Laplacian_27pt:
    (Nx, Ny, Nz) = (10, 10, 10)
    (Px, Py, Pz) = (1, 1, 1)

=============================================
Generate Matrix:
=============================================
Spatial Operator:
  wall clock time = 0.000000 seconds
  RHS vector has unit components
  Initial guess is 0
=============================================
IJ Vector Setup:
=============================================
RHS and Initial Guess:
  wall clock time = 0.000000 seconds
[memory.c, 66] hypre_assert failed: 0
Assertion failed: (0), function hypre_OutOfMemory, file memory.c, line 66.
Process 46833 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #4: 0x00000001001f3ae4 amg`hypre_OutOfMemory(size=18446744064213649472) at memory.c:66:4
   63
   64  	   hypre_sprintf(msg, "Out of memory trying to allocate %zu bytes\n", size);
   65  	   hypre_error_w_msg(HYPRE_ERROR_MEMORY, msg);
-> 66  	   hypre_assert(0);
   67  	   fflush(stdout);
   68  	}
   69
Target 0: (amg) stopped.

@BenWibking
Copy link
Author

Ok, it's not setting up BoomerAMG:

(lldb) breakpoint set --file par_amgdd_setup.c --line 74
Breakpoint 1: where = amg`hypre_BoomerAMGDDSetup + 160 at par_amgdd_setup.c:74:9, address = 0x00000001000ab000
(lldb) r
Process 47411 launched: '/Users/benwibking/AMG2023/amg' (arm64)
Running with these driver parameters:
  Problem ID    = 1

=============================================
Hypre init times:
=============================================
Hypre init:
  wall clock time = 0.000000 seconds
  Laplacian_27pt:
    (Nx, Ny, Nz) = (10, 10, 10)
    (Px, Py, Pz) = (1, 1, 1)

=============================================
Generate Matrix:
=============================================
Spatial Operator:
  wall clock time = 0.000000 seconds
  RHS vector has unit components
  Initial guess is 0
=============================================
IJ Vector Setup:
=============================================
RHS and Initial Guess:
  wall clock time = 0.000000 seconds
Process 47411 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x00000001000ab000 amg`hypre_BoomerAMGDDSetup(amgdd_vdata=0x0000000122015e00, A=0x0000600001744000, b=0x00006000039440c0, x=0x0000600003944180) at par_amgdd_setup.c:74:9
   71  	   }
   72
   73  	   // If the underlying AMG data structure has not yet been set up, call BoomerAMGSetup()
-> 74  	   if (!hypre_ParAMGDataAArray(amg_data))
   75  	   {
   76  	      hypre_BoomerAMGSetup((void*) amg_data, A, b, x);
   77  	   }
Target 0: (amg) stopped.
(lldb) step
Process 47411 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step in
    frame #0: 0x00000001000ab030 amg`hypre_BoomerAMGDDSetup(amgdd_vdata=0x0000000122015e00, A=0x0000600001744000, b=0x00006000039440c0, x=0x0000600003944180) at par_amgdd_setup.c:80:11
   77  	   }
   78
   79  	   // Get number of processes
-> 80  	   comm = hypre_ParCSRMatrixComm(A);
   81  	   hypre_MPI_Comm_size(comm, &num_procs);
   82
   83  	   // get info from amg about how to setup amgdd
Target 0: (amg) stopped.

@waynemitchell
Copy link
Contributor

Are you calling HYPRE_BoomerAMGDDCreate() before the setup?

@waynemitchell
Copy link
Contributor

OK, I think I see your issue. The AMG-DD solver object is called amgdd_solver:
BenWibking/AMG2023@c847960#diff-ee753cc8c3a9fe01da6eeade8f8b9aee1d4c7485f3f52f2ae2add0a12222111dR560
but you pass pcg_precond as the preconditioner to CG:
BenWibking/AMG2023@c847960#diff-ee753cc8c3a9fe01da6eeade8f8b9aee1d4c7485f3f52f2ae2add0a12222111dR592
So you aren't passing the correct preconditioner object.

@BenWibking
Copy link
Author

Ah, ok. I was confused about that. Everything works now.

@BenWibking
Copy link
Author

Does the AMGDD solver work on HIP?

I tried to run it on a single node on Frontier, but I get a segmentation fault (whereas the unmodified AMG2023 runs fine with this build):

Running with these driver parameters:
  Problem ID    = 1

=============================================
Hypre init times:
=============================================
Hypre init:
  wall clock time = 0.000005 seconds
  Laplacian_27pt:
    (Nx, Ny, Nz) = (256, 256, 256)
    (Px, Py, Pz) = (2, 2, 2)

=============================================
Generate Matrix:
=============================================
Spatial Operator:
  wall clock time = 0.977577 seconds
  RHS vector has unit components
  Initial guess is 0
=============================================
IJ Vector Setup:
=============================================
RHS and Initial Guess:
  wall clock time = 0.006108 seconds
srun: error: frontier10190: tasks 1,3,5,7: Segmentation fault
srun: Terminating StepId=2239275.0
slurmstepd: error: *** STEP 2239275.0 ON frontier10190 CANCELLED AT 2024-08-17T15:47:01 ***
srun: error: frontier10190: tasks 2,4,6: Terminated
srun: error: frontier10190: task 0: Segmentation fault (core dumped)
srun: Force Terminated StepId=2239275.0

I built Hypre with ./configure --with-hip --with-gpu-arch=gfx90a --with-MPI-lib-dirs="${MPICH_DIR}/lib" --with-MPI-libs="mpi" --with-MPI-include="${MPICH_DIR}/include" --enable-mixedint

@BenWibking BenWibking reopened this Aug 17, 2024
@BenWibking
Copy link
Author

You can see the full set of changes and the job scripts I used here: LLNL/AMG2023@main...BenWibking:AMG2023:amgdd

@BenWibking
Copy link
Author

I recompiled Hypre with ./configure --with-hip --with-gpu-arch=gfx90a --with-MPI-lib-dirs="${MPICH_DIR}/lib" --with-MPI-libs="mpi" --with-MPI-include="${MPICH_DIR}/include" --enable-mixedint --enable-unified-memory and now it hangs:

Running with these driver parameters:
  Problem ID    = 1

=============================================
Hypre init times:
=============================================
Hypre init:
  wall clock time = 0.000004 seconds
  Laplacian_27pt:
    (Nx, Ny, Nz) = (256, 256, 256)
    (Px, Py, Pz) = (2, 2, 2)

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 2239289.0 ON frontier10151 CANCELLED AT 2024-08-17T16:06:09 ***
slurmstepd: error: *** JOB 2239289 ON frontier10151 CANCELLED AT 2024-08-17T16:06:09 ***

@waynemitchell
Copy link
Contributor

It should work fine with hip (I just tried AMG-DD via the ij driver on an AMD machine with no issues). Your build looks OK to me. Also nothing is jumping out at me in your changes that would screw up a GPU run... Not sure where the issue is. Can you try running with valgrind? That might be the easiest way to at least diagnose where the segfault is occurring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants