-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improving scalability #53
Comments
Hi Francis |
Ah. Just realized this is 2D, so it's already slab. |
BTW, I do not find the results strange, but I cannot really understand them without a better knowledge of your PC. Is the memory shared, distributed, heterogeneous? How many cores in each CPU? By the looks of it I'm guessing 2 cores per CPU because the performance is very good on 2 cores, but then there is very little speed-up going to 4. That could be explained with your computer having fast interconnect between 2 cores on the same CPU, but slower from CPU to CPU. But you could also have 4 cores on one CPU and fast interconnect between cores 1 and 2, but slower between (1, 2) and to the two remaining (3, 4). If you look at our MPI paper we discuss this to some length in section 4. The thing with spectral codes is that they require communication between all cores because the basis functions are global. So spectral codes are very heavy on communication and require fast interconnect between all cores. Most computers are not built to handle that. On dedicated supercomputers with very fast interconnect the codes can scale well up to thousands of cores and I've used this code up to 70,000 cores with near perfect scaling. But that does not mean that the code will scale at all on a different computer with a different hardware. These things are complicated:-) |
Hello Mikael and thanks for your respones. Greatly appreciated. Yes, this is a 2D problem so it is always a slab. I did a 3D version of my code and did some tests with 512^3 and found the scalings of np = 1, 2, 4, 8, 16 I suspect that if I used more degrees fo freedom I would get even better performance. This suggests that my architure can have good scalability. As for the hardware, this is a brand new inspiron, 1 processor with 18 cores, which I have been led to believe should have great scalability. I might try doing some tests with parts of the code to see how computing the flux, RHS, many times scales. If that scales well then the problem might be with the time stepping. I plan to try this tomorrow and will let you know how things go. |
I have some strange results to share but might help to point out where the problem occurs. Below, I will copy a simple code that times how long it takes to compute the gradient 20 times. It doesn't do any updating so it's not physically meaningful but it should be a good measure of how the effort scales with mulitple cores. The results are as follows: N=2048 N=4096 In both cases the scaling from 1 to 2 is awful. It actually takes longer with two cores than with 1. But, going from 2 to 4, actually scales very well. The efficiency from 2 to 4 is 0.96 for the first and 0.84 for the second. Again, I would think with more points it should scale better. I guess I have some questions. First, do you get similar results on one of your machines that scales well? Second, could it be there is a problem in what's going on from 1 to 2? Third, I don't remember if I've turned on the optimization or not. Should that make a difference in the scalability? As an aside, I have tried to install shenfun on one of our servers and have problems both with and without conda, but I will save that for another issue. ; ) |
|
Hi Francis,
|
Hello Mikael, All excellent points. I'll start from the first step and go from there. I will keep you posted as I make progess. |
Hello Mikael,
I have tested the scalability of my 2D Navier-Stokes (shared in another issue) code and find that it's not doing great. It seems to get worst as the number of degrees of freedom increases. I have copied the results in a table below in an text table.
Just to be clear, I wasn't experting great efficiency as I have not been very clever in the setup of the code, but I do find it strange that as the number of points increase, the efficiency decrease. Normally, I would expect it to get better.
Question 1: Does this look odd to you?
Question 2: If I wanted to monitor the efficiency of my code in parallel, what would you recommend I use?
Note: these are done on my new, person desktop with 18 cores and 64 GB of ram. I am trying to get shenfun installed on a server, which has more cores to play with, but the people in charge have decided not to support conda, and don't want us to install conda, so that is becoming more of a chore than it should be. This may make it's way into another issue sometime ;)
| Np=1 | 2 | 4 | 8 | 16 |
N=1024 | 1 | 0.68 | 0.61 | 0.4 | 0.26
N=2048 | 1 | 0.55 | 0.47 | 0.37 | 0.19
N=4096 | 1 | 0.47 | 0.39 | 0.27 | 0.16
The text was updated successfully, but these errors were encountered: