-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARMV8SVE dgemm kernel on Nvidia Grace slightly slower than generic ARMV8 kernel #4440
Comments
With the same build of Julia as above (BTW, if you're curious you can obtain it from the "Artifacts" tab of https://buildkite.com/julialang/julia-master/builds/32344#018d12b8-4104-4406-b95b-a874b65d18db until the corresponding PR is merged) on A64FX I get:
This is much more reasonable: the armv8sve kernel is sensibly faster than the generic one. I obtained the same numbers on both Isambard and Ookami, so this seems to be consistent also across different systems (haven't tried Fugaku, but could test also that for good measure if that helps). Edit: using 20k x 20k matrices, for closer comparison with the numbers above:
Edit 2: on Fugaku:
|
Hmm, that's a bit disappointing but I don't think anybody had tested on NeoverseV2 yet, and currently ARMV8SVE would use the same GEMM P and Q that were shown to make poor use of the NeoverseV1's cache in #4381 - though the difference is unlikely to lead to as large speedup as you showed for A64FX (which, incidentally, the SVE DGEMM kernel was originally written for) |
In that sense I wonder what you'd get with |
It's similar to the generic one. |
You should expect similar performance, as SVE and ASIMD are the same width here. I started thinking about how to resolve this with #4397, as different implementations have different cache sizes so it's a bit trickier to specify. For now it'd be safe to use |
Ok thanks, at least that trivial tweak of the parameters file would put it ever so slightly ahead... still strange that we see none of the speedup of A64FX. (Now I'm wondering if the different number of threads, 48 vs 72, could play a role although of course it shouldn't) |
ah, V2 is no longer 256 bit SVE like V1... |
Thread scaling is very similar using the different kernels, but the baseline of neoversev1 is just higher. |
Thats life according to amdahls rule. |
@giordano I assume something like #4444 (comment) would fix this temporarily? |
Is it more like neoverse V1 or like N2 or ARMv8 generic is faster? |
closing as fixed by #4444 |
Now that I have proper build of OpenBLAS 0.3.26 with GCC 11 for julia (see #4431 for the record), I'm testing it on Nvidia Grace, with surprising results:
Source code of the
peakflops
function is at https://github.com/JuliaLang/julia/blob/b058146cafec8405aa07f9647642fd485ea3b5d7/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L636-L654, it basically runs a double precision matrix-matrix multiplication for a few times (3 by default) using the underlying BLAS library (OpenBLAS by default), measures the time and returns the flops. Numbers reported above are reproducible, I always get peakflops of the order of 2.0e12 for armv8sve, and 2.2e12 for armv8, it isn't a one-off result.It appears that the armv8sve dgemm kernel is slower (i.e. lower peakflops) than the generic armv8 one.
For the record, I get the same results as julia's dynamic arch build also with a native build of OpenBLAS 0.3.26 on the system driven by spack, configured with
Compiler used on the system is
So the problem isn't specific to julia's build of OpenBLAS.
The text was updated successfully, but these errors were encountered: