Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARMV8SVE dgemm kernel on Nvidia Grace slightly slower than generic ARMV8 kernel #4440

Closed
giordano opened this issue Jan 16, 2024 · 13 comments
Closed

Comments

@giordano
Copy link
Contributor

giordano commented Jan 16, 2024

Now that I have proper build of OpenBLAS 0.3.26 with GCC 11 for julia (see #4431 for the record), I'm testing it on Nvidia Grace, with surprising results:

$ OPENBLAS_NUM_THREADS=72 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8sve
julia> peakflops(20_000)
1.975233629673878e12

julia> 
$ OPENBLAS_NUM_THREADS=72 OPENBLAS_CORETYPE=ARMV8 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8
julia> peakflops(20_000)
2.2233581470163237e12

Source code of the peakflops function is at https://github.com/JuliaLang/julia/blob/b058146cafec8405aa07f9647642fd485ea3b5d7/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L636-L654, it basically runs a double precision matrix-matrix multiplication for a few times (3 by default) using the underlying BLAS library (OpenBLAS by default), measures the time and returns the flops. Numbers reported above are reproducible, I always get peakflops of the order of 2.0e12 for armv8sve, and 2.2e12 for armv8, it isn't a one-off result.

It appears that the armv8sve dgemm kernel is slower (i.e. lower peakflops) than the generic armv8 one.

For the record, I get the same results as julia's dynamic arch build also with a native build of OpenBLAS 0.3.26 on the system driven by spack, configured with

 'make' '-j24' '-s' 'CC=/home/cceamgi/repo/spack/lib/spack/env/gcc/gcc' 'FC=/home/cceamgi/repo/spack/lib/spack/env/gcc/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'TARGET=ARMV8SVE' 'USE_LOCKING=1' 'USE_OPENMP=1' 'USE_THREAD=1' 'INTERFACE64=1' 'SYMBOLSUFFIX=64_' 'RANLIB=ranlib' 'NUM_THREADS=512' 'all'

Compiler used on the system is

$ gcc --version
gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

So the problem isn't specific to julia's build of OpenBLAS.

@giordano
Copy link
Contributor Author

giordano commented Jan 16, 2024

With the same build of Julia as above (BTW, if you're curious you can obtain it from the "Artifacts" tab of https://buildkite.com/julialang/julia-master/builds/32344#018d12b8-4104-4406-b95b-a874b65d18db until the corresponding PR is merged) on A64FX I get:

$ OPENBLAS_NUM_THREADS=48 OPENBLAS_VERBOSE=2 ./julia -q
Unknown CPU model - implementer 46 part 1
Core: armv8sve
julia> peakflops(10_000)
6.493188698299631e11

julia> 
$ OPENBLAS_NUM_THREADS=48 OPENBLAS_CORETYPE=ARMV8 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8
julia> peakflops(10_000)
1.276729385867521e11

This is much more reasonable: the armv8sve kernel is sensibly faster than the generic one. I obtained the same numbers on both Isambard and Ookami, so this seems to be consistent also across different systems (haven't tried Fugaku, but could test also that for good measure if that helps).

Edit: using 20k x 20k matrices, for closer comparison with the numbers above:

$ OPENBLAS_NUM_THREADS=48 OPENBLAS_VERBOSE=2 ./julia -q
Unknown CPU model - implementer 46 part 1
Core: armv8sve
julia> peakflops(20_000)
8.650433108152069e11

julia>
$ OPENBLAS_NUM_THREADS=48 OPENBLAS_CORETYPE=ARMV8 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8
julia> peakflops(20_000)
1.3525547058847133e11

Edit 2: on Fugaku:

$ OPENBLAS_NUM_THREADS=48 OPENBLAS_VERBOSE=2 ./julia -q
Unknown CPU model - implementer 46 part 1
Core: armv8sve
julia> peakflops(20_000)
9.674598962304862e11

julia>
$ OPENBLAS_NUM_THREADS=48 OPENBLAS_CORETYPE=ARMV8 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8
julia> peakflops(20_000)
1.5124337976818753e11

@martin-frbg
Copy link
Collaborator

Hmm, that's a bit disappointing but I don't think anybody had tested on NeoverseV2 yet, and currently ARMV8SVE would use the same GEMM P and Q that were shown to make poor use of the NeoverseV1's cache in #4381 - though the difference is unlikely to lead to as large speedup as you showed for A64FX (which, incidentally, the SVE DGEMM kernel was originally written for)

@martin-frbg
Copy link
Collaborator

In that sense I wonder what you'd get with OPENBLAS_CORETYPE=NEOVERSEV1 instead of the automatic fallback to ARMV8SVE ?

@giordano
Copy link
Contributor Author

giordano commented Jan 16, 2024

$ OPENBLAS_NUM_THREADS=72 OPENBLAS_CORETYPE=NEOVERSEV1 OPENBLAS_VERBOSE=2 ./julia -q
Core: neoversev1
julia> peakflops(20_000)
2.250280675680324e12

It's similar to the generic one.

@Mousius
Copy link
Contributor

Mousius commented Jan 16, 2024

You should expect similar performance, as SVE and ASIMD are the same width here. I started thinking about how to resolve this with #4397, as different implementations have different cache sizes so it's a bit trickier to specify. For now it'd be safe to use NEOVERSEV1 though.

@martin-frbg
Copy link
Collaborator

Ok thanks, at least that trivial tweak of the parameters file would put it ever so slightly ahead... still strange that we see none of the speedup of A64FX. (Now I'm wondering if the different number of threads, 48 vs 72, could play a role although of course it shouldn't)

@martin-frbg
Copy link
Collaborator

ah, V2 is no longer 256 bit SVE like V1...

@giordano
Copy link
Contributor Author

(Now I'm wondering if the different number of threads, 48 vs 72, could play a role although of course it shouldn't)

$ OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8sve
julia> using LinearAlgebra

julia> for threads in (12, 24, 36, 48, 60, 72)
           BLAS.set_num_threads(threads)
           @assert BLAS.get_num_threads() == threads # just to be safe
           @show threads, peakflops(20_000)
       end
(threads, peakflops(20000)) = (12, 3.9528300310608466e11)
(threads, peakflops(20000)) = (24, 7.716683918087646e11)
(threads, peakflops(20000)) = (36, 1.1449729103280854e12)
(threads, peakflops(20000)) = (48, 1.4842484586872903e12)
(threads, peakflops(20000)) = (60, 1.7123865022574075e12)
(threads, peakflops(20000)) = (72, 1.951739414398176e12)

julia> 
$ OPENBLAS_CORETYPE=NEOVERSEV1 OPENBLAS_VERBOSE=2 ./julia -q
Core: neoversev1
julia> using LinearAlgebra

julia> for threads in (12, 24, 36, 48, 60, 72)
           BLAS.set_num_threads(threads)
           @assert BLAS.get_num_threads() == threads # just to be safe
           @show threads, peakflops(20_000)
       end
(threads, peakflops(20000)) = (12, 4.434179171364078e11)
(threads, peakflops(20000)) = (24, 8.723736554950096e11)
(threads, peakflops(20000)) = (36, 1.2741984819577405e12)
(threads, peakflops(20000)) = (48, 1.6504006918821333e12)
(threads, peakflops(20000)) = (60, 1.9505026499028577e12)
(threads, peakflops(20000)) = (72, 2.19212646981316e12)

Thread scaling is very similar using the different kernels, but the baseline of neoversev1 is just higher.

@giordano
Copy link
Contributor Author

And a plot of the above data just to make things more visual:
plot

Code to generate the plot

using Plots

armv8sve = [3.9528300310608466e11, 7.716683918087646e11, 1.1449729103280854e12, 1.4842484586872903e12, 1.7123865022574075e12, 1.951739414398176e12]
neoversev1 = [4.434179171364078e11, 8.723736554950096e11, 1.2741984819577405e12, 1.6504006918821333e12, 1.9505026499028577e12, 2.19212646981316e12]
threads = [12, 24, 36, 48, 60, 72]

p = plot(threads, threads ./ threads[begin]; linestyle=:dash, label="Ideal", xlabel="Number of threads", ylabel="Scaling", size=(900, 600))
plot!(p, threads, armv8sve ./ armv8sve[begin]; marker=:circle, label="ARMV8SVE")
plot!(p, threads, neoversev1 ./ neoversev1[begin]; marker=:diamond, label="NEOVERSEV1")

@brada4
Copy link
Contributor

brada4 commented Jan 19, 2024

Thats life according to amdahls rule.

@Mousius
Copy link
Contributor

Mousius commented Jan 19, 2024

@giordano I assume something like #4444 (comment) would fix this temporarily?

@brada4
Copy link
Contributor

brada4 commented Jan 19, 2024

Is it more like neoverse V1 or like N2 or ARMv8 generic is faster?

@martin-frbg
Copy link
Collaborator

closing as fixed by #4444

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants