ARMV8SVE dgemm kernel on Nvidia Grace slightly slower than generic ARMV8 kernel #4440

giordano · 2024-01-16T15:15:53Z

Now that I have proper build of OpenBLAS 0.3.26 with GCC 11 for julia (see #4431 for the record), I'm testing it on Nvidia Grace, with surprising results:

$ OPENBLAS_NUM_THREADS=72 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8sve
julia> peakflops(20_000)
1.975233629673878e12

julia> 
$ OPENBLAS_NUM_THREADS=72 OPENBLAS_CORETYPE=ARMV8 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8
julia> peakflops(20_000)
2.2233581470163237e12

Source code of the peakflops function is at https://github.com/JuliaLang/julia/blob/b058146cafec8405aa07f9647642fd485ea3b5d7/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L636-L654, it basically runs a double precision matrix-matrix multiplication for a few times (3 by default) using the underlying BLAS library (OpenBLAS by default), measures the time and returns the flops. Numbers reported above are reproducible, I always get peakflops of the order of 2.0e12 for armv8sve, and 2.2e12 for armv8, it isn't a one-off result.

It appears that the armv8sve dgemm kernel is slower (i.e. lower peakflops) than the generic armv8 one.

For the record, I get the same results as julia's dynamic arch build also with a native build of OpenBLAS 0.3.26 on the system driven by spack, configured with

 'make' '-j24' '-s' 'CC=/home/cceamgi/repo/spack/lib/spack/env/gcc/gcc' 'FC=/home/cceamgi/repo/spack/lib/spack/env/gcc/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'TARGET=ARMV8SVE' 'USE_LOCKING=1' 'USE_OPENMP=1' 'USE_THREAD=1' 'INTERFACE64=1' 'SYMBOLSUFFIX=64_' 'RANLIB=ranlib' 'NUM_THREADS=512' 'all'

Compiler used on the system is

$ gcc --version
gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

So the problem isn't specific to julia's build of OpenBLAS.

The text was updated successfully, but these errors were encountered:

giordano · 2024-01-16T16:09:05Z

With the same build of Julia as above (BTW, if you're curious you can obtain it from the "Artifacts" tab of https://buildkite.com/julialang/julia-master/builds/32344#018d12b8-4104-4406-b95b-a874b65d18db until the corresponding PR is merged) on A64FX I get:

$ OPENBLAS_NUM_THREADS=48 OPENBLAS_VERBOSE=2 ./julia -q
Unknown CPU model - implementer 46 part 1
Core: armv8sve
julia> peakflops(10_000)
6.493188698299631e11

julia> 
$ OPENBLAS_NUM_THREADS=48 OPENBLAS_CORETYPE=ARMV8 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8
julia> peakflops(10_000)
1.276729385867521e11

This is much more reasonable: the armv8sve kernel is sensibly faster than the generic one. I obtained the same numbers on both Isambard and Ookami, so this seems to be consistent also across different systems (haven't tried Fugaku, but could test also that for good measure if that helps).

Edit: using 20k x 20k matrices, for closer comparison with the numbers above:

$ OPENBLAS_NUM_THREADS=48 OPENBLAS_VERBOSE=2 ./julia -q
Unknown CPU model - implementer 46 part 1
Core: armv8sve
julia> peakflops(20_000)
8.650433108152069e11

julia>
$ OPENBLAS_NUM_THREADS=48 OPENBLAS_CORETYPE=ARMV8 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8
julia> peakflops(20_000)
1.3525547058847133e11

Edit 2: on Fugaku:

$ OPENBLAS_NUM_THREADS=48 OPENBLAS_VERBOSE=2 ./julia -q
Unknown CPU model - implementer 46 part 1
Core: armv8sve
julia> peakflops(20_000)
9.674598962304862e11

julia>
$ OPENBLAS_NUM_THREADS=48 OPENBLAS_CORETYPE=ARMV8 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8
julia> peakflops(20_000)
1.5124337976818753e11

martin-frbg · 2024-01-16T16:12:30Z

Hmm, that's a bit disappointing but I don't think anybody had tested on NeoverseV2 yet, and currently ARMV8SVE would use the same GEMM P and Q that were shown to make poor use of the NeoverseV1's cache in #4381 - though the difference is unlikely to lead to as large speedup as you showed for A64FX (which, incidentally, the SVE DGEMM kernel was originally written for)

martin-frbg · 2024-01-16T16:19:24Z

In that sense I wonder what you'd get with OPENBLAS_CORETYPE=NEOVERSEV1 instead of the automatic fallback to ARMV8SVE ?

giordano · 2024-01-16T16:22:24Z

$ OPENBLAS_NUM_THREADS=72 OPENBLAS_CORETYPE=NEOVERSEV1 OPENBLAS_VERBOSE=2 ./julia -q
Core: neoversev1
julia> peakflops(20_000)
2.250280675680324e12

It's similar to the generic one.

Mousius · 2024-01-16T16:25:51Z

You should expect similar performance, as SVE and ASIMD are the same width here. I started thinking about how to resolve this with #4397, as different implementations have different cache sizes so it's a bit trickier to specify. For now it'd be safe to use NEOVERSEV1 though.

martin-frbg · 2024-01-16T16:26:34Z

Ok thanks, at least that trivial tweak of the parameters file would put it ever so slightly ahead... still strange that we see none of the speedup of A64FX. (Now I'm wondering if the different number of threads, 48 vs 72, could play a role although of course it shouldn't)

martin-frbg · 2024-01-16T16:34:09Z

ah, V2 is no longer 256 bit SVE like V1...

giordano · 2024-01-16T16:46:03Z

(Now I'm wondering if the different number of threads, 48 vs 72, could play a role although of course it shouldn't)

$ OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8sve
julia> using LinearAlgebra

julia> for threads in (12, 24, 36, 48, 60, 72)
           BLAS.set_num_threads(threads)
           @assert BLAS.get_num_threads() == threads # just to be safe
           @show threads, peakflops(20_000)
       end
(threads, peakflops(20000)) = (12, 3.9528300310608466e11)
(threads, peakflops(20000)) = (24, 7.716683918087646e11)
(threads, peakflops(20000)) = (36, 1.1449729103280854e12)
(threads, peakflops(20000)) = (48, 1.4842484586872903e12)
(threads, peakflops(20000)) = (60, 1.7123865022574075e12)
(threads, peakflops(20000)) = (72, 1.951739414398176e12)

julia> 
$ OPENBLAS_CORETYPE=NEOVERSEV1 OPENBLAS_VERBOSE=2 ./julia -q
Core: neoversev1
julia> using LinearAlgebra

julia> for threads in (12, 24, 36, 48, 60, 72)
           BLAS.set_num_threads(threads)
           @assert BLAS.get_num_threads() == threads # just to be safe
           @show threads, peakflops(20_000)
       end
(threads, peakflops(20000)) = (12, 4.434179171364078e11)
(threads, peakflops(20000)) = (24, 8.723736554950096e11)
(threads, peakflops(20000)) = (36, 1.2741984819577405e12)
(threads, peakflops(20000)) = (48, 1.6504006918821333e12)
(threads, peakflops(20000)) = (60, 1.9505026499028577e12)
(threads, peakflops(20000)) = (72, 2.19212646981316e12)

Thread scaling is very similar using the different kernels, but the baseline of neoversev1 is just higher.

giordano · 2024-01-16T16:58:19Z

And a plot of the above data just to make things more visual:

Code to generate the plot

using Plots

armv8sve = [3.9528300310608466e11, 7.716683918087646e11, 1.1449729103280854e12, 1.4842484586872903e12, 1.7123865022574075e12, 1.951739414398176e12]
neoversev1 = [4.434179171364078e11, 8.723736554950096e11, 1.2741984819577405e12, 1.6504006918821333e12, 1.9505026499028577e12, 2.19212646981316e12]
threads = [12, 24, 36, 48, 60, 72]

p = plot(threads, threads ./ threads[begin]; linestyle=:dash, label="Ideal", xlabel="Number of threads", ylabel="Scaling", size=(900, 600))
plot!(p, threads, armv8sve ./ armv8sve[begin]; marker=:circle, label="ARMV8SVE")
plot!(p, threads, neoversev1 ./ neoversev1[begin]; marker=:diamond, label="NEOVERSEV1")

brada4 · 2024-01-19T00:12:15Z

Thats life according to amdahls rule.

Mousius · 2024-01-19T19:08:35Z

@giordano I assume something like #4444 (comment) would fix this temporarily?

brada4 · 2024-01-19T20:25:34Z

Is it more like neoverse V1 or like N2 or ARMv8 generic is faster?

martin-frbg · 2024-01-26T16:00:09Z

closing as fixed by #4444

giordano mentioned this issue Jan 19, 2024

Add dynamic support for Arm(R) Neoverse(TM) V2 processor #4444

Merged

martin-frbg closed this as completed Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARMV8SVE dgemm kernel on Nvidia Grace slightly slower than generic ARMV8 kernel #4440

ARMV8SVE dgemm kernel on Nvidia Grace slightly slower than generic ARMV8 kernel #4440

giordano commented Jan 16, 2024 •

edited

Loading

giordano commented Jan 16, 2024 •

edited

Loading

martin-frbg commented Jan 16, 2024

martin-frbg commented Jan 16, 2024

giordano commented Jan 16, 2024 •

edited

Loading

Mousius commented Jan 16, 2024

martin-frbg commented Jan 16, 2024

martin-frbg commented Jan 16, 2024

giordano commented Jan 16, 2024

giordano commented Jan 16, 2024

brada4 commented Jan 19, 2024

Mousius commented Jan 19, 2024

brada4 commented Jan 19, 2024

martin-frbg commented Jan 26, 2024

ARMV8SVE dgemm kernel on Nvidia Grace slightly slower than generic ARMV8 kernel #4440

ARMV8SVE dgemm kernel on Nvidia Grace slightly slower than generic ARMV8 kernel #4440

Comments

giordano commented Jan 16, 2024 • edited Loading

giordano commented Jan 16, 2024 • edited Loading

martin-frbg commented Jan 16, 2024

martin-frbg commented Jan 16, 2024

giordano commented Jan 16, 2024 • edited Loading

Mousius commented Jan 16, 2024

martin-frbg commented Jan 16, 2024

martin-frbg commented Jan 16, 2024

giordano commented Jan 16, 2024

giordano commented Jan 16, 2024

brada4 commented Jan 19, 2024

Mousius commented Jan 19, 2024

brada4 commented Jan 19, 2024

martin-frbg commented Jan 26, 2024

giordano commented Jan 16, 2024 •

edited

Loading

giordano commented Jan 16, 2024 •

edited

Loading

giordano commented Jan 16, 2024 •

edited

Loading