Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect result with cblas_dgemv vs reference netlib and other libraries #4324

Open
augusew1 opened this issue Nov 16, 2023 · 14 comments
Open

Comments

@augusew1
Copy link

We recently switched to testing openBLAS on a project and are noticing some test case failures due to a matrix multiplication operation returning an incorrect result.

This issue has been observed on a variety of platforms (Ubuntu 22.04, RHEL7, RHEL9, MSYS2 mingw), a variety of compilers (clang-15, mingw-13, gcc-12, gcc-11, gcc-9), as well as a variety of openblas versions(0.3.3, 0.3.20, 0.3.21, 0.3.24), and a variety of CPUs:

  • Intel(R) Xeon(R) CPU E5-4650 0 @ 2.70GHz
  • Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
  • Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz 2.59 GHz

Reproduction

I have attached a minimally reproducible example (in C++) showing the problem

Reproduction Code
#include <cblas.h>

#include <cassert>
#include <iostream>
#include <fstream>

constexpr static size_t size = 16;

void get_values(double* A, double* b, const char* binfile) {
    std::fstream binaryReader;
    binaryReader.open(binfile, std::ios::in | std::ios::binary);
    assert(binaryReader.is_open());

    for (size_t i = 0; i < (size * size); i++) {
        binaryReader.read(reinterpret_cast<char *>(&A[i]), sizeof(double));
    }

    for (size_t i = 0; i < size; i++) {
        binaryReader.read(reinterpret_cast<char *>(&b[i]), sizeof(double));
    }
}

void matrixMultiply(const double* A, const double* b, double* c) {
    const size_t N = size;
    for (size_t i = 0; i < N; ++i) {
        for (size_t j = 0; j < N; ++j){
            c[i] += A[ (i * N) + j] * b[j];
        }
    }
}

void CBlasMatrixMultiply(const double* a, const double* b, double* c){
    int32_t M = size;
    int32_t N = size;
    cblas_dgemv(CblasRowMajor,
                CblasNoTrans,
                M, N,
                1.0, a, N,
                b, 1,
                1.0, c, 1);
}

int main(int argc, char* argv[]) {
    auto* A = new double[256]();
    auto* b = new double[16]();

    const char* binfile = argc > 1 ? argv[1] : BIN_FILE;
    get_values(A, b, binfile);

    auto* c_blas = new double[16]();
    CBlasMatrixMultiply(A, b, c_blas);

    auto *c_mat = new double[16]();
    matrixMultiply(A, b, c_mat);

    printf("           BLAS                         MAT\n");
    for (size_t i = 0; i < size; i++) {
        printf("%0.18e\t%0.18e\n", c_blas[i], c_mat[i]);
    }

    // dot product?
    auto c_ddot = cblas_ddot(size, A + (size * 15), 1, b, 1);
    printf("ddot: %0.18e\n", c_ddot);
    printf("dgemv: %0.18e\n", c_blas[15]);
    printf("ddot == dgemv? %s\n", c_ddot == c_blas[15] ? "YES" : "NO");
    printf("dgemv is 0.0? %s\n", c_blas[15] == 0.0 ? "YES" : "NO");

    delete[] A;
    delete[] b;
    delete[] c_blas;
    delete[] c_mat;

    return 0;
}

Compile this code with:

g++ -o blas_test blas_test.cpp -DBIN_FILE=\"path/to/bin\" $(pkg-config --libs --cflags openblas)

And observe the following output:

OpenBLAS result
           BLAS                         MAT
1.175201193643801378e+00        1.175201193643801822e+00
1.103638323514327002e+00        1.103638323514327224e+00
3.578143506473725477e-01        3.578143506473724922e-01
7.045563366848892062e-02        7.045563366848883735e-02
9.965128148869371871e-03        9.965128148869309421e-03
1.099586127207577424e-03        1.099586127207556390e-03
9.945433911373591229e-05        9.945433911360671605e-05
7.620541308983597162e-06        7.620541308896932986e-06
5.064719745540013918e-07        5.064719744437437483e-07
2.971814122565419325e-08        2.971814140421127963e-08
1.560886642160141946e-09        1.560886564391797127e-09
7.419920233786569952e-11        7.419902813631537848e-11
3.221201083647429186e-12        3.221406076314455686e-12
1.292299600663682213e-13        1.289813102624490508e-13
4.440892098500626162e-15        4.440892098500626162e-15
0.000000000000000000e+00        -4.440892098500626162e-16
ddot: 0.000000000000000000e+00
dgemv: 0.000000000000000000e+00
ddot == dgemv? YES
dgemv is 0.0? YES

It is worth noting that only the last value is different outside of acceptable numerical precision, and that every other value passes within 1e-16. Furthermore, a value of exactly 0.0 is, in itself, suspicious, as there's no real circumstance the value could be that.

Change the compile command to:

# cblas here is netlib
g++ -o blas_test blas_test.cpp -DBIN_FILE=\"path/to/bin\" $(pkg-config --libs --cflags cblas)

and observe this result:

netlib result
           BLAS                         MAT
1.175201193643801822e+00        1.175201193643801822e+00
1.103638323514327224e+00        1.103638323514327224e+00
3.578143506473724922e-01        3.578143506473724922e-01
7.045563366848883735e-02        7.045563366848883735e-02
9.965128148869309421e-03        9.965128148869309421e-03
1.099586127207556390e-03        1.099586127207556390e-03
9.945433911360671605e-05        9.945433911360671605e-05
7.620541308896932986e-06        7.620541308896932986e-06
5.064719744437437483e-07        5.064719744437437483e-07
2.971814140421127963e-08        2.971814140421127963e-08
1.560886564391797127e-09        1.560886564391797127e-09
7.419902813631537848e-11        7.419902813631537848e-11
3.221406076314455686e-12        3.221406076314455686e-12
1.289813102624490508e-13        1.289813102624490508e-13
4.440892098500626162e-15        4.440892098500626162e-15
-4.440892098500626162e-16       -4.440892098500626162e-16
ddot: -4.440892098500626162e-16
dgemv: -4.440892098500626162e-16
ddot == dgemv? YES
dgemv is 0.0? NO

Here is the binary file that contains a 16x16 matrix and a 16x1 vector:
(NOTE: This is a binary data file, extension changed to make github happy)
reproduction.txt

Other Notes

We have done extensive testing in other BLAS-like environments to get a result close to the expected -4e-16 result, which passes our test. Both MATLAB (2023a) and numpy (1.26 w/ MKL) return a result very close to what we expect, and pass our test. And, obviously, our naive matrix multiplication in the reproduction code gives

The matrix in question is not overly ill-conditioned, it has a condition number of ~10.

@martin-frbg
Copy link
Collaborator

Somewhat surprising as the three cpu generations would be using different optimized implementations of the GEMV BLAS kernel

@martin-frbg
Copy link
Collaborator

On first glance this appears to be some FMA-related effect from letting the compiler use AVX instructions - it is possible to obtain the netlib result by building for TARGET=GENERIC, but if I reconfigure any TARGET to use the same unoptimized, plain-C GEMV kernel without changing the compiler options in Makefile.x86_64, I end up with an "intermediate" result,
although there is no other BLAS function in the call graph. (Unfortunately removing -mavx breaks some code that uses intrinsics, so it will take more time to confirm that suspicion - later.)

@brada4
Copy link
Contributor

brada4 commented Nov 17, 2023

It is one bit of precision off, very normal occurrence computing in different order.

@augusew1
Copy link
Author

I don't think that's right, this is roughly -2 ^ -51, and machine precision should be 2 ^ -53. Even if that is "one bit off", the result is exactly 0.0. Surely that's a bug.

If you do:

auto* c_blas = new double[16]();
c_blas[15] = 1.0e-18;
CBlasMatrixMultiply(A, b, c_blas);
printf("%s", c_blas[15] == 1.0e-18 ? "YES ": "NO");

Then it will return YES. It's not even changing the value. Our initial thought was that this was some sort of memory alignment issue due to the matrix size. We couldn't fix it with manual alignment to 8 bytes though.

@brada4
Copy link
Contributor

brada4 commented Nov 17, 2023

You abuse machine rounding precision 32 times (or 16 with FMA) , discount 5 bits in your check.

It is not magic symbolic computation soup that gives accurate poly result each time.

@augusew1
Copy link
Author

I do not expect an identical result, but this result is exactly the starting value of c_blas[15], every time, for every value. 32 (or 16) floating point operations and they all cancel out? Every time? For any starting value? Exactly? On multiple CPUs? This is the puzzle here, not why it's not exactly matching netlib/MKL.

@brada4
Copy link
Contributor

brada4 commented Nov 17, 2023

It is rounding to output precision to store in a register at every 1 or 2 FLOP-s 50% up 50% down and so lottery continues till the end of computation. Yes, workload splitting affects result.

@brada4
Copy link
Contributor

brada4 commented Nov 20, 2023

just a guess that intel uses generic code for small inputs, then gradually jumps to vector code and adds CPU threads as samples grow. Openblas uses vector code always and switxhes to all cpus at one point.

@martin-frbg
Copy link
Collaborator

Then it will return YES. It's not even changing the value. Our initial thought was that this was some sort of memory alignment issue due to the matrix size. We couldn't fix it with manual alignment to 8 bytes though.

As far as I can tell all terms cancel with the "right" evaluation order and the y[15] evaluates to "exact" zero within the limits of precision - as this is added to what the c_blas array initially contained (the "beta times y" of the GEMV equation, beta being one in your case), you see no change. There appears to be loss of precision in the AVX2 "microkernel" used for Haswell and newer due to operand size limitations in the instruction set. Certainly not ideal, but not a catastrophic failure either (which would certainly have shown up in testsuites during the almost ten years this code has been in place)

@augusew1
Copy link
Author

augusew1 commented Nov 27, 2023

There appears to be loss of precision in the AVX2 "microkernel" used for Haswell and newer due to operand size limitations in the instruction set. Certainly not ideal, but not a catastrophic failure either (which would certainly have shown up in testsuites during the almost ten years this code has been in place)

I can't understand this conclusion given that I can reproduce this with an OPENBLAS_CORETYPE env that doesn't seem to use AVX2. Am I misunderstanding what this env does? Is AVX2 always checked even when an older CPU is manually set?

Here's what I'm testing on:

lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  6
  On-line CPU(s) list:   0-5
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  1
    Core(s) per socket:  1
    Socket(s):           6
    Stepping:            0
    BogoMIPS:            5786.40
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
                         a cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall n
                         x pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtop
                         ology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni
                          pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic mov
                         be popcnt tsc_deadline_timer aes xsave avx f16c rdrand
                         hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti
                          ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 sme
                         p bmi2 invpcid avx512f avx512dq rdseed adx smap clflush
                         opt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsa
                         ves arat pku ospke md_clear flush_l1d arch_capabilities
Virtualization features:
  Hypervisor vendor:     VMware
  Virtualization type:   full
Caches (sum of all):
  L1d:                   192 KiB (6 instances)
  L1i:                   192 KiB (6 instances)
  L2:                    6 MiB (6 instances)
  L3:                    132 MiB (6 instances)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-5
Vulnerabilities:
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         KVM: Mitigation: VMX unsupported
  L1tf:                  Mitigation; PTE Inversion
  Mds:                   Mitigation; Clear CPU buffers; SMT Host state unknown
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT Host state unknown
  Retbleed:              Mitigation; IBRS
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer
                          sanitization
  Spectre v2:            Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB
                          filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

This is on Ubuntu 22.04, with the libopenblas-pthread-dev package. I can also recreate this on OpenBLAS 3.24.0 on MSYS2, but that machine is Coffee Lake and can't use SkylakeX instructions.

I'm using OPENBLAS_VERBOSE=5 to print the core type at the top

OPENBLAS_CORETYPE=Nehalem (No AVX2)
Core: Nehalem
           BLAS                         MAT
1.175201193643801378e+00        1.175201193643801822e+00
1.103638323514327002e+00        1.103638323514327224e+00
3.578143506473726032e-01        3.578143506473724922e-01
7.045563366848886511e-02        7.045563366848883735e-02
9.965128148869364932e-03        9.965128148869309421e-03
1.099586127207521913e-03        1.099586127207556390e-03
9.945433911356243994e-05        9.945433911360671605e-05
7.620541308817063708e-06        7.620541308896932986e-06
5.064719744429790893e-07        5.064719744437437483e-07
2.971814136443207133e-08        2.971814140421127963e-08
1.560886614404566330e-09        1.560886564391797127e-09
7.419920233786569952e-11        7.419902813631537848e-11
3.221423128252354218e-12        3.221406076314455686e-12
1.287858708565181587e-13        1.289813102624490508e-13
4.440892098500626162e-15        4.440892098500626162e-15
0.000000000000000000e+00        -4.440892098500626162e-16
ddot: 0.000000000000000000e+00
dgemv: 0.000000000000000000e+00
ddot == dgemv? YES
dgemv is 0.0? YES
OPENBLAS_CORETYPE=SandyBridge (No AVX2)
Core: Sandybridge
           BLAS                         MAT
1.175201193643801378e+00        1.175201193643801822e+00
1.103638323514327002e+00        1.103638323514327224e+00
3.578143506473726032e-01        3.578143506473724922e-01
7.045563366848886511e-02        7.045563366848883735e-02
9.965128148869364932e-03        9.965128148869309421e-03
1.099586127207521913e-03        1.099586127207556390e-03
9.945433911356243994e-05        9.945433911360671605e-05
7.620541308817063708e-06        7.620541308896932986e-06
5.064719744429790893e-07        5.064719744437437483e-07
2.971814136443207133e-08        2.971814140421127963e-08
1.560886614404566330e-09        1.560886564391797127e-09
7.419920233786569952e-11        7.419902813631537848e-11
3.221423128252354218e-12        3.221406076314455686e-12
1.287858708565181587e-13        1.289813102624490508e-13
4.440892098500626162e-15        4.440892098500626162e-15
0.000000000000000000e+00        -4.440892098500626162e-16
ddot: 0.000000000000000000e+00
dgemv: 0.000000000000000000e+00
ddot == dgemv? YES
dgemv is 0.0? YES
OPENBLAS_CORETYPE=Haswell
Core: Haswell
           BLAS                         MAT
1.175201193643801378e+00        1.175201193643801822e+00
1.103638323514327002e+00        1.103638323514327224e+00
3.578143506473725477e-01        3.578143506473724922e-01
7.045563366848892062e-02        7.045563366848883735e-02
9.965128148869371871e-03        9.965128148869309421e-03
1.099586127207577424e-03        1.099586127207556390e-03
9.945433911373591229e-05        9.945433911360671605e-05
7.620541308983597162e-06        7.620541308896932986e-06
5.064719745540013918e-07        5.064719744437437483e-07
2.971814122565419325e-08        2.971814140421127963e-08
1.560886642160141946e-09        1.560886564391797127e-09
7.419920233786569952e-11        7.419902813631537848e-11
3.221201083647429186e-12        3.221406076314455686e-12
1.292299600663682213e-13        1.289813102624490508e-13
4.440892098500626162e-15        4.440892098500626162e-15
0.000000000000000000e+00        -4.440892098500626162e-16
ddot: 0.000000000000000000e+00
dgemv: 0.000000000000000000e+00
ddot == dgemv? YES
dgemv is 0.0? YES
OPENBLAS_CORETYPE=SkylakeX (AVX512)
Core: SkylakeX
           BLAS                         MAT
1.175201193643801378e+00        1.175201193643801822e+00
1.103638323514327002e+00        1.103638323514327224e+00
3.578143506473725477e-01        3.578143506473724922e-01
7.045563366848892062e-02        7.045563366848883735e-02
9.965128148869371871e-03        9.965128148869309421e-03
1.099586127207577424e-03        1.099586127207556390e-03
9.945433911373591229e-05        9.945433911360671605e-05
7.620541308983597162e-06        7.620541308896932986e-06
5.064719745540013918e-07        5.064719744437437483e-07
2.971814122565419325e-08        2.971814140421127963e-08
1.560886642160141946e-09        1.560886564391797127e-09
7.419920233786569952e-11        7.419902813631537848e-11
3.221201083647429186e-12        3.221406076314455686e-12
1.292299600663682213e-13        1.289813102624490508e-13
4.440892098500626162e-15        4.440892098500626162e-15
0.000000000000000000e+00        -4.440892098500626162e-16
ddot: -8.881784197001252323e-16
dgemv: 0.000000000000000000e+00
ddot == dgemv? NO
dgemv is 0.0? YES

In my testing, I'm seeing that neither AVX2 or AVX matters, as Nehalem and SandyBridge give identical results. AVX2 on Haswell gives a slightly different result but one that's well within machine precision. The biggest difference I see is with SkylakeX and, presumably, AVX512 giving a different result for ddot only. This value is "correct" to us as it is within our test case, and matches very closely to what other BLAS libraries give.

Please help me understand how the AVX2 microkernel is the issue here. If OPENBLAS_CORETYPE does not actually control this, does OpenBLAS provide a way to disable AVX2 at runtime?

@brada4
Copy link
Contributor

brada4 commented Nov 27, 2023

It will fall back to older compute kernels if you do not have AVX2 in CPUID.
The difference is 1-2 youngest bits of significand and is expected. If you want same result always you have to use unoptimized netlib build.

@oscardssmith
Copy link

oscardssmith commented Sep 4, 2024

The issue here is actually really simple. OpenBLAS for gemv isn't using FMA which would also be faster.

For some Julia code demonstrating this, see

A = [1.1 0 -1.1 0
       -1.1 0  1.1 0]
v = fill(1.33,4)

A*v # returns [0.0, 0.0]

using MKL
A*v # returns [1.0391687510491465e-16, -1.0391687510491465e-16]

function simple_gemv_fma(A, v)
    result=zeros(eltype(A), size(A, 1))
    for i = 1:size(A, 1)
        for j = 1:size(A, 2)
            result[i] = fma(A[i,j], v[j], result[i])
        end
    end
    return result
end
function simple_gemv(A, v)
    result=zeros(eltype(A), size(A, 1))
    for i = 1:size(A, 1)
        for j = 1:size(A, 2)
            result[i] = A[i,j] * v[j] + result[i]
        end
    end
    return result
end

simple_gemv(A, v) # returns [0.0, 0.0]
simple_gemv_fma(A, v) # returns [1.0391687510491465e-16, -1.0391687510491465e-16]

@martin-frbg
Copy link
Collaborator

The issue here is actually really simple. OpenBLAS for gemv isn't using FMA which

I don't think it is that simple, unless you meant to write "the reference BLAS isn't using FMA" which is trivially true. The OpenBLAS GEMV kernels for the cpus mentioned here all use FMA instructions, the question is if they could/should be rewritten to minimize the difference seen in that particular case.

@oscardssmith
Copy link

hmm. If OpenBLAS GEMV is using fma, what order is it running to match the naive loop without FMA's results? I saw that the results were the results of the obvious algorithm and assumed from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants