Picked the mantle to implement matrices and matmul in Python, C, and CUDA. Then optimize the shit out of it. The inspiration is here.
A gem of knowledge, or where does these GEMM something names come from?
- Tiled matmul
- Multithreading
- Vectorization with CPU intrinsics
- CUDA
- Numpy (multithreaded, M2 Pro):
(128, 128, 128): 57.52 +/- 28.71 GFLOPS
(512, 512, 512): 253.81 +/- 25.85 GFLOPS
(1024, 1024, 1024): 274.68 +/- 56.40 GFLOPS
- Numpy (single thread, M2 Pro):
(128, 128, 128): 92.89 +/- 9.58 GFLOPS
(512, 512, 512): 100.50 +/- 8.87 GFLOPS
(1024, 1024, 1024): 104.77 +/- 2.32 GFLOPS
- Numpy (multithreaded, Ryzen 3600):
Average FLOPS | (10x100 iterations) | multithreaded=True:
(128, 128, 128): 113.38 +/- 50.58 GFLOPS
(512, 512, 512): 30.59 +/- 14.48 GFLOPS
(1024, 1024, 1024): 112.05 +/- 20.51 GFLOPS
- Numpy (single thread, Ryzen 3600):
(128, 128, 128): 89.12 +/- 0.66 GFLOPS
(512, 512, 512): 107.86 +/- 2.25 GFLOPS
(1024, 1024, 1024): 106.82 +/- 1.46 GFLOPS
- Python (matrix.py): ~0.0004 GFLOPS (Ryzen 3600), ~0.0010 (M2 Pro)
- C (matrix.c):
- Vanilla:
- M2 Pro: ~0.31 GFLOPS (-O0), higher optimization breaks benchmarking
- Ryzen 3600: ~0.13 (-O0), ~0.55 (-O1), ~1.00-1.6 for small and 0.33 for large (-O2), ~1.9 small, 0.9 medium, 0.33 for large (-O3)
- Optimization: Transpose for row-major optimized access
- Ryzen 3600: ~0.45, 0.49, 0.49 (-O0), ~2.70, 2.58, 2.58 (-O1), same for -02, ~3.34, 2.65, 2.61 (-O3),
- Vanilla:
- C (strassens.c):
- Vanilla:
- Ryzen 3600: ~0.43, 0.57, 0.65 GFLOPS (-O0), ~3.10, 3.45, 3.80 GFLOPS (-O1), ~3.80, 4.20, 4.90 GFLOPS (-O3)
- Optimization: Transpose for row-major optimized access
- Ryzen 3600: ~0.49, 0.61, 0.69 GFLOPS (-O0), ~3.23, 3.60, 4.20 GFLOPS (-O1), ~4.03, 4.70, 5.88 GFLOPS (-O3)
- Optimization: Transpose + Tiling for squeezing tiny matrices in CPU Cache
- Ryzen 3600: ~0.90, 1.04, 1.20 GFLOPS (-O0), ~4.75, 4.63, 5.77 GFLOPS (-O1), ~9.37, 9.94, 12.01 GFLOPS (-O3)
- Vanilla:
Matrices are implemented using 3D row-major strided representation. To create a Matrix object:
m = Matrix(d=2, h=2, w=2, data=[i for i in range(8)])
Above creates a 2x2x2 matrix with 0..8 values as entries. Operations on matrices work as Python usual, @
for matmul.
Testing Python matrix and matmul implementation:
python -m unittest tests.test_matrix.TestMatrix
Same as python, 3D row-major strided representation matrices. Matrix is built with a Matrix
struct. To access, set and matmul use: get
, set
, and matmul
. To create a Matrix:
Matrix A;
Matrix B;
Matrix C;
allocate_matrix_random(&A, 1, M, N);
allocate_matrix_consecutive(&B, 1, N, K);
allocate_matrix_zeros(&C, 1, M, K);
Naive benchmark:
gcc -o matmul benchmarks/naive.c -O0; ./matmul
Strassen benchmark:
gcc -o strassen benchmarks/strassen.c -O0; ./strassen
Profiling:
gcc -pg -o strassen benchmarks/strassen.c -O0; ./strassen
gprof strassens gmon.out
Functions overhead will kill you.
Look at this gprof
report (~0.13 GFLOPS):
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
68.36 13.57 13.57 2420113408 0.00 0.00 get
17.88 17.12 3.55 3 1.18 6.61 matmul
13.55 19.81 2.69 2421440512 0.00 0.00 strided_index
0.10 19.83 0.02 1327104 0.00 0.00 set
0.05 19.84 0.01 9 0.00 0.00 free_matrix
0.05 19.85 0.01 _init
0.00 19.85 0.00 6 0.00 0.00 allocate_matrix_random
0.00 19.85 0.00 3 0.00 0.00 allocate_matrix_zeros
And now with strided index, setting, and getting hardcoded in sub, and add functions (0.45 GFLOPS):
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
99.06 5.26 5.26 3 1.75 1.76 matmul
0.56 5.29 0.03 1327104 0.00 0.00 set
0.19 5.30 0.01 6 0.00 0.00 allocate_matrix_random
0.19 5.31 0.01 _init
0.00 5.31 0.00 1327104 0.00 0.00 strided_index
0.00 5.31 0.00 9 0.00 0.00 free_matrix
0.00 5.31 0.00 3 0.00 0.00 allocate_matrix_zeros
Bonkers.
Initial Naive (~0.02 GFLOPS)
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
26.63 7.95 7.95 593362726 0.00 0.00 add
17.92 13.30 5.35 1132783389 0.00 0.00 allocate_matrix_zeros
17.72 18.59 5.29 377594462 0.00 0.00 sub
9.65 21.47 2.88 107884132 0.00 0.00 split
7.05 23.57 2.10 3 0.70 9.85 strassens
6.60 25.55 1.97 647304798 0.00 0.00 get
4.29 26.82 1.28 323652399 0.00 0.00 set
4.02 28.02 1.20 1132783395 0.00 0.00 free_matrix
3.84 29.17 1.15 970957197 0.00 0.00 strided_index
1.24 29.54 0.37 53942066 0.00 0.00 combine
0.67 29.74 0.20 _init
0.13 29.78 0.04 matmul
0.12 29.82 0.04 allocate_matrix_consecutive
0.07 29.84 0.02 6 0.00 0.00 allocate_matrix_random
0.05 29.85 0.01 main
Defaulting to vanilla matmul for matrices smaller or equal 64 (0.48, 0.58, 0.65 GFLOPS)
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
85.04 3.07 3.07 2751 0.00 0.00 matmul
6.65 3.31 0.24 5038 0.00 0.00 add
2.77 3.41 0.10 3206 0.00 0.00 sub
2.49 3.50 0.09 916 0.00 0.00 split
1.66 3.56 0.06 11268096 0.00 0.00 set
0.83 3.59 0.03 458 0.00 0.00 combine
0.28 3.60 0.01 11268096 0.00 0.00 strided_index
0.28 3.61 0.01 6 0.00 0.00 allocate_matrix_random
0.00 3.61 0.00 9627 0.00 0.00 free_matrix
0.00 3.61 0.00 9621 0.00 0.00 allocate_matrix_zeros
0.00 3.61 0.00 3 0.00 1.20 strassens
So I wanted to get better caching with tiled matmul. And well it didn't work, about 30% performance drop...
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
91.32 29.03 29.03 13755 0.00 0.00 matmul_transpose_tiled
3.59 30.17 1.14 25190 0.00 0.00 add
2.49 30.96 0.79 16030 0.00 0.00 sub
1.01 31.28 0.32 4580 0.00 0.00 split
0.98 31.59 0.31 13755 0.00 0.00 transpose_inplace
0.53 31.76 0.17 13755 0.00 0.00 zero_matrix
0.09 31.79 0.03 2290 0.00 0.00 combine
0.00 31.79 0.00 48099 0.00 0.00 free_matrix
0.00 31.79 0.00 48093 0.00 0.00 allocate_matrix_zeros
0.00 31.79 0.00 15 0.00 2.12 strassens
0.00 31.79 0.00 6 0.00 0.00 allocate_matrix_random
Compared to just transposed strassen:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
86.05 14.99 14.99 13755 0.00 0.00 matmul_transpose
5.45 15.94 0.95 25190 0.00 0.00 add
3.96 16.63 0.69 16030 0.00 0.00 sub
2.64 17.09 0.46 4580 0.00 0.00 split
1.32 17.32 0.23 13755 0.00 0.00 transpose_inplace
0.52 17.41 0.09 2290 0.00 0.00 combine
0.06 17.42 0.01 6 0.00 0.00 allocate_matrix_random
0.00 17.42 0.00 48099 0.00 0.00 free_matrix
0.00 17.42 0.00 48093 0.00 0.00 allocate_matrix_zeros
0.00 17.42 0.00 15 0.00 1.16 strassens
Turns out that the one of few times I trusted SlopGPT generated code, I got misguided. We kind of assume here that a). we always have square matrices b). their shapes match, c). shape is a power of 2. This is quite a lot of assumptions. But we have a Need for Speed here, and speed is Most Wanted. Ergo, we don't care about real life where matrices usually don't fit these requirements. SlopGPT added 1. check if tile_size > size at each iteration, 2. additional loop for inner dim. Look what happens when we remove it, and come back to our perfect little square, matching, power of 2 world:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
73.07 7.46 7.46 13755 0.54 0.57 matmul_transpose_tiled
10.87 8.57 1.11 25190 0.04 0.04 add
6.17 9.20 0.63 16030 0.04 0.04 sub
5.29 9.74 0.54 4580 0.12 0.12 split
2.64 10.01 0.27 13755 0.02 0.02 transpose_inplace
1.47 10.16 0.15 13755 0.01 0.01 zero_matrix
0.39 10.20 0.04 2290 0.02 0.02 combine
0.10 10.21 0.01 6 1.67 1.67 allocate_matrix_random
0.00 10.21 0.00 48099 0.00 0.00 free_matrix
0.00 10.21 0.00 48093 0.00 0.00 allocate_matrix_zeros
0.00 10.21 0.00 15 0.00 680.00 strassens
Yessss, that's more like it. Now we are almost 200% faster, landing at 0.89 GFLOPS for small matrices and -O0.
So the typical time complexity of matrix multiplication is
- great read: https://salykova.github.io/matmul-cpu
- obvious read: https://siboehm.com/articles/22/CUDA-MMM
- Efficient matmul on CPU: https://www.cs.utexas.edu/~flame/pubs/GotoTOMS_final.pdf
- SIMD intro: https://www.codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX
- ARM instructions (suppossedly M series too): https://developer.arm.com/documentation/dui0801/l/A64-SIMD-Vector-Instructions/A64-SIMD-Vector-instructions-in-alphabetical-order
- amazing website as a whole: https://en.algorithmica.org/hpc/cpu-cache/associativity/
- CPU has 2-3 Caches, L1 fastest, and smallest, L2...
- L cache is extra fast, like few nanoseconds, while RAM access is ~100-200ns
- Cache contains memory bits that your CPU thought would be useful to you (prefetcher), or you used recently
- CPU check for the thing it needs in L1, then L2, L3, and RAM, if it goes up to RAM, then it fetches cache line/cache blocks, which are more than you needed, but take into account that if you needed this, you may need more from nearby, it gets places in one of L caches
- Everytime CPU doesn't find something in the cache level, it's called a Cache Miss, if CPU finds it - Cache Hit, Cache Misses stack for each search, so if the thing is in none of L1-L3 caches, then you have 3 misses in one search