M4 Pro Performance #1598
-
The function transforms documentation offers a simple benchmark between naive and vectorized functions and provides timing example results from an M1 Max system. xs = mx.random.uniform(shape=(4096, 100))
ys = mx.random.uniform(shape=(100, 4096))
def naive_add(xs, ys):
return [xs[i] + ys[:, i] for i in range(xs.shape[0])]
vmap_add = mx.vmap(lambda x, y: x + y, in_axes=(0, 1))
import timeit
print(timeit.timeit(lambda: mx.eval(naive_add(xs, ys)), number=100))
print(timeit.timeit(lambda: mx.eval(vmap_add(xs, ys)), number=100)) M1 Max results from docs Naive: 0.390s My results on a M4 Pro (Mac Mini, 48GB) Naive: 4.450s The memory bandwidth of an M1 Max (400GB/s) exceeds that of the M4 Pro (273GB/s), which may explain the overall slower results, however I would like to understand why the naive_add is more than 10x slower on the more modern chip. Any insights would be appreciated! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
The timings in the documentation are outdated. There was a bug in that example (which was fixed) but we never updated the timings. I just ran it on my M1 Max and the results are:
So the M4 pro is faster than the M1 Max for the naive (which is overhead bound so makes sense) but slower for the vectorized which should be memory bandwidth bound. We should update the timings in the doc.. |
Beta Was this translation helpful? Give feedback.
The timings in the documentation are outdated. There was a bug in that example (which was fixed) but we never updated the timings.
I just ran it on my M1 Max and the results are:
So the M4 pro is faster than the M1 Max for the naive (which is overhead bound so makes sense) but slower for the vectorized which should be memory bandwidth bound.
We should update the timings in the doc..