-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
very slow on apple silicon? #279
Comments
To get high performance out of FFTW, you need to create a plan first and then re-use it. Otherwise, you are getting a lot of overhead by re-creating the plan every time.) Ideally with a pre-allocated array. Note also that FFTW shares threads with Julia, so you generally need to start Julia with enough threads (e.g. (Unfortunately, the current FFTW_jll build is missing the cycle counter on Apple silicon, which disables everything but the default |
Thanks for this Threads.nthreads() a vast improvement. Compiled fftw (https://github.com/andrej5elin/howto_fftw_apple_silicon) seems to manage 350us without openmp on 4 threads with PATIENT, and more gains with openmp (for single precision 210us drops to 160us on 4 threads PATIENT). Any scope for building with openmp using Apple's Clang? Looking forward to the release! |
Apple Clang doesn't come with OpenMP, only thing one could do is to link an external OpenMP runtime, like LLVM's. |
Yes. |
is there a way to inject that into |
The build recipe of fftw is at https://github.com/JuliaPackaging/Yggdrasil/blob/42d73ea1c9e39c6f63bdfe065caad498257d0c6a/F/FFTW/build_tarballs.jl. At the moment OpenMP isn't used anywhere as far as I understand, I guess that's a question for @stevengj. |
Apologies: I realise now that my earlier benchmarks must have been in low power mode on the laptop. After a charge the times are a bit more comparable, but I notice that even in-place planning gains almost nothing on M1, but has significant gains on Intel even without MKL. The slowness compared to https://github.com/andrej5elin/howto_fftw_apple_silicon has not gone away, but the gap has closed: 446.69us on 4 threads (without openmp) vs FFTW.jl below running at 699us for 8 threads
2021 M1 Max
2019 Intel 8-core i9 (no MKL)
2019 Intel 8-core i9 (with MKL)
|
In that post they are using FFTW's |
Not related to this issue, but just as an fyi - Apple Silicon has been added to the CI now. |
I could have sworn this used to be much faster:
Compare with fftw installed via python (scypy) here https://github.com/andrej5elin/howto_fftw_apple_silicon, where 4 threads takes about 500us for double precision, on slightly weaker hardware.
Rosetta with mkl is also significantly (>10x) faster than fftw.jl according to those benchmarks. Am I missing something?
The text was updated successfully, but these errors were encountered: