-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add volk_64f_x2_dot_prod_64f #627
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Gonzalo J. Carracedo Carballal <[email protected]>
cc36044
to
3c4e7e1
Compare
@BatchDrake thanks for this PR! I'll look into it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again. Thanks for your PR. I hope I could add some hints.
#ifdef LV_HAVE_GENERIC | ||
|
||
|
||
static inline void volk_64f_x2_dot_prod_64f_a_generic(double* result, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to remove this function. Old kernels have the aligned generic version sometimes. But the generic kernel should not rely on any alignment. Also, this kernel yields wildly differing results compared to the "unaligned".
for (; number < eighthPoints; number++) { | ||
|
||
a0Val = _mm_load_pd(aPtr); | ||
a1Val = _mm_load_pd(aPtr + 2); | ||
a2Val = _mm_load_pd(aPtr + 4); | ||
a3Val = _mm_load_pd(aPtr + 6); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might actually be a source for slow results. Compilers are incredibly smart nowadays. This kind of manual "loop unrolling" might actually block some compiler optimizations.
You might want to start with godbolt.com and inspect the results for the generic kernel in case you compile for a specific SIMD extension. I'm aware that it might not be trivial to find the optimized assembly code in the output. Still, it is a possible starting point.
This is basically the 64-bit version of
volk_32f_x2_dot_prod_32f
. Since this is my first PR to Volk and I will probably be writing a few more kernels for batched 64-bit 3D plane/rect intersections at some point, all stylistic/performance feedback is more than welcome.This is the result of the test:
PD: While I expected some improvement (33% increase), I see that most of them perform worse than the generic kernel. Don't know whether it makes sense to keep the worst ones.