Optimise OpenCL kernels #7

hjabird · 2019-05-23T10:02:37Z

Surely the code can be faster?

hjabird · 2019-05-23T12:42:56Z

OpenCL particle induced velocity:

See commit ae683f3

Approximately 7.5% faster:

Move 1/4pi term to host
Precompute 1 / regularisation_radius
Winckelmans kernel: g = ... / powr(rho, 2.5) -> ... * rsqrt(pown(rho, 5))
Neaten with radd = length(rad)
Remove if(all(isequal(mespnt, particlepnt))) and replace with isnormal(..) ? ret : 0.0 0.0
Use knowledge that get_global_id(0) == get_local_id(0)

Not used:

fast_length instead of length results in ~7% faster, but this results in far lower precision.

Verified good:

Workgroup size seems good.
Reduction method seems good. I tried a scheme where reductions was performed in registers rather than local data store. This slowed down GPU considerably.
Host memory prep only uses a few % of runtime.

To do:

NB:

Benchmarking performed using a Julia script. The times are for the ind_vel only. The julia script adds overhead, dominating the low end results and adding a couple of seconds at around a million particles.

hjabird · 2019-05-23T21:10:11Z

ind_dvort is approximately 27% faster.

Move 1/(4 pi reg_rad^3) to host
Precompute 1 / reg_rad
Winckelmans kernel: same rsqrt(pown(..)) trick as above for both f and g terms.
Same removal of if with ret = isnormal(ret) ? ret : (float3)(0.f, 0.f, 0.f) as above

As above, there is some overhead in the Julia script used for benchmarking. 3% in worst case.

hjabird · 2019-05-23T21:35:00Z

ind_dvort is 1.5% faster

hjabird · 2019-05-24T21:56:19Z

Changing to transpose method for dvort calculations allows reusing a cross product. 7% faster.

hjabird added the enhancement New feature or request label May 23, 2019

hjabird self-assigned this May 23, 2019

Provide feedback