Replace 8PEXTRW with 1MOVDQU in f32_to_s16 #123

WolfWings · 2024-07-25T03:40:33Z

The existing code has a series of 8 sequential unrolled PEXTRW, which compilers generally cannot detect and optimize to a single MOVDQU instruction.

As such manually placing the optimized unaligned store intrinsic in place is an enormous performance win for SSE with identical output.

The existing code has a series of 8 sequential unrolled PEXTRW, which compilers generally cannot detect and optimize to a single MOVDQU instruction. As such manually placing the optimized unaligned store intrinsic in place is an enormous performance win for SSE with identical output.

WolfWings · 2024-07-25T03:46:12Z

Some recent versions of clang can identify this construct, but older ones cannot and no version of GCC I was able to test could make this optimization.

I chose the storeu as loadu is used elsewhere instead of juggling alignment issues, so the occasional extra cycle of latency to match existing code design seemed appropriate to also minimize the code change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace 8PEXTRW with 1MOVDQU in f32_to_s16 #123

Replace 8PEXTRW with 1MOVDQU in f32_to_s16 #123

WolfWings commented Jul 25, 2024

WolfWings commented Jul 25, 2024

Replace 8*PEXTRW with 1*MOVDQU in f32_to_s16 #123

Are you sure you want to change the base?

Replace 8*PEXTRW with 1*MOVDQU in f32_to_s16 #123

Conversation

WolfWings commented Jul 25, 2024

WolfWings commented Jul 25, 2024

Replace 8PEXTRW with 1MOVDQU in f32_to_s16 #123

Replace 8PEXTRW with 1MOVDQU in f32_to_s16 #123