-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimized memcpy #18912
base: master
Are you sure you want to change the base?
optimized memcpy #18912
Conversation
Nice work! Could you share the benchmark script? I am really curious to see how those aligned loads/stores are emitted. |
The benchmark script is just a pretty simple loop copying from one buffer to the other a bunch of times for each source alignment offset. I haven't done anything clever in the micro benchmark. The reason aligned vector moves are emitted is the use of Here's the micro benchmark (doesn't look like I can colour it inside the fold):pub fn main() !void { const allocator = std.heap.page_allocator; |
Interesting. I have attempted to re-produce this and I see 2, maybe 3 issues.
I recommend you use something like: dest[0..v].* = @as(*const @Vector(v, u8), @alignCast(@ptrCast(aligned_src[offset..][0..v]))).*; to generate your aligned vector loads, 😉 |
I am not convinced this is the case. It is trivial to produce a longer sequence of aligned move operations by unrolling the loop - though this does neccesitate extra cleanup code - and when I did this in the past it did not have much impact on performance. Unless I get benchmark results clearly showing a win that seems generalizable I wouldn't want to do it; forcing this sort of unrolling seems more likely to trip up the optimizer and may only positively impact performance on the machine you run the benchmark on while degrading performance on other machines.
If you let me know precisely how you compiled/ran it I will try on my machine as well. Particular things of interest would be how many iterations you did, the copy length, target cpu features, and optimize mode.
Can you be more specific about what you did when you saw worse performance relative to master so I can investigate? I have also used compiling the zig compiler itself as a test and saw a minor (not really sure it was outside uncertainty) improvement in compile time.
Can you post your benchmark? I'm not sure what you mean by "...and also cannot replicate it" along with "Very comparable results to the benchmark script your provided." - those read as contradictory statements to me, so I must be misinterpretting something. |
Looks like this somehow broke a bunch of non-x86_64 targets in the linux CI. I wonder if on those targets LLVM is making memcpy produce a call to itself... |
I assume you mean |
It looks like this is the issue, at least for wasm32-wasi. I checked a If anyone knows of a way to trace wasm code produced back to source lines, similar to |
I suspect zig’s memcpy could get a whole lot faster than 30%-ish. I think it’s hard to rule out measurement noise without also having a complete benchmark snippet for others to run and on larger amounts of data. There must already be comprehensive memcpy benchmarks online somewhere you could copy from |
For small sizes this is certainly true, but for large copies I'm a bit skeptical, at least not without using inline asm with something like |
I've made a bunch of improvements yielding both better performance and smaller code size. The table in the top post has been updated. Edit: Not sure what the problem with riscv64-linux is in the CI - I've disassembled it and memcpy and is not doing a recursive call. |
This is wrong, recursive |
81a4c1c
to
076afc6
Compare
The windows CI failure is |
@dweiller I just ran into that in a different PR as well, seems to be a fluke. |
The windows CI failure seems to be the compiler running out of memory - not sure why this would happen, it shouldn't be caused by this PR. |
A decent amount of the code could be simplified using comptime into something like: // shared (dst, src)
copy(blk, offsets):
for i in offsets:
dst[i..][0..blk].* = src[i..][0..blk].*
memcpy(len):
if len == 0: return
if len < 4: return copy(1, {0, len/2, len-1})
v = max(4, suggestVectorSize(u8) or 0)
inline for n in ctz(4)..ctz(v):
if len <= 1 << (n+1): // @expect(false)
return copy(1<<n, {0, len - (1<<n)})
for i in 0..(len / v): copy(v, .{ i * v }) // 4-way auto-vec
copy(v, .{ len - v }) But I think (at a higher level) we should go all the way: Facebook's folly memcpy is heavily optimized but also serves as their Interestingly. they never use aligned loads since it doesn't seem worth it. Copy-forward does both load/store unaligned, but copy-backward uses aligned stores with The general strategy of having memcpy & memmove be the same thing would be nice (perf & maintenance wise). Fast |
This looks pretty nice, I'll try it out.
Our memcpy assumes that src/dst don't overlap, so a change like that would be a (non breaking, but significant) change in semantics that would affect performance.
On my machine, IIRC, the aligned ops did affect performance, but this would also be something machine dependent. I have seen that the current wisdom seems to be that modern x86_64 doesn't really care (but for some reason haven't seen when this started being the case), but what about other platforms? At least for a general algorithm, I would think we should use aligned ops, and if/when we want to individually optimise different platforms we could use unaligned ops. I did also try it unrolling with prefetch, but didn't want to over-optimise for my machine - I can't recall how much difference it made for me.
I did some research into
I do actually have a memset branch locally as well - I could include it in this PR, but I haven't put as much work into it yet. |
I rather mean to keep memcpy, the function with noalias and such, but have it just call memmove internally. This should keep the noalias optimizations applied by compiler replacing memcpy, but keep the vector-based / branch-optimized version at runtime.
Was this in relation to the results above? I think the benchmark could report throughput in cycles/byte rather than ns. This is something used in benchmarks like I only mention this as aligned loads didn't seem to have an effect on other benchmarks so trying to somehow discover/rationalize the difference in the results. TBF, also haven't tested on anything outside avx2, avx512f, and apple_m1.
Indeed looks like it depends on micro-architecture specifically (being finnicky on zen2+). Seemed to be same speed as the normal vectorized loop at least on 5600x and 6900HS.
Yea a separate PR would be best. Just wanted to mention their similarity. |
Ah, okay - I understand now.
I think the benchmark has changed between when I tested (and the memcpy function has quite a bit too) - I'll re-check. Unless I'm missing something, I would be biased to keep the aligned strategy until/unless we diverge implementation based on whether a target has slow unaligned access or not, because I expect the impact on systems that don't care to be minimal (it costs one branch and one vector copy but this is only on large copies which are rare, and the effects of the increased code size), but systems that do have slower unaligned accesses will pay that cost proportional to the size of a long copy. I can have the benchmark report cycles/byte, and do things like core-pinning and disabling frequency scaling next time I work on improving the benchmarking, but I think adding other benchmarks will probably take priority.
One reason, could be the architecture - looking at Agner Fog's tables my machine (Zen 1) has worse latencies/throughputs on unaligned intstructions (at least for integer ones, can't remember at the moment if integer or float instructions were generated). |
It looks like this is not ready for merging then? Can you close it and open a new one when it's ready for review & merging? Otherwise we can use the issue tracker to discuss ideas, plans, strategies, etc. |
I've just done some cleanup of commits and assuming it passes CI, I'm happy for it to be merged. The unchecked boxes are all things that are potential future work and could be broken out into new issues. I would say the real question is what level of effort you want to see put into benchmarking before merging. I can certainly do more on my own - write more synthetic benchmarks, do some proper statistics, check some real-world impacts more carefully (I did see a small benefit in the zig compiler using the c backend, but that was a while ago and I don't remember the details) - but I think any serious benchmarking effort would require help from others (in particular to check on more machines/architectures) and may be better done as part of gotta-go-fast. |
Looking at the table above again - would it be better to not include the optimisations done in this PR on |
before this commit, when compiling for target riscv64-linux-none and with `-mcpu basline`, recursive jalr instructions were generated that caused infinite looping and a stack overflow for copy lengths between 16 and 31 (inclusive). This was caused by LLVM optimisations - in particular LLVM has an optimisation that can turn copies of a sufficient length into a call to memcpy. This change breaks up some copies in order to try and prevent this optimisation from occuring.
On x86_64+avx2 this replaces instructions (`add`, `cmp`, `ja`) with (`dec`, `jne`) in the hot loops.
This reverts commit cb4c1d5.
This change uses does branchless copies of lengths [a, 4 * a) rather than [a, 2 * a] in order to reduce the number of branches for small lengths.
Folly seems to do this in order to not break legacy code that (erroneously) depends on Does Zig's compiler-rt implementation of |
There are two |
This PR does a few things to improve
memcpy
performance:std.simd.suggestVectorLength(u8)
align(1)
vectorIt is also possible to use aligned vector loads and stores in the misaligned case, but I have not found a nice way to make the compiler do it (for x86_64), and would rather not do it with inline assembly. The simplest way to do it (using
@shuffle
) does improve performance further, however it causes massive bloat in thememcpy
function (producing >12x more code than without it on my avx2 machine) due to the need for a comptime mask.Here is a graph or the performance of master (commit f845fa0) and 8688b0a in a microbenchmark. The benchmark times
@memcpy
for a specific length and alignment (mod 32) of source and destination 100 000 times in a loop; this is done for all 256 combinations of source and destination alignment and the average time per iteration across all alignment combinations is the reported result.Note that this benchmark is not going to be particularly realistic for most circumstances as branch predictors will be perfectly trained.
In general we want to focus on optimizing small length copies as small lengths are the most frequent—some reported distributions can be found at here and this google paper (LLVM uses these for benchmarking). For small lengths, the above graph indicates between a modest and significant (up to around 3x) improvement, depending on the length.
Here is a graph showing performance across the distributions from the linked paper linked above (taken from the LLVM repo).
For code size of memcpy:
I have only checked the performance of this change on my (x86_64, avx2, znver1) machine—it would be good to check on other architectures as well if someone with one would like to test it. The benchmarks used can be found here—the
average
anddistrib
ones were used to produce the above charts (check out thetools
directory for data generation scripts).A few other notes:
ReleaseSmall
performancerep movsb
for copies over several hundred bytes (I don't have a suitable machine to test this)rep movsb
for all/most cases (I don't have a suitable machine to test this)Other stuff that can be done (either before merging or as follow up issues):
see if there is a reasonable way to do aligned vector moves for misaligned case (doubtful without inline assembly)pretty sure this can't be done without@select
taking a runtime mask or using inline assembly