-
-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorization of find_next_host_delimiter and find_next_host_delimiter_special #548
Conversation
Seems like a simple lookup table is faster than the current
./benchmarks/benchdata --benchmark_filter=BasicBench_AdaURL_aggregator_href --benchmark_color=true --benchmark_repetitions=30 Simplified: // : / [ \\ ?
static constexpr std::array<bool, 128> special_host_delimiters{
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, true, false, false, false, false, false, false, false,
false, false, false, true, false, false, false, false, true, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, true, true, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false,
};
// : / [ ?
static constexpr std::array<bool, 128> host_delimiters{
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, true, false, false, false, false, false, false, false,
false, false, false, true, false, false, false, false, true, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, true, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false, false, false, false, false,
false, false, false, false, false, false, false,
};
// starting at index location, this finds the next location of a character
// :, /, \\, ? or [. If none is found, view.size() is returned.
// For use within get_host_delimiter_location.
// ['0x3a', '0x2f', '0x5c', '0x3f', '0x5b']
ada_really_inline size_t find_next_host_delimiter_special(
std::string_view view, size_t location) noexcept {
auto const str = view.substr(location);
for (auto pos = str.begin(); pos != str.end(); ++pos) {
if (special_host_delimiters[*pos]) {
return pos - str.begin();
}
}
return size_t(view.size());
}
// starting at index location, this finds the next location of a character
// :, /, ? or [. If none is found, view.size() is returned.
// For use within get_host_delimiter_location.
ada_really_inline size_t find_next_host_delimiter(std::string_view view,
size_t location) noexcept {
auto const str = view.substr(location);
for (auto pos = str.begin(); pos != str.end(); ++pos) {
if (host_delimiters[*pos]) {
return pos - str.begin();
}
}
return size_t(view.size());
}
|
One of the tests is failing. |
#if ADA_NEON | ||
ada_really_inline size_t find_next_host_delimiter_special( | ||
std::string_view view, size_t location) noexcept { | ||
// first check for short strings in which case we do it naively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if statement seems like repetitive code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate? We handle the short string (< 16 characters) naively. What is repetitive?
src/helpers.cpp
Outdated
@@ -197,9 +197,132 @@ ada_really_inline uint64_t swap_bytes_if_big_endian(uint64_t val) noexcept { | |||
#endif | |||
} | |||
|
|||
ada_really_inline int trailing_zeroes(uint32_t input_num) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment on top of this function? If we don't want to expose this, we should add private to comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did. In my view, an inline function defined in a source file and just used locally does not have to be declared in the header file (and indeed, doing so just adds noise, isn't it?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(To be clearer: I did add a comment in a later commit.)
Let me know if you want the simplified version as a separate PR; also, I haven't tested the |
@the-moisrex Please see #549 |
Co-authored-by: Yagiz Nizipli <[email protected]>
…om:ada-url/ada into vectorize_find_next_host_delimiter_special
@the-moisrex I have updated this PR. We fallback on your proposal when SIMD is not available. I added credit to you in the comments. |
By bitset, I did not mean |
The approach in |
Yep. The percent coding work can be accelerated. We should vectorize it, at least on some systems. We have a PR on this front. |
We vectorize manually for SSE2 and NEON two functions that relied on SWAR:
find_next_host_delimiter
andfind_next_host_delimiter_special
. To assess the performance effect, we usebenchdata
and theBasicBench_AdaURL_aggregator_href
measure.Apple M2, LLVM 14
Before: speed=484.976M/
After: speed=496.554M/s
GCC 12, Intel Ice Lake
Before: speed=375.019M/s
After: speed=401.595M/s
Simpler alternative
If we remove optimizations and switch back to the following (which compiles to a bitset lookup, character-by-character), our speed goes down slightly to 372.901M/s (GCC 12, Intel Ice Lake).
This might compile to...
... which is likely highly competitive with a character-by-character table lookup (because of the latency of a table lookup).
Fixes #547
cc @the-moisrex