-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[perf] hashing function improvements #3242
base: main
Are you sure you want to change the base?
Conversation
|
This pull request is automatically built and testable in CodeSandbox. To see build info of the built libraries, click here or the icon next to each commit SHA. |
Have you confirmed that this performs better in all modern engines? |
Here is a comparison. The benchmark code is available here and requires bun (for JSC) and gjs (for SpiderMonkey). |
It looks good, cc @emmatown |
|
||
if (input.length > bufferLength) { | ||
// bufferLength must be a multiple of 4 to satisfy Int32Array constraints | ||
bufferLength = input.length + (4 - input.length % 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't allocating enough space for strings where utf8 length != utf16 length and will just ignore some input at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, and there's no way to know the byte length until it has been iterated/encoded. I'll allocate 2x the space to ensure there's enough space for the worst case.
if (hasTextEncoder === false) { | ||
const bytes = [] | ||
for (let i = 0; i < input.length; i++) { | ||
const codePoint = input.charCodeAt(i) | ||
if (codePoint > 0xff) { | ||
bytes.push(codePoint >>> 8) | ||
bytes.push(codePoint & 0xff) | ||
} else { | ||
bytes.push(codePoint) | ||
} | ||
} | ||
uint8View = new Uint8Array(bytes) | ||
int32View = new Int32Array(uint8View.buffer, 0, Math.floor(bytes.length / 4)) | ||
|
||
return bytes.length | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to encode like TextEncoder
since the hash needs to return the same results if there e.g. is TextEncoder
when a page is server-rendered but not on the client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self notes:
utf16 to unicode
unicode to utf8
return encoder!.encodeInto(input, uint8View).written as number; | ||
} | ||
|
||
export default function murmur2(input: string): string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I was exploring the data a bit more I realized that the hashing function is re-hashing the same strings a lot, and that we could speed it up by caching it in a Map
.
One note before the results, I was testing with a dataset made of a bunch of MUI components copy/pasted. For this benchmark I switched to a new dataset that is directly extracted from the MUI dashboard template, which I think is more realistic and reflects production setups even better.
Here are the results, which indicate a substantial improvement for that dataset when caching the murmur2 result. The implementation is very naive, and I found that trying to mutate/update the map with more complex rules actually decreased the gains by a lot, probably because adding bookkeeping on each hash iteration is too much work. For comparison I also added a case for the cached version, but with a dataset with absolutely no duplicate, which means the caching code is pure overhead.
Last note, for this particular dataset, unlike the last dataset, SpiderMonkey shows a small/moderate performance decrease from original
to improved
, not sure why. This new dataset seems to have a bit shorter string styles, might explain why. I still think original
to improved
is a gain considering that Firefox accounts for less than 3% of the trafic right now.
So to sum this up, caching the hash result is a gain if we assume that we're going to see a lot of duplicates following each other, which seems to be the case for MUI cases. I would love to have your opinion on this change.
While I was working on #3241 I saw a few issues with your existing javascript implementation of murmur2, here are a few fixes. I've included a comparison with the WASM alternatives just for the context, but the
murmur2_original
andmurmur2_improved
entries of this summary are the important ones. As you can see, we can more than double the performance with the changes in this PR.