-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize scalar multiplication with a 4-bit window #61
base: main
Are you sure you want to change the base?
Conversation
res = res.double(); | ||
} | ||
// Add the bottom-nibble from this byte into the result | ||
res += arr[usize::from(byte & 0b1111)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like previous impl targets constant time multiplication but here there is an addition of infinity point where byte
is zero which breaks such feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be constant time if the underlying functions are constant time. While it's a bit bold to assume to_repr is constant time, the prior impl also used it. If adding the identity isn't constant time, that's a fault in the addition operation, not this algorithm.
The prior algorithm performed addition by identity for every zero bit, so that specific behavior isn't unique to this algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears addition isn't constant time. I believe that deserves its own issue. Here, I solely note the prior algorithm also performed addition with identity, so that isn't a new issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prior algorithm performed addition by identity for every zero bit, so that specific behavior isn't unique to this algorithm.
Ah sorry it's my bad to figure the prior impl as constant time
acc = $name::conditional_select(&acc, &(acc + self), bit); | ||
// Create a table out of the point | ||
// TODO: This can be made more efficient with a 5-bit window | ||
let mut arr = [$name::identity(); 16]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a rather large stack allocation of 1.5 kiB (and it would be even larger with a 5-bit window). This could affect the ability for this library to be used effectively (or at all) in no-std
environments. I suspect at a minimum we would need a feature flag to switch back to the previous behaviour.
What we currently do in downstream crates (e.g. here) is use the group::WnafBase
and group::WnafScalar
types to implement this on the heap (and allow precomputing this table where it makes sense). Maybe we could instead document this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason not to simply document this is because every single *
call is ~2x as slow as it needs to be. While I do ack how the stack allocation may be too large for nostd, I'd argue the solution is to make this function body std only, not to remove it.
WnafBase/WnafScalar, from what I understand, is intended for/makes sense for points frequently used in multiplication operatons. This isn't for the reuse case and I'd argue against using WnafBase here because it uses the heap.
I'd be happy to update the PR with a feature flag between this and the old body.
This moves from 255 doubles and 255 additions to 259 doubles and 71 additions. If doubling is twice as fast, which is roughly the case as far as I can tell, this shifts the function from executing in (255 + (255 * 2)) = 765 time to (259 + (71 * 2)) = 401 time, a 48% speedup.
This moves from 255 doubles and 255 additions to 259 doubles and 71 additions. If doubling is twice as fast, which is roughly the case as far as I can tell, this shifts the function from executing in (255 + (255 * 2)) = 765 time to (259 + (71 * 2)) = 401 time, a 48% speedup.