Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize scalar multiplication with a 4-bit window #61

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ group = { version = "0.13", default-features = false }
rand = { version = "0.8", default-features = false }
static_assertions = "1.1.0"
subtle = { version = "2.3", default-features = false }
zeroize = { version = "^1.5", default-features = false, features = ["derive"] }

# alloc dependencies
blake2b_simd = { version = "1", optional = true, default-features = false }
Expand Down
116 changes: 67 additions & 49 deletions src/curves.rs
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ macro_rules! new_curve_impl {
(($($privacy:tt)*), $name:ident, $name_affine:ident, $iso:ident, $base:ident, $scalar:ident,
$curve_id:literal, $a_raw:expr, $b_raw:expr, $curve_type:ident) => {
/// Represents a point in the projective coordinate space.
#[derive(Copy, Clone, Debug)]
#[derive(Copy, Clone, Debug, zeroize::Zeroize)]
#[cfg_attr(feature = "repr-c", repr(C))]
$($privacy)* struct $name {
x: $base,
Expand Down Expand Up @@ -466,32 +466,41 @@ macro_rules! new_curve_impl {
type Output = $name;

fn mul(self, other: &'b $scalar) -> Self::Output {
// TODO: make this faster

let mut acc = $name::identity();

// This is a simple double-and-add implementation of point
// multiplication, moving from most significant to least
// significant bit of the scalar.
//
// We don't use `PrimeFieldBits::.to_le_bits` here, because that would
// force users of this crate to depend on `bitvec` where they otherwise
// might not need to.
//
// NOTE: We skip the leading bit because it's always unset (we are turning
// the 32-byte repr into 256 bits, and $scalar::NUM_BITS = 255).
for bit in other
.to_repr()
.iter()
.rev()
.flat_map(|byte| (0..8).rev().map(move |i| Choice::from((byte >> i) & 1u8)))
.skip(1)
{
acc = acc.double();
acc = $name::conditional_select(&acc, &(acc + self), bit);
// Create a table out of the point
// TODO: This can be made more efficient with a 5-bit window
let mut arr = [$name::identity(); 16];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rather large stack allocation of 1.5 kiB (and it would be even larger with a 5-bit window). This could affect the ability for this library to be used effectively (or at all) in no-std environments. I suspect at a minimum we would need a feature flag to switch back to the previous behaviour.

What we currently do in downstream crates (e.g. here) is use the group::WnafBase and group::WnafScalar types to implement this on the heap (and allow precomputing this table where it makes sense). Maybe we could instead document this approach?

Copy link
Author

@kayabaNerve kayabaNerve Mar 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason not to simply document this is because every single * call is ~2x as slow as it needs to be. While I do ack how the stack allocation may be too large for nostd, I'd argue the solution is to make this function body std only, not to remove it.

WnafBase/WnafScalar, from what I understand, is intended for/makes sense for points frequently used in multiplication operatons. This isn't for the reuse case and I'd argue against using WnafBase here because it uses the heap.

I'd be happy to update the PR with a feature flag between this and the old body.

arr[1] = *self;
for i in 2 .. 16 {
arr[i] = if (i % 2) == 0 {
arr[i / 2].double()
} else {
arr[i - 1] + arr[1]
};
}

let mut res = $name::identity();
let mut first = true;
// Iterate from significant byte to least significant byte
for byte in other.to_repr().iter().rev() {
// Shift the result over 4 bits
if !first {
for _ in 0 .. 4 {
res = res.double();
}
}
first = false;

// Add the top-nibble from this byte into the result
res += arr[usize::from(byte >> 4)];
// Shift the result over
for _ in 0 .. 4 {
res = res.double();
}
// Add the bottom-nibble from this byte into the result
res += arr[usize::from(byte & 0b1111)];
}

acc
res
}
}

Expand Down Expand Up @@ -581,32 +590,41 @@ macro_rules! new_curve_impl {
type Output = $name;

fn mul(self, other: &'b $scalar) -> Self::Output {
// TODO: make this faster

let mut acc = $name::identity();

// This is a simple double-and-add implementation of point
// multiplication, moving from most significant to least
// significant bit of the scalar.
//
// We don't use `PrimeFieldBits::.to_le_bits` here, because that would
// force users of this crate to depend on `bitvec` where they otherwise
// might not need to.
//
// NOTE: We skip the leading bit because it's always unset (we are turning
// the 32-byte repr into 256 bits, and $scalar::NUM_BITS = 255).
for bit in other
.to_repr()
.iter()
.rev()
.flat_map(|byte| (0..8).rev().map(move |i| Choice::from((byte >> i) & 1u8)))
.skip(1)
{
acc = acc.double();
acc = $name::conditional_select(&acc, &(acc + self), bit);
// Create a table out of the point
// TODO: This can be made more efficient with a 5-bit window
let mut arr = [$name::identity(); 16];
arr[1] = (*self).into();
for i in 2 .. 16 {
arr[i] = if (i % 2) == 0 {
arr[i / 2].double()
} else {
arr[i - 1] + arr[1]
};
}

let mut res = $name::identity();
let mut first = true;
// Iterate from significant byte to least significant byte
for byte in other.to_repr().iter().rev() {
// Shift the result over 4 bits
if !first {
for _ in 0 .. 4 {
res = res.double();
}
}
first = false;

// Add the top-nibble from this byte into the result
res += arr[usize::from(byte >> 4)];
// Shift the result over
for _ in 0 .. 4 {
res = res.double();
}
// Add the bottom-nibble from this byte into the result
res += arr[usize::from(byte & 0b1111)];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like previous impl targets constant time multiplication but here there is an addition of infinity point where byte is zero which breaks such feature.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be constant time if the underlying functions are constant time. While it's a bit bold to assume to_repr is constant time, the prior impl also used it. If adding the identity isn't constant time, that's a fault in the addition operation, not this algorithm.

The prior algorithm performed addition by identity for every zero bit, so that specific behavior isn't unique to this algorithm.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears addition isn't constant time. I believe that deserves its own issue. Here, I solely note the prior algorithm also performed addition with identity, so that isn't a new issue.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prior algorithm performed addition by identity for every zero bit, so that specific behavior isn't unique to this algorithm.

Ah sorry it's my bad to figure the prior impl as constant time

}

acc
res
}
}

Expand Down
2 changes: 1 addition & 1 deletion src/fields/fp.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ use crate::arithmetic::SqrtTables;
// The internal representation of this type is four 64-bit unsigned
// integers in little-endian order. `Fp` values are always in
// Montgomery form; i.e., Fp(a) = aR mod p, with R = 2^256.
#[derive(Clone, Copy, Eq)]
#[derive(Clone, Copy, Eq, zeroize::Zeroize)]
#[repr(transparent)]
pub struct Fp(pub(crate) [u64; 4]);

Expand Down
2 changes: 1 addition & 1 deletion src/fields/fq.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ use crate::arithmetic::SqrtTables;
// The internal representation of this type is four 64-bit unsigned
// integers in little-endian order. `Fq` values are always in
// Montgomery form; i.e., Fq(a) = aR mod q, with R = 2^256.
#[derive(Clone, Copy, Eq)]
#[derive(Clone, Copy, Eq, zeroize::Zeroize)]
#[repr(transparent)]
pub struct Fq(pub(crate) [u64; 4]);

Expand Down