Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster search algorithm #12

Open
warner opened this issue Jan 23, 2020 · 2 comments · May be fixed by #15
Open

faster search algorithm #12

warner opened this issue Jan 23, 2020 · 2 comments · May be fixed by #15

Comments

@warner
Copy link
Owner

warner commented Jan 23, 2020

At RWC this year, we (maybe @gtank or @tarcieri?) talked about a faster approach to finding suitable keypairs. The current scheme performs a large number of independent trials (in parallel on all available CPU cores), where each one picks a random secret key (a scalar), performs the (expensive) scalarmult operation to transform it into a public key (a point), then examines the base64-encoded public key for a match against the desired prefix.

The new scheme would take advantage of the fact that point addition is much much faster than scalar multiplication. The initialization step looks like:

  • set s to a randomly-selected scalar, clamped as usual
  • set scalar_offset to 8 (i.e. the Ed25519 group's cofactor)
  • set p to scalarmult(s, BASEPOINT)
  • set point_offset to scalarmult(scalar_offset, BASEPOINT)

Then each step of the loop looks like:

  • base64-encode p and test against the desired prefix. If it matches, print the result and start over again from initialization
  • else, set s = s + scalar_offset and p = p + point_offset
  • repeat

The scalar addition is done modulo the group order, and the point addition follows the usual rules of point addition.

The speed is bounded by the point addition, and we only do one scalarmult per output keypair. Much much faster. For any sensible scalar (large enough to have wrapped around at least once, which is ensured when the clamping step sets the 2^254 bit, I think), adding point_offset makes the point jump to an entirely new portion of the numberspace, so the prefix won't be very correlated, and we won't lose much yield to correlation. (if our prefix comes from the high-order bits, and the point-addition only changed the low-order bits, then the prefix would change very slowly).

Some details to figure out before implementing this:

  • clamping: scalar_offset must be a multiple of the cofactor for this to work at all (assuming s is a valid+safe private key, s + scalar_offset must also be one)
  • wrapping: if clamp(s + scalar_offset) != s + scalar_offset, then we'll also have problems. We could either test this at each cycle of the loop, or rely upon the fact that the group order is so large (about 2^252) that we'll never be unlucky enough for this to ever happen.
  • we'd reset the process each time we find a match because otherwise using two different keypairs from the same run would be mutually-unsafe. If someone compromised the first machine and learned the private key, they could search linearly outwards from that scalar to find the other one very quickly. This wasn't a problem when randomly selecting a scalar each time, but the speedup provided by using related scalars means we must avoid using any starting point more than once.

This is related to #1 because you could give each worker the starting point p (and they'd all use the same point_offset) and then just ask them to tell you how many steps they took before getting to a suitable pubkey. Then on the leader machine (which is the only one to know the private scalar) you just add their step count to the private scalar to get the resulting private key.

@warner
Copy link
Owner Author

warner commented Jan 23, 2020

This will require using curve25519-dalek in addition to the DH-centric x25519-dalek (which treats private keys as opaque objects without + defined on them).

warner added a commit that referenced this issue Feb 18, 2020
@warner
Copy link
Owner Author

warner commented Feb 24, 2020

The 12-count-scalars branch has a functioning implementation of this. Looks like a 4.8x speedup on all my (fairly modern) machines. The slowest part is the conversion of public-key points from Edwards form to Montgomery form (necessary because x25519, Wireguard's key-agreement algorithm, uses Montgomery-form keys to generate the base64 encoding that we have to examine).

test b1_point_generation              ... bench:         190 ns/iter (+/- 8)
test b2a_point_to_montgomery          ... bench:       3,154 ns/iter (+/- 228)
test b2b_montgomery_to_bytes          ... bench:           0 ns/iter (+/- 0)
test b2c_bytes_to_base64              ... bench:          99 ns/iter (+/- 3)
test b2d_base64_contains              ... bench:          88 ns/iter (+/- 17)
test b2e_total_point_checking         ... bench:       3,391 ns/iter (+/- 187)
test b3_point_generation_and_checking ... bench:       3,678 ns/iter (+/- 172)

Not clean enough to land just yet, but it's looking quite promising.

warner added a commit that referenced this issue Mar 4, 2020
and benchmark smaller pieces, for comparison against an upcoming (faster)
algorithm (#12)
warner added a commit that referenced this issue Mar 4, 2020
The basic operation used to take 17us/iter on my 2019 mac mini (3.2GHz Core
i7). The new approach takes 3.8us/iter.

refs #12
warner added a commit that referenced this issue Mar 4, 2020
The basic operation used to take 17us/iter on my 2019 mac mini (3.2GHz Core
i7). The new approach takes 3.8us/iter.

refs #12
@warner warner linked a pull request Mar 4, 2020 that will close this issue
warner added a commit that referenced this issue Mar 28, 2020
The basic operation used to take 17us/iter on my 2019 mac mini (3.2GHz Core
i7). The new approach takes 3.8us/iter.

refs #12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant