Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Hashing #495

Open
mr-bo-jangles opened this issue Jun 29, 2021 · 3 comments
Open

Parallel Hashing #495

mr-bo-jangles opened this issue Jun 29, 2021 · 3 comments

Comments

@mr-bo-jangles
Copy link

for file_path in file_paths {

This may be something you've already ruled out, however I wanted to suggest it just in case

@casey
Copy link
Owner

casey commented Jun 30, 2021

It's actually slightly more complex, since the hasher will produce a different output depending on the order that files are passed to it.

In a torrent file, info.pieces contains the bytes of the SHA hashes of the contents of the files, and often multiple files will contribute to a single hash.

Consider a torrent with the following files:

a: "xyz",
b: "123",

If the piece size is 2, info.pieces will contain 3 hashes. The 3 hashes will be hash("yx"), hash("z1"), and hash("23").

In imdl's implementation Hasher::hash_file is called with the path to a, which will then add hash("xy") to the in-progress info.pieces, then the next call to Hasher::hash_file must be passed the path to b, so that it gets the first byte of b, in this case 1, so that it can push hash("z1") into info.pieces. So since these calls are order-sensitive, it can't be parallelized with rayon without some additional refactoring.

Thanks for the suggestion though! I wish it were that simple T_T

@casey casey closed this as completed Jun 30, 2021
@mr-bo-jangles
Copy link
Author

mr-bo-jangles commented Jun 30, 2021 via email

@casey casey changed the title Could making the hashing parallel be as simple as replacing this forloop with something like Rayon? Parallel Hashing Jul 3, 2021
@casey
Copy link
Owner

casey commented Jul 3, 2021

I suppose that there would have to be an iterator over file bytes first, and that iterator would need to be parallelized. One question is whether the current hashing algorithm is I/O or CPU bound, since that would suggest whether parallelizing reads or hashing should be the priority.

This is discussed a bit in #26, but I think this issue is useful for tracking parallelization of hashing.

@casey casey reopened this Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants