Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do calculations in parallel #17

Open
nweyand opened this issue May 6, 2017 · 10 comments
Open

Do calculations in parallel #17

nweyand opened this issue May 6, 2017 · 10 comments

Comments

@nweyand
Copy link

nweyand commented May 6, 2017

When working on the comparison of two files or directories on different physical drives, it might be intelligent to calculate the hashes on both drives in parallel and not in sequence in order to increase the hashing speed.

On a side note: Have you considered a model where you have one thread reading data per physical disk and a pool of worker threads to handle the hashing of that data? This might be a good model to handle CPU time intensive hashes combined with high data transfer rate disks.

@tedsmith
Copy link
Owner

tedsmith commented May 8, 2017

Thanks for the contact. Your first point is very valid and one I will look into.

Your second point is one I have looked at several times over the years but, ultimately, it involves multi threaded programming which, whenever I look into it, always seems a bit too tricky. But if I ever work it out it will improve the program enormously, I agree.

@nweyand
Copy link
Author

nweyand commented May 8, 2017

You’re right: multi threaded hashing of a single file is difficult... But have you thought about just having multiple threads and spreading the entire files between them? Especially for very small files, you could just have one thread to read the entire files into RAM and then forward them to a few threads that would themselves do singlethreaded hashing of the files forwarded to them. And you could use a file size threshold to determine whether to use the current approach (which I assume doesn’t require loading the entire file into RAM and for conveniences sake could be located within your manager thread) or the new approach with preloaded files.

This would get you around the problem of using complex multithreading while at the same time give you at least some advantages of the speed gain by using several threads. Also, you could probably recycle most of your current code, as the worker threads would basically do the same as your code does now – with just the difference that they would get their data from a preloaded file in RAM instead of a file on disk.

@tedsmith
Copy link
Owner

tedsmith commented May 8, 2017

Yes, good ideas. So if I setup several TProcess instances, and then after the folder scan to identify all the files, for all files under say 100Kb divide them out to the individual TProcess threads which themsevles will call CalcTheHashFile. I suspect it might still be quite tricky, but do-able I'm sure. Might take me a while though. But I suspect it is much easier than multi-threading a single file, as you illustrate.

@nweyand
Copy link
Author

nweyand commented May 8, 2017

Yes indeed. The important thing about that is that there should always be only one thread per disk doing the actual data reading, as the disk threwput drops drastically (especially for magnetic drives) if several threads try to read at several separate locations at a time (as this requires seeker head position changes between each access).

And if you want to make it truly awesome, sort the file access by their disk position, as this will allow disks to optimally use their „read ahead“ feature to hand out the data faster. :)

@tedsmith
Copy link
Owner

tedsmith commented May 8, 2017

Sounds good, but to my knowledge that would mean reading offsets from $mft for ntfs, the fat table for fat32/16 and exFat, inodes for ext3/4 and so on. In turn that means writing readers for all the file systems. That would be an awful lot of work. Or do you know of a better way to do what you suggest? I know there's probably many ways with different languages but From your profile I think you're c++ coder so how would YOU do that in c++, do you know? Then I can perhaps migrate that tO FreePascal.

@nweyand
Copy link
Author

nweyand commented May 8, 2017

Actually, that last part was advice based on theoretical knowledge I have about how disks work... I must admit I have never done something like that myself until now. So if there is not some native OS call that can be used for this I would suggest letting it rest for now. It could always be added to the initial scanning process at a later time.

@tedsmith
Copy link
Owner

tedsmith commented May 9, 2017

https://www.freepascal.org/docs-html/rtl/classes/tthread.html

It seems there is a Thread class. If I create a TThread descendant (i.e. THashThread) that uses the CalcTheHashFile function, it may work, though I might need to make a revised version that is entirely self-contained (thread safe, I gather the terminology is). The hurdle may be creating a thread manager. But I will keep going and see what I might come up with.

v2.8.1 was just released so if I ever manage this, it will be in the next significant update, i.e. 2.9.

@tedsmith
Copy link
Owner

I have made a start on this, with some help from user Taazz at the FPC forums. A small self-contained prototype is getting there, allbeit with some thread count issues, in the latest commit : 4fb3d54

I'm not sure if I will continue with the idea or not. I do like the idea but it's just really quite tricky, and I'm not confident at my ability to incorporate it into the main program in a sufficiently stable way.

It is committed though as a good base point for me and any collaborators to help me with though. I will not close this ticket yet, because I don't want to forget about it. And I am still undecided whether to pursue the concept or not.

@mike-lawrence
Copy link

Any progress on at least the first suggestion of doing the checksums for two physical disks in parallel? Seems like that would be an easy doubling of performance in a common scenario.

@tedsmith
Copy link
Owner

tedsmith commented Jul 2, 2020

Not really, no. I had a lot of discussions about it, and there are lots of folk who ask the same. The problem is that QH is a hobby project that I fit in around everything else, and I'm not really what you would call a "highly capable" programmer, so everything takes me a while to learn to do. The odd update here, the odd feature there. Incorporating multi-threading is a massive thing really. And I just haven't had the time, yet. But I do keep returning to the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants