-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not eliminate candidates using first/last bytes for smaller files. #114
Comments
[For consideration, could you provide some numbers for what "small files" look like on your large, messy backup directory? E.g. counts, average size of "small" files ...] My reading of the code suggests that
As suggested, each file open / read is "expensive", especially for spinning-rust disks, or in my case, across a network connection. Based on the [I thought about trying to combine steps 3,4, and 5 together for "small" files. My gut suggests the additional memory and tracking overhead required to manage it, outweighs the simplicity of skipping steps 3 and 4.] Based on the user's knowledge of their disk contents, I could see this as a potentially big win, especially if checksum calculations were performed in parallel. |
I did some tests, combining the first / last bytes scans. I.e., taking this: and implementing this: There are two trade-offs here. First, the amount of memory used to store bytes doubles, as the 64 bytes from both the first and last bytes are now stored. The second trade-off is the reading of bytes now takes a bit longer, which might be "wasted" time if the number of files eliminated using first-bytes is significant. My limited testing on a physical disk accessed across the network gave me these numbers:
Time1 is "baseline": rdfind as in this repository. Time2 is after combining the first/last byte reading as described above. Times are in seconds. In short, in this test, this change improves the total time by about 20%, and reduces the first/last bytes time by about 50%. |
@fire-eggs this is a statistic of file sizes.
My suggestion here is not a duplicate of #89. The idea is to use only hash for files <= 4K (or whatever) but other files are potentially still going all phases while #89 is about using only hash phase. You are proposing a different solution (combining first/last phases together). Note that you could hash the first/last bytes to avoid the memory usage issue. |
Clarification appreciated. Apologies for misinterpreting!
Excellent suggestion! Thank you for the idea! |
@fziglio :
Based on this, I set up a test folder with these files:
So
I think this is pretty close (except maybe the total size) to your situation. Running
Times are in minutes. My takeaway from these numbers is the cost of opening and reading a file is prohibitively expensive for "slow" media. The cost of the first / last bytes scan is far more than the checksum calculation, especially when large numbers of "small" files are involved. After trying the "no first / last bytes" test, I didn't see the point of trying any further variations ["combined first/last bytes" or "skipping first/last bytes for small files"]. When running against "fast" media, I believe there will be no significant penalty when running the first / last byte scan. At this point, I'm convinced I can get more win by improving the checksum speed. An interesting exercise. Thank you for the stats! |
I wonder whether there's an "intercept" point where doing a checksum outperforms first + last bytes. I'm thinking a Big O analysis rather than an empirical finding (although the latter is certainly important). Theoretically, why would reading the first and last bytes of file be slower than a checksum? In both cases, don't the files need to be read from the hard disk into memory? To that end, perhaps one of the first optimisations should be minimising disk reads, since RAM and processor throughput will continue to outperform disk access for the foreseeable future. I haven't reviewed the code, but reading the same file several times seems extremely inefficient. |
Somehow related to #29.
I'm using
rdfind
on some large, messy backup directory. There are plenty of small duplicated files. The statistics areAlthough the first
Now eliminating candidates based on first bytes:
iterations removed quite some files I think in some cases it would be better to remove these steps for smaller files. Why? Consider local case (no NFS/SMB) with physical disks. Files are organized in blocks (usually today 4KB), the disk/OS basically read these 4KB at the same speed of 64 bytes (the currentSomeByteSize
size) so if for these files we just use the checksum directly we potentially save 2 read of the files at the cost of some more CPU. Looking athtop
commandrdfind
is mostly always inD
state (waiting for the disk).The text was updated successfully, but these errors were encountered: