Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not eliminate candidates using first/last bytes for smaller files. #114

Open
fziglio opened this issue Aug 23, 2022 · 7 comments
Open

Comments

@fziglio
Copy link

fziglio commented Aug 23, 2022

Somehow related to #29.

I'm using rdfind on some large, messy backup directory. There are plenty of small duplicated files. The statistics are

Now scanning ".", found 1898156 files.
Now have 1898156 files in total.
Removed 2868 files due to nonunique device and inode.
Total size is 434307786245 bytes or 404 GiB
Removed 52470 files due to unique sizes from list. 1842818 files left.
Now eliminating candidates based on first bytes: removed 631798 files from list. 1211020 files left.
Now eliminating candidates based on last bytes: removed 49901 files from list. 1161119 files left.
Now eliminating candidates based on sha1 checksum: removed 61578 files from list. 1099541 files left.
It seems like you have 1099541 files that are not unique

Although the first Now eliminating candidates based on first bytes: iterations removed quite some files I think in some cases it would be better to remove these steps for smaller files. Why? Consider local case (no NFS/SMB) with physical disks. Files are organized in blocks (usually today 4KB), the disk/OS basically read these 4KB at the same speed of 64 bytes (the current SomeByteSize size) so if for these files we just use the checksum directly we potentially save 2 read of the files at the cost of some more CPU. Looking at htop command rdfind is mostly always in D state (waiting for the disk).

@fire-eggs
Copy link

[For consideration, could you provide some numbers for what "small files" look like on your large, messy backup directory? E.g. counts, average size of "small" files ...]

My reading of the code suggests that rdfind runs each "stage" across the current file list. I.e. the code follows a sequence like:

  1. Scan for all files.
  2. Ask the file system for the size of each file; remove unique entries.
  3. Open each file in the current list, read the first bytes. Remove non-duplicate files.
  4. Open each file in the current list, read the last bytes. Remove non-duplicate files.
  5. Open each file in the current list, read the file and calculate a checksum. Remove non-duplicate files.

As suggested, each file open / read is "expensive", especially for spinning-rust disks, or in my case, across a network connection.

Based on the rdfind output posted, I think this means an option to skip steps 3 and 4 and go directly to step 5 would eliminate (1,842,818 + 1,211,020) file open/read acts, at the expense of having to calculate the checksum for (1,842,818 + 1,211,020) files vs 1,161,119 files.

[I thought about trying to combine steps 3,4, and 5 together for "small" files. My gut suggests the additional memory and tracking overhead required to manage it, outweighs the simplicity of skipping steps 3 and 4.]

Based on the user's knowledge of their disk contents, I could see this as a potentially big win, especially if checksum calculations were performed in parallel.

@fire-eggs
Copy link

This is technically a duplicate of #89 where @AlwaysHC has submitted a pull request to be able to disable first/last bytes scan.

@fire-eggs
Copy link

I did some tests, combining the first / last bytes scans. I.e., taking this:
3. Open each file in the current list, read the first bytes. Remove non-duplicate files.
4. Open each file in the current list, read the last bytes. Remove non-duplicate files.

and implementing this:
3. Open each file in the current list, read the first bytes. Seek to EOF, read the last bytes.
4. Remove non-duplicate files based on first bytes.
5. Remove non-duplicate files based on last bytes.

There are two trade-offs here. First, the amount of memory used to store bytes doubles, as the 64 bytes from both the first and last bytes are now stored. The second trade-off is the reading of bytes now takes a bit longer, which might be "wasted" time if the number of files eliminated using first-bytes is significant.

My limited testing on a physical disk accessed across the network gave me these numbers:

Stage Files Time1 Time2
Initial 465325 178 202
Sizes 432519 0 0
First 323606 2485 2039
Last 316491 1604 1
Sha1 ~6000 ~6000
Total ~10267 ~8250

Time1 is "baseline": rdfind as in this repository. Time2 is after combining the first/last byte reading as described above. Times are in seconds.

In short, in this test, this change improves the total time by about 20%, and reduces the first/last bytes time by about 50%.

@fziglio
Copy link
Author

fziglio commented Sep 5, 2022

@fire-eggs this is a statistic of file sizes.

Size Number
Up to 4K 1140103
4K-16K 438871
16K-64K 225600
Others 125785

My suggestion here is not a duplicate of #89. The idea is to use only hash for files <= 4K (or whatever) but other files are potentially still going all phases while #89 is about using only hash phase.

You are proposing a different solution (combining first/last phases together). Note that you could hash the first/last bytes to avoid the memory usage issue.

@fire-eggs
Copy link

My suggestion here is not a duplicate of #89. The idea is to use only hash for files <= 4K (or whatever) but other files are potentially still going all phases while #89 is about using only hash phase.

Clarification appreciated. Apologies for misinterpreting!

You are proposing a different solution (combining first/last phases together). Note that you could hash the first/last bytes to avoid the memory usage issue.

Excellent suggestion! Thank you for the idea!

@fire-eggs
Copy link

@fziglio :

@fire-eggs this is a statistic of file sizes.

Size Number
Up to 4K 1140103
4K-16K 438871
16K-64K 225600
Others 125785

Based on this, I set up a test folder with these files:

Size (Bytes) Copies Characteristic
3,175 900,000 Random binary file
12,313 75,000 Random binary file
60,987 50,000 Random binary file
316,895 25,000 Random binary file
60,987+64 300,000 Random binary file, unique in first 64 bytes
316,895+64 300,000 Random binary file, unique in first 64 bytes
316,895+64 50,000 Random binary file, unique in last 64 bytes

So rdfind produces the following results:

(DRYRUN MODE) Now scanning "/media/kevin/rdfind/base", found 1700000 files.
(DRYRUN MODE) Now have 1700000 files in total.
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 144003650000 bytes or 134 GiB
Removed 0 files due to unique sizes from list. 1700000 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes: removed 600000 files from list. 1100000 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 50000 files from list. 1050000 files left.
(DRYRUN MODE) Now eliminating candidates based on sha1 checksum: removed 0 files from list. 1050000 files left.
(DRYRUN MODE) It seems like you have 1050000 files that are not unique

I think this is pretty close (except maybe the total size) to your situation.

Running rdfind against this tree on a 1T spinning rust disk, connected via USB, I get the following timings:

Baseline No first/last bytes
First bytes scan 202.8 N/A
Last bytes scan 104.9 N/A
Checksum 80.5 201.1
End 388.3 201.1

Times are in minutes.

My takeaway from these numbers is the cost of opening and reading a file is prohibitively expensive for "slow" media. The cost of the first / last bytes scan is far more than the checksum calculation, especially when large numbers of "small" files are involved. After trying the "no first / last bytes" test, I didn't see the point of trying any further variations ["combined first/last bytes" or "skipping first/last bytes for small files"].

When running against "fast" media, I believe there will be no significant penalty when running the first / last byte scan.

At this point, I'm convinced I can get more win by improving the checksum speed.

An interesting exercise. Thank you for the stats!

@bordenc
Copy link

bordenc commented Mar 25, 2023

I wonder whether there's an "intercept" point where doing a checksum outperforms first + last bytes. I'm thinking a Big O analysis rather than an empirical finding (although the latter is certainly important). Theoretically, why would reading the first and last bytes of file be slower than a checksum? In both cases, don't the files need to be read from the hard disk into memory?

To that end, perhaps one of the first optimisations should be minimising disk reads, since RAM and processor throughput will continue to outperform disk access for the foreseeable future. I haven't reviewed the code, but reading the same file several times seems extremely inefficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants