Do not eliminate candidates using first/last bytes for smaller files. #114

fziglio · 2022-08-23T16:29:26Z

Somehow related to #29.

I'm using rdfind on some large, messy backup directory. There are plenty of small duplicated files. The statistics are

Now scanning ".", found 1898156 files.
Now have 1898156 files in total.
Removed 2868 files due to nonunique device and inode.
Total size is 434307786245 bytes or 404 GiB
Removed 52470 files due to unique sizes from list. 1842818 files left.
Now eliminating candidates based on first bytes: removed 631798 files from list. 1211020 files left.
Now eliminating candidates based on last bytes: removed 49901 files from list. 1161119 files left.
Now eliminating candidates based on sha1 checksum: removed 61578 files from list. 1099541 files left.
It seems like you have 1099541 files that are not unique

Although the first Now eliminating candidates based on first bytes: iterations removed quite some files I think in some cases it would be better to remove these steps for smaller files. Why? Consider local case (no NFS/SMB) with physical disks. Files are organized in blocks (usually today 4KB), the disk/OS basically read these 4KB at the same speed of 64 bytes (the current SomeByteSize size) so if for these files we just use the checksum directly we potentially save 2 read of the files at the cost of some more CPU. Looking at htop command rdfind is mostly always in D state (waiting for the disk).

The text was updated successfully, but these errors were encountered:

fire-eggs · 2022-08-25T18:38:00Z

[For consideration, could you provide some numbers for what "small files" look like on your large, messy backup directory? E.g. counts, average size of "small" files ...]

My reading of the code suggests that rdfind runs each "stage" across the current file list. I.e. the code follows a sequence like:

Scan for all files.
Ask the file system for the size of each file; remove unique entries.
Open each file in the current list, read the first bytes. Remove non-duplicate files.
Open each file in the current list, read the last bytes. Remove non-duplicate files.
Open each file in the current list, read the file and calculate a checksum. Remove non-duplicate files.

As suggested, each file open / read is "expensive", especially for spinning-rust disks, or in my case, across a network connection.

Based on the rdfind output posted, I think this means an option to skip steps 3 and 4 and go directly to step 5 would eliminate (1,842,818 + 1,211,020) file open/read acts, at the expense of having to calculate the checksum for (1,842,818 + 1,211,020) files vs 1,161,119 files.

[I thought about trying to combine steps 3,4, and 5 together for "small" files. My gut suggests the additional memory and tracking overhead required to manage it, outweighs the simplicity of skipping steps 3 and 4.]

Based on the user's knowledge of their disk contents, I could see this as a potentially big win, especially if checksum calculations were performed in parallel.

fire-eggs · 2022-09-05T19:48:01Z

This is technically a duplicate of #89 where @AlwaysHC has submitted a pull request to be able to disable first/last bytes scan.

fire-eggs · 2022-09-05T20:07:06Z

I did some tests, combining the first / last bytes scans. I.e., taking this:
3. Open each file in the current list, read the first bytes. Remove non-duplicate files.
4. Open each file in the current list, read the last bytes. Remove non-duplicate files.

and implementing this:
3. Open each file in the current list, read the first bytes. Seek to EOF, read the last bytes.
4. Remove non-duplicate files based on first bytes.
5. Remove non-duplicate files based on last bytes.

There are two trade-offs here. First, the amount of memory used to store bytes doubles, as the 64 bytes from both the first and last bytes are now stored. The second trade-off is the reading of bytes now takes a bit longer, which might be "wasted" time if the number of files eliminated using first-bytes is significant.

My limited testing on a physical disk accessed across the network gave me these numbers:

Stage	Files	Time1	Time2
Initial	465325	178	202
Sizes	432519	0	0
First	323606	2485	2039
Last	316491	1604	1
Sha1		~6000	~6000
Total		~10267	~8250

Time1 is "baseline": rdfind as in this repository. Time2 is after combining the first/last byte reading as described above. Times are in seconds.

In short, in this test, this change improves the total time by about 20%, and reduces the first/last bytes time by about 50%.

fziglio · 2022-09-05T21:24:42Z

@fire-eggs this is a statistic of file sizes.

Size	Number
Up to 4K	1140103
4K-16K	438871
16K-64K	225600
Others	125785

My suggestion here is not a duplicate of #89. The idea is to use only hash for files <= 4K (or whatever) but other files are potentially still going all phases while #89 is about using only hash phase.

You are proposing a different solution (combining first/last phases together). Note that you could hash the first/last bytes to avoid the memory usage issue.

fire-eggs · 2022-09-06T14:12:14Z

My suggestion here is not a duplicate of #89. The idea is to use only hash for files <= 4K (or whatever) but other files are potentially still going all phases while #89 is about using only hash phase.

Clarification appreciated. Apologies for misinterpreting!

You are proposing a different solution (combining first/last phases together). Note that you could hash the first/last bytes to avoid the memory usage issue.

Excellent suggestion! Thank you for the idea!

fire-eggs · 2022-09-07T18:33:05Z

@fziglio :

@fire-eggs this is a statistic of file sizes.

Size Number
Up to 4K 1140103
4K-16K 438871
16K-64K 225600
Others 125785

Based on this, I set up a test folder with these files:

Size (Bytes)	Copies	Characteristic
3,175	900,000	Random binary file
12,313	75,000	Random binary file
60,987	50,000	Random binary file
316,895	25,000	Random binary file
60,987+64	300,000	Random binary file, unique in first 64 bytes
316,895+64	300,000	Random binary file, unique in first 64 bytes
316,895+64	50,000	Random binary file, unique in last 64 bytes

So rdfind produces the following results:

(DRYRUN MODE) Now scanning "/media/kevin/rdfind/base", found 1700000 files.
(DRYRUN MODE) Now have 1700000 files in total.
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 144003650000 bytes or 134 GiB
Removed 0 files due to unique sizes from list. 1700000 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes: removed 600000 files from list. 1100000 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 50000 files from list. 1050000 files left.
(DRYRUN MODE) Now eliminating candidates based on sha1 checksum: removed 0 files from list. 1050000 files left.
(DRYRUN MODE) It seems like you have 1050000 files that are not unique

I think this is pretty close (except maybe the total size) to your situation.

Running rdfind against this tree on a 1T spinning rust disk, connected via USB, I get the following timings:

	Baseline	No first/last bytes
First bytes scan	202.8	N/A
Last bytes scan	104.9	N/A
Checksum	80.5	201.1
End	388.3	201.1

Times are in minutes.

My takeaway from these numbers is the cost of opening and reading a file is prohibitively expensive for "slow" media. The cost of the first / last bytes scan is far more than the checksum calculation, especially when large numbers of "small" files are involved. After trying the "no first / last bytes" test, I didn't see the point of trying any further variations ["combined first/last bytes" or "skipping first/last bytes for small files"].

When running against "fast" media, I believe there will be no significant penalty when running the first / last byte scan.

At this point, I'm convinced I can get more win by improving the checksum speed.

An interesting exercise. Thank you for the stats!

bordenc · 2023-03-25T12:17:39Z

I wonder whether there's an "intercept" point where doing a checksum outperforms first + last bytes. I'm thinking a Big O analysis rather than an empirical finding (although the latter is certainly important). Theoretically, why would reading the first and last bytes of file be slower than a checksum? In both cases, don't the files need to be read from the hard disk into memory?

To that end, perhaps one of the first optimisations should be minimising disk reads, since RAM and processor throughput will continue to outperform disk access for the foreseeable future. I haven't reviewed the code, but reading the same file several times seems extremely inefficient.

GerHobbelt mentioned this issue May 22, 2023

[report] rdfind measured performance on (large) HDD #133

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not eliminate candidates using first/last bytes for smaller files. #114

Do not eliminate candidates using first/last bytes for smaller files. #114

fziglio commented Aug 23, 2022

fire-eggs commented Aug 25, 2022

fire-eggs commented Sep 5, 2022

fire-eggs commented Sep 5, 2022

fziglio commented Sep 5, 2022

fire-eggs commented Sep 6, 2022

fire-eggs commented Sep 7, 2022

bordenc commented Mar 25, 2023

Do not eliminate candidates using first/last bytes for smaller files. #114

Do not eliminate candidates using first/last bytes for smaller files. #114

Comments

fziglio commented Aug 23, 2022

fire-eggs commented Aug 25, 2022

fire-eggs commented Sep 5, 2022

fire-eggs commented Sep 5, 2022

fziglio commented Sep 5, 2022

fire-eggs commented Sep 6, 2022

fire-eggs commented Sep 7, 2022

bordenc commented Mar 25, 2023