Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add xxHash checksum and fix a comparison bug #167

Open
wants to merge 4 commits into
base: devel
Choose a base branch
from

Conversation

trollkarlen
Copy link

add the very fast but not cryptographic checksum library xxHash perfect for this application.

try to fix some issue in the checksum test that made the test not run checksum because of too small test files.

@trollkarlen
Copy link
Author

Issues with non updated workflows, give me some time and i will fix

@trollkarlen
Copy link
Author

The failing workflows is now fixed!
But I am unsure if the devel or main branch is the target for pull requests ?

@pauldreik
Copy link
Owner

Thanks for the PR! I like the gist of this and will look at it. I can already now say that I am hesitant to change the default.

@trollkarlen
Copy link
Author

trollkarlen commented Jan 10, 2025 via email

@@ -1,19 +1,19 @@
#!/bin/sh
# Test that selection of checksum works as expected.


set -x
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this it's for debugging only

@pauldreik
Copy link
Owner

pauldreik commented Jan 12, 2025

I tested this on a fast system (fast CPU, enough RAM and a fast SSD, but inside qemu).

Large files

Creating two files with zeros head -c 1000000000 /dev/zero >1G; cp 1G 1G_copy and invoking it like this: time ./rdfind -checksum xxh128 1G* I get the following:

  • sha1 takes 1.2 seconds
  • xxh128 takes 0.3 seconds

With two 7G files, I get

  • sha1 8.7 s
  • xxh128 2.4 s

so this is clearly an improvement for large files at least.

Small files

If i create very many small files with head -c100000000 /dev/zero |split --bytes 1000 I get:

  • sha1 1.5s
  • xxh128 1.5s

which is not worse than before (good).

xxh builtin benchmark

This is on the debian provided xxh package. The builtin benchmark on the same system gives:

pauldreik@xxx:~/code/eget/rdfind$ xxhsum  -b
xxhsum 0.8.2 by Yann Collet 
compiled as 64-bit x86_64 autoVec little endian with GCC 14.2.0 
Sample of 100 KB...        
 1#XXH32                         :     102400 ->    81133 it/s ( 7923.1 MB/s)   
 3#XXH64                         :     102400 ->   118104 it/s (11533.6 MB/s)   
 5#XXH3_64b                      :     102400 ->   684392 it/s (66835.1 MB/s)   
11#XXH128                        :     102400 ->   682966 it/s (66695.9 MB/s)

Taking the file system overhead into account, xxh128sum 7G 7G_copy (to simulate the double reading rdfind does) takes 1.1 s.

Modifying rdfind buffer size

Increasing the buffer size inside rdfind from 4096 to 16K gets the time down from 2.4 to to 1.7 seconds. Still a bit far from 1.1 seconds, but better.

@trollkarlen
Copy link
Author

Maybe we could change the buffer size dependent on the checksum method used to make it as fast as possible.
And also a cli switch to change this of you have other form of media that preform better at other buffer sizes ?

@trollkarlen
Copy link
Author

trollkarlen commented Jan 13, 2025

I also did a test of this and got a speedup between 60%-500% faster performance on fedora, ssd and Intel(R) Xeon(R) CPU X5670. The faster the disk the bigger the speedup, until we hit the limit for xxhash.

This is the test script:

cat <<'EOF' > test.sh
#!/bin/bash

TEST_DIR=testdir
mkdir -p "$TEST_DIR"

if [[ ! -f "$TEST_DIR/a" ]]; then
    echo "creating test files in $TEST_DIR"
    head -c $((1024*1024*500)) /dev/random >"$TEST_DIR/a"
    cp "$TEST_DIR/a" "$TEST_DIR/b"
    cp "$TEST_DIR/a" "$TEST_DIR/c"
    cp "$TEST_DIR/a" "$TEST_DIR/d"
    cp "$TEST_DIR/a" "$TEST_DIR/e"
fi

echo drop caches
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null
echo "test coldcache sha1"
time ./rdfind -checksum sha1 -dryrun true -deleteduplicates true "$TEST_DIR"

echo drop caches
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null
echo "test coldcache xxhash"
time ./rdfind -checksum xxh128 -dryrun true -deleteduplicates true "$TEST_DIR"

echo "test hotcache sha1"
time ./rdfind -checksum sha1 -dryrun true -deleteduplicates true "$TEST_DIR"
/usr/bin/time -f '%M kB' ./rdfind -checksum sha1 -dryrun true -deleteduplicates true "$TEST_DIR"

echo "test hotcache xxhash"
time ./rdfind -checksum xxh128 -dryrun true -deleteduplicates true "$TEST_DIR"
/usr/bin/time -f '%M kB' ./rdfind -checksum xxh128 -dryrun true -deleteduplicates true "$TEST_DIR"
EOF
$ sudo bash test.sh                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               drop caches                                                                                                                                                                                                                                
test coldcache sha1                                                                                                                                                                                                                        
(DRYRUN MODE) Now scanning "testdir", found 5 files.                                                                                                                                                                                       
(DRYRUN MODE) Now have 5 files in total.                                                                                                                                                                                                   
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.                                                                                                                                                                           
(DRYRUN MODE) Total size is 2621440000 bytes or 2 GiB                                                                                                                                                                                      
Removed 0 files due to unique sizes from list. 5 files left.                                                                                                                                                                               (DRYRUN MODE) Now eliminating candidates based on first bytes: removed 0 files from list. 5 files left.                                                                                                                                    
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 0 files from list. 5 files left.                                                                                                                                     
(DRYRUN MODE) Now eliminating candidates based on sha1 checksum: removed 0 files from list. 5 files left.                                                                                                                                  
(DRYRUN MODE) It seems like you have 5 files that are not unique                                                                                                                                                                           
(DRYRUN MODE) Totally, 2 GiB can be reduced.                                                                                                                                                                                               
(DRYRUN MODE) Now making results file results.txt                                                                                                                                                                                          
(DRYRUN MODE) Now deleting duplicates:                                                                                                                                                                                                     
(DRYRUN MODE) delete testdir/b                                                                                                                                                                                                             
(DRYRUN MODE) delete testdir/c                                                                                                                                                                                                             
(DRYRUN MODE) delete testdir/d                                                                                                                                                                                                             
(DRYRUN MODE) delete testdir/e                                                                                                                                                                                                             
(DRYRUN MODE) Deleted 4 files.                                                                                                                                                                                                             
                                                                                                                                                                                                                                           
real    0m9.722s                                                                                                                                                                                                                           
user    0m7.409s                                                                                                                                                                                                                           
sys     0m1.559s                                                                                                                                                                                                                           
drop caches                                                                                                                                                                                                                                
test coldcache xxhash                                                                                                                                                                                                                      
(DRYRUN MODE) Now scanning "testdir", found 5 files.                                                                                                                                                                                       
(DRYRUN MODE) Now have 5 files in total.                                                                                                                                                                                                   
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.                                                                                                                                                                           
(DRYRUN MODE) Total size is 2621440000 bytes or 2 GiB                                                                                                                                                                                      
Removed 0 files due to unique sizes from list. 5 files left.                                                                                                                                                                               
(DRYRUN MODE) Now eliminating candidates based on first bytes: removed 0 files from list. 5 files left.                                                                                                                                    
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 0 files from list. 5 files left.                                                                                                                                     
(DRYRUN MODE) Now eliminating candidates based on xxh128 checksum: removed 0 files from list. 5 files left.                                                                                                                                
(DRYRUN MODE) It seems like you have 5 files that are not unique                                                                                                                                                                           
(DRYRUN MODE) Totally, 2 GiB can be reduced.                                                                                                                                                                                               
(DRYRUN MODE) Now making results file results.txt                                                                                                                                                                                          
(DRYRUN MODE) Now deleting duplicates:                                                                                                                                                                                                     
(DRYRUN MODE) delete testdir/b                                                                                                                                                                                                             
(DRYRUN MODE) delete testdir/c                                                                                                                                                                                                             
(DRYRUN MODE) delete testdir/d                                                                                                                                                                                                             
(DRYRUN MODE) delete testdir/e                                                                                                                                                                                                             
(DRYRUN MODE) Deleted 4 files.                                                                                                                                                                                                             
                                                                                                                                                                                                                                           
real    0m6.148s                                                                                                                                                                                                                           
user    0m1.375s                                                                                                                                                                                                                           
sys     0m2.175s
test hotcache sha1                                                                                                                                                                                                                         
(DRYRUN MODE) Now scanning "testdir", found 5 files.                                                                                                                                                                                       
(DRYRUN MODE) Now have 5 files in total.                                                                                                                                                                                                   
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.                                                                                                                                                                           
(DRYRUN MODE) Total size is 2621440000 bytes or 2 GiB
Removed 0 files due to unique sizes from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on sha1 checksum: removed 0 files from list. 5 files left.
(DRYRUN MODE) It seems like you have 5 files that are not unique
(DRYRUN MODE) Totally, 2 GiB can be reduced.
(DRYRUN MODE) Now making results file results.txt
(DRYRUN MODE) Now deleting duplicates:
(DRYRUN MODE) delete testdir/b
(DRYRUN MODE) delete testdir/c
(DRYRUN MODE) delete testdir/d
(DRYRUN MODE) delete testdir/e
(DRYRUN MODE) Deleted 4 files.

real    0m7.499s
user    0m6.868s
sys     0m0.588s
(DRYRUN MODE) Now scanning "testdir", found 5 files.
(DRYRUN MODE) Now have 5 files in total.
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 2621440000 bytes or 2 GiB
Removed 0 files due to unique sizes from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on sha1 checksum: removed 0 files from list. 5 files left.
(DRYRUN MODE) It seems like you have 5 files that are not unique
(DRYRUN MODE) Totally, 2 GiB can be reduced.
(DRYRUN MODE) Now making results file results.txt
(DRYRUN MODE) Now deleting duplicates:
(DRYRUN MODE) delete testdir/b
(DRYRUN MODE) delete testdir/c
(DRYRUN MODE) delete testdir/d
(DRYRUN MODE) delete testdir/e
(DRYRUN MODE) Deleted 4 files.
4104 kB
test hotcache xxhash
(DRYRUN MODE) Now scanning "testdir", found 5 files.
(DRYRUN MODE) Now have 5 files in total.
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 2621440000 bytes or 2 GiB
Removed 0 files due to unique sizes from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on xxh128 checksum: removed 0 files from list. 5 files left.
(DRYRUN MODE) It seems like you have 5 files that are not unique
(DRYRUN MODE) Totally, 2 GiB can be reduced.
(DRYRUN MODE) Now making results file results.txt
(DRYRUN MODE) Now deleting duplicates:
(DRYRUN MODE) delete testdir/b
(DRYRUN MODE) delete testdir/c
(DRYRUN MODE) delete testdir/d
(DRYRUN MODE) delete testdir/e
(DRYRUN MODE) Deleted 4 files.

real    0m1.476s
user    0m0.812s
sys     0m0.655s
(DRYRUN MODE) Now scanning "testdir", found 5 files.
(DRYRUN MODE) Now have 5 files in total.
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 2621440000 bytes or 2 GiB
Removed 0 files due to unique sizes from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on xxh128 checksum: removed 0 files from list. 5 files left.
(DRYRUN MODE) It seems like you have 5 files that are not unique
(DRYRUN MODE) Totally, 2 GiB can be reduced.
(DRYRUN MODE) Now making results file results.txt
(DRYRUN MODE) Now deleting duplicates:
(DRYRUN MODE) delete testdir/b
(DRYRUN MODE) delete testdir/c
(DRYRUN MODE) delete testdir/d
(DRYRUN MODE) delete testdir/e
(DRYRUN MODE) Deleted 4 files.
4044 kB                                                    

@trollkarlen
Copy link
Author

I did a patch for setting the buffersize from cli also should i push that also ?

looks like the 32KB is the optimal for xxhash and my disks/memory.

Testing buffersize 4096                                                                                                                                                                                                                    
test sha1                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                           
real    0m7.370s                                                                                                                                                                                                                           
user    0m6.730s                                                                                                                                                                                                                           
sys     0m0.596s                                                                                                                                                                                                                           
test xxhash                                                                                                                                                                                                                                
                                                                                                                                                                                                                                           
real    0m1.474s                                                                                                                                                                                                                           
user    0m0.842s                                                                                                                                                                                                                           
sys     0m0.623s                                                                                                                                                                                                                           
test xxhash memusage                                                                                                                                                                                                                       
4052 kB                                                                                                                                                                                                                                    
Testing buffersize 8192                                                                                                                                                                                                                    
test sha1                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                           
real    0m7.176s                                                                                                                                                                                                                           
user    0m6.592s                                                                                                                                                                                                                           
sys     0m0.541s                                                                                                                                                                                                                           
test xxhash                                                                                                                                                                                                                                
                                                                                                                                                                                                                                           
real    0m1.105s                                                                                                                                                                                                                           
user    0m0.483s                                                                                                                                                                                                                           
sys     0m0.615s                                                                                                                                                                                                                           
test xxhash memusage                                                                                                                                                                                                                       
4048 kB                                                                                                                                                                                                                                    
Testing buffersize 16384                                                                                                                                                                                                                   
test sha1                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                           
real    0m6.981s                                                                                                                                                                                                                           
user    0m6.398s
sys     0m0.543s
test xxhash

real    0m1.010s
user    0m0.463s
sys     0m0.541s
test xxhash memusage
4116 kB
Testing buffersize 32768
test sha1

real    0m7.147s
user    0m6.585s
sys     0m0.524s
test xxhash

real    0m0.954s
user    0m0.466s
sys     0m0.482s
test xxhash memusage
4192 kB
Testing buffersize 65536
test sha1

real    0m7.160s
user    0m6.595s
sys     0m0.527s
test xxhash

real    0m0.944s
user    0m0.451s
sys     0m0.486s
test xxhash memusage
4144 kB
Testing buffersize 131072
test sha1

real    0m7.067s
user    0m6.525s
sys     0m0.504s
test xxhash

real    0m0.999s
user    0m0.469s
sys     0m0.523s
test xxhash memusage
4240 kB

@trollkarlen
Copy link
Author

I updated the pathset because i had forgotten to update the help string

- std::cerr << "expected md5/sha1/sha256/sha512, not \""
+ std::cerr << "expected md5/sha1/sha256/sha512/xxh128, not \""

@trollkarlen
Copy link
Author

Made new pull request for buffersize #177

@pauldreik
Copy link
Owner

Thanks for this work, I intend to merge this feature.
I fixed the CI issues and the codesize bug on devel separately.

  1. Please make the PR against current devel
  2. xxhash should be optional
  3. the default hash should be unchanged, and the code comment should follow that
  4. let's control the buffer size outside of this PR
  5. don't touch the CI jobs except for adding a with/without job so we know the default works (in presence and absence of the library)
  6. please make separate commits - one for autoconf , one for the code changes, one for the man page, one for the documentation etc.
  7. inline the hash size macro into the cc file if possible

I made a branch here: https://github.com/pauldreik/rdfind/tree/trollkarlens_xxhash to test out the changes I want (work in progress, but you can at least see the autoconf change which makes it optional). I am totally fine with fixing this myself if you don't want to!

@trollkarlen
Copy link
Author

trollkarlen commented Jan 22, 2025

updated the patch set to reflect the suggestions in #167 (comment), please review.

The hard part was the ifdef and tried to keep that to a minimum.
Thats why I left the help text and so on there independent on with-xxhash.
Also did not move "hash size macro" to the .cc file to minimise the need for ifdefs.

Comment on lines +30 to +31
run: WITH_XXHASH=1 make check
- name: WITH_XXHASH=1 make distcheck
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are the WITH_XXHASH=1 here?

UPDATE: aha, it was for testing. ok, see the comment there.

Checksum.cc Outdated Show resolved Hide resolved
Checksum.cc Outdated Show resolved Hide resolved
Checksum.cc Outdated Show resolved Hide resolved
Checksum.hh Outdated Show resolved Hide resolved
configure.ac Outdated Show resolved Hide resolved
rdfind.cc Outdated Show resolved Hide resolved
rdfind.cc Outdated Show resolved Hide resolved

allchecksumtypes="md5 sha1 sha256 sha512"

if [ "$WITH_XXHASH" = "1" ]; then
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there must be some better way to detect this than using the environment variable.

not sure if autoconf can help here.

Copy link
Author

@trollkarlen trollkarlen Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not find any good way to get/set that from autoconf, but if you know one please advice.

The best would be if this was set as an env in the Makefile directly when HAVE_LIBXXHASH is set, but could not find a way to do that.

@@ -21,7 +27,7 @@ if [ ! -e speedtest/largefile1 ] ; then
fi


for checksumtype in md5 sha1 sha256; do
for checksumtype in $allchecksumtypes; do
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, sha512 was missing before! nice to get that fixed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move allchecksumtypes to the common_funcs.sh so its the same everywhere ?

Signed-off-by: Robert Marklund <[email protected]>
Paul Dreik and others added 3 commits January 22, 2025 23:37
add a very very fast non cryptographic checksumming library
https://xxhash.com

Signed-off-by: Robert Marklund <[email protected]>
Signed-off-by: Robert Marklund <[email protected]>
Signed-off-by: Robert Marklund <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants