-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add xxHash checksum and fix a comparison bug #167
base: devel
Are you sure you want to change the base?
Conversation
Issues with non updated workflows, give me some time and i will fix |
The failing workflows is now fixed! |
Thanks for the PR! I like the gist of this and will look at it. I can already now say that I am hesitant to change the default. |
The default checksum is still sha1, in the first revision of the patches I
had changed it by accident.
Is the devel branch the right for pull requests?
…On Fri, Jan 10, 2025, 14:15 Paul Dreik ***@***.***> wrote:
Thanks for the PR! I like the gist of this and will look at it. I can
already now say that I am hesitant to change the default.
—
Reply to this email directly, view it on GitHub
<#167 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAWIHEZGBYU5NN6SH4QESL2J7BXVAVCNFSM6AAAAABU2NCHWCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKOBSGY4TCMZZGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
testcases/checksum_options.sh
Outdated
@@ -1,19 +1,19 @@ | |||
#!/bin/sh | |||
# Test that selection of checksum works as expected. | |||
|
|||
|
|||
set -x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this it's for debugging only
I tested this on a fast system (fast CPU, enough RAM and a fast SSD, but inside qemu). Large filesCreating two files with zeros
With two 7G files, I get
so this is clearly an improvement for large files at least. Small filesIf i create very many small files with
which is not worse than before (good). xxh builtin benchmarkThis is on the debian provided xxh package. The builtin benchmark on the same system gives:
Taking the file system overhead into account, Modifying rdfind buffer sizeIncreasing the buffer size inside rdfind from 4096 to 16K gets the time down from 2.4 to to 1.7 seconds. Still a bit far from 1.1 seconds, but better. |
Maybe we could change the buffer size dependent on the checksum method used to make it as fast as possible. |
I also did a test of this and got a speedup between 60%-500% faster performance on fedora, ssd and Intel(R) Xeon(R) CPU X5670. The faster the disk the bigger the speedup, until we hit the limit for xxhash. This is the test script: cat <<'EOF' > test.sh
#!/bin/bash
TEST_DIR=testdir
mkdir -p "$TEST_DIR"
if [[ ! -f "$TEST_DIR/a" ]]; then
echo "creating test files in $TEST_DIR"
head -c $((1024*1024*500)) /dev/random >"$TEST_DIR/a"
cp "$TEST_DIR/a" "$TEST_DIR/b"
cp "$TEST_DIR/a" "$TEST_DIR/c"
cp "$TEST_DIR/a" "$TEST_DIR/d"
cp "$TEST_DIR/a" "$TEST_DIR/e"
fi
echo drop caches
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null
echo "test coldcache sha1"
time ./rdfind -checksum sha1 -dryrun true -deleteduplicates true "$TEST_DIR"
echo drop caches
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null
echo "test coldcache xxhash"
time ./rdfind -checksum xxh128 -dryrun true -deleteduplicates true "$TEST_DIR"
echo "test hotcache sha1"
time ./rdfind -checksum sha1 -dryrun true -deleteduplicates true "$TEST_DIR"
/usr/bin/time -f '%M kB' ./rdfind -checksum sha1 -dryrun true -deleteduplicates true "$TEST_DIR"
echo "test hotcache xxhash"
time ./rdfind -checksum xxh128 -dryrun true -deleteduplicates true "$TEST_DIR"
/usr/bin/time -f '%M kB' ./rdfind -checksum xxh128 -dryrun true -deleteduplicates true "$TEST_DIR"
EOF $ sudo bash test.sh drop caches
test coldcache sha1
(DRYRUN MODE) Now scanning "testdir", found 5 files.
(DRYRUN MODE) Now have 5 files in total.
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 2621440000 bytes or 2 GiB
Removed 0 files due to unique sizes from list. 5 files left. (DRYRUN MODE) Now eliminating candidates based on first bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on sha1 checksum: removed 0 files from list. 5 files left.
(DRYRUN MODE) It seems like you have 5 files that are not unique
(DRYRUN MODE) Totally, 2 GiB can be reduced.
(DRYRUN MODE) Now making results file results.txt
(DRYRUN MODE) Now deleting duplicates:
(DRYRUN MODE) delete testdir/b
(DRYRUN MODE) delete testdir/c
(DRYRUN MODE) delete testdir/d
(DRYRUN MODE) delete testdir/e
(DRYRUN MODE) Deleted 4 files.
real 0m9.722s
user 0m7.409s
sys 0m1.559s
drop caches
test coldcache xxhash
(DRYRUN MODE) Now scanning "testdir", found 5 files.
(DRYRUN MODE) Now have 5 files in total.
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 2621440000 bytes or 2 GiB
Removed 0 files due to unique sizes from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on xxh128 checksum: removed 0 files from list. 5 files left.
(DRYRUN MODE) It seems like you have 5 files that are not unique
(DRYRUN MODE) Totally, 2 GiB can be reduced.
(DRYRUN MODE) Now making results file results.txt
(DRYRUN MODE) Now deleting duplicates:
(DRYRUN MODE) delete testdir/b
(DRYRUN MODE) delete testdir/c
(DRYRUN MODE) delete testdir/d
(DRYRUN MODE) delete testdir/e
(DRYRUN MODE) Deleted 4 files.
real 0m6.148s
user 0m1.375s
sys 0m2.175s
test hotcache sha1
(DRYRUN MODE) Now scanning "testdir", found 5 files.
(DRYRUN MODE) Now have 5 files in total.
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 2621440000 bytes or 2 GiB
Removed 0 files due to unique sizes from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on sha1 checksum: removed 0 files from list. 5 files left.
(DRYRUN MODE) It seems like you have 5 files that are not unique
(DRYRUN MODE) Totally, 2 GiB can be reduced.
(DRYRUN MODE) Now making results file results.txt
(DRYRUN MODE) Now deleting duplicates:
(DRYRUN MODE) delete testdir/b
(DRYRUN MODE) delete testdir/c
(DRYRUN MODE) delete testdir/d
(DRYRUN MODE) delete testdir/e
(DRYRUN MODE) Deleted 4 files.
real 0m7.499s
user 0m6.868s
sys 0m0.588s
(DRYRUN MODE) Now scanning "testdir", found 5 files.
(DRYRUN MODE) Now have 5 files in total.
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 2621440000 bytes or 2 GiB
Removed 0 files due to unique sizes from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on sha1 checksum: removed 0 files from list. 5 files left.
(DRYRUN MODE) It seems like you have 5 files that are not unique
(DRYRUN MODE) Totally, 2 GiB can be reduced.
(DRYRUN MODE) Now making results file results.txt
(DRYRUN MODE) Now deleting duplicates:
(DRYRUN MODE) delete testdir/b
(DRYRUN MODE) delete testdir/c
(DRYRUN MODE) delete testdir/d
(DRYRUN MODE) delete testdir/e
(DRYRUN MODE) Deleted 4 files.
4104 kB
test hotcache xxhash
(DRYRUN MODE) Now scanning "testdir", found 5 files.
(DRYRUN MODE) Now have 5 files in total.
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 2621440000 bytes or 2 GiB
Removed 0 files due to unique sizes from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on xxh128 checksum: removed 0 files from list. 5 files left.
(DRYRUN MODE) It seems like you have 5 files that are not unique
(DRYRUN MODE) Totally, 2 GiB can be reduced.
(DRYRUN MODE) Now making results file results.txt
(DRYRUN MODE) Now deleting duplicates:
(DRYRUN MODE) delete testdir/b
(DRYRUN MODE) delete testdir/c
(DRYRUN MODE) delete testdir/d
(DRYRUN MODE) delete testdir/e
(DRYRUN MODE) Deleted 4 files.
real 0m1.476s
user 0m0.812s
sys 0m0.655s
(DRYRUN MODE) Now scanning "testdir", found 5 files.
(DRYRUN MODE) Now have 5 files in total.
(DRYRUN MODE) Removed 0 files due to nonunique device and inode.
(DRYRUN MODE) Total size is 2621440000 bytes or 2 GiB
Removed 0 files due to unique sizes from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on first bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on last bytes: removed 0 files from list. 5 files left.
(DRYRUN MODE) Now eliminating candidates based on xxh128 checksum: removed 0 files from list. 5 files left.
(DRYRUN MODE) It seems like you have 5 files that are not unique
(DRYRUN MODE) Totally, 2 GiB can be reduced.
(DRYRUN MODE) Now making results file results.txt
(DRYRUN MODE) Now deleting duplicates:
(DRYRUN MODE) delete testdir/b
(DRYRUN MODE) delete testdir/c
(DRYRUN MODE) delete testdir/d
(DRYRUN MODE) delete testdir/e
(DRYRUN MODE) Deleted 4 files.
4044 kB |
I did a patch for setting the buffersize from cli also should i push that also ? looks like the 32KB is the optimal for xxhash and my disks/memory. Testing buffersize 4096
test sha1
real 0m7.370s
user 0m6.730s
sys 0m0.596s
test xxhash
real 0m1.474s
user 0m0.842s
sys 0m0.623s
test xxhash memusage
4052 kB
Testing buffersize 8192
test sha1
real 0m7.176s
user 0m6.592s
sys 0m0.541s
test xxhash
real 0m1.105s
user 0m0.483s
sys 0m0.615s
test xxhash memusage
4048 kB
Testing buffersize 16384
test sha1
real 0m6.981s
user 0m6.398s
sys 0m0.543s
test xxhash
real 0m1.010s
user 0m0.463s
sys 0m0.541s
test xxhash memusage
4116 kB
Testing buffersize 32768
test sha1
real 0m7.147s
user 0m6.585s
sys 0m0.524s
test xxhash
real 0m0.954s
user 0m0.466s
sys 0m0.482s
test xxhash memusage
4192 kB
Testing buffersize 65536
test sha1
real 0m7.160s
user 0m6.595s
sys 0m0.527s
test xxhash
real 0m0.944s
user 0m0.451s
sys 0m0.486s
test xxhash memusage
4144 kB
Testing buffersize 131072
test sha1
real 0m7.067s
user 0m6.525s
sys 0m0.504s
test xxhash
real 0m0.999s
user 0m0.469s
sys 0m0.523s
test xxhash memusage
4240 kB |
I updated the pathset because i had forgotten to update the help string
|
Made new pull request for buffersize #177 |
Thanks for this work, I intend to merge this feature.
I made a branch here: https://github.com/pauldreik/rdfind/tree/trollkarlens_xxhash to test out the changes I want (work in progress, but you can at least see the autoconf change which makes it optional). I am totally fine with fixing this myself if you don't want to! |
updated the patch set to reflect the suggestions in #167 (comment), please review. The hard part was the ifdef and tried to keep that to a minimum. |
run: WITH_XXHASH=1 make check | ||
- name: WITH_XXHASH=1 make distcheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are the WITH_XXHASH=1 here?
UPDATE: aha, it was for testing. ok, see the comment there.
testcases/checksum_options.sh
Outdated
|
||
allchecksumtypes="md5 sha1 sha256 sha512" | ||
|
||
if [ "$WITH_XXHASH" = "1" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there must be some better way to detect this than using the environment variable.
not sure if autoconf can help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not find any good way to get/set that from autoconf, but if you know one please advice.
The best would be if this was set as an env in the Makefile directly when HAVE_LIBXXHASH is set, but could not find a way to do that.
@@ -21,7 +27,7 @@ if [ ! -e speedtest/largefile1 ] ; then | |||
fi | |||
|
|||
|
|||
for checksumtype in md5 sha1 sha256; do | |||
for checksumtype in $allchecksumtypes; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, sha512 was missing before! nice to get that fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move allchecksumtypes to the common_funcs.sh so its the same everywhere ?
Signed-off-by: Robert Marklund <[email protected]>
add a very very fast non cryptographic checksumming library https://xxhash.com Signed-off-by: Robert Marklund <[email protected]>
Signed-off-by: Robert Marklund <[email protected]>
Signed-off-by: Robert Marklund <[email protected]>
add the very fast but not cryptographic checksum library xxHash perfect for this application.
try to fix some issue in the checksum test that made the test not run checksum because of too small test files.