-
Notifications
You must be signed in to change notification settings - Fork 83
Hash Testing
Here are some numbers I got running against the same dataset with sha256 (gcrypt), xxhash and murmur3 from the hashes-update branch. Each run came up with the same total number of unique hashes, which is encouraging to see.
Each run was with 16 threads, writing to a hash file on a different partition. I rebooted between runs. SHA256 is our baseline (and the default in duperemove). Both the xxhash amd murmur3 patches aren't quite in mergeable state yet.
SHA256 via libgcrypt:
time duperemove.git/duperemove -rh --hash-threads=16 --write-hashes=hash-logs/hashes-gcrypt.dup /btrfs/
real 26m25.251s
user 78m38.612s
sys 4m52.848s
murmur3:
time duperemove.git/duperemove -rh --hash-threads=16 --write-hashes=hashes-murmur3.dup /btrfs/
real 26m19.224s
user 15m56.144s
sys 5m16.848s
xxhash, with changes (fills the 32 byte digest by splitting the input buffer into chunks for hashing):
time duperemove.git/duperemove -rh --hash-threads=16 --write-hashes=hash-logs/hashes-xxhash.dup /btrfs/
real 26m11.466s
user 1m11.408s
sys 9m57.720s
Some observations:
-
On my system, the overall time to complete the scan + checksum was dominated by I/O (and not cpu), as a result total time to actually run the test didn't change. For a system with slow cpu this might be different (I haven't tested though).
-
The difference in cpu usage is still worth having alternatives to the default hash. sha256 is super heavy for what we're doing. Duperemove can tolerate some small number of false positives - the kernel is always checking data for us anyway.
Here are some numbers for a direct xxhash encoding.
time duperemove.git/duperemove -rh --hash-threads=16 --write-hashes=hash-logs-2014-11-24/hashes-xxhash-direct.dup /btrfs/
real 26m17.420s
user 10m1.050s
sys 5m4.412s
The direct encoding version is so that I can compare our implementation with the reference, and also so that the digest returned from a running checksum would be the same as a one-shot call.
I have to investigate why CPU usage went up. That said, the extent results look ok - we need more testing though.