-
-
Notifications
You must be signed in to change notification settings - Fork 744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose hash over ChunkListEntry to borg list --format {chunk_ids_$HASH}
#5167
base: master
Are you sure you want to change the base?
Conversation
borg list --format {chunk_ids_$HASH}
borg list --format {chunk_ids_$HASH}
borg list --format {chunk_ids_$HASH}
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5167 +/- ##
==========================================
- Coverage 83.98% 83.86% -0.12%
==========================================
Files 37 37
Lines 9839 9854 +15
Branches 1638 1640 +2
==========================================
+ Hits 8263 8264 +1
- Misses 1097 1110 +13
- Partials 479 480 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the PR!
yeah, I see the use case. but i am somehow concerned that a lot of people might use it incorrectly, draw wrong conclusions and file support tickets if outcome is not as they expected...
the problem is that chunk lists (or hashes of chunk lists) can only be compared if can_compare_chunk_ids is True (see do_diff). as we do not know how people use the chunk list hashes that borg list would display, we can't be sure they are doing valid comparisons.
otoh, chunker params usually do not change that often and also false positives are unlikely (different files showing same hash). but false negatives could happen (same files showing different hash due to different chunker params - or even different hash algo as you let people choose it).
Good point. I will try to add a hashsum for the chunker parameters itself as well, and add a note in the format key documentation that it's only valid to compare if those are equal.
Is there anything here that could cause false positives besides the usual hash algo collisions? Are those more likely here?
Hmm, why can the choice of hash algo influence false negatives? One more question: this is doing the correct thing, isn't it? If chunker parameters are equal, equal checksums of the chunk ids would mean content of item/file is identical? Can you maybe write a test for this, or point me to a test that I can copy&modify that tests similar functionality? |
Yes, i think it is correct. The |
src/borg/helpers/parseformat.py
Outdated
@@ -688,7 +688,8 @@ class ItemFormatter(BaseFormatter): | |||
('size', 'csize', 'dsize', 'dcsize', 'num_chunks', 'unique_chunks'), | |||
('mtime', 'ctime', 'atime', 'isomtime', 'isoctime', 'isoatime'), | |||
tuple(sorted(hash_algorithms)), | |||
tuple(('chunk_ids_%s' % alg for alg in sorted(hash_algorithms))), | |||
tuple(['chunk_ids_%s' % alg for alg in sorted(hash_algorithms)] + [ | |||
'chunker_params_%s' % alg for alg in sorted(hash_algorithms)]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the chunker params are same within 1 archive, so guess you do not want to show them for each file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, true, but this would make parsing the output of list harder. As it is now, you can just compare (item['chunk_ids_checksum'], item['chunker_params']) after parsing the output for each item.
def hash_item(self, hash_function, item): | ||
if 'chunks' not in item: | ||
return "" | ||
def prepare_hash_function(self, hash_function): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not "prepare" the hash function in any way, it just picks one by its name, which is what one passes into it, rather than a function, thus naming this something like get_hash_function(self, name)
would be less misleading and more self-explanatory.
While ideally, a hash that could be quickly generated without having to look at the file content that still guarantees the content to be identical (collisions aside) would of course be what one would prefer, in practice it is not possible. Possible are the existing hashes/checksums that are slow due to looking at the file content. Also possible is this shortcut of quiclky getting the list of chunks, that, once established (by other means) that the actual file content has been put into these chunks and that (by If however one does rechunk a repo, each chunk is read, the data is processed (encryption, compression) and written to a new, different chunk, then without another separate file content verifycation, there is no guarantee that no bit error has happened in that process, so to me it is the desired behaviour for a hash based on the list of chunks to be different, because it is different chunks. |
Yes, my first thought as well, plus for people that do change them at some point, seeing which ones (ie. the current or older ones) where used for that file of list output is a benefit as well.
Minus the potential extra "collision", if two chunks of different data due to different chunker params end up with the same hash (which for a sigle chunk file would be enough). That could be prevented by doing the hash over the chunker params and the chunk IDs, calling it something like
The exact same could be said of the various existing hashes of file contents, yet noone (I hope 😉) would pose this as a valid argument to remove these/substitute them by a single one, thus I do not see this as a valid argument here. |
I meanwhile have the functionality I was after, ie. list of actual chunks, not a hash over that, implemented in shell using
Not sure yet, why borg debug inserts that unicode DELETE char in front of each chunk ID, but above command produces the same output as my own (hackish) patch adding a chunkids format key to list did. 😃 |
https://github.com/borgbackup/borg/blob/1.1-maint/src/borg/helpers.py#L1240 comment about special character. |
Due to @solrize asking about it on IRC, I have thought about such a "full file hash" (computed over the chunkids list) again. Guess, as it is now, the code here can only be used to compare files in the same repo, IF chunker_params are the same. With borg2, we'll get the concept of "related repositories" that share the same chunker secret and the same id-hash secret, thus (if chunker params are the same also), files will get chunked exactly the same way and the chunk ids will also be computed the same way. So we could quickly determine identical files in different related repos if these conditions are met.
Note:
H could be one specific cryptographic hash. Considering sha256 is hw accelerated in modern CPUs and data isn't that much anyway, we could just take that. Also: we could store the fingerprint into item metadata at |
Rationale: because checksums are so slow, and
borg diff
is much faster, I tried to expose toborg list
a hash over the data thatborg diff
seems to use to compare items