Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to specify a checksum while adding a file for integrity testing #7071

Open
RubenKelevra opened this issue Apr 1, 2020 · 10 comments
Open
Labels
kind/feature A new feature need/community-input Needs input from the wider community

Comments

@RubenKelevra
Copy link
Contributor

RubenKelevra commented Apr 1, 2020

When files are shared online, there's often a hash published with it. This allows checking if the integrity was compromised on the mirror server or while downloading.

When I import a file via an URL directly to IPFS it's impossible to check if the file got the correct hash. It would be nice if we can specify the expected hash and at the end of the operation the file is either added or an error message will be displayed.

It makes sense to support at least md5, sha1, sha256 and sha512.

@RubenKelevra RubenKelevra added the kind/feature A new feature label Apr 1, 2020
@RubenKelevra RubenKelevra changed the title Ability to specify a md5/sha1/sha256/sha512 sum while adding a file Ability to specify a checksum while adding a file for integrity testing Apr 1, 2020
@hsanjuan
Copy link
Contributor

hsanjuan commented Apr 1, 2020

When I import a file via an URL directly to IPFS it's impossible to check if the file got the correct hash

Actually in a oneliner:

ipfs add -Q url | ipfs cat | sha256sum

Does not seem like this would need to be a core feature?

@RubenKelevra
Copy link
Contributor Author

When I import a file via an URL directly to IPFS it's impossible to check if the file got the correct hash

Actually in a oneliner:

ipfs add -Q url | ipfs cat | sha256sum

Does not seem like this would need to be a core feature?

This would result in reading the whole file again from the disk. I was planning to import some files in the 80 GB region in the future, I was hoping for a solution which doesn't require to reread the whole file.

Additionally, it would double the traffic when used with --nocopy which I plan to use on some of the older files of the archive - since my hard disk space is limited to 2 TB.

@hsanjuan
Copy link
Contributor

hsanjuan commented Apr 2, 2020

ok, for anything other than nocopy you do something like:

curl https://1.1.1.1 | pee "ipfs add" "sha256sum"

to avoid re-reading. For the nocopy case I don't think there is a way to get around it.

@ribasushi thoughts? A very wild guess is the adder could attach an additional "regular hash calculation" to the output object it emits for every added file if configured to do so...

@ribasushi
Copy link
Contributor

@hsanjuan tehoretically yes - a standard "in-stream checksum" tally could be useful for both CLI and HTTP api calls. The practical trouble implementing this goes back to "redesigning the prerun/run/emit/postrun" discussion in ipfs/go-ipfs-cmds#115 / #7050 . We need a "Post-0.5-brain @Stebalien" to design this and have it implemented. I could take a stab at that now that I have a better idea how the entire system works. But again - design is 95% of the difficulty here.

For the time being your suggestion with various tee-like programs is the best one can do.

@hsanjuan
Copy link
Contributor

hsanjuan commented Apr 2, 2020

@ribasushi I don't mean making this part of the command handling, but modifying somewhere around https://github.com/ipfs/go-ipfs/blob/master/core/coreunix/add.go#L399 to duplicate the reader, calculate the hash and end up emitting (https://github.com/ipfs/go-ipfs/blob/master/core/coreunix/add.go#L251) the regular hash here, along with the usual notification object.

@ribasushi
Copy link
Contributor

If this were only valuable for add - sure. But because this is useful for any PUT call, it makes sense to generalize it instead of keeping adding ad-hoc parts here and there.

On the other hand - yes, your suggestion is very quick to implement. Though I would not want it on by default. Also - which hash? SHA256 is pretty expensive to calculate as it is...

@hsanjuan
Copy link
Contributor

hsanjuan commented Apr 2, 2020

If this were only valuable for add - sure. But because this is useful for any PUT call, it makes sense to generalize it instead of keeping adding ad-hoc parts here and there.

So here we are not getting the sum for the PUT request body though, but for each individual file added with it (every part of the multipart I guess)...

On the other hand - yes, your suggestion is very quick to implement. Though I would not want it on by default. Also - which hash? SHA256 is pretty expensive to calculate as it is...

Disabled by default, configurable sum function.

@mohe2015
Copy link
Contributor

mohe2015 commented Apr 2, 2020

Depending on what hash type you need you may also be able to extract its value from the cid. But I think this only works for sha256 or so

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Apr 2, 2020

SHA256 is pretty expensive to calculate as it is..

This depends on the features of the system, a skylake processor is just half as fast on sha2 than on sha1 for example. (according to Wikipedia).

Since I need to run this anyway, it's a necessary calculation. In comparison to doing an additional copy of the data in memory to a different process, a native implementation in ipfs will be much faster. It will also avoid a lot of context switching between processes.

@Stebalien Stebalien added the need/community-input Needs input from the wider community label Apr 2, 2020
@Stebalien
Copy link
Member

While I can see how this feature could be useful, that does not outweigh the cost in both implementing and maintaining it. Every feature is something we need to document, fix, and preserve through refactors.

There's no reason not to leave this issue open for further discussion and no reason not to consider it in the future, but we likely won't tackle this any time soon, if ever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature A new feature need/community-input Needs input from the wider community
Projects
None yet
Development

No branches or pull requests

5 participants