-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving the Cryptographic Integerity of a Canonical Bindle Invoice #292
Comments
This makes total sense to me. Do you want to add an example? Also, I think we might want to define how to determine the order of parcel fields. You show them in alpha order... that seems right to me. |
@technosophos As an implementer, I would find an example useful! @fibonacci1729 When you talk about "the merkle root," am I right in understanding you are referring to the input of the invoice signing specification (https://github.com/deislabs/bindle/blob/main/docs/signing-spec.md#signing-on-the-invoice)? So this is describing an improvement to the signing algorithm, but not to any other part of the invoice structure or wire protocol? (Forgive me if this is obvious!) |
@itowlson That is correct! The invoice structure is unchanged. I am just proposing we include certain omitted invoice metadata into the signing operation (specifically the construction of the logical canonical invoice). |
@technosophos Do you have any strong objections to dropping the signing information from the canonical invoice, namely |
I'll summarize here, and if we need to go into more details I can do so. But the idea with including that info was that I want to assert not just something about the content, but about the relationship between the content and the entity that signed it. As pretext: When I verify a bindle, I can choose a variety of methods to apply. I can choose, for example, that I will only accept packages that were created by an entity that I trust. Or I will only trust packages that were hosted by an entity that I trust. Or I could make a really stringent claim like "I will only accept bindles where i can verify every single signature on the bindle". But it is up to the end user to decide what their verification rules are. The idea of a role, in Bindle, is to allow an entity to assert (with constraints) that they are acting in a particular role. A creator is the one who first created the bindle. So they sign with that key. When a bindle is uploaded to a host, the host signs as a host role (and also verifies that there is exactly one creator whose public key is known to the host). [Note that the creator key verification is currently disabled, but will get enabled once we have all of this ironed out] That brings us to the question you asked. When I sign, I am making the assertion that "I, Matt Butcher, am signing as a Proxy at , and this is what the package looks like when I sign." During verification, you should be able to make the assertion that "When Matt Butcher signed at X with role Y, the package looked the same as it looks right now to me" It is probably the case that timestamp could be dropped. I'm not sure that buys us too much when it's in the signature block. The point there was to make it possible to construct an audit trail where one said "Matt's key was compromised at X, so don't trust any packages that Matt signed after X." But admittedly that is a leaky case anyway, and probably shouldn't be considered a good idea. |
I understand the need for a canonical invoice. What I don't understand is Bindle wants to have non-canonical invoices. Clients will need to grok the canonical invoice in order to validate the crypto. And they should never trust the semantic data without validating the crypto. This means that all clients need to understand the canonical invoice plus some other format. And further, because Bindle doesn't distribute the canonical invoice, the clients first need to parse another format before validating the cryptographic authenticity of the format. This means clients who have parsers with security issues are parsing untrusted input. This is the cause of nearly all the most serious security issues. A much stronger cryptographic model is for Bindle to only, ever distribute the canonical invoice. Clients can validate its cryptographic integrity before parsing, reducing the exposure to security sensitive parser bugs. |
Hey @npmccallum! Thank you for the input! This proposal is simply trying to formalize the ordering and set of relevant metadata that imply the semantic identity of a You can't achieve semantic identity of a structured input that can't be normalized (e.g.. D1 It's obvious these are semantically equivalent however structurally different byte-for-byte, i.e. Therefore, we can't rely on the structural identity of an To achieve semantic identity we need to define an ordering of elements of which comprises the semantically relevant metadata. The semantically relevant metadata of D1 and D2 is "x" and "2". Naturally, Could you open a separate issue about the client working with the canonical invoice rather than the |
@fibonacci1729 What use case requires semantic identity? |
I think we discussed the semantic identity thing at length on a call a while back, but here's the summary:
To flip the question the other way... how do you envision yanking, signing, storage, etc. working if we hash the bytes? I don't see a way to do those things without introducing a new set of content types. (Yanking, signing, and verifying are not optional for our use cases, so we can't simply punt on them.) |
If all the clients must understand the canonical invoice content type in order to validate the negotiated content type then the negotiated content type doesn't have any value and adds additional complexity. Do the existing use cases need to learn about the canonical encoding in order to validate the negotiated content type? If so, then they don't need TOML/JSON.
Combining mutable and immutable data in the same file seems to me like a choice fraught with security issues. It makes it very difficult for clients to know which data is reliable and which isn't. Security issues will arise from this as clients misunderstand which data is which. IMHO, these should be cleanly delineated to avoid confusion and compromise.
This is an implementation detail. The RDBMS data can always be generated from the on disk invoice. It is an optimization cache, not the canonical record.
Today, an invoice is a mixture of semantic and non-semantic data (SHA256 hashes are non-semantic). The invoice also contains semantic data that confounds semantic equivalence. For example, two bindles that are semantically identical except for their name/version appear from Bindle as two distinct entities. This strikes me as actually contrary to the goal of determining semantic equivalence. Researchers have been studying semantic equivalence and isomorphism to cryptography for decades now. This remains an unsolved problem. Deterministic encodings are the best solution we have. But trying to force semantic equivalence on a mixed semantic and non-semantic data doesn't actually solve this problem.
Signing is just: Names and versions don't belong in the invoice especially if you're trying to do semantic equivalence. Two bindles with differing names, versions and signatures that are otherwise identical are semantically equivalent. And legal entities can, and have, forced renaming. If I've been deploying a bindle for years and a court orders me to change its name, under Bindle's current design I can no longer test for historical semantic equivalence. On the other hand, the signing as I've proposed it retains this property. We should distinguish between yanking and deleting. Both need to be supported.
|
Just catching up here. @npmccallum some elaborations.
This I need to hear more about. I like your explanation of how you see it when you say:
but when you get to
For example, I do not foresee forced renaming; a court may decide to argue that an invoice be deleted and recreated for naming reasons, but for renaming a thing that can only be recreated differently seems fine by me. When you want semantic equivalence, you are arguing that naming should be a contrivance; I disagree, at least right now. It is true that otherwise the bindle would be "semantically equivalent" but I'm trying to understand what feature that gets you when you determine that by ignoring the names/versions_. I'm assuming that the feature there is "I can do the semantic equivalence without parsing a subtree of some sort?" If so, I would think that we need to rethink the relationship between the bindle identity and computational semantic equivalence. If I understand correctly, the current proposal seeks to make the bindle the object of semantic equivalence, whereas you're arguing we should remove the name and version from that calculation. Is that ☝️ understanding correct, to your way of thinking? |
as to deletion, deletion is done by anyone who owns an instance of the server, and therefore must be a public API. Deleting is a thing anyone may choose to do with their own instance. Or may wish to offer. We might profitably assert that a delete call that could trigger a cascading effect might error out, but have an override (say, if someone didn't care whether another user broke). Most prominent hosted versions might not expose this, I entirely agree. But a producer must not be prevented by the oss version from deleting their material. If someone else wanted it, they can use it another way. I was not aware that users of other people's code must be protected against the other producer deciding enough is enough and deleting (not merely yanking). With great power of usage comes great responsibility, IMHO. LeftPad 4 evah. Is that an objective? I find DCMA a great example of external force being applied in a "lawful" way. But a single developer deciding that the artifact they built and control must be deleted now for no reason at all I find a completely reasonable scenario. Or perhaps I'm misunderstanding the deletion issue, which is also possible. I would imagine, however, that a prominent service might have different terms of service, and that's where I think that division of responsibility lies, or should lie. |
I'm not sure in the end whether these arguments are about this PR, or about how this project should work more generally. As name and version are already there, this PR doesn't modify that objection in any way that I can see. And as there were mutable and immutable values in the invoice over which bindle worked anyway, I'm not clear how that is modified by this PR. If these thoughts are correct, I'd support implementing this approach and then opening issues on where mutable values should go if they should not live in the invoice along with immutable values as well as whether name/version should be handled a different way. |
The Problem
TL;DR: The canonical invoice omits semantic metadata relevant to the integrity of a
bindle
thus losing the guarantee of preserving the signed intent.Currently, from the bindle spec, the merkle root of an invoice is computed by hashing the concatenation of the following pieces of data together in a line-separated (
\n
) UTF-8 string:by
,name
,version
,role
,at
and thelabel.sha256
of each parcel.E.g:
What is not included in this construction is the integrity of the metadata that is semantically relevant (i.e.
groups
,features
,conditions
,annotations
, and parcel metadata).A possible attack vector arises for injecting/deselecting parcels from a
bindle
whereby an attacker that compromises aBindle
can change abindle
's invoice to augment how parcels are selected without corrupting thebindle
's merkle root.To preserve the cryptographic integrity of the invoice, all relevant semantic information should be included in the canonical invoice before computing the merkle root digest.
Proposed Solution
TL;DR When creating the canonical invoice of a
bindle
, include all semantically relevant metadata when computing the merkle root digest.As currently defined by the spec, the canonical invoice of a
bindle
is structured as the concatenation of:Omitting top-level metadata
groups
andannotations
, as well as the following parcel metadata:annotations
,conditions
,features
,media-type
,name
, andorigin
.To guarantee integrity of an invoice, we should include these omitted pieces of metadata.
I.e,
Where lines marked with
*
are the proposed additions.Reproducibility & Ordering
Reconstructing the canonical invoice from a
bindle
'sinvoice.toml
should be reproducible. To achieve this, the ordering of properties in each "block" (delimited by\n~\n
) should be included in lexicographic order. As well, the ordering of blocks themselves is according to the lexicographic ordering of eachparcel
'scontent-digest
, e.g. "DEADBEEF...".Example
WIP
The text was updated successfully, but these errors were encountered: