Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPIP-431: Opt-in Extensible CAR Metadata on Trustless Gateway #431
base: main
Are you sure you want to change the base?
IPIP-431: Opt-in Extensible CAR Metadata on Trustless Gateway #431
Changes from 7 commits
2eb7b9e
3814b6a
cb1d8b3
1c0fbaa
a7e75d7
68715c4
ed86a0f
5056bde
9170c29
65ffcfc
9d1b61f
93c3c28
eacf51a
62fb207
72ed04c
e1fc296
152f4a6
b6069bf
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These
b3*
fields are specific to SPARK retrieval attestation and should not be listed in the trustless gateway spec as a MUST. These may be mandatory for SPARK, but are optional for the rest of IPFS ecosystem.Please move them to "User benefit" section of the IPIP document and explain how
meta=eof
enables SPARK use case by allowing for these custom signatures to be passed along with the data. It makes a good example of extensibility that does not require PL's permission.ps. I know other services like dagHouse use different hash functions for getting "CAR CID", putting all bets on Blake3 feels like an unnecessary divergence.
Perhaps this could be made bit more future-proof and generic if blake3 is represented as Multihash wrapped in CIDv1+car codec (0x0202)? Just an idea, fine to ignore, given these are specific to SPARK.
Either way, this belongs to the "userland benefiting from metadata extensibility" story.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the Spark use case belongs in Userland section. However, the individual keys of the metadata section, and what the servers must do to implement them, feels like something that should be in the trustless gateway spec.
Keys like
car_bytes
,data_bytes
,block_count
will be used by Spark but also may be used by others, and the definition of a key (i.e. what does the server actually return as the value for each key) must be the same for each use case. E.g. if one use case setsdata_bytes
to be the total byte length of blocks and another use case sets it to be the total byte length of the CAR stream then trustless gateway implementers will need to implement different logic for each different use case.What's more, for the Spark use case, we do not want gateway operators to know that they are serving a Spark request and not some other request. Since the Spark ones will be incentivised and other request may not be, servers may simply provide a good retrieval service to Spark clients and a poor service to other clients.
car_bytes
,data_bytes
,block_count
seem generic enough. The troublesome one is then the Blake3Hash and signature.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps what we need to do is leave this IPIP to concern the metadata block being appended without any constraints on what can be included in it. Then in a separate place, we define a canonical way to include a key value object in the metadata block and how the server should implement certain useful keys such as
car_bytes
et alThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be ok to suggest a key name convention for generic things like
car_bytes
data_bytes
andblock_count
in the section that describedmeta
parameter, as long it ismeta=eof[+json]
?)I think it would be also ok to have a documented convention for passing a hash of the CAR stream (aka CAR CID) – maybe name it
car_cid
and use CIDv1 with 0x0202 codec – this convention is already used by .storage folks, no need to invent anything new.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to document
car_cid
.For SPARK, we specifically want a Blake3 hash so that we can use inclusion proofs. That's why we want to use a dedicated field
b3checksum
instead of a more genericcar_cid
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including content path and CAR export parameters feels generic enough to keep in the spec, but we should not mix content path with car and url parameters as it leads to bugs around things like percent-encoding especially where
?
or/
is involved (cc ipfs/gateway-conformance#115).These should be three separate fields:
content_path
- requested content pathdag_params
- map with DAG params likedag-scope
,entity-bytes
from IPIP-402car_params
- map with CAR content type params likeorder
anddups
from IPIP-412There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidel what is the reasoning behind splitting up
dag_params
andcar_params
? Could we instead go forcontent_path
andquery_params
to keep it simple and generic (allowing for other query params)?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that was a conscious design choice, to avoid mixing data selector with details of the transport format (not everything is in URL query params):
dag_params
are about what user data was selected, not tied to any specific transport (could be applied to something other than CAR)car_params
are specific to CAR container format, they do not change the user data that was selected, only the way it is represented when sent as CAR@patrickwoodhead that being said, if you want to simplify, this IPIP could go with a single dag-json map named
response_params
(to avoid confusion with URL query params, and account for the fact that server may ignore some of request params when producing a response)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using
retrieval_params
instead ofresponse_params
? In my mind, response parameters don't describe "what user data was selected".What is our motivation in SPARK:
We want the gateway to describe what exactly the client is retrieving (CID, subpath, dag params, car params) and provide a signature over that.
If two SPARK checker clients submit a metadata block with the same retrieval parameters (CID, subpath, dag params, car params) then we want:
When the client requests
GET /ipfs/bafy1234/cat.jpg
, what iscontent_path
?/ipfs/bafy1234/cat.jpg
bafy1234/cat.jpg
/cat.jpg
We need the metadata block to describe both the CID requested (
bafy1234
) and the resource subpath if specified (/cat.jpg
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate?
We already have a multicodec for CAR. Couldn't you retrieve a CAR file from a gateway by CID e.g. https://w3s.link/ipfs/bagbaierabxhdw7wglmlehzgobjuoq3v3bdv64iagjdhu74ysjvdecrezxldq - you don't need to signal successful transmission if the content hashes to the same CID.
If you're using the graph API (
?format=car
orAccept: application/vnd.ipld.car
) you're being specific about what you want in the request and the client is verifying the blocks...and so they know if the transmission over HTTP completed successfully or not...right?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are a couple rough edges that motivated the desire to have additional signaling / metadata here:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we get the "CAR CID" from?
It does not exist as a concept anywhere in the specs related to retrieval or routing.
AFAIK in the majority of real world use cases:
This works only if there is no HTTP midleware in-between client and server, which is never the case. There is always some HTTP middleware or CDN in production.
Once you are limited to HTTP semantics, you will cache truncated responses, and the client has to be smart enough to (1) detect that (2) be able to retry in a way that does not hit the same cache.
This is why it does not work in places like Rhea/Saturn, where HTTP responses are (last time i checked) cached blindly based only on HTTP semantics without understanding internal Block/DAG structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bajtos Let's go extra mile here and elaborate what happens when CAR response with
0x00
-prefixed suffix is parsed by existing CAR software.My suggestion is to add some clear statement about expected interop, like "libraries and implementations SHOULD ignore the suffix after 0x00", otherwise we will create a bad UX/DX, where developer tries to debug things with existign tooling and the tooling errors.
I imagine we don't want things to fail due to
0x00
suffix, bare minimum being:ipfs dag import
should ignore suffixThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a great idea to think about compatibility with existing & future tooling and clearly describe our thinking. 👍🏻
The most important aspect is avoiding the "0x00 insertion attack" vector. You can find more details in the section Zero-length-block insertion attacks (including the Filecoin-specific logic). I am cross-posting the mitigation I proposed:
When developers use existing tooling, they will never receive a CAR file with the
0x00
suffix.There are two major ways how a CAR with a
0x00
suffix can emerge:Somebody makes an HTTP request to a Trustless Gateway, explicitly asks to receive CAR with
meta=eof+json
, saves the response body to a.car
file and forgets to extract the CAR payload from the container (remove the\x00{metadata}
trailer).Somebody uses a tool that is aware of
meta=eof+json
. The tool opts into this new feature when requesting content from a Trustless Gateway, but does not extract the CAR payload from the container in the response body before returning the content back to the user.I am arguing that (2) is a bug in the tooling, introduced by the change that modified Trustless Gateway requests to opt-into
meta=eof+json
, and therefore, the maintainers of that tool should fix that bug - make the tool adhere to spec.Regarding (1): do you think this will happen frequently enough to justify the effort required to change all libraries you mentioned to start ignoring the
0x00
byte?Maybe it's actually a good thing that the tooling reports an error because it tells the user they are using the new
meta=eof+json
feature incorrectly.As an alternative to silently stripping the
0x00
suffix, the tooling can detect the situation where0x00
is followed by a valid DAG-JSON object and report a more helpful error message to the user, advising them to either change the "accept" header in the request to the Trustless Gateway or else remove the0x00
suffix (unpack CARv1 from the container format).Thoughts?
go-car/cmd/car/inspect.go
seems to always treat0x00
as EOF, if I am reading the source code correctly:https://github.com/ipld/go-car/blob/5c5d432d582564f88fd2124f2fce4f2f3e47a654/cmd/car/inspect.go#L26
js-car
seems to always reject zero-length blocks:https://github.com/ipld/js-car/blob/562c39266edda8422e471b7f83eadc8b7362ea0c/src/decoder.js#L94-L97
I guess I can test how existing tooling handles zero-length blocks and document this behaviour in the IPIP, so that we better understand the current landscape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flagging this TODO to show in the PR discussion.