Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPIP-431: Opt-in Extensible CAR Metadata on Trustless Gateway #431

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 28 additions & 1 deletion src/http-gateways/trustless-gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,8 @@ Below response types SHOULD be supported:
- [application/vnd.ipld.car](https://www.iana.org/assignments/media-types/application/vnd.ipld.car)
- Disables IPLD/IPFS deserialization, requests a verifiable CAR stream to be
returned, implementations MAY support optional CAR content type parameters
(:cite[ipip-0412]) and the explicit [CAR format signaling in HTTP Request](#car-format-signaling-in-request).
(:cite[ipip-0412]), the explicit [CAR format signaling in HTTP Request](#car-format-signaling-in-request)
and the optional [CAR metadata block](#car-meta-content-type-parameter).

- [application/vnd.ipfs.ipns-record](https://www.iana.org/assignments/media-types/application/vnd.ipfs.ipns-record)
- A verifiable :cite[ipns-record] (multicodec `0x0300`).
Expand Down Expand Up @@ -301,6 +302,32 @@ of their presence in the DAG or the value assigned to the "dups" parameter, as
the raw data is already present in the parent block that links to the identity
CID.

## CAR `meta` (content type parameter)

The `meta=eof` parameter allows clients to request the server to include additional metadata about the
CAR to be included at the end of the response body.
bajtos marked this conversation as resolved.
Show resolved Hide resolved

This parameter SHOULD only be used with CAR `version=1`.
Values other than `eof` SHOULD be ignored.

When the parameter is not set, the server must not add any extra CAR blocks to the response.

The metadata block is a regular CAR block with the following properties:

- CID specifies multicodec `car-metadata` (`0x04ff`), see
[multicodec#334](https://github.com/multiformats/multicodec/pull/334).
bajtos marked this conversation as resolved.
Show resolved Hide resolved

- The payload contains metadata encoded as DAG-CBOR.

The metadata MUST include the following fields:

- `len` - byte length of the CAR data (excluding the metadata block)
bajtos marked this conversation as resolved.
Show resolved Hide resolved
- `b3h` - Blake3 hash (checksum) of the CAR data (excluding the metadata block).
- `b3h_sig` - A signature over `<len><b3h><request>` using server's Ed2559 identity.
- `len` is encoded as `varint`,
- `b3h` is encoded as 32 bytes,
- The effective query as executed by the gateway. This query is the request url - path and query string arguments.
Copy link
Member

@lidel lidel Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These b3* fields are specific to SPARK retrieval attestation and should not be listed in the trustless gateway spec as a MUST. These may be mandatory for SPARK, but are optional for the rest of IPFS ecosystem.

Please move them to "User benefit" section of the IPIP document and explain how meta=eof enables SPARK use case by allowing for these custom signatures to be passed along with the data. It makes a good example of extensibility that does not require PL's permission.

ps. I know other services like dagHouse use different hash functions for getting "CAR CID", putting all bets on Blake3 feels like an unnecessary divergence.

Perhaps this could be made bit more future-proof and generic if blake3 is represented as Multihash wrapped in CIDv1+car codec (0x0202)? Just an idea, fine to ignore, given these are specific to SPARK.

Either way, this belongs to the "userland benefiting from metadata extensibility" story.

Suggested change
- `b3h` - Blake3 hash (checksum) of the CAR data (excluding the metadata block).
- `b3h_sig` - A signature over `<len><b3h><request>` using server's Ed2559 identity.
- `len` is encoded as `varint`,
- `b3h` is encoded as 32 bytes,
- The effective query as executed by the gateway. This query is the request url - path and query string arguments.

Copy link

@patrickwoodhead patrickwoodhead Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the Spark use case belongs in Userland section. However, the individual keys of the metadata section, and what the servers must do to implement them, feels like something that should be in the trustless gateway spec.

Keys like car_bytes, data_bytes, block_count will be used by Spark but also may be used by others, and the definition of a key (i.e. what does the server actually return as the value for each key) must be the same for each use case. E.g. if one use case sets data_bytes to be the total byte length of blocks and another use case sets it to be the total byte length of the CAR stream then trustless gateway implementers will need to implement different logic for each different use case.

What's more, for the Spark use case, we do not want gateway operators to know that they are serving a Spark request and not some other request. Since the Spark ones will be incentivised and other request may not be, servers may simply provide a good retrieval service to Spark clients and a poor service to other clients.

car_bytes, data_bytes, block_count seem generic enough. The troublesome one is then the Blake3Hash and signature.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps what we need to do is leave this IPIP to concern the metadata block being appended without any constraints on what can be included in it. Then in a separate place, we define a canonical way to include a key value object in the metadata block and how the server should implement certain useful keys such as car_bytes et al

Copy link
Member

@lidel lidel Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be ok to suggest a key name convention for generic things like car_bytes data_bytes and block_count in the section that described meta parameter, as long it is

  • scoped to JSON (perhaps list under explicit meta=eof[+json]?)
  • change requirement from MUST to SHOULD (convention, not a hard requirement)

I think it would be also ok to have a documented convention for passing a hash of the CAR stream (aka CAR CID) – maybe name it car_cid and use CIDv1 with 0x0202 codec – this convention is already used by .storage folks, no need to invent anything new.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be also ok to have a documented convention for passing a hash of the CAR stream (aka CAR CID) – maybe name it car_cid and use CIDv1 with 0x0202 codec – this convention is already used by .storage folks, no need to invent anything new.

+1 to document car_cid.

For SPARK, we specifically want a Blake3 hash so that we can use inclusion proofs. That's why we want to use a dedicated field b3checksum instead of a more generic car_cid.

Copy link
Member

@lidel lidel Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including content path and CAR export parameters feels generic enough to keep in the spec, but we should not mix content path with car and url parameters as it leads to bugs around things like percent-encoding especially where ? or / is involved (cc ipfs/gateway-conformance#115).

These should be three separate fields:

  • content_path - requested content path
  • dag_params - map with DAG params like dag-scope, entity-bytes from IPIP-402
  • car_params - map with CAR content type params like order and dups from IPIP-412

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidel what is the reasoning behind splitting up dag_params and car_params? Could we instead go for content_path and query_params to keep it simple and generic (allowing for other query params)?

Copy link
Member

@lidel lidel Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that was a conscious design choice, to avoid mixing data selector with details of the transport format (not everything is in URL query params):

  • dag_params are about what user data was selected, not tied to any specific transport (could be applied to something other than CAR)
    • these are things that land in URL query
  • car_params are specific to CAR container format, they do not change the user data that was selected, only the way it is represented when sent as CAR
    • these are things that land in Accept/Content-Type headers
  • people read "query" and assume URL query :)

@patrickwoodhead that being said, if you want to simplify, this IPIP could go with a single dag-json map named response_params (to avoid confusion with URL query params, and account for the fact that server may ignore some of request params when producing a response)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using retrieval_params instead of response_params? In my mind, response parameters don't describe "what user data was selected".

What is our motivation in SPARK:
We want the gateway to describe what exactly the client is retrieving (CID, subpath, dag params, car params) and provide a signature over that.

If two SPARK checker clients submit a metadata block with the same retrieval parameters (CID, subpath, dag params, car params) then we want:

  • To be able to verify that the clients retrieved the DAG (sub)tree they were expected to retrieve.
  • Confidence that both clients made the same request to the gateway (semantically?) and were supposed to receive exactly the same response from the gateway.

content_path - requested content path

When the client requests GET /ipfs/bafy1234/cat.jpg, what is content_path?

  • /ipfs/bafy1234/cat.jpg
  • bafy1234/cat.jpg
  • /cat.jpg
  • something else?

We need the metadata block to describe both the CID requested (bafy1234) and the resource subpath if specified (/cat.jpg).


## CAR format parameters and determinism

The default header and block order in a CAR format is not specified by IPLD specifications.
Expand Down
145 changes: 145 additions & 0 deletions src/ipips/ipip-0431.md
bajtos marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
---
title: "IPIP-0431: Opt-in Extensible CAR Metadata on Trustless Gateway"
date: 2023-08-08
ipip: proposal
editors:
- name: Miroslav Bajtoš
github: bajtos
affiliation:
name: Protocol Labs
url: https://protocol.ai/
relatedIssues:
- https://github.com/filecoin-project/boost/issues/1597
order: 431
tags: ['ipips']
---

## Summary

Define an optional enhancement of the CARv1 stream that allows a Gateway server to provide
additional metadata about the CARv1 response. Introduce a new content type that allows the client
and the server to signal or negotiate the inclusion of extra metadata.

## Motivation

SPARK is a Filecoin Station module that measures the reputation of Storage Providers by periodically
retrieving a random CID. Since both SPs and SPARK nodes are permissionless, and Proof of Retrieval
is an unsolved problem, we need a way to verify that a SPARK node retrieved the given CID from the
given SP. To enable that, we need the Trustless Gateway serving the retrieval request to include a
retrieval attestation after the entire response was sent to the client.

Aside from this specific use case, the IPFS Ecosystem at large has no reliable
mechanism to signal that a CAR file transmission over HTTP completed successfully.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate?

We already have a multicodec for CAR. Couldn't you retrieve a CAR file from a gateway by CID e.g. https://w3s.link/ipfs/bagbaierabxhdw7wglmlehzgobjuoq3v3bdv64iagjdhu74ysjvdecrezxldq - you don't need to signal successful transmission if the content hashes to the same CID.

If you're using the graph API (?format=car or Accept: application/vnd.ipld.car) you're being specific about what you want in the request and the client is verifying the blocks...and so they know if the transmission over HTTP completed successfully or not...right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a couple rough edges that motivated the desire to have additional signaling / metadata here:

  1. In a Filecoin deal or car uploaded that has links to blocks that don't exist. How do you differentiate if a requested car has all the blocks that it should have, or if a missing link in the traversal was skipped / incorrectly transmitted?
  2. If the client doesn't have all the codecs of all blocks, so can't parse links in blocks / "follow" the traversal or structure of the car it has asked for, how does it know it got the right blocks?

Copy link
Member

@lidel lidel Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a multicodec for CAR. Couldn't you retrieve a CAR file from a gateway by CID

Where do we get the "CAR CID" from?
It does not exist as a concept anywhere in the specs related to retrieval or routing.

AFAIK in the majority of real world use cases:

  • client does not know the hash of CAR stream ("CAR CID") because CARv1 is not deterministic (example, example), it only knows he CID of actual data
  • server may not know the final hash of CAR stream ("CAR CID") if it is generating CAR stream on the fly

the client is verifying the blocks...and so they know if the transmission over HTTP completed successfully or not...right?

This works only if there is no HTTP midleware in-between client and server, which is never the case. There is always some HTTP middleware or CDN in production.

Once you are limited to HTTP semantics, you will cache truncated responses, and the client has to be smart enough to (1) detect that (2) be able to retry in a way that does not hit the same cache.

This is why it does not work in places like Rhea/Saturn, where HTTP responses are (last time i checked) cached blindly based only on HTTP semantics without understanding internal Block/DAG structure.


However, we need this in order to be able to use CARs as a way of serving streaming
responses for queries. One way of solving this problem is to append an extra block at the end of the
CAR stream with information that clients can use to check whether all CAR blocks have been received.

## Detailed design

CAR content type
([`application/vnd.ipld.car`](https://www.iana.org/assignments/media-types/application/vnd.ipld.car))
already supports optional parameters like `version` and `order`, which allows
HTTP client to opt-in via `Accept` header and Gateway to indicate via
`Content-Type` header which CAR flavor is returned with the response.

The proposed solution introduces a new parameter for the CAR content type in HTTP requests
and responses: `meta`.

When the CAR content type parameter `meta` is set to `eof`, the Gateway will write one additional CAR
block with metadata to the response, after it sent all CAR blocks.

The metadata format is DAG-CBOR and open to extension, allowing standardized
userland experimentation similar to the Extensible Data field from IPNS V2.

See [CAR `meta` (content type parameter)](/http-gateways/trustless-gateway/#car-meta-content-type-parameter)
in Trustless Gateway specification for more details.

## Design rationale

The proposal introduces a minimal change allowing Gateways and retrieval clients to explicitly opt
into receiving additional metadata block at the end of the CAR response stream.

The metadata block is designed to be very flexible and able to support new use-cases that may arise
in the future.

### User benefit

- Clients of trustless gateways can use the fields from the metadata as an attestation that they
performed the retrieval from the given server.

- The `len` field in the metadata block allows clients to verify whether they received all CAR
bytes, which provides a backward-compatible solution for the [CARv1 streaming problem](https://github.com/ipfs/specs/pull/332) until new CAR version is introduced.

### Compatibility
Copy link
Member

@lidel lidel Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bajtos Let's go extra mile here and elaborate what happens when CAR response with 0x00-prefixed suffix is parsed by existing CAR software.

My suggestion is to add some clear statement about expected interop, like "libraries and implementations SHOULD ignore the suffix after 0x00", otherwise we will create a bad UX/DX, where developer tries to debug things with existign tooling and the tooling errors.

I imagine we don't want things to fail due to 0x00 suffix, bare minimum being:

  • >80% of Amino DHT IPFS network (including IPFS Desktop and Brave) is Kubo
    • ipfs dag import should ignore suffix
  • reference CAR libraries ignore 0x00 by default
    • js-car (JS library used by things like custom Service Workers, Helia)
    • go-car v1 and v2 (GO libraries)
      • Caveat: I think @rvagg mentioned this may not be possible, because of Filecoin-specific logic present in the library?
  • CLI tools we recommend to developers, they will try to use these for debugging CAR responses with the suffix:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go extra mile here and elaborate what happens when CAR response with 0x00-prefixed suffix is parsed by existing CAR software.

It's a great idea to think about compatibility with existing & future tooling and clearly describe our thinking. 👍🏻

The most important aspect is avoiding the "0x00 insertion attack" vector. You can find more details in the section Zero-length-block insertion attacks (including the Filecoin-specific logic). I am cross-posting the mitigation I proposed:

Our proposal avoids this attack vector:

  • It does not change the current semantics of CARv1. Zero-length blocks remain invalid.
  • Instead, we treat the response body as a new container format combining the CARv1 file with additional data.
  • Clients must explicitly request this new container format. Existing clients not aware of the new metadata will not receive responses in the new format.

When developers use existing tooling, they will never receive a CAR file with the 0x00 suffix.

There are two major ways how a CAR with a 0x00 suffix can emerge:

  1. Somebody makes an HTTP request to a Trustless Gateway, explicitly asks to receive CAR with meta=eof+json, saves the response body to a .car file and forgets to extract the CAR payload from the container (remove the \x00{metadata} trailer).

  2. Somebody uses a tool that is aware of meta=eof+json. The tool opts into this new feature when requesting content from a Trustless Gateway, but does not extract the CAR payload from the container in the response body before returning the content back to the user.

I am arguing that (2) is a bug in the tooling, introduced by the change that modified Trustless Gateway requests to opt-into meta=eof+json, and therefore, the maintainers of that tool should fix that bug - make the tool adhere to spec.

Regarding (1): do you think this will happen frequently enough to justify the effort required to change all libraries you mentioned to start ignoring the 0x00 byte?

Maybe it's actually a good thing that the tooling reports an error because it tells the user they are using the new meta=eof+json feature incorrectly.

As an alternative to silently stripping the 0x00 suffix, the tooling can detect the situation where 0x00 is followed by a valid DAG-JSON object and report a more helpful error message to the user, advising them to either change the "accept" header in the request to the Trustless Gateway or else remove the 0x00 suffix (unpack CARv1 from the container format).

Thoughts?


go-car/cmd/car/inspect.go seems to always treat 0x00 as EOF, if I am reading the source code correctly:

https://github.com/ipld/go-car/blob/5c5d432d582564f88fd2124f2fce4f2f3e47a654/cmd/car/inspect.go#L26

	rd, err := carv2.NewReader(inStream, carv2.ZeroLengthSectionAsEOF(true))

js-car seems to always reject zero-length blocks:

https://github.com/ipld/js-car/blob/562c39266edda8422e471b7f83eadc8b7362ea0c/src/decoder.js#L94-L97

  let length = decodeVarint(await reader.upTo(8), reader)
  if (length === 0) {
    throw new Error('Invalid CAR section (zero length)')
  }

I guess I can test how existing tooling handles zero-length blocks and document this behaviour in the IPIP, so that we better understand the current landscape.


The new feature requires clients to explicitly ask the server to include the extra block via `Accept` header,
therefore the change is fully backwards-compatible for all existing gateway clients.

Gateways receiving requests for the CAR content type can ignore the `meta` parameter they don't
support and return back a response with one of the CAR content types they support. This makes the
proposed change backwards-compatible for existing gateways too.


### Security

The proposed specification change does not introduce any negative security implications.

### Alternatives
bajtos marked this conversation as resolved.
Show resolved Hide resolved

#### HTTP Trailers

Instead of adding a new content type argument, we were considering sending the additional metadata
in HTTP response trailers. Unfortunately, HTTP trailers are not widely supported by the ecosystem.
Nginx proxy module discards them, [browser `Fetch API` does not allow JS clients to access trailer
headers](https://github.com/mdn/browser-compat-data/issues/14703), neither does the Rust `reqwest` client.

#### New Content-Type

We could introduce a new content type that is not CARv3, but a thin envelope
around CARv1 with purpose of streaming over HTTP (e.g. `Content-Type:
application/vnd.ipld.car-stream`).

It would have three fields:
- `car-stream-header` (optional DAG-CBOR)
- `car` (same as `application/vnd.ipld.car;version=1`)
- `car-stream-end` (optional DAG-CBOR)

This will be enough to append DAG-CBOR manifest at the end of the stream. It
would be effectively the same CAR byte stream, but with different
`Content-Type`.

Upside of this solution:

- does not require registering new codec, or mixing data plane with control
plane, no sniffing the last DAG-CBOR block

Downsides of this solution:

- maintenance cost, requires duplicating of all CAR-related tests and features
- ecosystem opportunity cost, in creating new content type, we increase
cognitive overhead for everyone working with IPFS over HTTP
- no backward-compatible interop with existing tools and gateways that only
speak `application/vnd.ipld.car`
- distracts us away from working on things like large blocks and CARv3

#### Create CARv3

We could admit we've clearly hit limitation of what we can do with HTTP and CARv1 and CARv2 and stop abusing existing CARv1 by mixing data plane with control plane.

Spend energy on creating CARv3 that solves the problems from "Motivation" section and more:
- optional index or key-value metadata before or after data
- native truncation detection and standardized error handling and passing during streaming
- support for things like [Large Blocks](https://discuss.ipfs.tech/t/supporting-large-ipld-blocks/15093/)

TODO: link to some public artifact about CARv3

## Test fixtures

TBD

Using one CID, request the CAR data using various combinations of content type parameters.
Comment on lines +191 to +193
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging this TODO to show in the PR discussion.


### Copyright

Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).