Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPIP-431: Opt-in Extensible CAR Metadata on Trustless Gateway #431

Open
wants to merge 18 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 108 additions & 1 deletion src/http-gateways/trustless-gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,8 @@ Below response types SHOULD be supported:
- [application/vnd.ipld.car](https://www.iana.org/assignments/media-types/application/vnd.ipld.car)
- Disables IPLD/IPFS deserialization, requests a verifiable CAR stream to be
returned, implementations MAY support optional CAR content type parameters
(:cite[ipip-0412]) and the explicit [CAR format signaling in HTTP Request](#car-format-signaling-in-request).
(:cite[ipip-0412]), the explicit [CAR format signaling in HTTP Request](#car-format-signaling-in-request)
and the optional [metadata block](#meta-content-type-parameter).

- [application/vnd.ipfs.ipns-record](https://www.iana.org/assignments/media-types/application/vnd.ipfs.ipns-record)
- A verifiable :cite[ipns-record] (multicodec `0x0300`).
Expand Down Expand Up @@ -301,6 +302,112 @@ of their presence in the DAG or the value assigned to the "dups" parameter, as
the raw data is already present in the parent block that links to the identity
CID.

## `meta` (content type parameter)

The `meta` parameter allows clients to request the server to include additional metadata about the CAR along with the response body.

The value of this parameter includes both the location where the metadata is given (e.g. `eof`) as well as the type of data received (e.g. `json`) separated by a `+`, to give a value such as `meta=eof+json`

When the location parameter is set to `eof`, which is currently the only supported value, the server SHOULD respond with the following response body:

```
<Response body as CARv1 stream> <0x00 byte> <Metadata>
```
bajtos marked this conversation as resolved.
Show resolved Hide resolved

The only supported value for the data type parameter is `json`. This signifies that the metadata MUST be a JSON object.

This parameter MUST only be used with CAR `version=1`.

When the parameter is not set or does not equal `eof+json`, the server SHOULD not add any extra blocks to the response, neither the 0x00 byte nor any metadata.

When `meta=eof+json`, the JSON object SHOULD conform to the following [JSON schema](https://json-schema.org/).
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion points:

  • In the current spec & IPIP, we are formatting metadata as JSON. Should we say DAG-JSON instead?
  • Do we want to serialise the metadata as a CAR block, prefixing the JSON data with varint | CID header?

Copy link
Member

@lidel lidel Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willscott @rvagg thoughts? Value added in DAG-JSON prefixed with own CID is that it allows client to detect truncation beyond 0x00 byte.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe clients can already easily detect truncation of the metadata block.

  • The block is a DAG-JSON object, it must start with { and end with a matching }.
  • If the block is truncated, it will not end with the matching } and the JSON parser will throw an error.


```json
{
"type": "object",
"properties": {
"data": {
"type": "object",
"description": "Properties of the response"
},
Comment on lines +329 to +332
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion point:

In the current proposal, the top-level "data" object combines fields about "what was requested" (e.g. CAR & DAG params) with "what was returned" (e.g. CARv1 length in bytes).

I'd like to discuss an alternative: split data into two fields req and res. The first will describe what the client requested, the second will describe what the server returned.

Such division would allow us to shorten field names, e.g. data.car_params.dup can become req.dups.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting into req and res sgtm, improves clarity

"error": {
"type": "string",
"description": "Error message"
},
"sig": {
"type": "string",
"description": "A signature, using the server's Ed2559 identity, over the `data` object serialized as JSON."
Comment on lines +337 to +339
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bajtos

  • HTTP Gateways have no concept of "sever ED25519" introduced here. How one verifies the signature without knowing the pubkey?
    • One way to avoid being prescriptive about key type or its location, is to have sig_key with CID-encoded public libp2p-key that can be used for signature verification.
      • The nice thing about this is that Gateway/client implementation will already have relevant code/library as we use these in IPNS and libp2p.
  • If you sign JSON, you want it to be deterministic variant like DAG-JSON, otherwise someone will run into bugs when they use less strict JSON library in different languages.
Suggested change
"sig": {
"type": "string",
"description": "A signature, using the server's Ed2559 identity, over the `data` object serialized as JSON."
"sig_pubkey": {
"type": "string",
"description": "A libp2p-key used for signing"
},
"sig": {
"type": "string",
"description": "A signature, using the `sig_pubkey`, over the `data` object serialized as DAG-JSON."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is our use case:

  1. An untrusted/permissionless client makes a retrieval request to the Storage Provider's booster-http address advertised in IPNI.
  2. The client submits the measurement to the SPARK orchestration layer.
  3. Later, SPARK's evaluation service wants to verify that the client contacted the SP.

To do so, we must not accept signatures from any identity, only the signature from the identity advertised by SP to IPNI.

I am arguing this is true for everybody else who wants to use the signature to verify that a metadata block submitted by an untrusted party was indeed produced by the expected Trustless Gateway.

Consider a simple attack vector: the attacker takes the metadata block produced by the origin gateway and replaces the signature with one created using the attacker's identity. Clients verifying the signature against the sig_pubkey field in the metadata will not notice the attack.

Now I can see how including sig_pubkey can simplify troubleshooting:

  • If sig_pubkey does not match the pubkey we expected, then we know the metadata block was signed by somebody else
  • If sig_pubkey matches but the signature does not, then we know the metadata block was modified from the original.
    Compare that with my proposal:
  • If the signature is not valid, then either the metadata block was tampered with or it was signed by a different identity.

IMO, this improvement is not worth the cost of increasing metadata block size and, thus, egress traffic for Trustless Gateways.

Do you have any other use case for the signature in your mind?

IMO, the clients making retrieval requests don't need this signature for validating the metadata block, as they can rely on guarantees provided by the underlying transport - HTTPS.

  • HTTP Gateways have no concept of "server ED25519" introduced here.

Good point. We don't require all Gateways to sign the metadata block, SPARK needs the signature only from Storage Providers' servers handling retrieval (booster-http).

Let's update the spec to explicitly mention the signature is an optional field.

How one verifies the signature without knowing the pubkey?

  • One way to avoid being prescriptive about key type or its location, is to have sig_key with CID-encoded public libp2p-key that can be used for signature verification.

As I wrote above, if you don't know the expected server identity, then the signature is not useful for you.

Having said that, I like the idea of adding more details about the identity/public key to the spec.

The proposed format CID-encoded public libp2p-key seems like a good candidate, although AFAICT, that's not the format advertised to IPNI. In IPNI, I see identities in the format that can be used in multiaddr's /p2p/{id} part:

12D3KooWAWHEbCQy22d45mKbKSewoB1xksDDhR7o5S4mDrSNKXNk
12D3KooWAy5kaLtHf5uS7PZVLjSYd8sGqJ6fn7bxMjqLLZ1uULp9
12D3KooWEiPRcfjXJVehty8okJGJpBZP8zM5UBoCK5yw2MXfx98x
12D3KooWFpv7LP1MUmjfQ8sAUXgJXG5FRMJLnqnJyR32fVboqspB
12D3KooWHKeaNCnYByQUMS2n5PAZ1KZ9xKXqsb4bhpxVJ6bBJg5V
12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f
12D3KooWSfsqUahHLCmiENT8oN4FkVtz5pSCxKtNEb7wrR1rrRjk
  • If you sign JSON, you want it to be deterministic variant like DAG-JSON, otherwise someone will run into bugs when they use less strict JSON library in different languages.

Makes sense; I'll update the spec to require the metadata to be a DAG-JSON.

},
"required": []
}
}
```

The properties object can include any fields that the server would like to implement. The following JSON schema explicitly mentions certain properties fields in order to reach a convention on their definition as they have existing use cases.

```json
{
"type": "object",
"properties": {
"car_bytes": {
"description": "The total byte length of the CAR stream (excluding the 0x00 byte and the metadata block)",
"type": "integer"
},
"data_bytes": {
"description": "Total byte length of the flat file before it was encoded into a CAR file",
"type": "integer"
},
Comment on lines +356 to +359
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bajtos what happens when returned CAR is for:

  • HAMT-sharded UnixFS directory?
  • a single file under some sub-path of HAMT-sharded UnixFS directory?

Is the semantic meaning here to be "raw bytes of all files, ignoring UnixFS directory metadata", or something else?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great questions! TBH, I don't know the answers. We don't need data_bytes for SPARK. I think this field was added based on the discussion in this proposal, but I could not find the specific comment requesting it.

I am proposing to remove data_field from the spec. We can introduce it later if there is a clear need. We will better understand the desired semantics at that point.

"block_count": {
"description": "Total number of blocks present in the CAR stream (excluding the 0x00 byte and the metadata block, but including duplicates when present)",
"type": "integer"
},
"car_cid": {
"description": "A hash of the CAR stream giving a CIDv1 with 0x0202 codec",
"type": "string"
},
"b3checksum": {
"description": "A Blake3 hash (checksum) of the CAR stream (excluding the 0x00 byte and the metadata block). The value should be serialized as a multihash with multibase prefix, preferably using Base58 encoding.",
"type": "string"
},
Comment on lines +368 to +371
Copy link
Member

@lidel lidel Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bajtos What is the difference between car_cid and this field?

Hardcoding Blake3 in field name and description makes no sense if you use Multihash. It could use functions other than blake3 in the future.

To reduce future confusion, could this be renamed to car_checksum ? (and remove car_cid since it is redundant?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is your description of car_cid, see #431 (comment):

I think it would be also ok to have a documented convention for passing a hash of the CAR stream (aka CAR CID) – maybe name it car_cid and use CIDv1 with 0x0202 codec – this convention is already used by .storage folks, no need to invent anything new.

Regarding b3checksum:

For SPARK, we specifically need the Blake3 hash of the CAR stream, and we need gateways to always return this hash. In particular, clients cannot ask the server to use Blake3 for the CAR checksum because the server could use this information to detect SPARK clients vs. other clients and provide different quality of service.

I agree it's confusing to have both car_cid and b3checksum, but I don't see a better solution. Do you?

"content_path": {
"description": "The url path in the request as executed by the gateway, e.g. `/ipfs/bafy1234/cat.jpg`. The query string MUST BE stripped from the path.",
"type": "string"
},
Comment on lines +372 to +375
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion point leading back to #431 (comment):

How do we represent the information about what content was requested?

  • The CID
  • An optional path to a file inside UnixFS

"dag_params": {
"description": "A map with DAG params like dag-scope, entity-bytes from [IPIP-402](https://specs.ipfs.tech/ipips/ipip-0402/)",
"type": "object",
"properties": {
"dag-scope": {
"description": "See [IPIP-402](https://specs.ipfs.tech/ipips/ipip-0402/) for the definition",
"type": "string"
},
"entity-bytes": {
"description": "See [IPIP-402](https://specs.ipfs.tech/ipips/ipip-0402/) for the definition",
"type": "string"
}
},
"required": []
},
"car_params": {
"description": "A map with CAR content type params like order and dups from [IPIP-412](https://specs.ipfs.tech/ipips/ipip-0412/)",
"type": "object",
"properties": {
"order": {
"description": "See [IPIP-412](https://specs.ipfs.tech/ipips/ipip-0412/) for the definition.",
"type": "string"
},
"dups": {
"description": "See [IPIP-412](https://specs.ipfs.tech/ipips/ipip-0412/) for the definition.",
"type": "string"
}
},
"required": []
}
},
"required": []
}
```

## CAR format parameters and determinism

The default header and block order in a CAR format is not specified by IPLD specifications.
Expand Down
197 changes: 197 additions & 0 deletions src/ipips/ipip-0431.md
bajtos marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
---
title: "IPIP-0431: Opt-in Extensible CAR Metadata on Trustless Gateway"
date: 2023-08-08
ipip: proposal
editors:
- name: Miroslav Bajtoš
github: bajtos
affiliation:
name: Protocol Labs
url: https://protocol.ai/
- name: Patrick Woodhead
github: patrickwoodhead
affiliation:
name: Protocol Labs
url: https://protocol.ai/
relatedIssues:
- https://github.com/filecoin-project/boost/issues/1597
order: 431
tags: ['ipips']
---

## Summary

Define an optional enhancement of the CARv1 response that allows a Gateway server to provide
additional metadata about the CARv1 stream. Introduce a new content type that allows the client
and the server to signal or negotiate the inclusion of extra metadata.

## Motivation

SPARK is a Filecoin Station module that measures the reputation of Storage Providers by periodically
retrieving a random CID. Since both SPs and SPARK nodes are permissionless, and Proof of Retrieval
is an unsolved problem, we need a way to verify that a SPARK node retrieved the given CID from the
given SP. To enable that, we want the Trustless Gateway serving the retrieval request to include a
retrieval attestation after the entire response was sent to the client.

Aside from this specific use case, the IPFS Ecosystem at large has no reliable
mechanism to signal that a CAR file transmission over HTTP completed successfully.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate?

We already have a multicodec for CAR. Couldn't you retrieve a CAR file from a gateway by CID e.g. https://w3s.link/ipfs/bagbaierabxhdw7wglmlehzgobjuoq3v3bdv64iagjdhu74ysjvdecrezxldq - you don't need to signal successful transmission if the content hashes to the same CID.

If you're using the graph API (?format=car or Accept: application/vnd.ipld.car) you're being specific about what you want in the request and the client is verifying the blocks...and so they know if the transmission over HTTP completed successfully or not...right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are a couple rough edges that motivated the desire to have additional signaling / metadata here:

  1. In a Filecoin deal or car uploaded that has links to blocks that don't exist. How do you differentiate if a requested car has all the blocks that it should have, or if a missing link in the traversal was skipped / incorrectly transmitted?
  2. If the client doesn't have all the codecs of all blocks, so can't parse links in blocks / "follow" the traversal or structure of the car it has asked for, how does it know it got the right blocks?

Copy link
Member

@lidel lidel Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a multicodec for CAR. Couldn't you retrieve a CAR file from a gateway by CID

Where do we get the "CAR CID" from?
It does not exist as a concept anywhere in the specs related to retrieval or routing.

AFAIK in the majority of real world use cases:

  • client does not know the hash of CAR stream ("CAR CID") because CARv1 is not deterministic (example, example), it only knows he CID of actual data
  • server may not know the final hash of CAR stream ("CAR CID") if it is generating CAR stream on the fly

the client is verifying the blocks...and so they know if the transmission over HTTP completed successfully or not...right?

This works only if there is no HTTP midleware in-between client and server, which is never the case. There is always some HTTP middleware or CDN in production.

Once you are limited to HTTP semantics, you will cache truncated responses, and the client has to be smart enough to (1) detect that (2) be able to retry in a way that does not hit the same cache.

This is why it does not work in places like Rhea/Saturn, where HTTP responses are (last time i checked) cached blindly based only on HTTP semantics without understanding internal Block/DAG structure.


We need such signalling mechanism in order to be able to use CARs as a way of serving streaming
responses for queries. One way of solving this problem is to append an extra block at the end of the
CAR stream with information that clients can use to check whether all CAR blocks have been received.

## Detailed design

CAR content type
([`application/vnd.ipld.car`](https://www.iana.org/assignments/media-types/application/vnd.ipld.car))
already supports optional parameters like `version` and `order`, which allows
HTTP client to opt-in via `Accept` header and Gateway to indicate via
`Content-Type` header which CAR flavor is returned with the response.

The proposed solution introduces a new parameter for the CAR content type in HTTP requests
and responses: `meta`.

The `meta` parameter allows clients to request the server to include additional metadata about the CAR along with the response body.

The value of this parameter includes both the location where the metadata is given (e.g. `eof`) as well as the type of data received (e.g. `json`) separated by a `+`, to give a value such as `meta=eof+json`

When the location parameter is set to `eof`, which is currently the only supported value, the server SHOULD respond with the following response body:

```
<Response body as CARv1 stream> <0x00 byte> <Metadata>
```

The only supported value for the data type parameter is `json`. This signifies that the metadata MUST be a JSON object.

This parameter MUST only be used with CAR `version=1`.

When the parameter is not set or does not equal `eof+json`, the server SHOULD not add any extra blocks to the response, neither the 0x00 byte nor any metadata.

This results in a example content type of `application/vnd.ipld.car;version=1;meta=eof+json`

See [CAR `meta` (content type parameter)](/http-gateways/trustless-gateway/#car-meta-content-type-parameter)
in Trustless Gateway specification for more details.

## Design rationale

The proposal introduces a minimal change allowing Gateways and retrieval clients to explicitly opt
into receiving additional metadata block at the end of the CAR response stream.

The metadata block is designed to be very flexible and able to support new use-cases that may arise
in the future.

### User benefit

- Clients of trustless gateways can use the fields from the metadata as an attestation that they
performed the retrieval from the given server.

- For example, the metadata block could include a `car_bytes` field, the byte length of the CAR stream (excluding the metadata block). This would allow clients to verify whether they received all CAR
bytes, which provides a backward-compatible solution for the [CARv1 streaming problem](https://github.com/ipfs/specs/pull/332) until new CAR version is introduced.

- As another example, the metadata object includes the `error` field, allowing the server to pass back additional information about why the response is an error, such as why the CAR stream was incomplete.

- In the SPARK use case, retrieval clients would like to prove they have retrieved an entire file from a specific retrieval provider that has implemented the trustless gateway spec. The additional metadata block allows checksums and signatures to be passed along with the data, allowing the retrieval client to create a proof of correct retrieval.

- The metadata `sig` field SHOULD also be populated, returning a signature, using the server's Ed2559 identity, over the metadata properties object. This allows gateway clients to submit the metadata block as an attestation of retrieval that 3rd parties can verify.

### Compatibility
Copy link
Member

@lidel lidel Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bajtos Let's go extra mile here and elaborate what happens when CAR response with 0x00-prefixed suffix is parsed by existing CAR software.

My suggestion is to add some clear statement about expected interop, like "libraries and implementations SHOULD ignore the suffix after 0x00", otherwise we will create a bad UX/DX, where developer tries to debug things with existign tooling and the tooling errors.

I imagine we don't want things to fail due to 0x00 suffix, bare minimum being:

  • >80% of Amino DHT IPFS network (including IPFS Desktop and Brave) is Kubo
    • ipfs dag import should ignore suffix
  • reference CAR libraries ignore 0x00 by default
    • js-car (JS library used by things like custom Service Workers, Helia)
    • go-car v1 and v2 (GO libraries)
      • Caveat: I think @rvagg mentioned this may not be possible, because of Filecoin-specific logic present in the library?
  • CLI tools we recommend to developers, they will try to use these for debugging CAR responses with the suffix:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go extra mile here and elaborate what happens when CAR response with 0x00-prefixed suffix is parsed by existing CAR software.

It's a great idea to think about compatibility with existing & future tooling and clearly describe our thinking. 👍🏻

The most important aspect is avoiding the "0x00 insertion attack" vector. You can find more details in the section Zero-length-block insertion attacks (including the Filecoin-specific logic). I am cross-posting the mitigation I proposed:

Our proposal avoids this attack vector:

  • It does not change the current semantics of CARv1. Zero-length blocks remain invalid.
  • Instead, we treat the response body as a new container format combining the CARv1 file with additional data.
  • Clients must explicitly request this new container format. Existing clients not aware of the new metadata will not receive responses in the new format.

When developers use existing tooling, they will never receive a CAR file with the 0x00 suffix.

There are two major ways how a CAR with a 0x00 suffix can emerge:

  1. Somebody makes an HTTP request to a Trustless Gateway, explicitly asks to receive CAR with meta=eof+json, saves the response body to a .car file and forgets to extract the CAR payload from the container (remove the \x00{metadata} trailer).

  2. Somebody uses a tool that is aware of meta=eof+json. The tool opts into this new feature when requesting content from a Trustless Gateway, but does not extract the CAR payload from the container in the response body before returning the content back to the user.

I am arguing that (2) is a bug in the tooling, introduced by the change that modified Trustless Gateway requests to opt-into meta=eof+json, and therefore, the maintainers of that tool should fix that bug - make the tool adhere to spec.

Regarding (1): do you think this will happen frequently enough to justify the effort required to change all libraries you mentioned to start ignoring the 0x00 byte?

Maybe it's actually a good thing that the tooling reports an error because it tells the user they are using the new meta=eof+json feature incorrectly.

As an alternative to silently stripping the 0x00 suffix, the tooling can detect the situation where 0x00 is followed by a valid DAG-JSON object and report a more helpful error message to the user, advising them to either change the "accept" header in the request to the Trustless Gateway or else remove the 0x00 suffix (unpack CARv1 from the container format).

Thoughts?


go-car/cmd/car/inspect.go seems to always treat 0x00 as EOF, if I am reading the source code correctly:

https://github.com/ipld/go-car/blob/5c5d432d582564f88fd2124f2fce4f2f3e47a654/cmd/car/inspect.go#L26

	rd, err := carv2.NewReader(inStream, carv2.ZeroLengthSectionAsEOF(true))

js-car seems to always reject zero-length blocks:

https://github.com/ipld/js-car/blob/562c39266edda8422e471b7f83eadc8b7362ea0c/src/decoder.js#L94-L97

  let length = decodeVarint(await reader.upTo(8), reader)
  if (length === 0) {
    throw new Error('Invalid CAR section (zero length)')
  }

I guess I can test how existing tooling handles zero-length blocks and document this behaviour in the IPIP, so that we better understand the current landscape.


The new feature requires clients to explicitly ask the server to include the extra block via `Accept` header,
therefore the change is fully backwards-compatible for all existing gateway clients.

Gateways receiving requests for the CAR content type can ignore the `meta` parameter they don't
support and return back a response with one of the CAR content types they support. This makes the
proposed change backwards-compatible for existing gateways too.

All metadata fields are optional to allow different applications to experiment with different metadata. Future IPIPs may standardize metadata fields that are observed to be widely used.

### Security

#### Zero-length-block insertion attacks

The idea of using the zero-length block (a single byte `0x00`) to signal the end of the CARv1 stream has been already considered in the past.

> CARv1 is nicely sectioned, such that each section has a specific length, you know when it ends. In the [ZeroLengthSectionAsEOF](https://pkg.go.dev/github.com/ipld/go-car/v3#ZeroLengthSectionAsEOF) mode, when it gets to a new section and reads a 0x00, i.e. zero length (sections are prefixed with a length varint), it treats that as the end of the CAR. So all it takes with this turned on is to attach a 0x00 to the end of a stream and you get your EOF.
>
> The background for this is the power-of-two padding that is needed for a Filecoin sector — stick a CAR into the sector and fill it out with zeros but have no way of saying that the CAR is x-bytes long; hence the need for an EOF signal, which is this.

However, introducing a `0x00` into CARv1 spec would create a security vulnerability:
- Tools and services not aware of this new semantics will happily accept a CARv1 payload containing zero-length blocks in the middle.
- Tools and services treating `0x00` as EOF will discard the remaining blocks in such CARv1 file
after encountering the zero-length block.

Our proposal avoids this attack vector:
- It does not change the current semantics of CARv1. Zero-length blocks remain invalid.
- Instead, we treat the response body as a new container format combining the CARv1 file with additional data.
- Clients must explicitly request this new container format. Existing clients not aware of the new metadata will not receive responses in the new format.

#### Denial of Service attacks

Computing the signature for the metadata blcok has a non-negligible performance cost. To mittigate DoS attacks, we designed the metadata to be highly cacheable. When a gateway receives two requests for the same content, it can return the same metadata block in both responses, including the signature. This allows gateway operators to deploy a traditional caching layer operating at the HTTP protocol, the cache does not need to understand any specifics of IPFS and Trustless Gateway protocols.

### Alternatives
bajtos marked this conversation as resolved.
Show resolved Hide resolved

#### HTTP Trailers

Instead of adding a new content type argument, we were considering sending the additional metadata
in HTTP response trailers. Unfortunately, HTTP trailers are not widely supported by the ecosystem.
Nginx proxy module discards them, [browser `Fetch API` does not allow JS clients to access trailer
headers](https://github.com/mdn/browser-compat-data/issues/14703), neither does the Rust `reqwest` client.

#### New Content-Type

We could introduce a new content type that is not CARv3, but a thin envelope
around CARv1 with purpose of streaming over HTTP (e.g. `Content-Type:
application/vnd.ipld.car-stream`).

It would have three fields:
- `car-stream-header` (optional DAG-CBOR)
- `car` (same as `application/vnd.ipld.car;version=1`)
- `car-stream-end` (optional DAG-CBOR)

This will be enough to append DAG-CBOR manifest at the end of the stream. It
would be effectively the same CAR byte stream, but with different
`Content-Type`.

Upside of this solution:

- does not require registering new codec, or mixing data plane with control
plane, no sniffing the last DAG-CBOR block

Downsides of this solution:

- maintenance cost, requires duplicating of all CAR-related tests and features
- ecosystem opportunity cost, in creating new content type, we increase
cognitive overhead for everyone working with IPFS over HTTP
- no backward-compatible interop with existing tools and gateways that only
speak `application/vnd.ipld.car`
- distracts us away from working on things like large blocks and CARv3

#### Create CARv3

We could admit we've clearly hit limitation of what we can do with HTTP and CARv1 and CARv2 and stop abusing existing CARv1 by mixing data plane with control plane.

Spend energy on creating CARv3 that solves the problems from "Motivation" section and more:
- optional index or key-value metadata before or after data
- native truncation detection and standardized error handling and passing during streaming
- support for things like [Large Blocks](https://discuss.ipfs.tech/t/supporting-large-ipld-blocks/15093/)

TODO: link to some public artifact about CARv3
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging this TODO to show in the PR discussion.

Any suggestions for the artefacts I can link to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aschmahmann do we have anything on GH?


#### Create a new multicodec for this metadata block

Initially, we proposed to create a new multicodec for this metadata block called `car-metadata`. This was ruled out due to some concerns that you can find documented [here](https://github.com/multiformats/multicodec/pull/334#issuecomment-1668086641).

#### Using CBOR instead of JSON for the metadata block

We could use CBOR instead of JSON for the metadata block. However it was [decided](https://github.com/ipfs/specs/pull/431#issuecomment-1719634928) to opt for user readibility over number of bytes since CBOR doesn't greatly reduce the number of bytes in a key value map compared with JSON.

## Test fixtures

TBD

Using one CID, request the CAR data using various combinations of content type parameters.
Comment on lines +191 to +193
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging this TODO to show in the PR discussion.


### Copyright

Copyright and related rights waived via [CC0](https://creativecommons.org/publicdomain/zero/1.0/).