Recommend URI syntax normalization + scheme normalization for identifiers? (Also consider query component rules) #483

trwnh · 2024-11-18T06:14:15Z

For HTTPS scheme normalization, refer to RFC 9110 Section 4.2.3: https://datatracker.ietf.org/doc/html/rfc9110#section-4.2.3

For URI syntax normalization, refer to RFC 3986 Section 6: https://datatracker.ietf.org/doc/html/rfc3986#section-6

Some common considerations in imperative form

For https identifiers, transform the authority component to lowercase.
- Example: HTTPs://Domain.EXAMPLE SHOULD be normalized to https://domain.example (excluding other considerations)
For https identifiers, if the port is 443 or empty, then omit the port from the identifier.
- Example: https://domain.example:443 SHOULD be normalized to https://domain.example (excluding other considerations)
- Example: https://domain.example: SHOULD be normalized to https://domain.example (excluding other considerations)
For https identifiers, if the identifier has an empty path component, then use / for the path component.
- Example: https://domain.example SHOULD be normalized to https://domain.example/ (excluding other considerations)
Percent-encoded URIs SHOULD be decoded before comparison.

Considerations that do not exist at URI/HTTPS level and must be considered at a protocol level

Query component processing

Per https://datatracker.ietf.org/doc/html/rfc3986#section-3.4:

query         = *( pchar / "/" / "?" )
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

Query components are by default opaque. At the level of an HTTPS URI, the first unencoded ? delimits the query component, which ends only when encountering a # (delimiting the start of the fragment component) or the end of the URI.

Purely by convention, it is common for application servers to try to parse "query parameters" out of the query component of the URI. Arguably this is a misfeature and an antipattern, since the ordering of such query parameters should not have any bearing on the identity of the resource -- /?foo=1&bar=2 is semantically equivalent to /?bar=2&foo=1 when being used to extract request parameters; such "request parameters" should go on the request itself, not on the identifier (which becomes a completely different identifier when the order of the parameters is changed). But the practice of using = and & to parse a query component as a series of request parameters is (unfortunately) quite prevalent, even very widespread (although at some point around the era of HTML4 it was recommended that the delimiter between such "parameters" be ; instead of &.)

ActivityPub should probably also warn about this or give guidance that query components in id are opaque and SHOULD NOT be parsed as parameters for the purposes of reference or equivalence.

If ActivityPub ever prescribed specific query parameter processing, then the ordering of such query parameters needs to be canonicalized with some kind of normalization algorithm.

At the very least, for implementers using the query component to encode request parameters, these implementers SHOULD normalize/canonicalize the order of these parameters when normalizing/canonicalizing their URIs before including them as id on any object(s).

Recommendations

Needs Primer -- https://www.w3.org/wiki/ActivityPub/Primer/Object_identifiers is a natural target for inclusion
Does this rise to the level of recommendation in the spec itself, either via a non-normative note or possibly even via a normative SHOULD? Naive implementations might not consider the importance of producing identifiers in normal form. We could potentially avoid a whole class of issues if we use normative language here.
- Query component considerations in particular seem like something that should be warned against.

The text was updated successfully, but these errors were encountered:

bobwyman · 2024-11-22T18:34:55Z

Section 6 of RFC3986, Uniform Resource Identifier (URI): Generic Syntax, has traditionally been the first place to look for URI normalization and comparison guidance, however, it should be noted that the WhatWG's URL Standard is intended to obsolete both RFC3986 and RFC3987. Thus, the WhatWG URL standard may be a better resource going forward.

Also, the Wikipedia page on URI Normalization lists some normalization rules that are or have been employed.

evanp added Needs Primer Page Needs a page in the ActivityPub primer Next version Normative change, requires new version of spec labels Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommend URI syntax normalization + scheme normalization for identifiers? (Also consider query component rules) #483

Recommend URI syntax normalization + scheme normalization for identifiers? (Also consider query component rules) #483

trwnh commented Nov 18, 2024 •

edited

Loading

bobwyman commented Nov 22, 2024

Recommend URI syntax normalization + scheme normalization for identifiers? (Also consider query component rules) #483

Recommend URI syntax normalization + scheme normalization for identifiers? (Also consider query component rules) #483

Comments

trwnh commented Nov 18, 2024 • edited Loading

Some common considerations in imperative form

Considerations that do not exist at URI/HTTPS level and must be considered at a protocol level

Query component processing

Recommendations

bobwyman commented Nov 22, 2024

trwnh commented Nov 18, 2024 •

edited

Loading