Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommend URI syntax normalization + scheme normalization for identifiers? (Also consider query component rules) #483

Open
trwnh opened this issue Nov 18, 2024 · 1 comment
Labels
Needs Primer Page Needs a page in the ActivityPub primer Next version Normative change, requires new version of spec

Comments

@trwnh
Copy link

trwnh commented Nov 18, 2024

Related:

ActivityPub developers and implementers using HTTPS identifiers ought to be aware of the "normalization and comparison" considerations for HTTPS URIs.

For HTTPS scheme normalization, refer to RFC 9110 Section 4.2.3: https://datatracker.ietf.org/doc/html/rfc9110#section-4.2.3

For URI syntax normalization, refer to RFC 3986 Section 6: https://datatracker.ietf.org/doc/html/rfc3986#section-6

Some common considerations in imperative form

  • For https identifiers, transform the authority component to lowercase.
    • Example: HTTPs://Domain.EXAMPLE SHOULD be normalized to https://domain.example (excluding other considerations)
  • For https identifiers, if the port is 443 or empty, then omit the port from the identifier.
    • Example: https://domain.example:443 SHOULD be normalized to https://domain.example (excluding other considerations)
    • Example: https://domain.example: SHOULD be normalized to https://domain.example (excluding other considerations)
  • For https identifiers, if the identifier has an empty path component, then use / for the path component.
    • Example: https://domain.example SHOULD be normalized to https://domain.example/ (excluding other considerations)
  • Percent-encoded URIs SHOULD be decoded before comparison.

Considerations that do not exist at URI/HTTPS level and must be considered at a protocol level

Query component processing

Per https://datatracker.ietf.org/doc/html/rfc3986#section-3.4:

query         = *( pchar / "/" / "?" )
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

Query components are by default opaque. At the level of an HTTPS URI, the first unencoded ? delimits the query component, which ends only when encountering a # (delimiting the start of the fragment component) or the end of the URI.

Purely by convention, it is common for application servers to try to parse "query parameters" out of the query component of the URI. Arguably this is a misfeature and an antipattern, since the ordering of such query parameters should not have any bearing on the identity of the resource -- /?foo=1&bar=2 is semantically equivalent to /?bar=2&foo=1 when being used to extract request parameters; such "request parameters" should go on the request itself, not on the identifier (which becomes a completely different identifier when the order of the parameters is changed). But the practice of using = and & to parse a query component as a series of request parameters is (unfortunately) quite prevalent, even very widespread (although at some point around the era of HTML4 it was recommended that the delimiter between such "parameters" be ; instead of &.)

ActivityPub should probably also warn about this or give guidance that query components in id are opaque and SHOULD NOT be parsed as parameters for the purposes of reference or equivalence.

If ActivityPub ever prescribed specific query parameter processing, then the ordering of such query parameters needs to be canonicalized with some kind of normalization algorithm.

At the very least, for implementers using the query component to encode request parameters, these implementers SHOULD normalize/canonicalize the order of these parameters when normalizing/canonicalizing their URIs before including them as id on any object(s).

Recommendations

  • Needs Primer -- https://www.w3.org/wiki/ActivityPub/Primer/Object_identifiers is a natural target for inclusion
  • Does this rise to the level of recommendation in the spec itself, either via a non-normative note or possibly even via a normative SHOULD? Naive implementations might not consider the importance of producing identifiers in normal form. We could potentially avoid a whole class of issues if we use normative language here.
    • Query component considerations in particular seem like something that should be warned against.
@evanp evanp added Needs Primer Page Needs a page in the ActivityPub primer Next version Normative change, requires new version of spec labels Nov 22, 2024
@bobwyman
Copy link

Section 6 of RFC3986, Uniform Resource Identifier (URI): Generic Syntax, has traditionally been the first place to look for URI normalization and comparison guidance, however, it should be noted that the WhatWG's URL Standard is intended to obsolete both RFC3986 and RFC3987. Thus, the WhatWG URL standard may be a better resource going forward.

Also, the Wikipedia page on URI Normalization lists some normalization rules that are or have been employed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Primer Page Needs a page in the ActivityPub primer Next version Normative change, requires new version of spec
Projects
None yet
Development

No branches or pull requests

3 participants