Skip to content

URI Path Canonicalization

Greg Wilkins edited this page Oct 7, 2021 · 17 revisions

MOVING TO A PR. DO NOT EDIT!!!

Once agreed, the following text will be added to the Servlet specification at the start of section 12. See also issue 18 for background and discussion.

URI Path Canonicalization

The process described here adapts and extends the URI canonicalization process described in RFC 3986 to create a standard Servlet URI path canonicalization process that ensures that URIs can be mapped to Servlets, Filters and security constraints in an unambiguous manner. It is also intended to provide information to reverse proxy implementations so they are aware of how requests they pass to servlet containers will be processed.

Servlet containers may implement the standard Servlet URI path canonicalization in any manner they see fit as long as the end result is identical to the end result of the process described here. Servlet containers may provide container specific configuration options to vary the standard canonicalization process. Any such variations may have security implications and both Servlet container implementors and users are advised to be sure that they understand the implications of any such container specific canonicalization options.

Stage 0: Identify the URI

Stage 0a: HTTP/1.1 requests

The URI is extracted from the request-target as defined by RFC 7230. URIs in origin-form or asterisk-form are passed unchanged to stage 2. URIs in absolute-form have the protocol and authority removed to convert them to origin-form and are then passed to stage 2. URIs in authority-form are outside of the scope of this specification.

Stage 0b: HTTP/2 requests

The URI is the :path pseudo header as defined by RFC 7540 and is passed unchanged to stage 2.

Stage 0c: Other protocols

Containers may support other protocols. Containers should extract an appropriate URI for the request from the protocol and pass it to stage 2.

Stage 1: Separation of query and fragment.

The URI is split by the first occurrence of any '?' character to path and query. The query is preserved for later handling and the following steps applied to the path. A fragment in the path is indicated by the first occurrence of a '#' character. Any '#' character and following fragment is removed from the path and discarded.

Stage 2: Decoding of non special characters.

Characters other than /, ; and % that are encoded in %nn form are decoded and the resulting octet sequences is treated as UTF-8 and converted to a character sequence.

Note that special characters cannot be part of a UTF-8 character sequence as all such sequences are comprised of negative octets.

Note this is not reserved characters as defined by RFC3986, as that does not include % and includes many characters we don't care about. Avoiding a second decoding is worthwhile.

Stage 3: Collapse sequences of multiple "/" characters

WARNING Swapping the order of stage 3 and stage 4 may be significant. Consider "/aaa/bbb//../".

Any sequence of more than one "/" character in the URI must be replaced with a single "/".

Stage 4: Remove dot-segments

URIs that contain segments of the following forms must be rejected with a 400 response:

  • ".." sub-delim *(pchar)
  • "." sub-delim *(pchar)

Sequences of the form "/./" must be replaced with "/". Sequences of the form "/" segment "/../" must be replaced with "/". If there is no preceding segment for a ".." segment then return a 400 response.

Stage 5: Removal of path parameters

Sequences of the form "/" *(unreserved / pct-encoded / ":" / "@") sub-delim *(pchar) "/" must have the characters from and including the sub-delim to the end of the segment removed.

TODO How do we handle URIs like /foo/;/bar? I think as currently written we end up with /foo//bar ?

Stage 6: Final Decoding of remaining %nn sequences

Any remaining %nn sequences should be decoded, although some containers may be configured to leave some specific characters encoded (eg. the characters '/' and '%' may be left decoded by some container configuration).

Stage 7: Mapping URI to context and resource

The decoded path is used to map the request to a context and resource within the context. This form of the URI path is used for all subsequent mapping (web applications, servlet, filters and security constraints).

Stage 8: Rejecting Suspicious Sequences

If suspicious sequences are discovered during the prior decoding steps suspicious, the request can be rejected with a 400 bad request using the error handling of the matched context.
By default the set of rejected sequences must include:

  • %2F, %2f
  • /.;
  • /..;
  • /%2E, /%2e
  • /%2E%2E, /%2E%2e, /%2e%2E, /%2e%2e

TODO should we also by default reject '' and '%5c' ?

TODO should we also by default reject non visible and/or control characters ?

TODO if %2F is allowed, we may now have double '/', '/./' and '/../' segments in the URL, should stage 3 and 4 be re-run if this is allowed?

A container or context may be configured to have a different set of rejected sequences.