-
Notifications
You must be signed in to change notification settings - Fork 87
URI Path Canonicalization
MOVING TO A PR. DO NOT EDIT!!!
Once agreed, the following text will be added to the Servlet specification at the start of section 12. See also issue 18 for background and discussion.
The process described here adapts and extends the URI canonicalization process described in RFC 3986 to create a standard Servlet URI path canonicalization process that ensures that URIs can be mapped to Servlets, Filters and security constraints in an unambiguous manner. It is also intended to provide information to reverse proxy implementations so they are aware of how requests they pass to servlet containers will be processed.
Servlet containers may implement the standard Servlet URI path canonicalization in any manner they see fit as long as the end result is identical to the end result of the process described here. Servlet containers may provide container specific configuration options to vary the standard canonicalization process. Any such variations may have security implications and both Servlet container implementors and users are advised to be sure that they understand the implications of any such container specific canonicalization options.
The URI is extracted from the request-target
as defined by RFC 7230. URIs in origin-form
or asterisk-form
are passed unchanged to stage 2. URIs in absolute-form
have the protocol and authority removed to convert them to origin-form
and are then passed to stage 2. URIs in authority-form
are outside of the scope of this specification.
The URI is the :path
pseudo header as defined by RFC 7540 and is passed unchanged to stage 2.
Containers may support other protocols. Containers should extract an appropriate URI for the request from the protocol and pass it to stage 2.
The URI is split by the first occurrence of any '?' character to path and query. The query is preserved for later handling and the following steps applied to the path. A fragment in the path is indicated by the first occurrence of a '#' character. Any '#' character and following fragment is removed from the path and discarded.
Characters other than /
, ;
and %
that are encoded in %nn
form are decoded and the resulting octet sequences is treated as UTF-8 and converted to a character sequence.
Note that special characters cannot be part of a UTF-8 character sequence as all such sequences are comprised of negative octets.
Note this is not reserved characters as defined by RFC3986, as that does not include
%
and includes many characters we don't care about. Avoiding a second decoding is worthwhile.
WARNING Swapping the order of stage 3 and stage 4 may be significant. Consider
"/aaa/bbb//../"
.
Any sequence of more than one "/"
character in the URI must be replaced with a single "/"
.
URIs that contain segments of the following forms must be rejected with a 400 response:
".." sub-delim *(pchar)
"." sub-delim *(pchar)
Sequences of the form "/./"
must be replaced with "/"
.
Sequences of the form "/" segment "/../"
must be replaced with "/"
. If there is no preceding segment for a ".."
segment then return a 400 response.
Sequences of the form "/" *(unreserved / pct-encoded / ":" / "@") sub-delim *(pchar) "/"
must have the characters from and including the sub-delim
to the end of the segment removed.
TODO How do we handle URIs like
/foo/;/bar
? I think as currently written we end up with/foo//bar
?
Any remaining %nn
sequences should be decoded, although some containers may be configured to leave some specific characters encoded (eg. the characters '/' and '%' may be left decoded by some container configuration).
The decoded path is used to map the request to a context and resource within the context. This form of the URI path is used for all subsequent mapping (web applications, servlet, filters and security constraints).
If suspicious sequences are discovered during the prior decoding steps suspicious, the request can be rejected with a 400 bad request using the error handling of the matched context.
By default the set of rejected sequences must include:
-
%2F
,%2f
/.;
/..;
-
/%2E
,/%2e
-
/%2E%2E
,/%2E%2e
,/%2e%2E
,/%2e%2e
TODO should we also by default reject '' and '%5c' ?
TODO should we also by default reject non visible and/or control characters ?
TODO if %2F is allowed, we may now have double '/', '/./' and '/../' segments in the URL, should stage 3 and 4 be re-run if this is allowed?
A container or context may be configured to have a different set of rejected sequences.