URL 的百分比编码差异

Discrepancies of Percent Encoding for URLs

查看后 this previous SO question regarding percent encoding, I'm curious as to which styles of encodings are correct - the Wikipedia article on percent encoding 暗示使用 + 而不是 %20 作为空格,同时仍然具有 application/x-www-urlencoded 内容类型。

这让我认为 +%20 的行为取决于 URL 的哪一部分被编码。路径段与查询字符串的首选区别是什么?非常感谢此规范的详细信息和参考资料。


注意:我假设非字母数字字符将通过 UTF-8 编码,因为一个字符的每个八位字节都变成一个 %XX 字符串。如果我在这里错了请纠正我(例如 latin-1 而不是 utf-8),但我对 [=28= 的不同部分的编码之间的 差异 更感兴趣].

This leads me to think the + vs. %20 behavior depends on which part of the URL is being encoded.

它不仅取决于特定的 URL 组件,而且还取决于该组件填充数据的环境。

使用 '+' 编码 space 字符特定于 application/x-www-form-urlencoded 格式,适用于在 HTTP 请求中提交的网络表单数据。它不适用于 URL 本身。

application/x-www-form-urlencoded格式由W3C在HTML规范中正式定义。这是 HTML 4.01 中的定义:

Section 17.13.3 Processing form data, Step four: Submit the encoded form data set

This specification does not specify all valid submission methods or content types that may be used with forms. However, HTML 4 user agents must support the established conventions in the following cases:

If the method is "get" and the action is an HTTP URI, the user agent takes the value of action, appends a `?' to it, then appends the form data set, encoded using the "application/x-www-form-urlencoded" content type. The user agent then traverses the link to this URI. In this scenario, form data are restricted to ASCII codes.

• If the method is "post" and the action is an HTTP URI, the user agent conducts an HTTP "post" transaction using the value of the action attribute and a message created according to the content type specified by the enctype attribute.

Section 17.13.4 Form content types, application/x-www-form-urlencoded

This is the default content type. Forms submitted with this content type must be encoded as follows:

1.Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., '%0D%0A').

2.The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.

相应的 HTML5 定义 (Section 4.10.22.3 Form submission algorithm and Section 4.10.22.6 URL-encoded form data) 更加精炼和详细,但就本次讨论而言,要旨大致相同。

因此,在通过 HTTP GET 请求而不是 POST 请求提交网络表单数据的情况下,网络表单数据使用 application/x-www-form-urlencoded 编码并放置为-在 URL query 组件中。

根据 RFC 3986: Uniform Resource Identifier (URI): Generic Syntax:

URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component.

'+'为保留字符:

reserved    = gen-delims / sub-delims

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="

query 组件明确允许未编码的 '+' 字符,因为它允许来自 sub-delims:

的字符
unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"

pct-encoded = "%" HEXDIG HEXDIG

pchar       = unreserved / pct-encoded / sub-delims / ":" / "@"

query       = *( pchar / "/" / "?" )

因此,在网络表单提交的上下文中,spaces 在按原样放入 query 组件之前使用 '+' 进行编码。 URL 语法允许这样做,因为 application/x-www-form-urlencoded 的编码形式与 query 组件的定义兼容。

因此,例如:http://server/script?field=hello+world

但是,在网络表单提交之外,将 space 字符直接放入 query 组件需要使用 pct-encoded,因为 ' ' 不包含在 query 组件中unreservedsub-delimsquery 定义未明确允许。

因此,例如:http://server/script?hello%20world

类似的规则也适用于 path 组件,因为它使用了 pchar:

  path          = path-abempty    ; begins with "/" or is empty
                / path-absolute   ; begins with "/" but not "//"
                / path-noscheme   ; begins with a non-colon segment
                / path-rootless   ; begins with a segment
                / path-empty      ; zero characters

  path-abempty  = *( "/" segment )
  path-absolute = "/" [ segment-nz *( "/" segment ) ]
  path-noscheme = segment-nz-nc *( "/" segment )
  path-rootless = segment-nz *( "/" segment )
  path-empty    = 0<pchar>
  segment       = *pchar
  segment-nz    = 1*pchar
  segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                ; non-zero-length segment without any colon ":"

因此,虽然 path 确实允许未编码的 sub-delims 字符,但 '+' 字符将按原样处理,而不是编码的 space。 application/x-www-form-urlencoded 不与 path 组件一起使用,因此由于 pchar 和 [=47= 的定义,space 字符必须编码为 %20 ].

现在,关于用于编码字符的字符集 -

对于网络表单提交,该字符集由网络表单编码算法中定义的规则决定(在 HTML5 中比 HTML4 中更多)用于在插入之前准备网络表单数据进入URL。简而言之,HTML 可以直接在 <form> 本身中指定一个 accept-charset 属性或隐藏的 _charset_ 字段,否则字符集通常是父 [=] 使用的字符集104=].

但是,在提交网络表单之外,没有正式的标准用于在 URL 组件中使用哪个字符集对非 ascii 字符进行编码(另一方面,IRI 语法, 需要 UTF-8,尤其是在将 IRI 转换为 URI/URL 时)。在 IRI 之外,由特定的 URI 方案决定它们的字符集(HTTP 方案没有),否则服务器决定它要使用哪个字符集。现在大多数 schemes/servers 使用 UTF-8,但仍有一些 servers/schemes 使用其他字符集,通常基于服务器的区域设置(Latin1、Shift-JIS 等)。有尝试直接在HTTP的URLand/or中添加charset报告(比如Deterministic URI Encoding ),但这些并不常用。