URL 的百分比编码差异
Discrepancies of Percent Encoding for URLs
查看后 this previous SO question regarding percent encoding, I'm curious as to which styles of encodings are correct - the Wikipedia article on percent encoding 暗示使用 +
而不是 %20
作为空格,同时仍然具有 application/x-www-urlencoded
内容类型。
这让我认为 +
与 %20
的行为取决于 URL 的哪一部分被编码。路径段与查询字符串的首选区别是什么?非常感谢此规范的详细信息和参考资料。
注意:我假设非字母数字字符将通过 UTF-8 编码,因为一个字符的每个八位字节都变成一个 %XX
字符串。如果我在这里错了请纠正我(例如 latin-1 而不是 utf-8),但我对 [=28= 的不同部分的编码之间的 差异 更感兴趣].
This leads me to think the +
vs. %20
behavior depends on which part of the URL is being encoded.
它不仅取决于特定的 URL 组件,而且还取决于该组件填充数据的环境。
使用 '+'
编码 space 字符特定于 application/x-www-form-urlencoded
格式,适用于在 HTTP 请求中提交的网络表单数据。它不适用于 URL 本身。
application/x-www-form-urlencoded
格式由W3C在HTML规范中正式定义。这是 HTML 4.01 中的定义:
Section 17.13.3 Processing form data, Step four: Submit the encoded form data set
This specification does not specify all valid submission methods or content types that may be used with forms. However, HTML 4 user agents must support the established conventions in the following cases:
• If the method is "get" and the action is an HTTP URI, the user agent takes the value of action, appends a `?' to it, then appends the form data set, encoded using the "application/x-www-form-urlencoded" content type. The user agent then traverses the link to this URI. In this scenario, form data are restricted to ASCII codes.
• If the method is "post" and the action is an HTTP URI, the user agent conducts an HTTP "post" transaction using the value of the action attribute and a message created according to the content type specified by the enctype attribute.
Section 17.13.4 Form content types, application/x-www-form-urlencoded
This is the default content type. Forms submitted with this content type must be encoded as follows:
1.Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., '%0D%0A').
2.The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.
相应的 HTML5 定义 (Section 4.10.22.3 Form submission algorithm and Section 4.10.22.6 URL-encoded form data) 更加精炼和详细,但就本次讨论而言,要旨大致相同。
因此,在通过 HTTP GET
请求而不是 POST
请求提交网络表单数据的情况下,网络表单数据使用 application/x-www-form-urlencoded
编码并放置为-在 URL query
组件中。
根据 RFC 3986: Uniform Resource Identifier (URI): Generic Syntax:
URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component.
'+'
为保留字符:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
query
组件明确允许未编码的 '+'
字符,因为它允许来自 sub-delims
:
的字符
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
query = *( pchar / "/" / "?" )
因此,在网络表单提交的上下文中,spaces 在按原样放入 query
组件之前使用 '+'
进行编码。 URL 语法允许这样做,因为 application/x-www-form-urlencoded
的编码形式与 query
组件的定义兼容。
因此,例如:http://server/script?field=hello+world
但是,在网络表单提交之外,将 space 字符直接放入 query
组件需要使用 pct-encoded
,因为 ' '
不包含在 query
组件中unreserved
或 sub-delims
,query
定义未明确允许。
因此,例如:http://server/script?hello%20world
类似的规则也适用于 path
组件,因为它使用了 pchar
:
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
; non-zero-length segment without any colon ":"
因此,虽然 path
确实允许未编码的 sub-delims
字符,但 '+'
字符将按原样处理,而不是编码的 space。 application/x-www-form-urlencoded
不与 path
组件一起使用,因此由于 pchar
和 [=47= 的定义,space 字符必须编码为 %20
].
现在,关于用于编码字符的字符集 -
对于网络表单提交,该字符集由网络表单编码算法中定义的规则决定(在 HTML5 中比 HTML4 中更多)用于在插入之前准备网络表单数据进入URL。简而言之,HTML 可以直接在 <form>
本身中指定一个 accept-charset
属性或隐藏的 _charset_
字段,否则字符集通常是父 [=] 使用的字符集104=].
但是,在提交网络表单之外,没有正式的标准用于在 URL 组件中使用哪个字符集对非 ascii 字符进行编码(另一方面,IRI 语法, 需要 UTF-8,尤其是在将 IRI 转换为 URI/URL 时)。在 IRI 之外,由特定的 URI 方案决定它们的字符集(HTTP 方案没有),否则服务器决定它要使用哪个字符集。现在大多数 schemes/servers 使用 UTF-8,但仍有一些 servers/schemes 使用其他字符集,通常基于服务器的区域设置(Latin1、Shift-JIS 等)。有尝试直接在HTTP的URLand/or中添加charset报告(比如Deterministic URI Encoding
),但这些并不常用。
查看后 this previous SO question regarding percent encoding, I'm curious as to which styles of encodings are correct - the Wikipedia article on percent encoding 暗示使用 +
而不是 %20
作为空格,同时仍然具有 application/x-www-urlencoded
内容类型。
这让我认为 +
与 %20
的行为取决于 URL 的哪一部分被编码。路径段与查询字符串的首选区别是什么?非常感谢此规范的详细信息和参考资料。
注意:我假设非字母数字字符将通过 UTF-8 编码,因为一个字符的每个八位字节都变成一个 %XX
字符串。如果我在这里错了请纠正我(例如 latin-1 而不是 utf-8),但我对 [=28= 的不同部分的编码之间的 差异 更感兴趣].
This leads me to think the
+
vs.%20
behavior depends on which part of the URL is being encoded.
它不仅取决于特定的 URL 组件,而且还取决于该组件填充数据的环境。
使用 '+'
编码 space 字符特定于 application/x-www-form-urlencoded
格式,适用于在 HTTP 请求中提交的网络表单数据。它不适用于 URL 本身。
application/x-www-form-urlencoded
格式由W3C在HTML规范中正式定义。这是 HTML 4.01 中的定义:
Section 17.13.3 Processing form data, Step four: Submit the encoded form data set
This specification does not specify all valid submission methods or content types that may be used with forms. However, HTML 4 user agents must support the established conventions in the following cases:
• If the method is "get" and the action is an HTTP URI, the user agent takes the value of action, appends a `?' to it, then appends the form data set, encoded using the "application/x-www-form-urlencoded" content type. The user agent then traverses the link to this URI. In this scenario, form data are restricted to ASCII codes.
• If the method is "post" and the action is an HTTP URI, the user agent conducts an HTTP "post" transaction using the value of the action attribute and a message created according to the content type specified by the enctype attribute.
Section 17.13.4 Form content types, application/x-www-form-urlencoded
This is the default content type. Forms submitted with this content type must be encoded as follows:
1.Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., '%0D%0A').
2.The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.
相应的 HTML5 定义 (Section 4.10.22.3 Form submission algorithm and Section 4.10.22.6 URL-encoded form data) 更加精炼和详细,但就本次讨论而言,要旨大致相同。
因此,在通过 HTTP GET
请求而不是 POST
请求提交网络表单数据的情况下,网络表单数据使用 application/x-www-form-urlencoded
编码并放置为-在 URL query
组件中。
根据 RFC 3986: Uniform Resource Identifier (URI): Generic Syntax:
URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component.
'+'
为保留字符:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
query
组件明确允许未编码的 '+'
字符,因为它允许来自 sub-delims
:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
query = *( pchar / "/" / "?" )
因此,在网络表单提交的上下文中,spaces 在按原样放入 query
组件之前使用 '+'
进行编码。 URL 语法允许这样做,因为 application/x-www-form-urlencoded
的编码形式与 query
组件的定义兼容。
因此,例如:http://server/script?field=hello+world
但是,在网络表单提交之外,将 space 字符直接放入 query
组件需要使用 pct-encoded
,因为 ' '
不包含在 query
组件中unreserved
或 sub-delims
,query
定义未明确允许。
因此,例如:http://server/script?hello%20world
类似的规则也适用于 path
组件,因为它使用了 pchar
:
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
; non-zero-length segment without any colon ":"
因此,虽然 path
确实允许未编码的 sub-delims
字符,但 '+'
字符将按原样处理,而不是编码的 space。 application/x-www-form-urlencoded
不与 path
组件一起使用,因此由于 pchar
和 [=47= 的定义,space 字符必须编码为 %20
].
现在,关于用于编码字符的字符集 -
对于网络表单提交,该字符集由网络表单编码算法中定义的规则决定(在 HTML5 中比 HTML4 中更多)用于在插入之前准备网络表单数据进入URL。简而言之,HTML 可以直接在 <form>
本身中指定一个 accept-charset
属性或隐藏的 _charset_
字段,否则字符集通常是父 [=] 使用的字符集104=].
但是,在提交网络表单之外,没有正式的标准用于在 URL 组件中使用哪个字符集对非 ascii 字符进行编码(另一方面,IRI 语法, 需要 UTF-8,尤其是在将 IRI 转换为 URI/URL 时)。在 IRI 之外,由特定的 URI 方案决定它们的字符集(HTTP 方案没有),否则服务器决定它要使用哪个字符集。现在大多数 schemes/servers 使用 UTF-8,但仍有一些 servers/schemes 使用其他字符集,通常基于服务器的区域设置(Latin1、Shift-JIS 等)。有尝试直接在HTTP的URLand/or中添加charset报告(比如Deterministic URI Encoding ),但这些并不常用。