URL编码应该使用什么字符集?

What character set should be used for URL encoding?

我需要对 URL 组件进行编码。 URL 组件可以包含特殊字符,如“?、#、=”,也可以包含中文字符。

我应该使用哪种字符集:UTF-8、UTF-16 或 UTF-32?为什么?

我想你指的是这里的百分比编码。

RFC 3986, section 2.5 对此非常清楚(强调我的):

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent- encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2".

因此,这应该是 UTF-8。

另外,提防URLEncoder.encode();虽然对它的建议一再重复,但事实是它不适合 URI 编码;引用 class 本身的 javadoc:

This class contains static methods for converting a String to the application/x-www-form-urlencoded MIME format

不是 URI 编码所使用的。 (如果您想知道,application/x-www-form-urlencoded 是 HTTP POST 数据中使用的内容)您想要使用的是 URI 模板。例如参见 [​​=15=].

UTF-8 (Unicode) 是 HTML5 中的默认字符编码,因为它涵盖了几乎所有 symbols/characters。

编码您的 URL 以转义特殊字符。有几个网站可以为您做这件事。 例如。 http://www.url-encode-decode.com/

Go for UTF-8, also you can achieve the same thing by URLEncoder.encode(string, encoding)

In addition, you can refer This blog, It tried to encode some Chinese characters like '维也纳恩斯特哈佩尔球场'

HTML 观点的参考。

HTML4 规范第 Non-ASCII characters in URI attribute values 节指出(我强调):

We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases:

  1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes.
  2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).

类似的,在HTML5规范中,Selecting a form submission encoding section,基本上说如果没有accept-charset属性,应该使用UTF-8已指定。

另一方面,我没有发现任何声明必须使用 UTF-8 的内容。 一些较旧的软件特别使用 iso-8859-1。例如,版本 8 之前的 Apache Tomcat 将 iso-8859-1 作为其 URIEncoding 设置的默认值。