URL 的哪些部分可以被 URL 编码?

What parts of a URL can be URL-encoded?

我的Chrome版本101允许我打开

但不是


根据最新规范,URL 的哪些部分和哪些字符可以 URL 编码?

部分”是指 方案用户名密码主机端口路径查询, 片段, ., :, //, @, ?, #等等.

什么字符”,我的意思是“在什么部分有什么价值的字符。”

按规范

来自 RFC 3986.


2.1. Percent-Encoding

….

pct-encoded = "%" HEXDIG HEXDIG

The uppercase hexadecimal digits “A” through “F” are equivalent to the lowercase digits “a” through “f,” respectively. If two URIs differ only in the case of hexadecimal digits used in percent-encoded octets, they are equivalent. For consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings.

  • Percent-encoding 是 case-insensitive.

2.2. Reserved Characters

reserved   = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

The purpose of reserved characters is to provide a set of delimiting characters that are distinguishable from other data within a URI. URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent. Percent-encoding a reserved character, or decoding a percent-encoded octet that corresponds to a reserved character, will change how the URI is interpreted by most applications. Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI.

A subset of the reserved characters (gen-delims) is used as delimiters of the generic URI components described in Section 3. A component’s ABNF syntax rule will not use the reserved or gen-delims rule names directly; instead, each syntax rule lists the characters allowed within that component (i.e., not delimiting it), and any of those characters that are also in the reserved set are “reserved” for use as subcomponent delimiters within the component. Only the most common subcomponents are defined by this specification; other subcomponents may be defined by a URI scheme’s specification, or by the implementation-specific syntax of a URI’s dereferencing algorithm, provided that such subcomponents are delimited by characters in the reserved set allowed within that component.

URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component. If a reserved character is found in a URI component and no delimiting role is known for that character, then it must be interpreted as representing the data octet corresponding to that character’s encoding in US-ASCII.

  • 字符“:/?#[]@!$&'()*+,;=”为保留字符
  • URL 方案规范将语法 URL 定界符定义为保留字符中的一些字符。
  • 句法 URL 分隔符不是 percent-encoded.
  • 非句法URL分隔符的保留字符可以是percent-encoded也可以不是,但建议是percent-encoded.

2.3. Unreserved Characters

Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded US-ASCII octet are equivalent: they identify the same resource. However, URI comparison implementations do not always perform normalization prior to comparison (see Section 6). For consistency, percent-encoded octets in the ranges of ALPHA (%41%5A and %61%7A), DIGIT (%30%39), hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be created by URI producers and, when found in a URI, should be decoded to their corresponding unreserved characters by URI normalizers.

6. Normalization and Comparison

…URI comparison is performed for some particular purpose. Protocols or implementations that compare URIs for different purposes will often be subject to differing design trade-offs in regards to how much effort should be spent in reducing aliased identifiers. This section describes various methods that may be used to compare URIs, the trade-offs between them, and the types of applications that might use them.

  • URL中允许的字符而不是保留字符,即“ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~”,是非保留字符。
  • 非保留字符可以是percent-encoded也可以不是,但建议不要

总结

  • 句法 URL 分隔符 → 不能是 percent-encoded.
  • 除此之外→可以是percent-encoded也可以不是
  • Percent-encoding 是 case-insensitive.

实施方式如何

一些实现不进行完整、广泛的 URL 规范化。例如,“%68%74%74%70%73://example.com”根据规范是有效的 URL,但 Chrome(版本 101)在放入 omnibar 时不会将其规范化为“https://example.com” .