URI 中的非法字符
Illegal characters in URI
java.net.URI
构造函数接受 大多数 非 ASCII 字符但不接受 ideographic space (0x3000)。 ctor 失败 java.net.URISyntaxException: Illegal character in path ...
所以我的问题是:
- 为什么
URI
ctor 不接受 0x3000
但接受其他非 ASCII 字符?
- 它不接受其他哪些字符?
JavaDoc documentation for java.net.URI
中详细说明了可接受的字符集
Character categories
RFC 2396 specifies precisely which characters are permitted in the various components of a URI reference. The following categories, most of which are taken from that specification, are used below to describe these constraints:
- alpha The US-ASCII alphabetic characters, 'A' through 'Z' and 'a' through 'z'
- digit The US-ASCII decimal digit characters, '0' through '9'
- alphanum All alpha and digit characters
unreserved All alphanum characters together with those in the string "_-!.~'()*"
- punct The characters in the string ",;:$&+="
- reserved All punct characters together with those in the string "?/[]@"
- escaped Escaped octets, that is, triplets consisting of the percent character ('%') followed by two hexadecimal digits ('0'-'9', 'A'-'F', and 'a'-'f')
- other The Unicode characters that are not in the US-ASCII character set, are not control characters (according to the
Character.isISOControl
method), and are not space characters (according to the Character.isSpaceChar
method) (Deviation from RFC 2396, which is limited to US-ASCII)
The set of all legal URI characters consists of the unreserved, reserved, escaped, and other characters.
特别是,"other" 不 包含 space 个字符,这些字符(由 Character.isSpaceChar)定义为具有 Unicode 通用类别的字符类型
- SPACE_SEPARATOR
- LINE_SEPARATOR
- PARAGRAPH_SEPARATOR
根据您在问题中链接到的页面,表意 space 字符确实是这些类型之一。
Please note the 1st example contains the ideographic space rather than a regular space.
问题出在表意文字 space 上。
下面是允许使用非 ASCII 字符的代码:
} else if ((c > 128)
&& !Character.isSpaceChar(c)
&& !Character.isISOControl(c)) {
// Allow unescaped but visible non-US-ASCII chars
return p + 1;
}
如您所见,它不允许 "funky" 个不可见字符。
另请参阅 URI
class javadocs,它指定 URI 的每个组件中允许使用哪些字符(通过 class!)。
Why?
这可能是一种安全措施。
What others are disallowed?
白色字符space 或控制字符...根据相应的Character
谓词方法。 (请参阅 Character
javadocs 以获得精确的规格。)
您还应注意,这是与 URI 规范的偏差。 URI 规范规定非 ASCII 字符仅在以下情况下才允许:
- 将 UCS 字符编码转换为 UTF-8,并且
- 百分比根据规范要求对 UTF-8 字节进行编码。
我的理解是,如果您有一个 "deviant" java.net.URI
对象,URI.toASCIIString()
方法会解决这个问题。
java.net.URI
构造函数接受 大多数 非 ASCII 字符但不接受 ideographic space (0x3000)。 ctor 失败 java.net.URISyntaxException: Illegal character in path ...
所以我的问题是:
- 为什么
URI
ctor 不接受0x3000
但接受其他非 ASCII 字符? - 它不接受其他哪些字符?
JavaDoc documentation for java.net.URI
Character categories
RFC 2396 specifies precisely which characters are permitted in the various components of a URI reference. The following categories, most of which are taken from that specification, are used below to describe these constraints:
- alpha The US-ASCII alphabetic characters, 'A' through 'Z' and 'a' through 'z'
- digit The US-ASCII decimal digit characters, '0' through '9'
- alphanum All alpha and digit characters unreserved All alphanum characters together with those in the string "_-!.~'()*"
- punct The characters in the string ",;:$&+="
- reserved All punct characters together with those in the string "?/[]@"
- escaped Escaped octets, that is, triplets consisting of the percent character ('%') followed by two hexadecimal digits ('0'-'9', 'A'-'F', and 'a'-'f')
- other The Unicode characters that are not in the US-ASCII character set, are not control characters (according to the
Character.isISOControl
method), and are not space characters (according to theCharacter.isSpaceChar
method) (Deviation from RFC 2396, which is limited to US-ASCII)The set of all legal URI characters consists of the unreserved, reserved, escaped, and other characters.
特别是,"other" 不 包含 space 个字符,这些字符(由 Character.isSpaceChar)定义为具有 Unicode 通用类别的字符类型
- SPACE_SEPARATOR
- LINE_SEPARATOR
- PARAGRAPH_SEPARATOR
根据您在问题中链接到的页面,表意 space 字符确实是这些类型之一。
Please note the 1st example contains the ideographic space rather than a regular space.
问题出在表意文字 space 上。
下面是允许使用非 ASCII 字符的代码:
} else if ((c > 128)
&& !Character.isSpaceChar(c)
&& !Character.isISOControl(c)) {
// Allow unescaped but visible non-US-ASCII chars
return p + 1;
}
如您所见,它不允许 "funky" 个不可见字符。
另请参阅 URI
class javadocs,它指定 URI 的每个组件中允许使用哪些字符(通过 class!)。
Why?
这可能是一种安全措施。
What others are disallowed?
白色字符space 或控制字符...根据相应的Character
谓词方法。 (请参阅 Character
javadocs 以获得精确的规格。)
您还应注意,这是与 URI 规范的偏差。 URI 规范规定非 ASCII 字符仅在以下情况下才允许:
- 将 UCS 字符编码转换为 UTF-8,并且
- 百分比根据规范要求对 UTF-8 字节进行编码。
我的理解是,如果您有一个 "deviant" java.net.URI
对象,URI.toASCIIString()
方法会解决这个问题。