URI 中的非法字符

Illegal characters in URI

java.net.URI 构造函数接受 大多数 非 ASCII 字符但不接受 ideographic space (0x3000)。 ctor 失败 java.net.URISyntaxException: Illegal character in path ...

所以我的问题是:

JavaDoc documentation for java.net.URI

中详细说明了可接受的字符集

Character categories

RFC 2396 specifies precisely which characters are permitted in the various components of a URI reference. The following categories, most of which are taken from that specification, are used below to describe these constraints:

  • alpha The US-ASCII alphabetic characters, 'A' through 'Z' and 'a' through 'z'
  • digit The US-ASCII decimal digit characters, '0' through '9'
  • alphanum All alpha and digit characters unreserved All alphanum characters together with those in the string "_-!.~'()*"
  • punct The characters in the string ",;:$&+="
  • reserved All punct characters together with those in the string "?/[]@"
  • escaped Escaped octets, that is, triplets consisting of the percent character ('%') followed by two hexadecimal digits ('0'-'9', 'A'-'F', and 'a'-'f')
  • other The Unicode characters that are not in the US-ASCII character set, are not control characters (according to the Character.isISOControl method), and are not space characters (according to the Character.isSpaceChar method) (Deviation from RFC 2396, which is limited to US-ASCII)

The set of all legal URI characters consists of the unreserved, reserved, escaped, and other characters.

特别是,"other" 包含 space 个字符,这些字符(由 Character.isSpaceChar)定义为具有 Unicode 通用类别的字符类型

  • SPACE_SEPARATOR
  • LINE_SEPARATOR
  • PARAGRAPH_SEPARATOR

根据您在问题中链接到的页面,表意 space 字符确实是这些类型之一。

Please note the 1st example contains the ideographic space rather than a regular space.

问题出在表意文字 space 上。

下面是允许使用非 ASCII 字符的代码:

        } else if ((c > 128)
                   && !Character.isSpaceChar(c)
                   && !Character.isISOControl(c)) {
            // Allow unescaped but visible non-US-ASCII chars
            return p + 1;
        }

如您所见,它不允许 "funky" 个不可见字符。

另请参阅 URI class javadocs,它指定 URI 的每个组件中允许使用哪些字符(通过 class!)。

Why?

这可能是一种安全措施。

What others are disallowed?

白色字符space 或控制字符...根据相应的Character 谓词方法。 (请参阅 Character javadocs 以获得精确的规格。)

您还应注意,这是与 URI 规范的偏差。 URI 规范规定非 ASCII 字符仅在以下情况下才允许:

  • 将 UCS 字符编码转换为 UTF-8,并且
  • 百分比根据规范要求对 UTF-8 字节进行编码。

我的理解是,如果您有一个 "deviant" java.net.URI 对象,URI.toASCIIString() 方法会解决这个问题。