这个 Google Closure UTF-8 字符串有效吗?
Is this Google Closure UTF-8 string valid?
在Google闭包中UTF-8 to byte array tests是字符串
\u0000\u007F\u0080\u07FF\u0800\uFFFF
应该转换为数组
[0x00, 0x7F, 0xC2, 0x80, 0xDF, 0xBF, 0xE0, 0xA0, 0x80, 0xEF, 0xBF, 0xBF]
我尝试了一些其他 JavaScript 和 TypeScript UTF-8 到字节数组的实现,他们声称 UTF-8 字符串无效。
该字符串似乎涵盖了从 1 字节值到 2 字节值再到 3 字节值的值。
Google 正确还是其他库?
Google 正确。
字符串 '\u0000\u007F\u0080\u07FF\u0800\uFFFF'
表示 Unicode 代码点 U+0000 U+007F U+0080 U+07FF U+0800 U+FFFF
。
这些代码点到 UTF-8 的 文字 翻译确实是字节 00 7F C2 80 DF BF E0 A0 80 EF BF BF
,正如 Google 所说。
请注意 U+FFFF
是 non-character codepoint, per the Unicode standard:
A "noncharacter" is a code point that is permanently reserved in the Unicode Standard for internal use
...
In Unicode 1.0 the code points U+FFFE and U+FFFF were annotated in the code charts as "Not character codes" and instead of having actual names were labeled "NOT A CHARACTER". The term "noncharacter" in later versions of the standard evolved from those early annotations and labels.
特别是:
Q: Are noncharacters intended for interchange?
A: No. They are intended explicity for internal use. For example, they might be used internally as a particular kind of object placeholder in a string. Or they might be used in a collation tailoring as a target for a weighting that comes between weights for "real" characters of different scripts, thus simplifying the support of "alphabetic index" implementations.
Q: Are noncharacters prohibited in interchange?
A: This question has led to some controversy, because the Unicode Standard has been somewhat ambiguous about the status of noncharacters. The formal wording of the definition of "noncharacter" in the standard has always indicated that noncharacters "should never be interchanged." That led some people to assume that the definition actually meant "shall not be interchanged" and that therefore the presence of a noncharacter in any Unicode string immediately rendered that string malformed according to the standard. But the intended use of noncharacters requires the ability to exchange them in a limited context, at least across APIs and even through data files and other means of "interchange", so that they can be processed as intended. The choice of the word "should" in the original definition was deliberate, and indicated that one should not try to interchange noncharacters precisely because their interpretation is strictly internal to whatever implementation uses them, so they have no publicly interchangeable semantics. But other informative wording in the text of the core specification and in the character names list was differently and more strongly worded, leading to contradictory interpretations.
Given this ambiguity of intent, in 2013 the UTC issued Corrigendum #9, which deleted the phrase "and that should never be interchanged" from the definition of noncharacters, to make it clear that prohibition from interchange is not part of the formal definition of noncharacters. Corrigendum #9 has been incorporated into the core specification for Unicode 7.0.
Q: Are noncharacters invalid in Unicode strings and UTFs?
A: Absolutely not. Noncharacters do not cause a Unicode string to be ill-formed in any UTF. This can be seen explicitly in the table above, where every noncharacter code point has a well-formed representation in UTF-32, in UTF-16, and in UTF-8. An implementation which converts noncharacter code points between one UTF representation and another must preserve these values correctly. The fact that they are called "noncharacters" and are not intended for open interchange does not mean that they are somehow illegal or invalid code points which make strings containing them invalid.
在Google闭包中UTF-8 to byte array tests是字符串
\u0000\u007F\u0080\u07FF\u0800\uFFFF
应该转换为数组
[0x00, 0x7F, 0xC2, 0x80, 0xDF, 0xBF, 0xE0, 0xA0, 0x80, 0xEF, 0xBF, 0xBF]
我尝试了一些其他 JavaScript 和 TypeScript UTF-8 到字节数组的实现,他们声称 UTF-8 字符串无效。
该字符串似乎涵盖了从 1 字节值到 2 字节值再到 3 字节值的值。
Google 正确还是其他库?
Google 正确。
字符串 '\u0000\u007F\u0080\u07FF\u0800\uFFFF'
表示 Unicode 代码点 U+0000 U+007F U+0080 U+07FF U+0800 U+FFFF
。
这些代码点到 UTF-8 的 文字 翻译确实是字节 00 7F C2 80 DF BF E0 A0 80 EF BF BF
,正如 Google 所说。
请注意 U+FFFF
是 non-character codepoint, per the Unicode standard:
A "noncharacter" is a code point that is permanently reserved in the Unicode Standard for internal use
...
In Unicode 1.0 the code points U+FFFE and U+FFFF were annotated in the code charts as "Not character codes" and instead of having actual names were labeled "NOT A CHARACTER". The term "noncharacter" in later versions of the standard evolved from those early annotations and labels.
特别是:
Q: Are noncharacters intended for interchange?
A: No. They are intended explicity for internal use. For example, they might be used internally as a particular kind of object placeholder in a string. Or they might be used in a collation tailoring as a target for a weighting that comes between weights for "real" characters of different scripts, thus simplifying the support of "alphabetic index" implementations.
Q: Are noncharacters prohibited in interchange?
A: This question has led to some controversy, because the Unicode Standard has been somewhat ambiguous about the status of noncharacters. The formal wording of the definition of "noncharacter" in the standard has always indicated that noncharacters "should never be interchanged." That led some people to assume that the definition actually meant "shall not be interchanged" and that therefore the presence of a noncharacter in any Unicode string immediately rendered that string malformed according to the standard. But the intended use of noncharacters requires the ability to exchange them in a limited context, at least across APIs and even through data files and other means of "interchange", so that they can be processed as intended. The choice of the word "should" in the original definition was deliberate, and indicated that one should not try to interchange noncharacters precisely because their interpretation is strictly internal to whatever implementation uses them, so they have no publicly interchangeable semantics. But other informative wording in the text of the core specification and in the character names list was differently and more strongly worded, leading to contradictory interpretations.
Given this ambiguity of intent, in 2013 the UTC issued Corrigendum #9, which deleted the phrase "and that should never be interchanged" from the definition of noncharacters, to make it clear that prohibition from interchange is not part of the formal definition of noncharacters. Corrigendum #9 has been incorporated into the core specification for Unicode 7.0.
Q: Are noncharacters invalid in Unicode strings and UTFs?
A: Absolutely not. Noncharacters do not cause a Unicode string to be ill-formed in any UTF. This can be seen explicitly in the table above, where every noncharacter code point has a well-formed representation in UTF-32, in UTF-16, and in UTF-8. An implementation which converts noncharacter code points between one UTF representation and another must preserve these values correctly. The fact that they are called "noncharacters" and are not intended for open interchange does not mean that they are somehow illegal or invalid code points which make strings containing them invalid.