JSON 解析器如何编码不在基本多语言平面中的 unicode 字符?
How do JSON parsers encode unicode characters not in the basic multilingual plane?
我正在用 Xojo 编写 JSON 解析器。除了我无法弄清楚如何编码和解码不在基本多语言平面 (BMP) 中的 unicode 字符串之外,它还在工作。换句话说,如果遇到大于 \uFFFF
的东西,我的解析器就会死掉。
规格说明:
To escape a code point that is not in the Basic Multilingual Plane,
the character may be represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair corresponding to the code point. So
for example, a string containing only the G clef character (U+1D11E)
may be represented as "\uD834\uDD1E". However, whether a processor of
JSON texts interprets such a surrogate pair as a single code point or
as an explicit surrogate pair is a semantic decision that is
determined by the specific processor.
我不明白的是从U+1D11E
到\uD834\uDD1E
的算法是什么。我找不到关于如何 "encode the UTF-16 surrogate pair corresponding to the code point".
的任何解释
例如,假设我要对笑脸字符 (U+1F600
) 进行编码。作为 UTF-16 代理对,这将是什么?派生它的工作是什么?
有人能至少给我指出正确的方向吗?
摘自 Remy Lebeau 在上述评论中链接的维基百科文章 (link):
To encode U+10437 () to UTF-16:
Subtract 0x10000 from the code point, leaving 0x0437. For the high
surrogate, shift right by 10 (divide by 0x400), then add 0xD800,
resulting in 0x0001 + 0xD800 = 0xD801. For the low surrogate, take the
low 10 bits (remainder of dividing by 0x400), then add 0xDC00,
resulting in 0x0037 + 0xDC00 = 0xDC37. To decode U+10437 () from
UTF-16:
Take the high surrogate (0xD801) and subtract 0xD800, then multiply by
0x400, resulting in 0x0001 × 0x400 = 0x0400. Take the low surrogate
(0xDC37) and subtract 0xDC00, resulting in 0x37. Add these two results
together (0x0437), and finally add 0x10000 to get the final decoded
UTF-32 code point, 0x10437.
我正在用 Xojo 编写 JSON 解析器。除了我无法弄清楚如何编码和解码不在基本多语言平面 (BMP) 中的 unicode 字符串之外,它还在工作。换句话说,如果遇到大于 \uFFFF
的东西,我的解析器就会死掉。
规格说明:
To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor.
我不明白的是从U+1D11E
到\uD834\uDD1E
的算法是什么。我找不到关于如何 "encode the UTF-16 surrogate pair corresponding to the code point".
例如,假设我要对笑脸字符 (U+1F600
) 进行编码。作为 UTF-16 代理对,这将是什么?派生它的工作是什么?
有人能至少给我指出正确的方向吗?
摘自 Remy Lebeau 在上述评论中链接的维基百科文章 (link):
To encode U+10437 () to UTF-16:
Subtract 0x10000 from the code point, leaving 0x0437. For the high surrogate, shift right by 10 (divide by 0x400), then add 0xD800, resulting in 0x0001 + 0xD800 = 0xD801. For the low surrogate, take the low 10 bits (remainder of dividing by 0x400), then add 0xDC00, resulting in 0x0037 + 0xDC00 = 0xDC37. To decode U+10437 () from UTF-16:
Take the high surrogate (0xD801) and subtract 0xD800, then multiply by 0x400, resulting in 0x0001 × 0x400 = 0x0400. Take the low surrogate (0xDC37) and subtract 0xDC00, resulting in 0x37. Add these two results together (0x0437), and finally add 0x10000 to get the final decoded UTF-32 code point, 0x10437.