JSON 解析器如何编码不在基本多语言平面中的 unicode 字符？

How do JSON parsers encode unicode characters not in the basic multilingual plane?

我正在用 Xojo 编写 JSON 解析器。除了我无法弄清楚如何编码和解码不在基本多语言平面 (BMP) 中的 unicode 字符串之外，它还在工作。换句话说，如果遇到大于 \uFFFF 的东西，我的解析器就会死掉。

规格说明：

To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor.

我不明白的是从U+1D11E到\uD834\uDD1E的算法是什么。我找不到关于如何 "encode the UTF-16 surrogate pair corresponding to the code point".

的任何解释

例如，假设我要对笑脸字符 (U+1F600) 进行编码。作为 UTF-16 代理对，这将是什么？派生它的工作是什么？

有人能至少给我指出正确的方向吗？

摘自 Remy Lebeau 在上述评论中链接的维基百科文章 (link):

To encode U+10437 () to UTF-16:

Subtract 0x10000 from the code point, leaving 0x0437. For the high surrogate, shift right by 10 (divide by 0x400), then add 0xD800, resulting in 0x0001 + 0xD800 = 0xD801. For the low surrogate, take the low 10 bits (remainder of dividing by 0x400), then add 0xDC00, resulting in 0x0037 + 0xDC00 = 0xDC37. To decode U+10437 () from UTF-16:

Take the high surrogate (0xD801) and subtract 0xD800, then multiply by 0x400, resulting in 0x0001 × 0x400 = 0x0400. Take the low surrogate (0xDC37) and subtract 0xDC00, resulting in 0x37. Add these two results together (0x0437), and finally add 0x10000 to get the final decoded UTF-32 code point, 0x10437.

JSON 解析器如何编码不在基本多语言平面中的 unicode 字符？

How do JSON parsers encode unicode characters not in the basic multilingual plane?

javascript

unicode

json

xojo