为什么 Unicode 限制为 0x10FFFF?

Why is Unicode restricted to 0x10FFFF?

为什么最大 Unicode 代码点被限制为 0x10FFFF?是否可以在此代码点之上表示 Unicode - 例如0x10FFFF + 0x000001 = 0x110000 - 通过任何编码方案,如 UTF-16、UTF-8?

这是因为 UTF-16。 基本多语言平面 (BMP) 之外的字符在 UTF-16 中使用第一个代码单元的 surrogate pair 表示(CU) 位于 0xD800–0xDBFF 之间,第二个位于 0xDC00–0xDFFF 之间。每个 CU 代表代码点的 10 位,允许总共 20 位 数据(0x100000 个字符)被分成 16 个平面(16 ×216个字符)。剩余的 BMP 将代表 0x10000 个字符(代码点 0–0xFFFF)

因此字符总数为 17×216 = 0x100000 + 0x10000 = 0x110000 这允许代码点从 0 到0x110000 - 1 = 0x10FFFF。或者,最后一个可表示的代码点可以这样计算: BMP 中的代码点在 0–0xFFFF 范围内,因此使用代理项对编码的字符的偏移量为 0xFFFF + 1 = 0x10000,这意味着最后一个代码点是一个代理对表示是 0xFFFFF + 0x10000 = 0x10FFFF

Unicode Character Encoding Stability Policies 保证上面的代码点永远不会被分配

The General_Category property value Surrogate (Cs) is immutable: the set of code points with that value will never change.

历史上 UTF-8 允许 up to U+7FFFFFFF using 6 bytes 而 UTF-32 可以存储两倍的数量。然而,由于 UTF-16 的限制,Unicode 委员会决定 UTF-8 永远不能超过 4 个字节,导致与 UTF-16

相同的范围

In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.

https://en.wikipedia.org/wiki/UTF-8#History

同样适用于 UTF-32

In November 2003, Unicode was restricted by RFC 3629 to match the constraints of the UTF-16 encoding: explicitly prohibiting code points greater than U+10FFFF (and also the high and low surrogates U+D800 through U+DFFF). This limited subset defines UTF-32

https://en.wikipedia.org/wiki/UTF-32

您可以阅读 this more detailed answer