为什么 Unicode 代码点总是至少写 2 个字节？

Question

为什么 Unicode 代码点总是用 2 个字节（4 位数字）书写，即使这不是必需的？

$ -> U+0024
¢ -> U+00A2

Answer 1

TL;DR 这完全符合 Unicode 联盟的约定。

这是正式定义，可在 Appendix A: Notational Conventions of the Unicode standard (I've referenced the latest at this time, version 11):

中找到

In running text, an individual Unicode code point is expressed as U+n, where n is four to six hexadecimal digits, using the digits 0–9 and uppercase letters A–F (for 10 through 15, respectively). Leading zeros are omitted, unless the code point would have fewer than four hexadecimal digits—for example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.

它们是十六进制数字，表示 Unicode 标量值。最初只有称为基本多语言平面的第一个平面可用，它支持定义 U+0000 到 U+FFFF 的范围。因此最初 U+ 编码总是有 4 个十六进制字符。

但是，这只允许 64 Ki (65536) 个代码点用于字符（不包括一些保留值）。所以后来单机扩展到17架。对于 U+10000 或更高的值，前导零被抑制，因此下一个字符写为 U+10000，而不是 U+010000。目前有17个64Ki码位的plane（其中一些可能被保留），从U+0000开始，U+10000 ... U+90000最后是U100000。

U+xxxx 符号不遵循 UTF-8 编码。它也不遵循 UTF-16、UTF-32 或已弃用的 UCS 编码，无论是大端还是小端。然而，Basic Multilingual Plane 中的字符编码与十六进制的 UTF-16(BE) 相同。请注意，UTF-16 可能包含 代理代码单元 用作转义以对其他平面中的字符进行编码。这些代码单元的范围未映射到字符，因此不会出现在文本代码点表示中。

例如，参见加减号，±：

Unicode code point: U+00B1 (as a textual string)
UTF-8             : 0xC2 0xB1 (as two bytes)
UTF-16            : 0x00B1
UTF-16BE          : 0x00B1 as 0x00 0xB1 (as two bytes)
UTF-16LE          : 0x00B1 as 0xB1 0x00 (as two bytes)

https://www.fileformat.info/info/unicode/char/00b1/index.htm

大部分信息都可以找到 at sil.org。

为什么 Unicode 代码点总是至少写 2 个字节？

Why Unicode code points are always written with at least 2 bytes?

unicode

encoding

utf-8

utf-16