以 UTF-32 编码的单个字符的长度

Question

Wikipedia 告诉我 UTF-32 编码使用的位数是 32 位，为什么这给我一个 64 位长度？

>>> Bits(bytes = 'a'.encode('utf-32')).bin
'1111111111111110000000000000000001100001000000000000000000000000'
>>> len(Bits(bytes = 'a'.encode('utf-32')).bin)
64

UTF-32应该是一个4字节的定长字符集，按照我的理解是每个字符都定长表示在32位以内，但是上面代码的输出是64位。怎么样？

Answer 1

编码为 UTF-32 通常包含一个 Byte Order Mark; you have two characters encoded to UTF-32. The BOM is usually required as it lets the decoder know if the data was encoded in little endian or big endian order. The BOM is really just the U+FEFF ZERO WIDTH NO-BREAK SPACE 代码点，在您的示例中编码为 '11111111111111100000000000000000'（小端）。

编码为两个字节序特定变体之一 Python 提供（'utf-32-le' 或 'utf-32-be'）以获得单个字符：

>>> Bits(bytes = 'a'.encode('utf-32-le')).bin
'01100001000000000000000000000000'
>>> len(Bits(bytes = 'a'.encode('utf-32-le')).bin)
32

-le 和 -be 变体允许您在没有 BOM 的情况下编码或解码 UTF-32，因为您明确设置了字节顺序。

如果您对多个字符进行编码，您会注意到总是比所需的字符数多 4 个字节：

>>> len('abcd'.encode('utf-32'))  # (BOM + 4 chars) * 4 bytes == 20 bytes
20

以 UTF-32 编码的单个字符的长度

Length of a single character encoded in UTF-32

python

unicode

python-3.x

utf-32