以 UTF-32 编码的单个字符的长度

Length of a single character encoded in UTF-32

Wikipedia 告诉我 UTF-32 编码使用的位数是 32 位,为什么这给我一个 64 位长度?

>>> Bits(bytes = 'a'.encode('utf-32')).bin
'1111111111111110000000000000000001100001000000000000000000000000'
>>> len(Bits(bytes = 'a'.encode('utf-32')).bin)
64

UTF-32应该是一个4字节的定长字符集,按照我的理解是每个字符都定长表示在32位以内,但是上面代码的输出是64位。怎么样?

编码为 UTF-32 通常包含一个 Byte Order Mark; you have two characters encoded to UTF-32. The BOM is usually required as it lets the decoder know if the data was encoded in little endian or big endian order. The BOM is really just the U+FEFF ZERO WIDTH NO-BREAK SPACE 代码点,在您的示例中编码为 '11111111111111100000000000000000'(小端)。

编码为两个字节序特定变体之一 Python 提供('utf-32-le''utf-32-be')以获得单个字符:

>>> Bits(bytes = 'a'.encode('utf-32-le')).bin
'01100001000000000000000000000000'
>>> len(Bits(bytes = 'a'.encode('utf-32-le')).bin)
32

-le-be 变体允许您在没有 BOM 的情况下编码或解码 UTF-32,因为您明确设置了字节顺序。

如果您对多个字符进行编码,您会注意到总是比所需的字符数多 4 个字节:

>>> len('abcd'.encode('utf-32'))  # (BOM + 4 chars) * 4 bytes == 20 bytes
20