python3.5 中不同编码的字符串长度

Question

我在 python 中尝试过此操作以获取以字节为单位的字符串长度。

>>> s = 'a'
>>> s.encode('utf-8')
b'a'
>>> s.encode('utf-16')
b'\xff\xfea\x00'
>>> s.encode('utf-32')
b'\xff\xfe\x00\x00a\x00\x00\x00'
>>> len(s.encode('utf-8'))
1
>>> len(s.encode('utf-16'))
4
>>> len(s.encode('utf-32'))
8

utf-8使用一个字节存储一个ascii字符，符合预期，但为什么utf-16使用4个字节？ len() 到底测量的是什么？

Answer 1

您的长度看起来很奇怪的原因是 UTF-16 和 UTF-32 编码在编码期间将 byte order mark 附加到字符串的开头。这就是为什么字符串的长度看起来是您预期的两倍。他们使用两个代码点。字节顺序标记告诉你一些事情（字节顺序和编码是主要的）。所以基本上 len 就像你期望的那样运行（它测量编码表示中使用的字节数）。

Answer 2

长话短说：

UTF-8 : 1 byte 'a'
UTF-16: 2 bytes 'a' + 2 bytes BOM
UTF-32: 4 bytes 'a' + 4 bytes BOM

UTF-8是一种变长编码，字符的编码长度可以在1到4个字节之间。它旨在匹配前 128 个字符的 ASCII，因此 'a' 是单字节宽度 .
UTF-16是变长编码；代码点用一个或两个 16 位代码单元（即 2 或 4 个字节）编码，'a' 是 2 个字节宽.
UTF-32 是固定宽度的，每个代码点恰好 32 位，每个字符都是 4 字节宽，所以 an 'a' 是 4 字节宽.

对于以 UTF-8、UTF-16、UTF-32 编码的“a”的长度，您可能会分别看到 1、2、4 的结果。 1、4、8 的实际结果被夸大了，因为在最后两种情况下，输出包括数据的 BOM - that \xff\xfe thing is the byte order mark, used to indicate the endianness。

unicode 标准允许使用 UTF-8 格式的 BOM，但既不要求也不推荐使用它（它在那里没有任何意义），这就是为什么您在第一个示例中看不到任何 BOM 的原因。 UTF-16 BOM 为 2 个字节宽，UTF-32 BOM 为 4 个字节宽（实际上它与 UTF-16 BOM 相同，加上一些填充空值）。

>>> 'a'.encode('utf-16')  # length 4: 2 bytes BOM + 2 bytes a
b'\xff\xfea\x00'
  BOM.....a....
>>> 'aaa'.encode('utf-16')  # length 8: 2 bytes BOM + 3*2 bytes of a
b'\xff\xfea\x00a\x00a\x00'
  BOM.....a....a....a....

如果您使用 bitstring 模块查看原始位，则在数据中查看 BOM 可能会更清楚：

>>> # pip install bitstring
>>> from bitstring import Bits
>>> Bits(bytes='a'.encode('utf-32')).bin
'1111111111111110000000000000000001100001000000000000000000000000'
>>> Bits(bytes='aaa'.encode('utf-32')).bin
'11111111111111100000000000000000011000010000000000000000000000000110000100000000000000000000000001100001000000000000000000000000'
 BOM.............................a...............................a...............................a...............................

Answer 3

len() Return 对象的长度（项目数）。当您对字符串进行编码时 s.encode('utf-16') python returns 字符串的编码版本带有 字节顺序标记 。这计入了字符串的长度。为了说明我的观点

for i in range(0, len(s.encode('utf-16'))):
  print(s.encode('utf-16')[:i])

结果：

b''  #this is the byte order mark
b'\xff'
b'\xff\xfe'
b'\xff\xfea'

python3.5 中不同编码的字符串长度

length of string in python3.5 with different encode

python

unicode

byte-order-mark

utf-8

utf-16