为什么 Python 2 认为这些字节是麦克风表情符号，而 Python 3 却不是？

Question

我在数据库中有一些数据是用户输入的 "BTS⚾️>BTS"，即 "BTS" + 棒球表情符号 +“>BTS”+ 麦克风表情符号。当我从数据库中读取它、解码它并在 Python 2 中打印它时，它会正确显示表情符号。但是当我尝试解码 Python 3 中的相同字节时，它失败并显示 UnicodeDecodeError.

Python2中的字节数：

>>> data
'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'

将这些解码为 UTF-8 输出此 unicode 字符串：

>>> 'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
u'BTS\u26be\ufe0f>BTS\U0001f3a4'

在我的 Mac 上打印那个 unicode 字符串显示棒球和麦克风表情符号：

>>> print u'BTS\u26be\ufe0f>BTS\U0001f3a4'
BTS⚾️>BTS

但是在 Python 3 中，解码与 UTF-8 相同的字节给我一个错误：

>>> b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 13: invalid continuation byte

特别是最后 6 个字节（麦克风表情符号）似乎有问题：

>>> b'\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

此外，其他工具，如这个在线十六进制到 Unicode 转换器，告诉我这些字节不是有效的 Unicode 字符：

https://onlineutf8tools.com/convert-bytes-to-utf8?input=ed%20a0%20bc%20ed%20be%20a4

为什么 Python 2 和任何对用户输入进行编码的程序都认为这些字节是麦克风表情符号，但 Python 3 和其他工具却不这样认为？

Answer 1

尝试在 python 3

中用 utf-8 再次编码这个字节 u'BTS\u26be\ufe0f>BTS\U0001f3a4'

text = u'BTS\u26be\ufe0f>BTS\U0001f3a4'
result = text.encode('utf_8')
print(result)
result.decode('utf_8')

result 包含此字节：

b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xf0\x9f\x8e\xa4'

python2 中的内容与此不同：

b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'

但如果你再次解码 result: b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xf0\x9f\x8e\xa4' in utf-8 in python 3，你将收到你想要的结果

简而言之，python2 和 python3 以不同的方式工作，因此您必须在数据库中保存唯一的解码字节。

Answer 2

似乎有几个网页可以帮助回答您的问题：

https://bugs.python.org/issue9133（与 Python 2 过于宽松的 UTF-8 处理相关）
（与处理那种纵容有关）

如果我使用 Python 3 的 "surrogatepass" 错误处理程序解码你从 Python 2 获得的字节，即：

b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8',
    errors = 'surrogatepass')

然后我得到字符串 'BTS⚾️>BTS\ud83c\udfa4'，其中 '\ud83c\udfa4' 是应该代表麦克风表情符号的代理对。

您可以在 Python 3 中返回麦克风，方法是使用 "surrogate pass" 将具有代理项对的字符串编码为 UTF-16 并解码为 UTF-16:

>>> string_as_utf_8 = b'BTS\xe2\x9a\xbe\xef\xb8\x8f>BTS\xed\xa0\xbc\xed\xbe\xa4'.decode('utf_8', errors='surrogatepass')
>>> bytes_as_utf_16 = string_as_utf_8.encode('utf_16', errors='surrogatepass')
>>> string_as_utf_16 = bytes_as_utf_16.decode('utf_16')
>>> print(string_as_utf_16)
BTS⚾️>BTS

为什么 Python 2 认为这些字节是麦克风表情符号，而 Python 3 却不是？

Why does Python 2 think these bytes are the microphone emoji but Python 3 doesn't?

python

unicode

python-2.x

python-3.x

emoji