使用 json 负载时表情符号的 Unicode 解码不匹配

Question

我有一个 utf-8 编码对象列表，例如：

test = [b'{"abc\xf0\x9f\x94\xa5\xf0\x9f\x91\xbd\xf0\x9f\xa7\x83": 123}',
 b'{"abc\xf0\x9f\xa7\x83": 234}']

解码如下：

result = list(map(lambda x: json.loads(x.decode('utf-8','ignore')),test))

我注意到某些表情符号未按预期转换，如下所示：

[{'abc\U0001f9c3': 123}, {'abc\U0001f9c3': 234}]

然而，当我解码单个字符串时，我得到了预期的输出：

print(b"abc\xf0\x9f\x94\xa5\xf0\x9f\x91\xbd\xf0\x9f\xa7\x83".decode('utf-8'))
abc

我不确定为什么使用 json.loads 的第一种方法会产生意外的输出。有人可以提供任何指示吗？

Answer 1

在 json.loads() 之后您正在打印列表。列表使用引用 Unicode 表的字符串 (repr()) 的调试表示来确定代码点是否可打印。如果未知，您会在列表显示中获得转义码。 print 一个字符串直接查看没有转义码的字符串（str()）的“user-friendly”表示。

U+1F9C3 BEVERAGE BOX 已添加到 Unicode 12.0. Python 3.7 uses Unicode 11.0 定义中，这就是为什么您会看到带有转义码的原因。 Python 3.8 使用 Unicode 12.1，更新后的表格表明该字符可打印。如果您的终端支持该字符并且使用了适当的字体，它将显示。

例如，我使用的是支持 Unicode 13.0 的 Python 3.10。 U+1F978 在 Unicode 13.0 but U+1F979 was added in Unicode 14.0 中定义。您的浏览器可能会或可能不会显示实际的表情符号，具体取决于浏览器的 Unicode 支持和使用的字体（Chrome 99 没有）。如果不是，则打印替换字符。这仍然证明了字符串的 repr() 显示与 print:

使用的 str() 之间的区别

>>> s = '\U0001f978\U0001f979'
>>> s                      # The REPL shows the repr (debug) representation
'\U0001f979'
>>> print(repr(s))         # forcing print to use the repr as well.
'\U0001f979'
>>> [s]                    # repr() is also used for list content.
['\U0001f979']
>>> print(s)               # no escape codes here.

>>> print(ascii(s))        # forcing all non-ASCII to escape codes
'\U0001f978\U0001f979'

使用 json 负载时表情符号的 Unicode 解码不匹配

Unicode decode mismatch on emojis when using json loads

python

json

utf-8

unicode-escapes