为什么有些表情符号没有转换回它们的表示形式？

Question

我正在开发表情符号检测模块。对于某些表情符号，我观察到奇怪的行为，即在将它们转换为 utf-8 编码后，它们不会转换回其原始表示形式。我需要将它们的确切颜色表示作为 API 响应发送，而不是发送 unicode 转义字符串。有线索吗？

In [1]: x = "example1:  and example2:  and example3: " 

In [2]: x.encode('utf8')                                                                                                                                                                                                          
Out[2]: b'example1: \xf0\x9f\xa4\xad and example2: \xf0\x9f\x98\x81 and example3: \xf0\x9f\xa5\xba'

In [3]: x.encode('utf8').decode('utf8')                                                                                                                                                                                           
Out[3]: 'example1: \U0001f92d and example2:  and example3: \U0001f97a'

In [4]: print( x.encode('utf8').decode('utf8')  )                                                                                                                                                                                 
*example1:  and example2:  and example3: *

Link Emoji used in example

更新一：通过这个例子，解释起来一定更清楚了。在这里，当我发送了 unicode 转义字符串时，呈现了两个表情符号，但是第三个示例未能正确转换表情符号，在这种情况下该怎么办？

Answer 1

'\U0001f92d' == '' 是 True。它是一个转义码，但仍然是相同的字符...... display/entry 的两种方式。前者是字符串的repr()，打印调用str()。示例：

>>> s = ''
>>> print(repr(s))
'\U0001f92d'
>>> print(str())

>>> s
'\U0001f92d'
>>> print(s)

当 Python 生成 repr() 时，如果它认为显示器无法处理该字符，它会使用转义码表示。字符串的内容仍然是相同的...Unicode 代码点。

这是一个调试功能。比如白色的spacespace是tabs还是tabs？字符串的repr()通过使用\t作为转义码使其清晰。

>>> s = 'a\tb'
>>> print(s)
a       b
>>> s
'a\tb'

至于为什么一个表情符号使用转义码而不是另一个，这取决于使用的 Python 版本支持的 Unicode 版本。

Pyton 3.8 使用 Unicode 9.0，您的一个表情符号未在该版本级别定义：

>>> import unicodedata as ud
>>> ud.unidata_version
'9.0.0'
>>> ud.name('')
'GRINNING FACE WITH SMILING EYES'
>>> ud.name('')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name

为什么有些表情符号没有转换回它们的表示形式？

Why some emojis are not converted back into their representation?

unicode

utf-8

unicode-escapes

python-3.x

python-unicode