Python 3 推文中字符和笑脸的 UnicodeEncodeError

Question

我正在创建 Twitter API，我收到有关特定词的推文（现在是 'flafel'）。一切都很好，除了这条推文

b'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\xf0\x9f\x98\x82'

我使用 print ("Tweet info: {}".format(str(tweet.text).encode('utf-8').decode('utf-8'))) 查看推文，但是这个每次都会给我 UnicodeEncodeError 如果我从那一行中删除 decode() 就像 print ("Tweet info: {}".format(str(tweet.text).encode('utf-8')) 我可以看到像上面那样的实际推文，但我想将 \xf0\x9f\x98\x82 部分转换为 str。我尝试了一切，每个版本的解码-编码等等。我该如何解决这个问题？

编辑：好吧，我刚去那个用户的 Twitter 帐户看看那个非 ASCII 部分是什么，结果是微笑：

是否可以转换那个笑脸？

Edit2: 代码是;

...
...
api = tweepy.API(auth)
for tweet in tweepy.Cursor(api.search,
                           q = "flafel",
                           result_type = "recent",
                           include_entities = True,
                           lang = "en").items():

    print ("Tweet info: {}".format(str(tweet.text).encode('utf-8').decode('utf-8')))

Answer 1

当您尝试在 Windows 上使用 unicode 字符 \U0001f602 时，可能会出现问题。 Python-3 可以将其从 utf-8 转换为完整的 unicode 并返回，但 windows 无法显示它。

我在 Windows 7 框上以不同的方式尝试了这段代码：

>>> b = b'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\xf0\x9f\x98\x82'
>>> u = b.decode('utf8')
>>> u
'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\U0001f602'
>>> print(u)

这里是发生了什么：

在 IDLE（Python 基于 Tk 的 GUI 解释器）中，我得到了这个错误：

UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 139-139: Non-BMP character not supported in Tk

在使用非 unicode 代码页的控制台中我得到了这个错误：

UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f602' in position 139: character maps to <undefined>

（细心的reader BMP在这里的意思是基本多语言平面）

在使用 utf-8 代码页 (chcp 65001) 的控制台中，我没有收到任何错误，但显示很奇怪：

>>> u
'And when I\'m thinking about getting the chili sauce on my flafel and the waitr
ess, a Pinay, tells me not to get it cos "hindi yan masarap."ðŸ˜‚'
>>> print(u)
And when I'm thinking about getting the chili sauce on my flafel and the waitres
s, a Pinay, tells me not to get it cos "hindi yan masarap."ðŸ˜‚
>>>

我的结论是错误不在转换utf-8 <-> unicode。但看起来 Window Tk 版本不支持这个字符，也不支持任何控制台代码页（除了 65001 只是试图显示单个 utf8 字节！）

TL/DR: 问题不在核心 Python 处理中，也不在 UTF-8 转换器中，而仅在用于显示字符 '\U0001f602'[ 的系统转换中=23=]

但希望，因为核心 Python 没有问题，您可以轻松地将有问题的 '\U0001f602' 更改为 ':D' 例如仅 string.replace（在上面的代码显示之后):

>>> print (u.replace(U'\U0001f602', ':D'))

And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap.":D

如果要对BMP以外的所有字符进行特殊处理，知道它的最高编码是0xFFFF就可以了。所以你可以使用这样的代码：

def convert(t):
    with io.StringIO() as fd:
        for c in t:  # replace all chars outside BMP with a !
            dummy = fd.write(c if ord(c) < 0x10000 else '!')
        return fd.getvalue()

Answer 2

正如我在评论中提到的，您可以使用标准 unicodedata 模块获取 Unicode 代码点的名称。这是一个小演示：

import unicodedata as ud

test = ('And when I\'m thinking about getting the chili sauce on my flafel and the '
    'waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\U0001F602')

def convert_special(c):
    if c > '\uffff':
        c = ':{}:'.format(ud.name(c).lower().replace(' ', '_')) 
    return c

def convert_string(s):
    return ''.join([convert_special(c) for c in s])

for s in (test, 'Some special symbols \U0001F30C, ©, ®, ™, \U0001F40D, \u2323'): 
    print('{}\n{}\n'.format(s.encode('unicode-escape'), convert_string(s)))

输出

b'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\U0001f602'
And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap.":face_with_tears_of_joy:

b'Some special symbols \U0001f30c, \xa9, \xae, \u2122, \U0001f40d, \u2323'
Some special symbols :milky_way:, ©, ®, ™, :snake:, ⌣

另一个选项是测试字符是否在 Unicode "Symbol_Other" 类别中。我们可以通过替换

if c > '\uffff':

在 convert_special 中测试

if ud.category(c) == 'So':

当我们这样做时，我们得到这个输出：

b'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\U0001f602'
And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap.":face_with_tears_of_joy:

b'Some special symbols \U0001f30c, \xa9, \xae, \u2122, \U0001f40d, \u2323'
Some special symbols :milky_way:, :copyright_sign:, :registered_sign:, :trade_mark_sign:, :snake:, :smile:

Python 3 推文中字符和笑脸的 UnicodeEncodeError

Python 3 UnicodeEncodeError for characters and smileys in Tweets

python

tweepy

python-unicode

python-3.4