Python 3 推文中字符和笑脸的 UnicodeEncodeError

Python 3 UnicodeEncodeError for characters and smileys in Tweets

我正在创建 Twitter API,我收到有关特定词的推文(现在是 'flafel')。一切都很好,除了这条推文

b'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\xf0\x9f\x98\x82'

我使用 print ("Tweet info: {}".format(str(tweet.text).encode('utf-8').decode('utf-8'))) 查看推文,但是这个每次都会给我 UnicodeEncodeError 如果我从那一行中删除 decode() 就像 print ("Tweet info: {}".format(str(tweet.text).encode('utf-8')) 我可以看到像上面那样的实际推文,但我想将 \xf0\x9f\x98\x82 部分转换为 str。我尝试了一切,每个版本的解码-编码等等。我该如何解决这个问题?

编辑:好吧,我刚去那个用户的 Twitter 帐户看看那个非 ASCII 部分是什么,结果是微笑:

是否可以转换那个笑脸?

Edit2: 代码是;

...
...
api = tweepy.API(auth)
for tweet in tweepy.Cursor(api.search,
                           q = "flafel",
                           result_type = "recent",
                           include_entities = True,
                           lang = "en").items():

    print ("Tweet info: {}".format(str(tweet.text).encode('utf-8').decode('utf-8')))

当您尝试在 Windows 上使用 unicode 字符 \U0001f602 时,可能会出现问题。 Python-3 可以将其从 utf-8 转换为完整的 unicode 并返回,但 windows 无法显示它。

我在 Windows 7 框上以不同的方式尝试了这段代码:

>>> b = b'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\xf0\x9f\x98\x82'
>>> u = b.decode('utf8')
>>> u
'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\U0001f602'
>>> print(u)

这里是发生了什么:

  • 在 IDLE(Python 基于 Tk 的 GUI 解释器)中,我得到了这个错误:

UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 139-139: Non-BMP character not supported in Tk

  • 在使用非 unicode 代码页的控制台中我得到了这个错误:

UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f602' in position 139: character maps to <undefined>

(细心的reader BMP在这里的意思是基本多语言平面

  • 在使用 utf-8 代码页 (chcp 65001) 的控制台中,我没有收到任何错误,但显示很奇怪:

    >>> u
    'And when I\'m thinking about getting the chili sauce on my flafel and the waitr
    ess, a Pinay, tells me not to get it cos "hindi yan masarap."😂'
    >>> print(u)
    And when I'm thinking about getting the chili sauce on my flafel and the waitres
    s, a Pinay, tells me not to get it cos "hindi yan masarap."😂
    >>>
    

我的结论是错误不在转换utf-8 <-> unicode。但看起来 Window Tk 版本不支持这个字符,也不支持任何控制台代码页(除了 65001 只是试图显示单个 utf8 字节!)

TL/DR: 问题不在核心 Python 处理中,也不在 UTF-8 转换器中,而仅在用于显示字符 '\U0001f602'[ 的系统转换中=23=]

但希望,因为核心 Python 没有问题,您可以轻松地将有问题的 '\U0001f602' 更改为 ':D' 例如仅 string.replace(在上面的代码显示之后):

>>> print (u.replace(U'\U0001f602', ':D'))
And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap.":D

如果要对BMP以外的所有字符进行特殊处理,知道它的最高编码是0xFFFF就可以了。所以你可以使用这样的代码:

def convert(t):
    with io.StringIO() as fd:
        for c in t:  # replace all chars outside BMP with a !
            dummy = fd.write(c if ord(c) < 0x10000 else '!')
        return fd.getvalue()

正如我在评论中提到的,您可以使用标准 unicodedata 模块获取 Unicode 代码点的名称。这是一个小演示:

import unicodedata as ud

test = ('And when I\'m thinking about getting the chili sauce on my flafel and the '
    'waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\U0001F602')

def convert_special(c):
    if c > '\uffff':
        c = ':{}:'.format(ud.name(c).lower().replace(' ', '_')) 
    return c

def convert_string(s):
    return ''.join([convert_special(c) for c in s])

for s in (test, 'Some special symbols \U0001F30C, ©, ®, ™, \U0001F40D, \u2323'): 
    print('{}\n{}\n'.format(s.encode('unicode-escape'), convert_string(s)))

输出

b'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\U0001f602'
And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap.":face_with_tears_of_joy:

b'Some special symbols \U0001f30c, \xa9, \xae, \u2122, \U0001f40d, \u2323'
Some special symbols :milky_way:, ©, ®, ™, :snake:, ⌣

另一个选项是测试字符是否在 Unicode "Symbol_Other" 类别中。我们可以通过替换

if c > '\uffff':

convert_special 中测试

if ud.category(c) == 'So':

当我们这样做时,我们得到这个输出:

b'And when I\'m thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap."\U0001f602'
And when I'm thinking about getting the chili sauce on my flafel and the waitress, a Pinay, tells me not to get it cos "hindi yan masarap.":face_with_tears_of_joy:

b'Some special symbols \U0001f30c, \xa9, \xae, \u2122, \U0001f40d, \u2323'
Some special symbols :milky_way:, :copyright_sign:, :registered_sign:, :trade_mark_sign:, :snake:, :smile: