删除 unicode 字符 python

Question

我正在使用 tweepy 在 python 中提取推文。它以 unicode 类型提供整个数据。例如：打印类型（数据）给我 <type 'unicode'>

其中包含unicode字符。例如：hello\u2026 im am fine\u2019s

我想删除所有这些 unicode 字符。我可以使用任何正则表达式吗？ str.replace 不是一个可行的选项，因为 unicode 字符可以是任何值，从笑脸到 unicode 撇号。

Answer 1

In [10]: from unicodedata import normalize

In [11]: out_text = normalize('NFKD', input_text).encode('ascii','ignore')

试试这个。

编辑

实际上规范化 Return Unicode 字符串 unistr 的规范形式。表单的有效值为“NFC”、“NFKC”、“NFD”和“NFKD”。如果您想了解更多关于 NFKD 的信息，请转到此 link

In [12]: u = unichr(40960) + u'abcd' + unichr(1972)
In [13]: u.encode('utf-8')
Out[13]: '\xea\x80\x80abcd\xde\xb4'
In [14]: u
Out[14]: u'\ua000abcd\u07b4'
In [16]: u.encode('ascii', 'ignore')
Out[16]: 'abcd'

从上面的代码中，您将了解 encode('ascii','ignore') 的作用。

参考：https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize

删除 unicode 字符 python

Remove unicode characters python

python

unicode

unicode-string

python-2.7