python 将 unicode 转换成它的 "print" 形式

python convert unicode into it's "print" form

我在网页中抓取了这段话:

It doesn’t look like a controversial new case management system is going anywhere. So the city plans to spend the next few months helping local social assistance workers learn to live with it.

在我下载的 html 数据中 python unicode 看起来像这样:

mystr = u'It doesn\u2019t look lake a controversial new case management system is going anywhere. So\xa0the city plans to spend the next few months helping local social assistance workers learn to live with it.'

我的计划是能够使用类似 mystr.find("doesn't") 的东西来查找单词的位置。目前,mystr.find("doesn't") 将 return -1 因为它实际上是 doesn\u2019tmystr

有没有一种快速的方法可以将 mystr 完全转换为上面段落的样子,以便所有 unicode 'characters' 都被 'normal' 字符替换,这样我就可以使用 str.find()?

到目前为止,我在网页上找到的最好的帖子是将 u'\u2019' 替换为 "'",然后将 u'\xa0' 替换为 ' '。有没有更方便的方法,让我不必真正编写方法和构建转换字典?

ps:

我也试过 unicodedata.normalizing 之类的东西,似乎不起作用。

编辑: 忘了说了,python版本是2.7

因为它不是 doesn't 而是它的 doesn’t 引号是一个 unicode 所以如果你使用 doesn’t python raise UnicodeDecodeError 。所以你需要在 doesn’t

字符串的开头添加 u
>>> mystr.find("doesn’t")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 5: ordinal not in range(128)
>>> mystr.find(u"doesn’t")
3

您已经知道该网页包含的内容。 \u2019U+2019 RIGHT SINGLE QUOTATION MARK, a fancy single quote, but you are using a simple ASCII single quote instead, e.g. the lowly U+0027 APOSTROPHE

如果你打印这个值,你会看到它产生的东西看起来很像里面有一个单引号,但稍微 弯曲 :

>>> mystr = u'It doesn\u2019t look lake a controversial new case management system is going anywhere. So\xa0the city plans to spend the next few months helping local social assistance workers learn to live with it.'
>>> print mystr
It doesn’t look lake a controversial new case management system is going anywhere. So the city plans to spend the next few months helping local social assistance workers learn to live with it.

所有 Python 所做的只是回显字符串的 表示 ,它用使值 [=27 的转义序列替换任何不可打印和非 ASCII 的内容=]可重现;您可以将该值复制并粘贴到任何 Python 解释器或脚本中,它会产生相同的值。由于 Python 的默认源编码是 ASCII,因此仅使用 ASCII 字符来描述该值。

您可以改为查找该文本:

>>> u'doesn\u2019t' in mystr
True

或者您可以使用像 unidecode 这样的库将非 ASCII 代码点替换为 ASCII 'lookalikes';它将用普通的 ASCII 引号替换花哨的引号:

>>> from unidecode import unidecode
>>> unidecode(mystr)
"It doesn't look lake a controversial new case management system is going anywhere. So the city plans to spend the next few months helping local social assistance workers learn to live with it."
>>> "doesn't" in unidecode(mystr)
True