如何删除 python 中的所有 unicode 表示

Question

我试图删除我文档中所有特殊字符的表示，例如文档的一部分说：“world\u2019s”，当我拆分它时它给出 ['world', '\u2019', 's'] 但我只需要单词（unicode 和 's' 已删除）。
我已经删除了所有标点符号，这适用于通常不显示在这些 unicode 表示形式上的实际标点符号。而且我还尝试使用正则表达式来匹配所有以 '\' 开头的内容，但这似乎也不起作用。

Answer 1

import re

string = "world\u2019s"

print (re.sub(r"\b([^\s]+)\([^\s]+)\b",r'',str(string.encode('ascii', 'backslashreplace'), 'ascii')))

输出：

world

您可以将其应用于整个字符串文档，应该可以正常工作。

import re

string = "world\u2019s h\u2018e"

print (re.sub(r"\b([^\s]+)\([^\s]+)\b",r'',str(string.encode('ascii', 'backslashreplace'), 'ascii')))

输出：

world h

如何删除 python 中的所有 unicode 表示

How to remove all unicode representations in python

python

unicode

ascii