删除特殊引号和其他字符

Question

我正在尝试使用 Article 从 newspaper 下载文章，并尝试使用 nltk word_tokenizer 对单词进行标记。问题是，当我尝试打印已解析的文章文本时，其中一些文章有特殊的引号，如 “、”、’，标记器未将其过滤掉，就像常规的 ' 和 ".

有没有办法用普通引号替换这些特殊引号，或者更好的是删除分词器可能遗漏的所有可能的特殊字符？

我试图通过在代码中明确提及这些特殊字符来删除它们，但它给了我错误 Non-UTF-8 code starting with '\x92'。

Answer 1

使用 unidecode 包通常会将这些字符替换为 utf-8 字符。

from unidecode import unidecode
text = unidecode(text)

然而，一个缺点是您还会更改一些您可能想要保留的字符（例如重音字符）。如果是这种情况，一个选项是使用 regular expressions 专门删除（或替换）一些预先识别的特殊字符：

import re
exotic_quotes = ['\x92'] # fill this up
text = re.sub(exotic_quotes, "'", text) # changing the second argument to fill the kind of quote you want to replace the exotic ones with

希望对您有所帮助！

删除特殊引号和其他字符

Remove special quotation marks and other characters

python

nltk

python-newspaper