使用 NLTK 的编码问题

Question

我正在尝试抓取一个非常 'right side' 的网站来研究仇恨和种族主义检测，因此我的测试内容可能令人反感。

我正在尝试删除 python 中的一些停用词和标点符号，我正在使用 NLTK，但我遇到了编码问题...我正在使用 python 2.7 和数据来自我用我爬取的网站上的文章填充的文件：

stop_words = set(nltk.corpus.stopwords.words("english"))
for key, value in data.iteritems():
    print type(value), value
    tokenized_article = nltk.word_tokenize(value.lower())
    print tokenized_article
    break

输出看起来像：（我添加...以缩短示例）

<type 'str'>   A Negress Bernie ... they’re not going to take it anymore.

['a', 'negress', 'bernie', ... , 'they\u2019re', 'not', 'going', 'to', 'take', 'it', 'anymore', '.']

我不明白为什么会有这个不应该出现的'\u2019'。如果有人能告诉我如何乘坐它。我尝试用 UTF-8 编码，但我仍然遇到同样的问题。

Answer 1

stop_words = set(nltk.corpus.stopwords.words("english"))
for key, value in data.iteritems():
    print type(value), value
    #replace value with ignored handler
    value = value.encode('ascii', 'ignore')
    tokenized_article = nltk.word_tokenize(value.lower())
    print tokenized_article
    break

使用 NLTK 的编码问题

Encoding issue using NLTK

python

encoding

nltk

stop-words

python-2.7