UnicodeDecodeError: 'ascii' codec can't decode byte - NLTK

UnicodeDecodeError: 'ascii' codec can't decode byte - NLTK

下面的代码打印数据:

f = codecs.open('scrapeddata.csv', 'r')
data = f.read()
print data

数据如下所示:

Foul by Fabian Sch�r (Switzerland).       Wayne Rooney (England) wins a free kick in the attacking half.       Attempt missed. Xherdan Shaqiri (Switzerland) right footed shot from outside the box is high and wide to the right. Assisted by Josip Drmic.       Booking       James Milner (England) is shown the yellow card for a bad foul.       Stephan Lichtsteiner (Switzerland) wins a free kick in the defensive half.       Foul by James Milner (England).       Offside, Switzerland. G�khan Inler tries a through ball, but Xherdan Shaqiri is caught offside.

然后,我尝试用下面的代码做简单的词频分析:

from nltk import FreqDist, sent_tokenize, word_tokenize

data = word_tokenize(data)
freq = FreqDist(data)

freq

这个returns:

----> 3 data = word_tokenize(data)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 14: ordinal not in range(128)

有什么帮助吗?

打开文件时提供明确的编码。你说是UTF-8,所以告诉Python:

f = codecs.open('scrapeddata.csv', 'r', 'utf-8')
data = f.read()

原始数据是通过网络抓取收集的。所以我改变了将原始数据保存在 csv 中的方式,并修复了 ascii 错误。

data = [' scrapped data here']

w = csv.writer(open('scrapeddata.csv', 'wb'))

for sentence in data:
    w.writerow([sentence.encode('ascii','ignore')])