UnicodeDecodeError 在处理数据集时数据意外结束

Question

我是 python 的新手，我正在尝试处理一小部分 Yelp！ JSON 中的数据集，但我使用 pandas 库和 NLTK 转换为 CSV。

在对数据进行预处理时，我首先尝试删除所有标点符号以及最常见的停用词。这样做之后，我想应用 nltk.stem 中很容易获得的 Porter Stemming 算法。

这是我的代码：

"""A method for removing the noise in the data and the most common stop.words (NLTK)."""
def stopWords(review):

    stopset = set(stopwords.words("english"))
    review = review.lower()
    review = review.replace(".","")
    review = review.replace("-"," ")
    review = review.replace(")","")
    review = review.replace("(","")
    review = review.replace("i'm"," ")
    review = review.replace("!","")
    review = re.sub("[$!@#*;:<+>~-]", '', review)
    row = review.split()

    tokens = ' '.join([word for word in row if word not in stopset])
    return tokens

我使用此处的标记输入我写的词干提取方法：

"""A method for stemming the words to their roots using Porter Algorithm (NLTK)"""
def stemWords(impWords):
    stemmer = stem.PorterStemmer()
    tok = stopWords(impWords)
    ========================================================================
    stemmed = " ".join([stemmer.stem(str(word)) for word in tok.split(" ")])
    ========================================================================
    return stemmed

但是我收到一个错误 UnicodeDecodeError: 'utf8' codec can't decode byte 0xc2 in position 0: unexpected end of data。 '==' 内的那一行给我错误。

我已经尝试清理数据并删除所有特殊字符 !@#$^&* 和其他字符来完成这项工作。但是停用词工作正常。词干提取不起作用。有人能告诉我哪里做错了吗？

如果我的数据不干净，或者 unicode 字符串在某处损坏，我可以通过任何方式清理或修复它，这样它就不会给我这个错误吗？我想做词干提取，任何建议都会有所帮助。

Answer 1

阅读 python 中的 unicode 字符串处理。有类型 str 但也有类型 unicode.

我建议：

阅读后立即解码每一行，以缩小输入数据中不正确的字符（真实数据包含错误）
随处使用 unicode 和 u" " 字符串。

Answer 2

有一种简单的方法可以过滤掉这些烦人的错误。您可以使用

预处理每条评论

review = review.encode('ascii', errors='ignore')

删除所有无效字符。根据您的代码，ascii 字符是您想要的。

UnicodeDecodeError 在处理数据集时数据意外结束

UnicodeDecodeError unexpected end of data while stemming over dataset

python

unicode

stemming

nltk

pandas