在 Python3 中使用 unicode 进行酸洗

Question

我正在尝试挑选 {word : {docId : int}} 形式的字典。我的代码如下：

def vocabProcess(documents):
    word_splitter = re.compile(r"\w+", re.VERBOSE)
    stemmer=PorterStemmer()#
    stop_words = set(stopwords.words('english'))

    wordDict = {}
    for docId in documents:
        processedDoc = [stemmer.stem(w.lower()) for w in 
        word_splitter.findall(reuters.raw(docId)) if not w in stop_words]

        for w in processedDoc:
            if w not in wordDict:
                wordDict[w] = {docId : processedDoc.count(w)}
            else:
                wordDict[w][docId] = processedDoc.count(w)
    with open("vocabListings.txt", "wb") as f:
        _pickle.dump(wordDict, f)

if __name__ == "__main__":
    documents = reuters.fileids()
    with open("vocabListings.txt", "r") as f:
        vocabulary = _pickle.load(f)

当我运行这段代码时，我得到了错误

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2399: 
character maps to <undefined>

当 none 的路透社 docs/docids 中有 unicode 时，为什么会出现这种情况？我该如何解决这个问题，以便我仍然可以使用 _pickle 模块？

Answer 1

您需要使用二进制模式来编写和读取泡菜。您的问题是：

with open("vocabListings.txt", "r") as f:
    vocabulary = _pickle.load(f)

在 Python 3 上，以文本模式阅读会得到 str（一种文本类型）而不是 bytes（pickle 使用的二进制类型）。并且它会尝试解码数据，就好像它是您语言环境编码中的文本一样；原始二进制流在许多编码中不太可能有效，因此在 pickle 甚至看到数据之前你就会遇到错误。

On Python 2 on Windows，文本模式读取有时可以，除非二进制数据在数据中有一个\r\n序列，那样的话数据会被破坏（在pickle看到的数据中会被一个\n代替）。

无论哪种方式，使用模式 "rb" 阅读（就像您使用 "wb" 写作一样），您会没事的。

在 Python3 中使用 unicode 进行酸洗

pickling with unicode in Python3

python

pickle

python-unicode