What could cause this error : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 568: invalid start byte

Question

我对编码还很陌生 python 所以我真的很困惑这个错误。这是我的练习代码，我需要在包含多个文件的目录中找到最常用的单词

import pathlib

directory = pathlib.Path('/Users/k/files/Code/exo')

stats ={}

for path in directory.iterdir():
    file = open(str(path))
    text = file.read().lower()

    punctuation  = (";", ".")
    for mark in punctuation:
        text = text.replace(mark, "")


    for word in text.split():
        if word in stats:

            stats[word] = stats[word] + 1
        else:
            stats[word] = 1

most_used_word = None
score_max = 0
for word, score in stats.items():
    if score > score_max:
        score_max = score
        most_used_word = word

print(word,"The most used word is : ", score_max)

这是我得到的

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    text = file.read().lower()
  File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 568: invalid start byte

什么可能导致此错误？

Answer 1

可能您的文件包含非 ascii 字符，因此您必须对它们进行解码以使 UnicodeDecodeError 消失。您可以尝试以 'rb' 模式阅读，如下所示：

file = open(str(path), 'rb')

On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.

(来自docs)

What could cause this error : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 568: invalid start byte

What could cause this error : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 568: invalid start byte

python

error-handling

utf-8

unicode-string

python-3.x