如何在读取文本文件时解决这个 unicode 问题？

Question

我有一个巨大的文本文件要处理，我的代码如下：

freq_counter = collections.Counter()
    unigram_counter = collections.Counter()
    with open(filename, 'r', encoding='utf8') as f:
        for i, batch in enumerate(read_batch(f, batch_size=1000)):
            logger.info("Batch {}".format(str(i)))
            frequency_with_batch(batch, freq_counter, unigram_counter, enable)
            if data_batch == i+1:
                break

def read_batch(file_handle, batch_size=1000):
    batch = []
    for line in file_handle:
        if not line:
            continue
        
        batch.append(line)
        if len(batch) == batch_size:
            yield batch
            batch.clear()
    if batch:
        yield batch

编码='utf8'在处理大文本文件的中间部分之前一直运行良好。在某行报如下错误：

File "/data5/congmin/tool/utils/my_utils.py", line 484, in read_batch
    for line in file_handle:
   File "/usr/lib/python3.6/codecs.py", line 321, in decode
     (result, consumed) = self._buffer_decode(data, self.errors, final)
 UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 6084-6085: invalid continuation byte

这是否意味着在同一个文件中有些文本是 utf8 编码的，有些不是 utf8 编码的？我做了一些互联网搜索，有些人建议使用 encoding='latin-1' 或 encoding="ISO-8859-1"。我通常一直使用 'utf8' 来读取和写入文件。像这样的大文件，或者很多文本文件，如果大部分时间都是utf8，编码参数应该怎么用？

编辑：我将 encoding=utf8 更改为 ISO-8859-1，错误消息消失了。但是，输出到文本文件的字符不可读，如下所示：

添加：

我在 ubuntu 上安装了 'file' 命令并找到了文件编码：

file all.txt
 all.txt: UTF-8 Unicode text, with very long lines

所以它实际上是utf-8文件。如果是utf8，为什么会报错？

Answer 1

您的文件似乎是 UTF-8，但其中包含一些非法字节。要抑制异常并了解问题所在，请使用 errors='backslashreplace' 打开文件。这将使您阅读整个文件并查看麻烦的部分。从您之前的消息中，您已经知道第一个非法字节的位置。它可能像从另一份文件中引用一样简单在不同的编码中。（不应该发生，但确实发生了。）或者它可能是损坏的文件。（同上。）

如何在读取文本文件时解决这个 unicode 问题？

How to solve this unicode problem in reading text file?

python

encoding

character-encoding

python-3.x