UnicodeDecodeError 使用 for 循环解析文件 python3

Question

当我在文件中循环行时出现 UnicodeDecodeError。

with open(somefile,'r') as f:
    for line in f:
        #do something

这发生在我使用 python 3.4 时。一般来说，我有一些文件不包含 UTF-8 字符。我想逐行解析文件并找到问题所在的行，并在出现此类非 utf-8 的行中获得确切的索引。我已经为它准备好了代码，但它在 python 2.7.9 下工作，但在 python 3.4 下，当执行 for 循环时我得到了 UnicodeDecodeError。有任何想法吗？？？

Answer 1

您需要以二进制模式打开文件并一次解码一行。试试这个：

with open('badutf.txt', 'rb') as f:
    for i, line in enumerate(f,1):
        try:
            line.decode('utf-8')
        except UnicodeDecodeError as e:
            print ('Line: {}, Offset: {}, {}'.format(i, e.start, e.reason))

这是我在 Python3 中得到的结果：

Line: 16, Offset: 6, invalid start byte

果然第16行第6位是坏字节

UnicodeDecodeError 使用 for 循环解析文件 python3

UnicodeDecodeError parsing file with for loop python3

python

unicode-string

python-2.7

python-3.4