'utf-8' 编解码器无法解码字节读取 Python3.4 中的文件，但不能解码 Python2.7 中的文件

Question

我试图读取 python2.7 中的一个文件，它被完美地读取了。我遇到的问题是当我在 Python3.4 中执行相同的程序然后出现错误：

'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'

另外，当我运行 Windows 中的程序（with python3.4）时，错误没有出现。文档的第一行是： Codi;Codi_lloc_anonim;Nom

我的程序代码是：

def lectdict(filename,colkey,colvalue):
    f = open(filename,'r')
    D = dict()

    for line in f:
       if line == '\n': continue
       D[line.split(';')[colkey]] = D.get(line.split(';')[colkey],[]) + [line.split(';')[colvalue]]

f.close
return D

Traduccio = lectdict('Noms_departaments_centres.txt',1,2)

Answer 1

在Python2、

f = open(filename,'r')
for line in f:

从文件 中以字节 读取行。

在 Python3 中，相同的代码从文件 中读取行作为字符串 。 Python3 字符串是 Python2 调用的 unicode 对象。这些是解码的字节根据一些编码。 Python3 中的默认编码是 utf-8.

错误信息

'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'

显示 Python3 正在尝试将字节解码为 utf-8。由于出现错误，该文件显然不包含 utf-8 编码字节 。

要解决此问题，您需要指定文件的正确编码：

with open(filename, encoding=enc) as f:
    for line in f:

如果您不知道正确的编码，您可以运行这个程序简单地尝试 Python 已知的所有编码。如果你幸运的话会有一个将字节转换为可识别字符的编码。有时更多一种编码可能 看起来 有效，在这种情况下，您需要检查并仔细比较结果。

# Python3
import pkgutil
import os
import encodings

def all_encodings():
    modnames = set(
        [modname for importer, modname, ispkg in pkgutil.walk_packages(
            path=[os.path.dirname(encodings.__file__)], prefix='')])
    aliases = set(encodings.aliases.aliases.values())
    return modnames.union(aliases)

filename = '/tmp/test'
encodings = all_encodings()
for enc in encodings:
    try:
        with open(filename, encoding=enc) as f:
            # print the encoding and the first 500 characters
            print(enc, f.read(500))
    except Exception:
        pass

Answer 2

好的，我按照@unutbu 告诉我的做了同样的事情。结果是很多编码其中之一是 cp1250，因此我更改了：

f = open(filename,'r')

至

f = open(filename,'r', encoding='cp1250')

喜欢@triplee 建议我。现在我可以阅读我的文件了。

Answer 3

在我的例子中，我无法更改编码，因为我的文件实际上是 UTF-8 编码的。但是有些行已损坏并导致相同的错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 7092: invalid continuation byte

我的决定是以二进制模式打开文件:

open(filename, 'rb')

'utf-8' 编解码器无法解码字节读取 Python3.4 中的文件，但不能解码 Python2.7 中的文件

'utf-8' codec can't decode byte reading a file in Python3.4 but not in Python2.7

python

utf-8

python-3.x