UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)

Question

我试图将字典的具体内容保存到一个文件中，但是当我尝试写入它时，出现以下错误：

Traceback (most recent call last):
  File "P4.py", line 83, in <module>
    outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)

这是代码：

from collections import Counter

with open("corpus.txt") as inf:
    wordtagcount = Counter(line.decode("latin_1").rstrip() for line in inf)

with open("lexic.txt", "w") as outf:
    outf.write('Palabra\tTag\tApariciones\n'.encode("utf-8"))
    for word,count in wordtagcount.iteritems():
        outf.write(u"{}\t{}\n".format(word, count).encode("utf-8"))
"""
2) TAGGING USING THE MODEL
Dados los ficheros de test, para cada palabra, asignarle el tag mas
probable segun el modelo. Guardar el resultado en ficheros que tengan
este formato para cada linea: Palabra  Prediccion
"""
file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()
diccionario = {}

"""
In this portion of code we iterate the lines of the .txt document and we create a dictionary with a word as a key and a List as a value
Key: word
Value: List ([tag, #ocurrencesWithTheTag])
"""
for linea in data:
    aux = linea.decode('latin_1').encode('utf-8')
    sintagma = aux.split('\t')  # Here we separate the String in a list: [word, tag, ocurrences], word=sintagma[0], tag=sintagma[1], ocurrences=sintagma[2]
    if (sintagma[0] != "Palabra" and sintagma[1] != "Tag"): #We are not interested in the first line of the file, this is the filter
        if (diccionario.has_key(sintagma[0])): #Here we check if the word was included before in the dictionary
            aux_list = diccionario.get(sintagma[0]) #We know the name already exists in the dic, so we create a List for every value
            aux_list.append([sintagma[1], sintagma[2]]) #We add to the list the tag and th ocurrences for this concrete word
            diccionario.update({sintagma[0]:aux_list}) #Update the value with the new list (new list = previous list + new appended element to the list)
        else: #If in the dic do not exist the key, que add the values to the empty list (no need to append)
            aux_list_else = ([sintagma[1],sintagma[2]])
            diccionario.update({sintagma[0]:aux_list_else})

"""
Here we create a new dictionary based on the dictionary created before, in this new dictionary (diccionario2) we want to keep the next
information:
Key: word
Value: List ([suggestedTag, #ocurrencesOfTheWordInTheDocument, probability])

For retrieve the information from diccionario, we have to keep in mind:

In case we have more than 1 Tag associated to a word (keyword ), we access to the first tag with keyword[0], and for ocurrencesWithTheTag with keyword[1],
from the second case and forward, we access to the information by this way:

diccionario.get(keyword)[2][0] -> with this we access to the second tag
diccionario.get(keyword)[2][1] -> with this we access to the second ocurrencesWithTheTag
diccionario.get(keyword)[3][0] -> with this we access to the third tag
...
..
.
etc.
"""
diccionario2 = dict.fromkeys(diccionario.keys())#We create a dictionary with the keys from diccionario and we set all the values to None
with open("estimation.txt", "w") as outfile:
    for keyword in diccionario:
        tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8')) #tagSugerido is the tag with more ocurrences for a concrete keyword
        maximo = float(diccionario.get(keyword)[1]) #maximo is a variable for the maximum number of ocurrences in a keyword
        if ((len(diccionario.get(keyword))) > 2): #in case we have > 2 tags for a concrete word
            suma = float(diccionario.get(keyword)[1])
            for i in range (2, len(diccionario.get(keyword))):
                suma += float(diccionario.get(keyword)[i][1])
                if (diccionario.get(keyword)[i][1] > maximo):
                    tagSugerido = unicode(diccionario.get(keyword)[i][0]).decode('utf-8'))
                    maximo = float(diccionario.get(keyword)[i][1])
            probabilidad = float(maximo/suma);
            diccionario2.update({keyword:([tagSugerido, suma, probabilidad])})

        else:
            diccionario2.update({keyword:([diccionario.get(keyword)[0],diccionario.get(keyword)[1], 1])})

        outfile.write(u"{}\t{}\n".format(keyword, tagSugerido).encode("utf-8"))

所需的输出将如下所示：

keyword(String)  tagSugerido(String):
Hello    NC
Friend   N
Run      V
...etc

冲突行是：

outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))

谢谢。

Answer 1

由于您没有给出简单明了的代码来说明您的问题，我只会给您一个一般性的建议，告诉您应该是什么错误：

如果您遇到解码错误，那是因为 tagSugerido 被读取为 ASCII 而不是 Unicode。要解决这个问题，您应该这样做：

tagSugerido = unicode(diccionario.get(keyword[0]).decode('utf-8'))

将其存储为 unicode。

那么你很可能在 write() 阶段遇到编码错误，你应该按以下方式修复你的写入：

outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))

应该是：

outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8")))

我随便回答了一个非常相似的问题moments ago。当使用 unicode 字符串时，切换到 python3，它会让您的生活更轻松！

如果您现在还不能切换到 python3，您可以使用 python-future 导入语句让您的 python2 表现得接近 python3：

from __future__ import absolute_import, division, print_function, unicode_literals

N.B.: 而不是做：

file=open("lexic.txt", "r") # abrimos el fichero lexic (nuestro modelo) (probar con este)
data=file.readlines()
file.close()

这将无法在 readlines 失败时正确关闭文件描述符，你最好这样做：

with open("lexic.txt", "r") as f:
    data=f.readlines()

即使失败也会始终关闭文件。

N.B.2: 避免使用 file 因为这是你要隐藏的 python 类型，但是使用 f 或 lexic_file…

Answer 2

赞zmo建议：

outfile.write(u"{}\t{}\n".format(keyword, str(tagSugerido)).encode("utf-8"))

应该是：

outfile.write(u"{}\t{}\n".format(keyword, tagSugerido.encode("utf-8")))

Python2

中关于 unicode 的说明

您的软件应仅在内部使用 unicode 字符串，在输出时转换为特定编码。

不要一遍又一遍地犯同样的错误你应该确保你理解 ascii 和 utf-8 之间的区别编码以及 str 和 unicode 对象之间 Python.

ASCII和UTF-8编码的区别：

Ascii 只需要一个字节来表示 ascii charset/encoding 中所有可能的字符。 UTF-8 最多需要四个字节来表示完整的字符集。

ascii (default)
1    If the code point is < 128, each byte is the same as the value of the code point.
2    If the code point is 128 or greater, the Unicode string can’t be represented in this encoding. (Python raises a UnicodeEncodeError exception in this case.)

utf-8 (unicode transformation format)
1    If the code point is <128, it’s represented by the corresponding byte value.
2    If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
3    Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.

str和unicode对象的区别：

可以说str基本上是字节串，unicode是unicode字符串。两者都可以有不同的编码，如 ascii 或 utf-8。

str vs. unicode
1   str     = byte string (8-bit) - uses \x and two digits
2   unicode = unicode string      - uses \u and four digits
3   basestring
        /\
       /  \
    str    unicode

如果您遵循一些简单的规则，您应该能够很好地处理 str/unicode 不同编码的对象，例如 ascii 或 utf-8 或您必须使用的任何编码：

Rules
1    encode(): Gets you from Unicode -> bytes
     encode([encoding], [errors='strict']), returns an 8-bit string version of the Unicode string,
2    decode(): Gets you from bytes -> Unicode
     decode([encoding], [errors]) method that interprets the 8-bit string using the given encoding
3    codecs.open(encoding=”utf-8″): Read and write files directly to/from Unicode (you can use any encoding, not just utf-8, but utf-8 is most common).
4    u”: Makes your string literals into Unicode objects rather than byte sequences.
5    unicode(string[, encoding, errors])

警告：不要对字节使用 encode() 或对 Unicode 对象使用 decode()

再说一次：软件应该只在内部使用 Unicode 字符串，在输出时转换为特定的编码。

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 40: ordinal not in range(128)

python

io

file-io

output