写入文件时出现 UnicodeEncodeError
UnicodeEncodeError when writing to file
我有一个 python 脚本,在我的本地机器 (OS X) 上运行良好,但是当我将它复制到服务器 (Debian) 时,它无法按预期运行。该脚本读取 xml 文件并以新格式打印内容。在我的本地机器上,我可以 运行 带有标准输出的脚本到终端或文件(即 > myFile.txt
),两者都工作正常。
但是,在服务器 (ssh
) 上,当我打印到终端时一切正常,但是打印到文件(这是我真正需要的)会出现 UnicodeEncodeError:UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
。所有文件均为utf-8编码,在魔术注释中声明为utf-8
如果我在列表中打印 str
个对象(这是我通常用来处理编码问题的技巧),它也会抛出同样的错误。
如果我使用 print( x.encode('utf-8') )
,那么它会打印代码样式位(例如 b'1' b'\xd0\x9a\xd0\xb0\xd0\xbc\xd0\xb0'
)。
如果我在 shell 中 $ export PYTHONIOENCODING=utf-8
(如某些 SO 帖子中所建议的),那么我会得到一个二进制文件:1 <D0><9A><D0><B0><D0><BC><D0><B0>
.
我检查了所有 locale
变量,相关变量与我在本地计算机上的变量相匹配。
我可以简单地在本地处理文件并上传,但我真的很想了解这里发生了什么。由于 python 代码在一台计算机上运行,我不确定它是否相关,但我在下面添加它:
# -*- encoding: utf-8 -*-
import sys, xml.etree.ElementTree as ET
corpus = ET.parse('file.xml')
text = corpus.getroot()
for body in text :
for sent in body :
depDOMs = [(0,'') for i in range(len(sent)+1)]
for word in sent :
if word.tag == 'LF' :
pass
elif 'ID' in word.attrib and 'FEAT' in word.attrib and 'DOM' in word.attrib :
ID = word.attrib['ID']
try :
Form = word.text.replace(' ','_')
except AttributeError :
Form = '_'
try :
Lemma = word.attrib['LEMMA'].replace(' ', '_')
except KeyError :
Lemma = '*NULL*'
CPOS = word.attrib['FEAT'].split()[0]
POS = word.attrib['FEAT'].replace( ' ' , '_' )
Feats = '_'
Head = word.attrib['DOM']
if Head == '_root' :
Head = '0'
try :
DepRel = word.attrib['LINK']
except KeyError :
DepRel = 'ROOT'
PHead = '_'
PDepRel = '_'
try:
if word.attrib['NODETYPE'] == 'FANTOM' :
word.attrib['LEMMA'] = '*'+word.attrib['LEMMA']+'*'
except KeyError :
pass
print( ID , Form , Lemma , Feats, CPOS , POS , Head , DepRel , PHead , PDepRel , sep='\t' )
else :
print( 'WARNING: what is this?',sent.attrib['ID'],word.attrib)
print()
您可以在基于 UnicodeError 的异常的属性中找到与您遇到的错误相关的重要信息。
引用文档:
UnicodeError has attributes that describe the encoding or decoding
error. For example, err.object[err.start:err.end]
gives the particular
invalid input that the codec failed on.
encoding
The name of the encoding that raised the error.
reason
A string describing the specific codec error.
object
The object the codec was attempting to encode or decode.
start
The first index of invalid data in object.
end
The index after the last invalid data in object.
潜在的问题可能是由于 Linux 的语言环境配置错误引起的,这意味着 Python 在打印非 ASCII 字符时过于谨慎。
使用 locale
确认语言环境配置。如果出现问题,您会看到类似以下内容:
$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
解决这个问题:
$ sudo locale-gen "en_US.UTF-8"
(将 "en_US.UTF-8" 替换为无效的语言环境)。有关详细信息,请参阅:https://askubuntu.com/questions/162391/how-do-i-fix-my-locale-issue
我有一个 python 脚本,在我的本地机器 (OS X) 上运行良好,但是当我将它复制到服务器 (Debian) 时,它无法按预期运行。该脚本读取 xml 文件并以新格式打印内容。在我的本地机器上,我可以 运行 带有标准输出的脚本到终端或文件(即 > myFile.txt
),两者都工作正常。
但是,在服务器 (ssh
) 上,当我打印到终端时一切正常,但是打印到文件(这是我真正需要的)会出现 UnicodeEncodeError:UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
。所有文件均为utf-8编码,在魔术注释中声明为utf-8
如果我在列表中打印 str
个对象(这是我通常用来处理编码问题的技巧),它也会抛出同样的错误。
如果我使用 print( x.encode('utf-8') )
,那么它会打印代码样式位(例如 b'1' b'\xd0\x9a\xd0\xb0\xd0\xbc\xd0\xb0'
)。
如果我在 shell 中 $ export PYTHONIOENCODING=utf-8
(如某些 SO 帖子中所建议的),那么我会得到一个二进制文件:1 <D0><9A><D0><B0><D0><BC><D0><B0>
.
我检查了所有 locale
变量,相关变量与我在本地计算机上的变量相匹配。
我可以简单地在本地处理文件并上传,但我真的很想了解这里发生了什么。由于 python 代码在一台计算机上运行,我不确定它是否相关,但我在下面添加它:
# -*- encoding: utf-8 -*-
import sys, xml.etree.ElementTree as ET
corpus = ET.parse('file.xml')
text = corpus.getroot()
for body in text :
for sent in body :
depDOMs = [(0,'') for i in range(len(sent)+1)]
for word in sent :
if word.tag == 'LF' :
pass
elif 'ID' in word.attrib and 'FEAT' in word.attrib and 'DOM' in word.attrib :
ID = word.attrib['ID']
try :
Form = word.text.replace(' ','_')
except AttributeError :
Form = '_'
try :
Lemma = word.attrib['LEMMA'].replace(' ', '_')
except KeyError :
Lemma = '*NULL*'
CPOS = word.attrib['FEAT'].split()[0]
POS = word.attrib['FEAT'].replace( ' ' , '_' )
Feats = '_'
Head = word.attrib['DOM']
if Head == '_root' :
Head = '0'
try :
DepRel = word.attrib['LINK']
except KeyError :
DepRel = 'ROOT'
PHead = '_'
PDepRel = '_'
try:
if word.attrib['NODETYPE'] == 'FANTOM' :
word.attrib['LEMMA'] = '*'+word.attrib['LEMMA']+'*'
except KeyError :
pass
print( ID , Form , Lemma , Feats, CPOS , POS , Head , DepRel , PHead , PDepRel , sep='\t' )
else :
print( 'WARNING: what is this?',sent.attrib['ID'],word.attrib)
print()
您可以在基于 UnicodeError 的异常的属性中找到与您遇到的错误相关的重要信息。
引用文档:
UnicodeError has attributes that describe the encoding or decoding error. For example,
err.object[err.start:err.end]
gives the particular invalid input that the codec failed on.encoding
The name of the encoding that raised the error.
reason
A string describing the specific codec error.
object
The object the codec was attempting to encode or decode.
start
The first index of invalid data in object.
end
The index after the last invalid data in object.
潜在的问题可能是由于 Linux 的语言环境配置错误引起的,这意味着 Python 在打印非 ASCII 字符时过于谨慎。
使用 locale
确认语言环境配置。如果出现问题,您会看到类似以下内容:
$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: No such file or directory
LANG=en_US.UTF-8
LANGUAGE=
解决这个问题:
$ sudo locale-gen "en_US.UTF-8"
(将 "en_US.UTF-8" 替换为无效的语言环境)。有关详细信息,请参阅:https://askubuntu.com/questions/162391/how-do-i-fix-my-locale-issue