lxml.etree.XML Unicode 字符串的 ValueError
lxml.etree.XML ValueError for Unicode string
我正在转型an xml document with xslt。在使用 python3 执行此操作时,我遇到了以下错误。但是 python2
没有任何错误
-> % python3 cstm/artefact.py
Traceback (most recent call last):
File "cstm/artefact.py", line 98, in <module>
simplify_this_dataset('fisheries-service-des-peches.xml')
File "cstm/artefact.py", line 85, in simplify_this_dataset
xslt_root = etree.XML(xslt_content)
File "lxml.etree.pyx", line 3012, in lxml.etree.XML (src/lxml/lxml.etree.c:67861)
File "parser.pxi", line 1780, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102420)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
#!/usr/bin/env python3
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
# -*- coding: utf-8 -*-
from lxml import etree
def simplify_this_dataset(dataset):
"""Create A simplify version of an xml file
it will remove all the attributes and assign them as Elements instead
"""
module_path = os.path.dirname(os.path.abspath(__file__))
data = open(module_path+'/data/ex-fire.xslt')
xslt_content = data.read()
xslt_root = etree.XML(xslt_content)
dom = etree.parse(module_path+'/../CanSTM_dataset/'+dataset)
transform = etree.XSLT(xslt_root)
result = transform(dom)
f = open(module_path+ '/../CanSTM_dataset/otra.xml', 'w')
f.write(str(result))
f.close()
data = open(module_path+'/data/ex-fire.xslt')
xslt_content = data.read()
这会使用默认编码将文件中的字节隐式解码为 Unicode 文本。 (如果 XML 文件不在该编码中,这可能会给出错误的结果。)
xslt_root = etree.XML(xslt_content)
XML 有自己的编码处理和信号,<?xml encoding="..."?>
prolog。如果您将以 <?xml encoding="..."?>
开头的 Unicode 字符串传递给解析器,解析器会使用该编码重新解释字节字符串的其余部分......但不能,因为您已经将字节输入解码为一个 Unicode 字符串。
相反,您应该将未解码的字节字符串传递给解析器:
data = open(module_path+'/data/ex-fire.xslt', 'rb')
xslt_content = data.read()
xslt_root = etree.XML(xslt_content)
或者,更好的是,让解析器直接从文件中读取:
xslt_root = etree.parse(module_path+'/data/ex-fire.xslt')
您还可以解码 UTF-8 字符串并在将其传递给 etree.XML
之前使用 ascii 对其进行编码
xslt_content = data.read()
xslt_content = xslt_content.decode('utf-8').encode('ascii')
xslt_root = etree.XML(xslt_content)
我通过简单地使用默认选项重新编码使其工作
xslt_content = data.read().encode()
我正在转型an xml document with xslt。在使用 python3 执行此操作时,我遇到了以下错误。但是 python2
没有任何错误-> % python3 cstm/artefact.py
Traceback (most recent call last):
File "cstm/artefact.py", line 98, in <module>
simplify_this_dataset('fisheries-service-des-peches.xml')
File "cstm/artefact.py", line 85, in simplify_this_dataset
xslt_root = etree.XML(xslt_content)
File "lxml.etree.pyx", line 3012, in lxml.etree.XML (src/lxml/lxml.etree.c:67861)
File "parser.pxi", line 1780, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:102420)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
#!/usr/bin/env python3
# vim:fileencoding=UTF-8:ts=4:sw=4:sta:et:sts=4:ai
# -*- coding: utf-8 -*-
from lxml import etree
def simplify_this_dataset(dataset):
"""Create A simplify version of an xml file
it will remove all the attributes and assign them as Elements instead
"""
module_path = os.path.dirname(os.path.abspath(__file__))
data = open(module_path+'/data/ex-fire.xslt')
xslt_content = data.read()
xslt_root = etree.XML(xslt_content)
dom = etree.parse(module_path+'/../CanSTM_dataset/'+dataset)
transform = etree.XSLT(xslt_root)
result = transform(dom)
f = open(module_path+ '/../CanSTM_dataset/otra.xml', 'w')
f.write(str(result))
f.close()
data = open(module_path+'/data/ex-fire.xslt')
xslt_content = data.read()
这会使用默认编码将文件中的字节隐式解码为 Unicode 文本。 (如果 XML 文件不在该编码中,这可能会给出错误的结果。)
xslt_root = etree.XML(xslt_content)
XML 有自己的编码处理和信号,<?xml encoding="..."?>
prolog。如果您将以 <?xml encoding="..."?>
开头的 Unicode 字符串传递给解析器,解析器会使用该编码重新解释字节字符串的其余部分......但不能,因为您已经将字节输入解码为一个 Unicode 字符串。
相反,您应该将未解码的字节字符串传递给解析器:
data = open(module_path+'/data/ex-fire.xslt', 'rb')
xslt_content = data.read()
xslt_root = etree.XML(xslt_content)
或者,更好的是,让解析器直接从文件中读取:
xslt_root = etree.parse(module_path+'/data/ex-fire.xslt')
您还可以解码 UTF-8 字符串并在将其传递给 etree.XML
之前使用 ascii 对其进行编码 xslt_content = data.read()
xslt_content = xslt_content.decode('utf-8').encode('ascii')
xslt_root = etree.XML(xslt_content)
我通过简单地使用默认选项重新编码使其工作
xslt_content = data.read().encode()