lxml.etree.XMLSyntaxError,标记为 UTF-16 但具有 UTF-8 内容的文档
lxml.etree.XMLSyntaxError, Document labelled UTF-16 but has UTF-8 content
lxml.etree.XMLSyntaxError, Document labelled UTF-16 but has UTF-8 content
我在 python 中使用 lxml 库时遇到错误。其他 solutions/hacks 正在将文件 php 中的 utf-16 替换为 utf-8。解决此问题的 pythonic 方法是什么?
python代码:
import lxml.etree as etree
tree = etree.parse("req.xml")
req.xml:
<?xml version="1.0" encoding="utf-16"?>
<test
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
</test>
您可以使用 BeautifulSoup
解析 xml 内容,这是您需要的更 pythonic 方式。
NOTE: If your data is encoded in utf-16
it can easily parse by decoding in utf-8
during reading/PARSE the file content.
下面是代码:
sample.xml 包含以下数据:
<?xml version="1.0" encoding="utf-16"?>
<test
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
</test>
代码:
from bs4 import BeautifulSoup
with open("sample.xml", "r") as f: # opening xml file
content = f.read().decode('utf-8', 'ignore') # xml content stored in this variable and decode to utf-8
soup = BeautifulSoup(content, 'html.parser') #parse content to BeautifulSoup Module
data = [data.attrsfor data in soup.findAll("test")]
print data
输出:
{u'xmlns:xsi': u'http://www.w3.org/2001/XMLSchema-instance', u'xmlns:xsd': u'http://www.w3.org/2001/XMLSchema'}
查看 XMLParser
构造函数的文档:
>>> help(etree.XMLParser)
在其他选项中,有一个 encoding
参数,它允许您 "override the document encoding",正如文档所说。
这正是您所需要的:
parser = etree.XMLParser(encoding='UTF-8')
tree = etree.parse("req.xml", parser=parser)
如果错误消息是正确的(即文档没有任何其他问题),那么我希望它能工作。
lxml.etree.XMLSyntaxError, Document labelled UTF-16 but has UTF-8 content
我在 python 中使用 lxml 库时遇到错误。其他 solutions/hacks 正在将文件 php 中的 utf-16 替换为 utf-8。解决此问题的 pythonic 方法是什么?
python代码:
import lxml.etree as etree
tree = etree.parse("req.xml")
req.xml:
<?xml version="1.0" encoding="utf-16"?>
<test
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
</test>
您可以使用 BeautifulSoup
解析 xml 内容,这是您需要的更 pythonic 方式。
NOTE: If your data is encoded in
utf-16
it can easily parse by decoding inutf-8
during reading/PARSE the file content.
下面是代码:
sample.xml 包含以下数据:
<?xml version="1.0" encoding="utf-16"?>
<test
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
</test>
代码:
from bs4 import BeautifulSoup
with open("sample.xml", "r") as f: # opening xml file
content = f.read().decode('utf-8', 'ignore') # xml content stored in this variable and decode to utf-8
soup = BeautifulSoup(content, 'html.parser') #parse content to BeautifulSoup Module
data = [data.attrsfor data in soup.findAll("test")]
print data
输出:
{u'xmlns:xsi': u'http://www.w3.org/2001/XMLSchema-instance', u'xmlns:xsd': u'http://www.w3.org/2001/XMLSchema'}
查看 XMLParser
构造函数的文档:
>>> help(etree.XMLParser)
在其他选项中,有一个 encoding
参数,它允许您 "override the document encoding",正如文档所说。
这正是您所需要的:
parser = etree.XMLParser(encoding='UTF-8')
tree = etree.parse("req.xml", parser=parser)
如果错误消息是正确的(即文档没有任何其他问题),那么我希望它能工作。