lxml.etree.XMLSyntaxError，标记为 UTF-16 但具有 UTF-8 内容的文档

Question

lxml.etree.XMLSyntaxError, Document labelled UTF-16 but has UTF-8 content

我在 python 中使用 lxml 库时遇到错误。其他 solutions/hacks 正在将文件 php 中的 utf-16 替换为 utf-8。解决此问题的 pythonic 方法是什么？

python代码：

import lxml.etree as etree

tree =  etree.parse("req.xml")

req.xml:

<?xml version="1.0" encoding="utf-16"?>
<test 
    xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> 
</test>

Answer 1

您可以使用 BeautifulSoup 解析 xml 内容，这是您需要的更 pythonic 方式。

NOTE: If your data is encoded in utf-16 it can easily parse by decoding in utf-8 during reading/PARSE the file content.

下面是代码：

sample.xml 包含以下数据：

<?xml version="1.0" encoding="utf-16"?>
<test 
    xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> 
</test>

代码：

from bs4 import BeautifulSoup
with open("sample.xml", "r") as f: # opening xml file
    content = f.read().decode('utf-8', 'ignore') # xml content stored in this variable and decode to utf-8

soup = BeautifulSoup(content, 'html.parser') #parse content to BeautifulSoup Module
data = [data.attrsfor data in soup.findAll("test")]
print data

输出：

{u'xmlns:xsi': u'http://www.w3.org/2001/XMLSchema-instance', u'xmlns:xsd': u'http://www.w3.org/2001/XMLSchema'}

Answer 2

查看 XMLParser 构造函数的文档：

>>> help(etree.XMLParser)

在其他选项中，有一个 encoding 参数，它允许您 "override the document encoding"，正如文档所说。

这正是您所需要的：

parser = etree.XMLParser(encoding='UTF-8')
tree = etree.parse("req.xml", parser=parser)

如果错误消息是正确的（即文档没有任何其他问题），那么我希望它能工作。