如何使用 lxml 解析以 iso-8859-1 编码的 xml 文件？

Question

你好，我正在尝试使用 lxml 库解析以 ISO-8859-1 编码的 xml 文件，但到目前为止没有成功。

我打开文件以使用

检查正确的编码

f = open(file, mode = "r", encoding = "ISO-8859-1")
print(f.read())

回复成功

但是当使用下面的代码时，我得到了错误的字符：

from lxml import etree
import json
file = "test.xml"
parser = etree.XMLParser(encoding = "iso-8859-1")
tree = etree.parse(file, parser)
root = tree.getroot()
top_asset = root.find("Asset")
asset_metadata = top_asset.find("Metadata")
series_names = []
dict = {}
for app_data in asset_metadata.findall("App_Data"):
    if app_data.attrib["Name"].lower() == "asset_name":
        series_names.append(app_data.attrib["Value"])
        key = app_data.attrib["Name"].lower()
        value = series_names
        dict[key] = value
print(json.dumps(dict,indent = 2))

我得到以下输出：

{"asset_name": [
    "Todo en 90 d\u00edas: Antes del viaje",
    "90 Day Fiance: Before the 90 Days"
  ]}

我尝试用第一个代码打开文件然后使用 etree.fromstring(f) 但由于 xml 文件的第一行是 <?xml version="1.0" encoding="ISO-8859-1"?> 我在尝试解析时遇到错误. 如果我删除该特定行我可以解析它但得到相同的错误字符的响应

这里是 xml 文件中有问题字符的部分：

<?xml version="1.0" encoding="ISO-8859-1"?>
<ADI>
    <Asset>
        <Metadata>
            <App_Data Value="Todo en 90 días: Antes del viaje" Name="Asset_Name" App="MOD"/>
            <App_Data Value="90 Day Fiance: Before the 90 Days" Name="Asset_Name" App="MOD"/>
        </Metadata>

我检查了https://validator.w3.org/中的整个文件，输出如下：

Warning: Documents encoded as windows-1252 are often mislabeled as ISO-8859-1, which is the declared encoding of this document.

At line 1, column 41

尝试将 iso-8859-1 和 windows-1252 编码作为 XMLParser 参数。

Answer 1

在我看来，编码在 json.dumps() 舞台上丢失了。我会尝试添加 ensure_ascii 参数：

print(json.dumps(dict, indent = 2, ensure_ascii=False))

如何使用 lxml 解析以 iso-8859-1 编码的 xml 文件？

How to parse an xml file encoded in iso-8859-1 with lxml?

python

xml

lxml