引用无效字符数：（Python ElementTree 解析）

Question

我有 xml 个文件，其中包含以下内容：

    <word>vegetation</word>
    <word>cover</word>
    <word>(&#x2;31%</word>
    <word>split_identifier ;</word>
    <word>Still</word>
    <word>and</word>

当我使用 ElmentTree 解析读取文件时，出现错误：

xml.etree.ElementTree.ParseError: reference to invalid character number

这是因为（是“~”）。

我该如何处理这些问题。我不确定将来会得到多少其他符号。

Answer 1

如果你想摆脱那些特殊字符，你可以通过将输入 XML 作为字符串擦除：

respXML = response.content.decode("utf-16")

scrubbedXML = re.sub('&.+[0-9]+;', '', respXML)

respRoot = ET.fromstring(scrubbedXML)

如果您希望保留特殊字符，您可以事先解析它们。在您的情况下，它看起来像 html，因此您可以使用 python html 模块：

import html
respRoot = ET.fromstring(html.unescape(response.content.decode("utf-16"))

引用无效字符数：（Python ElementTree 解析）

reference to invalid character number: (Python ElementTree parse)

python

elementtree