解析具有无效 xml:id 值（以数字开头）的 XML 文件

Question

前提是我有一个XML如下：请注意，属性 xml:id 是从数字开始的字符串

<node1>
    <text xml:id='7865ft6zh67'>
       <div chapter='0'>
          <div id='theNode'>
              <p xml:id="40">
               A House that has:
                   <p xml:id="45">- a window;</p>
                   <p xml:id="46">- a door</p>
                   <p xml:id="46">- a door</p>
               its a beuatiful house
               </p>
          </div>
       </div>
    </text>
</node1>

我想定位文本标题并从出现在文本标题书节点内的第一个 p 标签中获取所有文本

第一种方法可以使用此处的答案来完成：（我自己的问题）

但是在这个新的 XML 中（与提到的问题相比） xml:id 以数字开头，并且正如其中一个答案所指出的那样，在使用代码时会发生以下错误：

 xml:id : attribute value 7865ft6zh67 is not an NCName, line 3, column 31

我如何仍然解析 XML 与“XML 不合规 xml:id”？

到目前为止，我能想到的唯一解决方案是将 xml 传递给字符串，并在每个 xml:id 的开头添加一个字母”，例如：

newXML = '...hange><change xml:id="6f58f74883d55b...'
newXML_repared = newXML.replace('xml:id="','xml:id="XXid')
newXML_repared

from lxml import etree
XML_tree = etree.fromstring(newXML_repared,parser=parser)

但这样做时我得到：

 ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

有什么建议吗？

注意：我注意到字符串本身以：

开头

<?xml version="1.0" encoding="UTF-8"?>
<teiCorpus subtype="simple"  ...etc

lxml教程中可以阅读：然而，这要求 unicode 字符串本身不指定冲突的编码，因此对它们的真实编码撒谎： (https://lxml.de/parsing.html)

但是我还是不知道怎么解决这个问题

谢谢。

Answer 1

在您提供的文档 (https://lxml.de/parsing.html) 的 link 中找到了一个选项。

特别是 parser options 中列出的“恢复”选项。

示例...

from lxml import etree

XML_content = """
<node1>
    <text xml:id='7865ft6zh67' title="book">
       <div chapter='0'>
          <div id='theNode'>
              <p xml:id="40">
               A House that has:
                   <p xml:id="45">- a window;</p>
                   <p xml:id="46">- a door</p>
                   <p xml:id="46">- a door</p>
               its a beuatiful house
               </p>
          </div>
       </div>
    </text>
</node1>
"""

parser = etree.XMLParser(recover=True)

XML_tree = etree.fromstring(XML_content, parser=parser)
text = XML_tree.xpath('normalize-space(//text[@title="book"]/div/div/p)')
# text = XML_tree.xpath('string(//text[@title="book"]/div/div/p)')
print(text)

注意：我添加了 title="book"，所以您的相关问题中来自的 XPath 仍然有效。

解析具有无效 xml:id 值（以数字开头）的 XML 文件

Parsing an XML file with invalid xml:id values (starting with a number)

python

xml

lxml

xml-parsing