如何 xml 从包含“<?>”的标签中解析文本

Question

我的目标是获取文本： 27. The method according to claim 23 wherein...
如何检索包含 <? 的标签内的文本。我相信它们被称为 php 谷歌搜索的短标签。

我正在使用 lxml、xpaths，他们似乎只是没有将其注册为标签或节点。我试过 itertext() 但效果不佳。

 <claim id="CLM-00027" num="00027">
            <claim-text>                <?insert-start id="REI-00005" date="20191203" ?>27. The method according to claim 23 wherein the amorphous metal is selected from the group consisting of Zr based alloys, Ti based alloys, Al based alloys, Fe based alloys, La based alloys, Cu based alloys, Mg based alloys, Pt based alloys, and Pd based alloys.                <?insert-end id="REI-00005" ?></claim-text>
        </claim>

Answer 1

这是一段代码，它使用 XPath 到达最深的 'valid' 标记，然后 getchildren 和 tail 从那里一直深入到实际文字。

import lxml
xml=""" <claim id="CLM-00027" num="00027">
            <claim-text>                <?insert-start id="REI-00005" date="20191203" ?>27. The method according to claim 23 wherein the amorphous metal is selected from the group consisting of Zr based alloys, Ti based alloys, Al based alloys, Fe based alloys, La based alloys, Cu based alloys, Mg based alloys, Pt based alloys, and Pd based alloys.                <?insert-end id="REI-00005" ?></claim-text>
        </claim>"""

root = lxml.etree.fromstring(xml)
e = root.xpath("/claim/claim-text")
res = e[0].getchildren()[0].tail
print(res)

输出：

'27. The method according to claim 23 wherein the amorphous metal is selected from the group consisting of Zr based alloys, Ti based alloys, Al based alloys, Fe based alloys, La based alloys, Cu based alloys, Mg based alloys, Pt based alloys, and Pd based alloys.

Answer 2

通过索引访问特定的子节点。

from xml.etree import ElementTree as ET
tree = ET.parse('path_to_your.xml')

root = tree.getroot()

print(root[0].text)

输出：

        27. The method according to claim 23 wherein the amorphous metal is selected from the group consisting of Zr based alloys, Ti based alloys, Al based alloys, Fe based alloys, La based alloys, Cu based alloys, Mg based alloys, Pt based alloys, and Pd based alloys.

如何 xml 从包含“<?>”的标签中解析文本

How to xml parse text from a tag containing "<?>"

python

xml

lxml

processing-instruction