使用 ElementTree 按顺序解析某些 XML 标签
Parse certain XML tags sequentially using ElementTree
我正在尝试以顺序方式解析 XML 文件,只考虑感兴趣的 XML 标签。下面显示了示例 XML 文件(存储为 file.xml)。我只对已知路径的某些 XML 标签感兴趣,如下面的 Python 代码片段所示(例如 header/para/paratext、body/section/intro/text)。不同的 XML 文件可能有不同的标签顺序,所以我不想规定我已知的 XML 标签出现的顺序。有什么建议可以有效地执行此操作而不必遍历整个 XML 文件吗?
XML 文件
<data>
<header>
<para>
<paratext>0 - extract this</paratext>
</para>
</header>
<body>
<section>
<intro>
<text>1 - extract this</text>
</intro>
<para>
<paratext>2 - extract this</paratext>
</para>
<items>
<paratext>do not extract this</paratext>
<part>
<para>
<paratext>3 - extract this</paratext>
</para>
</part>
</items>
</section>
<section>
<text>do not extract this</text>
<intro>
<text>4 - extract this</text>
</intro>
<para>
<paratext>5 - extract this</paratext>
</para>
<para>
<paratext>6 - extract this</paratext>
</para>
</section>
</body>
</data>
期望输出:['0 - extract this', '1 - extract this', '2 - extract this', '3 - extract this', '4 - extract this', '5 - extract this', '6 - extract this']
示例 Python 脚本:
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
### Paths I would like to extract (but sequentially)
[i.text for i in root.findall('header/para/paratext')]
# ['0 - extract this']
[i.text for i in root.findall('body/section/intro/text')]
# ['1 - extract this', '4 - extract this']
[i.text for i in root.findall('body/section/para/paratext')]
# ['2 - extract this', '5 - extract this', '6 - extract this']
[i.text for i in root.findall('body/section/items/part/para/paratext')]
# ['3 - extract this']
我认为最好的方法是使用 union operator ("|
") in XPath。这将 select 文档顺序中的所需元素。
不幸的是,ElementTree 有 limited XPath support.
如果你可以使用 lxml,它有 much better XPath support。
示例...
Python
from lxml import etree
tree = etree.parse("file.xml")
print([i.text for i in tree.xpath('header/para/paratext|'
'body/section/intro/text|'
'body/section/para/paratext|'
'body/section/items/part/para/paratext')])
打印输出
['0 - extract this', '1 - extract this', '2 - extract this', '3 - extract this', '4 - extract this', '5 - extract this', '6 - extract this']
我正在尝试以顺序方式解析 XML 文件,只考虑感兴趣的 XML 标签。下面显示了示例 XML 文件(存储为 file.xml)。我只对已知路径的某些 XML 标签感兴趣,如下面的 Python 代码片段所示(例如 header/para/paratext、body/section/intro/text)。不同的 XML 文件可能有不同的标签顺序,所以我不想规定我已知的 XML 标签出现的顺序。有什么建议可以有效地执行此操作而不必遍历整个 XML 文件吗?
XML 文件
<data>
<header>
<para>
<paratext>0 - extract this</paratext>
</para>
</header>
<body>
<section>
<intro>
<text>1 - extract this</text>
</intro>
<para>
<paratext>2 - extract this</paratext>
</para>
<items>
<paratext>do not extract this</paratext>
<part>
<para>
<paratext>3 - extract this</paratext>
</para>
</part>
</items>
</section>
<section>
<text>do not extract this</text>
<intro>
<text>4 - extract this</text>
</intro>
<para>
<paratext>5 - extract this</paratext>
</para>
<para>
<paratext>6 - extract this</paratext>
</para>
</section>
</body>
</data>
期望输出:['0 - extract this', '1 - extract this', '2 - extract this', '3 - extract this', '4 - extract this', '5 - extract this', '6 - extract this']
示例 Python 脚本:
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
### Paths I would like to extract (but sequentially)
[i.text for i in root.findall('header/para/paratext')]
# ['0 - extract this']
[i.text for i in root.findall('body/section/intro/text')]
# ['1 - extract this', '4 - extract this']
[i.text for i in root.findall('body/section/para/paratext')]
# ['2 - extract this', '5 - extract this', '6 - extract this']
[i.text for i in root.findall('body/section/items/part/para/paratext')]
# ['3 - extract this']
我认为最好的方法是使用 union operator ("|
") in XPath。这将 select 文档顺序中的所需元素。
不幸的是,ElementTree 有 limited XPath support.
如果你可以使用 lxml,它有 much better XPath support。
示例...
Python
from lxml import etree
tree = etree.parse("file.xml")
print([i.text for i in tree.xpath('header/para/paratext|'
'body/section/intro/text|'
'body/section/para/paratext|'
'body/section/items/part/para/paratext')])
打印输出
['0 - extract this', '1 - extract this', '2 - extract this', '3 - extract this', '4 - extract this', '5 - extract this', '6 - extract this']