通过避免特定分支快速遍历 lxml 树
Fast traverse through lxml tree by avoiding specific branch
假设我有一个 etree 如下:
my_data.xml
<?xml version="1.0" encoding="UTF-8"?>
<data>
<country name="Liechtenstein" xmlns="aaa:bbb:ccc:liechtenstein:eee">
<rank updated="yes">2</rank>
<holidays>
<christmas>Yes</christmas>
</holidays>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore" xmlns="aaa:bbb:ccc:singapore:eee">
<continent>Asia</continent>
<holidays>
<christmas>Yes</christmas>
</holidays>
<rank updated="yes">5</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama" xmlns="aaa:bbb:ccc:panama:eee">
<rank updated="yes">69</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
<ethnicity xmlns="aaa:bbb:ccc:ethnicity:eee">
<malay>
<holidays>
<ramadan>Yes</ramadan>
</holidays>
</malay>
</ethnicity>
</data>
正在解析:
xtree = etree.parse('my_data.xml')
xroot = xtree.getroot()
我想遍历树并对所有分支做一些事情,除了某些树枝。在此示例中,我想排除 ethnicity
分支:
node_to_exclude = xroot.xpath('.//*[local-name()="ethnicity"]')
exclude_path = xtree.getelementpath(node_to_exclude[0])
for element in xroot.iter('*'):
if exclude_path not in xtree.getelementpath(element ):
...do stuff...
但是这样还是会遍历整棵树。有没有比这更好/更快的方法(即忽略整个 ethnicity
分支)?我正在寻找一种语法解决方案,而不是递归算法。
XPath 可以为您做这件事
for element in xroot.xpath('.//*[not(ancestor-or-self::*[local-name()="ethnicity"])]'):
# ...do stuff...
它可能会(也可能不会)衡量它 - 提高性能以指定您指的是哪个祖先。例如,如果 <ethnicity xmlns="...">
始终是顶级元素的子元素,即“倒数第二个祖先”,您可以这样做:
for element in xroot.xpath('.//*[not(ancestor-or-self::*[last()-1][local-name()="ethnicity"])]'):
# ...do stuff...
当然你也可以这样做:
for child in xroot.getchildren()
if 'ethnicity' in child.tag:
continue
for element in child.xpath('//*'):
# ...do stuff...
假设我有一个 etree 如下:
my_data.xml
<?xml version="1.0" encoding="UTF-8"?>
<data>
<country name="Liechtenstein" xmlns="aaa:bbb:ccc:liechtenstein:eee">
<rank updated="yes">2</rank>
<holidays>
<christmas>Yes</christmas>
</holidays>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore" xmlns="aaa:bbb:ccc:singapore:eee">
<continent>Asia</continent>
<holidays>
<christmas>Yes</christmas>
</holidays>
<rank updated="yes">5</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama" xmlns="aaa:bbb:ccc:panama:eee">
<rank updated="yes">69</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
<ethnicity xmlns="aaa:bbb:ccc:ethnicity:eee">
<malay>
<holidays>
<ramadan>Yes</ramadan>
</holidays>
</malay>
</ethnicity>
</data>
正在解析:
xtree = etree.parse('my_data.xml')
xroot = xtree.getroot()
我想遍历树并对所有分支做一些事情,除了某些树枝。在此示例中,我想排除 ethnicity
分支:
node_to_exclude = xroot.xpath('.//*[local-name()="ethnicity"]')
exclude_path = xtree.getelementpath(node_to_exclude[0])
for element in xroot.iter('*'):
if exclude_path not in xtree.getelementpath(element ):
...do stuff...
但是这样还是会遍历整棵树。有没有比这更好/更快的方法(即忽略整个 ethnicity
分支)?我正在寻找一种语法解决方案,而不是递归算法。
XPath 可以为您做这件事
for element in xroot.xpath('.//*[not(ancestor-or-self::*[local-name()="ethnicity"])]'):
# ...do stuff...
它可能会(也可能不会)衡量它 - 提高性能以指定您指的是哪个祖先。例如,如果 <ethnicity xmlns="...">
始终是顶级元素的子元素,即“倒数第二个祖先”,您可以这样做:
for element in xroot.xpath('.//*[not(ancestor-or-self::*[last()-1][local-name()="ethnicity"])]'):
# ...do stuff...
当然你也可以这样做:
for child in xroot.getchildren()
if 'ethnicity' in child.tag:
continue
for element in child.xpath('//*'):
# ...do stuff...