通过避免特定分支快速遍历 lxml 树

Question

假设我有一个 etree 如下：

my_data.xml

<?xml version="1.0" encoding="UTF-8"?>
<data>
  <country name="Liechtenstein" xmlns="aaa:bbb:ccc:liechtenstein:eee">
    <rank updated="yes">2</rank>
    <holidays>
      <christmas>Yes</christmas>
    </holidays>
    <year>2008</year>
    <gdppc>141100</gdppc>
    <neighbor name="Austria" direction="E"/>
    <neighbor name="Switzerland" direction="W"/>
  </country>
  <country name="Singapore" xmlns="aaa:bbb:ccc:singapore:eee">
    <continent>Asia</continent>
    <holidays>
      <christmas>Yes</christmas>
    </holidays>
    <rank updated="yes">5</rank>
    <year>2011</year>
    <gdppc>59900</gdppc>
    <neighbor name="Malaysia" direction="N"/>
  </country>
  <country name="Panama" xmlns="aaa:bbb:ccc:panama:eee">
    <rank updated="yes">69</rank>
    <year>2011</year>
    <gdppc>13600</gdppc>
    <neighbor name="Costa Rica" direction="W"/>
    <neighbor name="Colombia" direction="E"/>
  </country>
  <ethnicity xmlns="aaa:bbb:ccc:ethnicity:eee">
    <malay>
      <holidays>
        <ramadan>Yes</ramadan>
      </holidays>
    </malay>
  </ethnicity>
</data>

正在解析：

xtree = etree.parse('my_data.xml')
xroot = xtree.getroot()

我想遍历树并对所有分支做一些事情，除了某些树枝。在此示例中，我想排除 ethnicity 分支：

node_to_exclude = xroot.xpath('.//*[local-name()="ethnicity"]')
exclude_path = xtree.getelementpath(node_to_exclude[0])

for element in xroot.iter('*'):
   if exclude_path not in xtree.getelementpath(element ):
      ...do stuff...

但是这样还是会遍历整棵树。有没有比这更好/更快的方法（即忽略整个 ethnicity 分支）？我正在寻找一种语法解决方案，而不是递归算法。

Answer 1

XPath 可以为您做这件事

for element in xroot.xpath('.//*[not(ancestor-or-self::*[local-name()="ethnicity"])]'):
    # ...do stuff...

它可能会（也可能不会）衡量它 - 提高性能以指定您指的是哪个祖先。例如，如果 <ethnicity xmlns="..."> 始终是顶级元素的子元素，即“倒数第二个祖先”，您可以这样做：

for element in xroot.xpath('.//*[not(ancestor-or-self::*[last()-1][local-name()="ethnicity"])]'):
    # ...do stuff...

当然你也可以这样做：

for child in xroot.getchildren()
    if 'ethnicity' in child.tag:
        continue
    for element in child.xpath('//*'):
        # ...do stuff...

通过避免特定分支快速遍历 lxml 树

Fast traverse through lxml tree by avoiding specific branch

python

lxml