试图遍历嵌套的 xml 标签，但递归函数没有完全遍历

Question

我有以下字符串格式的 xml 数据，它使用 python 的 lxml 包，我正在解析为 xml。

现在，我必须遍历此 xml 数据并生成特定格式的输出，类似于这样

<A xmlns="dfjdlfkdjflsd">
  <B>
    <B1>B_1</B1>
    <B2>B_2</B2>
    <B3>
      <B31>B3_1</B31>
      <B32>B3_2</B32>
      <B33>
        <B331>
          <B3311></B3311>
        </B331>
        <B332>
          <B3321></B3321>
        </B332>
      </B33>
      <B34>
        <B341>
          <B3411></B3411>
        </B341>
        <B342>
          <B3421></B3421>
        </B342>
      </B34>
      <B35>
        <B351>B35_1</B351>
        <B352>
          <B3521>
            <B35211></B35211>
            <B35211></B35212>
          </B3521>
        </B352>
      </B35>
      <B36>
        <B361>B36_1</B361>
        <B362>B36_2</B362>
      </B36>
    </B3>
  </B>
</A>

我希望输出格式如下：

{
    'B1': 'B_1',
    'B2': 'B_2',
    'B3_B31': 'B3_1',
    'B3_B32': 'B3_2',
    'B3_B33_B331_B3311': '-',
    'B3_B33_B331_B3312': '-',
    'B3_B34_B341_B3411': '-',
    'B3_B34_B342_B3421': '-',
    'B3_B35_B351': 'B35_1',
    'B3_B35_B352_B3521_B35211': '-',
    'B3_B35_B352_B3521_B35212': '-',
    'B3_B36_B361': 'b36_1',
    'B3_B36_B361': 'B36_2',
}

现在，这只是一个例子。在实际场景中，每个 xml 标签的深度可能不同。所以，我决定使用递归方法。这是到目前为止我在代码方面的进展：

class ParseXML:
    main_output = []
    output = {}

    def __init__(self, xml_input):
        parser = ET.XMLParser(recover=True)
        tree = ET.ElementTree(ET.fromstring(xml_input, parser=parser))
        self.root = tree.getroot()

    def parse_outer_xml(self):
        for children in self.root:
            output = self.parse_xml(children, output={})
            self.main_output.append(output)
        return self.main_output

    def parse_xml(self, children, tag=None, output={}):
        if len(children):
            for child in children.getchildren():
                if child.tag.split('}')[1] in GLOBAL_DICT:
                    output['{0}_{1}'.format(tag, child.tag.split('}')[1]) if tag else child.tag.split('}')[1]] = child.text
                else:
                    if child.tag.split('}')[1] not in GLOBAL_EXCLUDE_DICT:
                        if len(child):
                            if children.tag.split('}')[1] == 'B':
                                tag = child.tag.split('}')[1]
                            else:
                                tag = "{0}_{1}".format(tag, child.tag.split('}')[1]) if tag else "{0}_{1}".format(children.tag.split('}')[1], child.tag.split('}')[1])
                            return self.parse_xml(child, tag, output)
                        else:
                            output['{0}_{1}'.format(tag, child.tag.split('}')[1]) if tag else child.tag.split('}')[1]] = child.text if child.text else "-"
        else:
            output['{0}_{1}'.format(tag, children.tag.split('}')[1]) if tag else children.tag.split('}')[1]] = children.text if children.text else "-"
        return output


if __name__ == '__main__':
    parse = ParseXML(data)
    temp = parse.parse_outer_xml()
    pprint(temp)

我在运行

时得到这个输出

[{'B1': 'B_1',
  'B2': 'B_2',
  'B3_B31': 'B3_1',
  'B3_B32': 'B3_2',
  'B3_B33_B331_B3311': '-'}]

但是这段代码没有遍历到全深度。任何人都可以研究这个并提供一些关于如何遍历这个 xml 数据直到完整深度的指导。

Answer 1

您可以使用递归生成器函数：

import xml.etree.ElementTree as ET, re
t = ET.fromstring(re.sub('\sxmlns\="\w+"', '', s_xml))
def flatten(t, p = []):
   if not (c:=list(t)):
      yield ('_'.join(p+[t.tag]), '-' if t.text is None else t.text)
   else:
      yield from [j for k in c for j in flatten(k, p+[t.tag])]

r = dict(j for k in list(t)[0] for j in flatten(k))

输出：

{'B1': 'B_1', 'B2': 'B_2', 'B3_B31': 'B3_1', 'B3_B32': 'B3_2', 'B3_B33_B331_B3311': '-', 'B3_B33_B332_B3321': '-', 'B3_B34_B341_B3411': '-', 'B3_B34_B342_B3421': '-', 'B3_B35_B351': 'B35_1', 'B3_B35_B352_B3521_B35211': '-', 'B3_B36_B361': 'B36_1', 'B3_B36_B362': 'B36_2'}

Answer 2

首先，您问题中的示例 xml 格式不正确。假设这是固定的，您首先必须处理 xml 包含命名空间声明这一事实。所以总而言之，像下面这样的东西（使用 lxml）至少应该让你足够接近：

from lxml import etree
doc = etree.XML([your xml above, well formed])

#remove the namespace
for elem in doc.getiterator():
    elem.tag = etree.QName(elem).localname

#from here, get the path of each element and massage it a bit to fit what I believe 
#are your requirements 
tree = etree.ElementTree(doc)    
targets = []
for e in doc.iter():
        path = tree.getpath(e).replace("/A/B/","").replace("/","_") 
        if "A" not in path:        
            if e.text is not None and len(e.text.strip())>0:
                targets.append(path+" : "+e.text.strip())
            else:
                if not e.text:
                    targets.append(path+": -")
                
for target in targets:
    print(target)

输出（至少来自示例 xml）应该是您预期的输出。

试图遍历嵌套的 xml 标签，但递归函数没有完全遍历

Trying to traverse through nested xml tags but recursive function does not traverse in full depth

python

lxml

elementtree