在递归嵌套 XML 中获取元素的最大嵌入级别

Question

对于任意递归嵌套的每个元素 XML 我需要找到它的最大嵌入级别。

所以例如这个 XML

<chorus>
    <l>Alright now lose it <ah>aah <i>aah <ah>a<ah>a</ah>h</ah> aah</i> aah</ah></l>
    <l>Just lose it aah aah aah aah aah</l>
    <l>Go crazy aah aah aah aah aah</l>
    <l>Oh baby <ah>aah aah</ah>, oh baby baby <ah>aah aah</ah></l>
</chorus>

输出应如下所示：{"chorus": 0, "l": 0, "ah": 2, "i": 0}

不幸的是，解决方案仅限于使用 xml.etree.ElementTree。

我尝试了几个小时的不同方法，但我无法理解它。

Answer 1

我认为这是函数 str.find(string, beginning, end) 的一个很好的用例：https://www.tutorialspoint.com/python/string_find.htm

因此，如果我对 <> 之间的每个元素的理解正确，我们将查看该元素的最大嵌入深度，因此：

我们寻找元素，我们保存找到它的位置
然后我们寻找结束匹配的元素，保存位置
再次查找该元素，但从我们第一次找到它的位置开始
如果元素第二次出现的位置在结束元素第一次出现之前找到，我们将深度加1，我们将开始寻找第二次出现的位置
如果元素第二次出现的位置是在第一次出现结束元素之后找到的，则它们不是嵌入的，所以我们不更新深度，我们将关注结束元素的位置找到下一个结束元素
重复直到在子字符串中找不到元素为止
对每个不同的元素重复

我认为它是完全可以优化的，但经过一些测试后它起作用了：

import re

string = ("<chorus>"
          "<l>Alright now lose it <ah>aah <i>aah <ah>a<ah>a</ah>h</ah> aah</i> aah</ah></l>"
          "<l>Just lose it aah aah aah aah aah</l>"
          "<l>Go crazy aah aah aah aah aah</l>"
          "<l>Oh baby <ah>aah aah</ah>, oh baby baby <ah>aah aah</ah></l>"
          "</chorus>")

# Start looking for the different nodes
regex = re.compile('\<[a-z]+\>')
# create a set of the nodes and closing nodes so we can iterate trough them
nodes = set(regex.findall(string))
closing_nodes = [node.replace('<', '</') for node in nodes]


depth = {node: '0' for node in nodes}

for node, closing_node in zip(nodes, closing_nodes):
    pos_node = string.find(node) + len(node)
    pos_closing_node = 0
    node_depth = 0
    max_node_depth = 0
    # we keep looking until we do not find our node (then str.find(node) returns -1)
    while pos_node >= 0:
        pos_closing_node = string.find(closing_node, pos_closing_node)
        pos_node = string.find(node, pos_node)
        # if we didnt find the node at all of if we found an occurence of a closing node before the node, we reduce depth by 1
        # and we will be looking for the next closing node next time, so we add the lengh of the closing node to the starting position of our search
        if pos_closing_node < pos_node or pos_node == -1:
            node_depth -= 1
            pos_closing_node = pos_closing_node + len(closing_node)
        else:
            node_depth += 1
            pos_node = pos_node + len(node)
        # we want the max depth so we take the max of the depth values
        max_node_depth = max(max_node_depth, node_depth)
    depth[node] = max_node_depth

print(depth)

Answer 2

您可以使用 this example from the docs 的修改版本：

尝试将 maxDepth 和 depth 更改为使用键的元素名称（标签）的字典...

Python

from xml.etree.ElementTree import XMLParser


class MaxDepth:  # The target object of the parser
    maxDepth = {}
    depth = {}

    def start(self, tag, attrib):  # Called for each opening tag.
        try:
            self.depth[tag] += 1
        except KeyError:
            self.depth[tag] = 0
            self.maxDepth[tag] = 0
        if self.depth[tag] > self.maxDepth[tag]:
            self.maxDepth[tag] = self.depth[tag]

    def end(self, tag):  # Called for each closing tag.
        self.depth[tag] -= 1

    def data(self, data):
        pass  # We do not need to do anything with data.

    def close(self):  # Called when all data has been parsed.
        return self.maxDepth


target = MaxDepth()
parser = XMLParser(target=target)
exampleXml = """
<chorus>
    <l>Alright now lose it <ah>aah <i>aah <ah>a<ah>a</ah>h</ah> aah</i> aah</ah></l>
    <l>Just lose it aah aah aah aah aah</l>
    <l>Go crazy aah aah aah aah aah</l>
    <l>Oh baby <ah>aah aah</ah>, oh baby baby <ah>aah aah</ah></l>
</chorus>"""
parser.feed(exampleXml)
print(parser.close())

输出

{'chorus': 0, 'l': 0, 'ah': 2, 'i': 0}

已编辑 Python（其中 chorus 已经是一个 ElementTree.Element 对象）

import xml.etree.ElementTree as ET
from xml.etree.ElementTree import XMLParser


class MaxDepth:  # The target object of the parser
    maxDepth = {}
    depth = {}

    def start(self, tag, attrib):  # Called for each opening tag.
        try:
            self.depth[tag] += 1
        except KeyError:
            self.depth[tag] = 0
            self.maxDepth[tag] = 0
        if self.depth[tag] > self.maxDepth[tag]:
            self.maxDepth[tag] = self.depth[tag]

    def end(self, tag):  # Called for each closing tag.
        self.depth[tag] -= 1

    def data(self, data):
        pass  # We do not need to do anything with data.

    def close(self):  # Called when all data has been parsed.
        return self.maxDepth


exampleXml = """
<chorus>
    <l>Alright now lose it <ah>aah <i>aah <ah>a<ah>a</ah>h</ah> aah</i> aah</ah></l>
    <l>Just lose it aah aah aah aah aah</l>
    <l>Go crazy aah aah aah aah aah</l>
    <l>Oh baby <ah>aah aah</ah>, oh baby baby <ah>aah aah</ah></l>
</chorus>"""

chorus_element = ET.fromstring(exampleXml)

target = MaxDepth()
parser = XMLParser(target=target)
parser.feed(ET.tostring(chorus_element))
print(parser.close())

在递归嵌套 XML 中获取元素的最大嵌入级别

Get maximal level of embedding of element in recursively nested XML

python

xml

elementtree

xml-parsing

python-3.x