在递归嵌套 XML 中获取元素的最大嵌入级别
Get maximal level of embedding of element in recursively nested XML
对于任意递归嵌套的每个元素 XML 我需要找到它的最大嵌入级别。
所以例如这个 XML
<chorus>
<l>Alright now lose it <ah>aah <i>aah <ah>a<ah>a</ah>h</ah> aah</i> aah</ah></l>
<l>Just lose it aah aah aah aah aah</l>
<l>Go crazy aah aah aah aah aah</l>
<l>Oh baby <ah>aah aah</ah>, oh baby baby <ah>aah aah</ah></l>
</chorus>
输出应如下所示:{"chorus": 0, "l": 0, "ah": 2, "i": 0}
不幸的是,解决方案仅限于使用 xml.etree.ElementTree
。
我尝试了几个小时的不同方法,但我无法理解它。
我认为这是函数 str.find(string, beginning, end) 的一个很好的用例:https://www.tutorialspoint.com/python/string_find.htm
因此,如果我对 <> 之间的每个元素的理解正确,我们将查看该元素的最大嵌入深度,因此:
- 我们寻找元素,我们保存找到它的位置
- 然后我们寻找结束匹配的元素,保存位置
- 再次查找该元素,但从我们第一次找到它的位置开始
- 如果元素第二次出现的位置在结束元素第一次出现之前找到,我们将深度加1,我们将开始寻找第二次出现的位置
- 如果元素第二次出现的位置是在第一次出现结束元素之后找到的,则它们不是嵌入的,所以我们不更新深度,我们将关注结束元素的位置找到下一个结束元素
- 重复直到在子字符串中找不到元素为止
- 对每个不同的元素重复
我认为它是完全可以优化的,但经过一些测试后它起作用了:
import re
string = ("<chorus>"
"<l>Alright now lose it <ah>aah <i>aah <ah>a<ah>a</ah>h</ah> aah</i> aah</ah></l>"
"<l>Just lose it aah aah aah aah aah</l>"
"<l>Go crazy aah aah aah aah aah</l>"
"<l>Oh baby <ah>aah aah</ah>, oh baby baby <ah>aah aah</ah></l>"
"</chorus>")
# Start looking for the different nodes
regex = re.compile('\<[a-z]+\>')
# create a set of the nodes and closing nodes so we can iterate trough them
nodes = set(regex.findall(string))
closing_nodes = [node.replace('<', '</') for node in nodes]
depth = {node: '0' for node in nodes}
for node, closing_node in zip(nodes, closing_nodes):
pos_node = string.find(node) + len(node)
pos_closing_node = 0
node_depth = 0
max_node_depth = 0
# we keep looking until we do not find our node (then str.find(node) returns -1)
while pos_node >= 0:
pos_closing_node = string.find(closing_node, pos_closing_node)
pos_node = string.find(node, pos_node)
# if we didnt find the node at all of if we found an occurence of a closing node before the node, we reduce depth by 1
# and we will be looking for the next closing node next time, so we add the lengh of the closing node to the starting position of our search
if pos_closing_node < pos_node or pos_node == -1:
node_depth -= 1
pos_closing_node = pos_closing_node + len(closing_node)
else:
node_depth += 1
pos_node = pos_node + len(node)
# we want the max depth so we take the max of the depth values
max_node_depth = max(max_node_depth, node_depth)
depth[node] = max_node_depth
print(depth)
您可以使用 this example from the docs 的修改版本:
尝试将 maxDepth
和 depth
更改为使用键的元素名称(标签)的字典...
Python
from xml.etree.ElementTree import XMLParser
class MaxDepth: # The target object of the parser
maxDepth = {}
depth = {}
def start(self, tag, attrib): # Called for each opening tag.
try:
self.depth[tag] += 1
except KeyError:
self.depth[tag] = 0
self.maxDepth[tag] = 0
if self.depth[tag] > self.maxDepth[tag]:
self.maxDepth[tag] = self.depth[tag]
def end(self, tag): # Called for each closing tag.
self.depth[tag] -= 1
def data(self, data):
pass # We do not need to do anything with data.
def close(self): # Called when all data has been parsed.
return self.maxDepth
target = MaxDepth()
parser = XMLParser(target=target)
exampleXml = """
<chorus>
<l>Alright now lose it <ah>aah <i>aah <ah>a<ah>a</ah>h</ah> aah</i> aah</ah></l>
<l>Just lose it aah aah aah aah aah</l>
<l>Go crazy aah aah aah aah aah</l>
<l>Oh baby <ah>aah aah</ah>, oh baby baby <ah>aah aah</ah></l>
</chorus>"""
parser.feed(exampleXml)
print(parser.close())
输出
{'chorus': 0, 'l': 0, 'ah': 2, 'i': 0}
已编辑 Python(其中 chorus
已经是一个 ElementTree.Element
对象)
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import XMLParser
class MaxDepth: # The target object of the parser
maxDepth = {}
depth = {}
def start(self, tag, attrib): # Called for each opening tag.
try:
self.depth[tag] += 1
except KeyError:
self.depth[tag] = 0
self.maxDepth[tag] = 0
if self.depth[tag] > self.maxDepth[tag]:
self.maxDepth[tag] = self.depth[tag]
def end(self, tag): # Called for each closing tag.
self.depth[tag] -= 1
def data(self, data):
pass # We do not need to do anything with data.
def close(self): # Called when all data has been parsed.
return self.maxDepth
exampleXml = """
<chorus>
<l>Alright now lose it <ah>aah <i>aah <ah>a<ah>a</ah>h</ah> aah</i> aah</ah></l>
<l>Just lose it aah aah aah aah aah</l>
<l>Go crazy aah aah aah aah aah</l>
<l>Oh baby <ah>aah aah</ah>, oh baby baby <ah>aah aah</ah></l>
</chorus>"""
chorus_element = ET.fromstring(exampleXml)
target = MaxDepth()
parser = XMLParser(target=target)
parser.feed(ET.tostring(chorus_element))
print(parser.close())
对于任意递归嵌套的每个元素 XML 我需要找到它的最大嵌入级别。
所以例如这个 XML
<chorus>
<l>Alright now lose it <ah>aah <i>aah <ah>a<ah>a</ah>h</ah> aah</i> aah</ah></l>
<l>Just lose it aah aah aah aah aah</l>
<l>Go crazy aah aah aah aah aah</l>
<l>Oh baby <ah>aah aah</ah>, oh baby baby <ah>aah aah</ah></l>
</chorus>
输出应如下所示:{"chorus": 0, "l": 0, "ah": 2, "i": 0}
不幸的是,解决方案仅限于使用 xml.etree.ElementTree
。
我尝试了几个小时的不同方法,但我无法理解它。
我认为这是函数 str.find(string, beginning, end) 的一个很好的用例:https://www.tutorialspoint.com/python/string_find.htm
因此,如果我对 <> 之间的每个元素的理解正确,我们将查看该元素的最大嵌入深度,因此:
- 我们寻找元素,我们保存找到它的位置
- 然后我们寻找结束匹配的元素,保存位置
- 再次查找该元素,但从我们第一次找到它的位置开始
- 如果元素第二次出现的位置在结束元素第一次出现之前找到,我们将深度加1,我们将开始寻找第二次出现的位置
- 如果元素第二次出现的位置是在第一次出现结束元素之后找到的,则它们不是嵌入的,所以我们不更新深度,我们将关注结束元素的位置找到下一个结束元素
- 重复直到在子字符串中找不到元素为止
- 对每个不同的元素重复
我认为它是完全可以优化的,但经过一些测试后它起作用了:
import re
string = ("<chorus>"
"<l>Alright now lose it <ah>aah <i>aah <ah>a<ah>a</ah>h</ah> aah</i> aah</ah></l>"
"<l>Just lose it aah aah aah aah aah</l>"
"<l>Go crazy aah aah aah aah aah</l>"
"<l>Oh baby <ah>aah aah</ah>, oh baby baby <ah>aah aah</ah></l>"
"</chorus>")
# Start looking for the different nodes
regex = re.compile('\<[a-z]+\>')
# create a set of the nodes and closing nodes so we can iterate trough them
nodes = set(regex.findall(string))
closing_nodes = [node.replace('<', '</') for node in nodes]
depth = {node: '0' for node in nodes}
for node, closing_node in zip(nodes, closing_nodes):
pos_node = string.find(node) + len(node)
pos_closing_node = 0
node_depth = 0
max_node_depth = 0
# we keep looking until we do not find our node (then str.find(node) returns -1)
while pos_node >= 0:
pos_closing_node = string.find(closing_node, pos_closing_node)
pos_node = string.find(node, pos_node)
# if we didnt find the node at all of if we found an occurence of a closing node before the node, we reduce depth by 1
# and we will be looking for the next closing node next time, so we add the lengh of the closing node to the starting position of our search
if pos_closing_node < pos_node or pos_node == -1:
node_depth -= 1
pos_closing_node = pos_closing_node + len(closing_node)
else:
node_depth += 1
pos_node = pos_node + len(node)
# we want the max depth so we take the max of the depth values
max_node_depth = max(max_node_depth, node_depth)
depth[node] = max_node_depth
print(depth)
您可以使用 this example from the docs 的修改版本:
尝试将 maxDepth
和 depth
更改为使用键的元素名称(标签)的字典...
Python
from xml.etree.ElementTree import XMLParser
class MaxDepth: # The target object of the parser
maxDepth = {}
depth = {}
def start(self, tag, attrib): # Called for each opening tag.
try:
self.depth[tag] += 1
except KeyError:
self.depth[tag] = 0
self.maxDepth[tag] = 0
if self.depth[tag] > self.maxDepth[tag]:
self.maxDepth[tag] = self.depth[tag]
def end(self, tag): # Called for each closing tag.
self.depth[tag] -= 1
def data(self, data):
pass # We do not need to do anything with data.
def close(self): # Called when all data has been parsed.
return self.maxDepth
target = MaxDepth()
parser = XMLParser(target=target)
exampleXml = """
<chorus>
<l>Alright now lose it <ah>aah <i>aah <ah>a<ah>a</ah>h</ah> aah</i> aah</ah></l>
<l>Just lose it aah aah aah aah aah</l>
<l>Go crazy aah aah aah aah aah</l>
<l>Oh baby <ah>aah aah</ah>, oh baby baby <ah>aah aah</ah></l>
</chorus>"""
parser.feed(exampleXml)
print(parser.close())
输出
{'chorus': 0, 'l': 0, 'ah': 2, 'i': 0}
已编辑 Python(其中 chorus
已经是一个 ElementTree.Element
对象)
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import XMLParser
class MaxDepth: # The target object of the parser
maxDepth = {}
depth = {}
def start(self, tag, attrib): # Called for each opening tag.
try:
self.depth[tag] += 1
except KeyError:
self.depth[tag] = 0
self.maxDepth[tag] = 0
if self.depth[tag] > self.maxDepth[tag]:
self.maxDepth[tag] = self.depth[tag]
def end(self, tag): # Called for each closing tag.
self.depth[tag] -= 1
def data(self, data):
pass # We do not need to do anything with data.
def close(self): # Called when all data has been parsed.
return self.maxDepth
exampleXml = """
<chorus>
<l>Alright now lose it <ah>aah <i>aah <ah>a<ah>a</ah>h</ah> aah</i> aah</ah></l>
<l>Just lose it aah aah aah aah aah</l>
<l>Go crazy aah aah aah aah aah</l>
<l>Oh baby <ah>aah aah</ah>, oh baby baby <ah>aah aah</ah></l>
</chorus>"""
chorus_element = ET.fromstring(exampleXml)
target = MaxDepth()
parser = XMLParser(target=target)
parser.feed(ET.tostring(chorus_element))
print(parser.close())