如何使用 lxml 遍历 xml 数据以删除下一个重复元素
how to iterate through xml data to remove next duplicate element using lxml
我正在努力想出一个简单的解决方案,该解决方案迭代 xml 数据以删除下一个元素(如果它是实际元素的副本)。
示例:
从这个"input":
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data2</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data5</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data6</c>
</b>
</root>
我想开始这个 "output":
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
</root>'''
为此,我想出了以下代码:
from lxml import etree
from io import StringIO
xml = '''
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data2</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data5</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data6</c>
</b>
</root>'''
# this is to simulate that above xml was read from a file
file = StringIO(unicode(xml))
# reading the xml from a file
tree = etree.parse(file)
root = tree.getroot()
# iterate over all "b" elements
for element in root.iter('b'):
# checks if the last "b" element has been reached.
# on last element it raises "AttributeError" eception and terminates the for loop
try:
# attributes of actual element
elem_attrib_ACT = element.attrib
# attributes of next element
elem_attrib_NEXT = element.getnext().attrib
except AttributeError:
# if no other element, break
break
print('attributes of ACTUAL elem:', elem_attrib_ACT, 'attributes of NEXT elem:', elem_attrib_NEXT)
if elem_attrib_ACT == elem_attrib_NEXT:
print('next elem is duplicate of actual one -> remove it')
# I would like to remove next element but this approach is not working
# if you uncomment, it removes the elements of "data2" but stops
# how to remove the next duplicate element?
#element.getparent().remove(element.getnext())
else:
print('next elem is not a duplicate of actual')
print('result:')
print(etree.tostring(root))
取消注释行
#element.getparent().remove(element.getnext())
删除 "data2" 周围的元素但停止执行。结果 xml 是这个:
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data5</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data6</c>
</b>
</root>
我的印象是我 "cut the branch on which I am sitting"...
有什么解决这个问题的建议吗?
我认为你的怀疑是正确的,如果你在中断 except
块之前放置一个打印语句,你可以看到它提前中断了,因为这个元素已被删除(我认为)
<b attrib1="abc" attrib2="def">
<c>data2</c>
</b>
尝试使用 getprevious()
而不是 getnext()
。我还更新为使用列表理解来避免第一个元素上的错误(这当然会在 .getprevious()
处引发异常):
for element in [e for e in root.iter('b')][1:]:
try:
if element.getprevious().attrib == element.attrib:
element.getparent().remove(element)
except:
print 'except '
print etree.tostring(root)
结果:
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
</root>
我正在努力想出一个简单的解决方案,该解决方案迭代 xml 数据以删除下一个元素(如果它是实际元素的副本)。
示例:
从这个"input":
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data2</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data5</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data6</c>
</b>
</root>
我想开始这个 "output":
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
</root>'''
为此,我想出了以下代码:
from lxml import etree
from io import StringIO
xml = '''
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data2</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data5</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data6</c>
</b>
</root>'''
# this is to simulate that above xml was read from a file
file = StringIO(unicode(xml))
# reading the xml from a file
tree = etree.parse(file)
root = tree.getroot()
# iterate over all "b" elements
for element in root.iter('b'):
# checks if the last "b" element has been reached.
# on last element it raises "AttributeError" eception and terminates the for loop
try:
# attributes of actual element
elem_attrib_ACT = element.attrib
# attributes of next element
elem_attrib_NEXT = element.getnext().attrib
except AttributeError:
# if no other element, break
break
print('attributes of ACTUAL elem:', elem_attrib_ACT, 'attributes of NEXT elem:', elem_attrib_NEXT)
if elem_attrib_ACT == elem_attrib_NEXT:
print('next elem is duplicate of actual one -> remove it')
# I would like to remove next element but this approach is not working
# if you uncomment, it removes the elements of "data2" but stops
# how to remove the next duplicate element?
#element.getparent().remove(element.getnext())
else:
print('next elem is not a duplicate of actual')
print('result:')
print(etree.tostring(root))
取消注释行
#element.getparent().remove(element.getnext())
删除 "data2" 周围的元素但停止执行。结果 xml 是这个:
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data5</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data6</c>
</b>
</root>
我的印象是我 "cut the branch on which I am sitting"...
有什么解决这个问题的建议吗?
我认为你的怀疑是正确的,如果你在中断 except
块之前放置一个打印语句,你可以看到它提前中断了,因为这个元素已被删除(我认为)
<b attrib1="abc" attrib2="def">
<c>data2</c>
</b>
尝试使用 getprevious()
而不是 getnext()
。我还更新为使用列表理解来避免第一个元素上的错误(这当然会在 .getprevious()
处引发异常):
for element in [e for e in root.iter('b')][1:]:
try:
if element.getprevious().attrib == element.attrib:
element.getparent().remove(element)
except:
print 'except '
print etree.tostring(root)
结果:
<root>
<b attrib1="abc" attrib2="def">
<c>data1</c>
</b>
<b attrib1="uvw" attrib2="xyz">
<c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
<c>data4</c>
</b>
</root>