如何使用 lxml 遍历 xml 数据以删除下一个重复元素

how to iterate through xml data to remove next duplicate element using lxml

我正在努力想出一个简单的解决方案,该解决方案迭代 xml 数据以删除下一个元素(如果它是实际元素的副本)。

示例:

从这个"input":

<root>
    <b attrib1="abc" attrib2="def">
        <c>data1</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data2</c>
    </b>
    <b attrib1="uvw" attrib2="xyz">
        <c>data3</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data4</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data5</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data6</c>
    </b>
</root>

我想开始这个 "output":

<root>
    <b attrib1="abc" attrib2="def">
        <c>data1</c>
    </b>
    <b attrib1="uvw" attrib2="xyz">
        <c>data3</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data4</c>
    </b>
</root>'''

为此,我想出了以下代码:

from lxml import etree
from io import StringIO


xml = '''
<root>
    <b attrib1="abc" attrib2="def">
        <c>data1</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data2</c>
    </b>
    <b attrib1="uvw" attrib2="xyz">
        <c>data3</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data4</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data5</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data6</c>
    </b>
</root>'''

# this is to simulate that above xml was read from a file
file = StringIO(unicode(xml))

# reading the xml from a file
tree = etree.parse(file)
root = tree.getroot()

# iterate over all "b" elements
for element in root.iter('b'):
    # checks if the last "b" element has been reached.
    # on last element it raises "AttributeError" eception and terminates the for loop
    try:
        # attributes of actual element
        elem_attrib_ACT = element.attrib
        # attributes of next element
        elem_attrib_NEXT = element.getnext().attrib
    except AttributeError:
        # if no other element, break
        break
    print('attributes of ACTUAL elem:', elem_attrib_ACT, 'attributes of NEXT elem:', elem_attrib_NEXT)
    if elem_attrib_ACT == elem_attrib_NEXT:
        print('next elem is duplicate of actual one -> remove it')
        # I would like to remove next element but this approach is not working
        # if you uncomment, it removes the elements of "data2" but stops
        # how to remove the next duplicate element?
        #element.getparent().remove(element.getnext())
    else:
        print('next elem is not a duplicate of actual')

print('result:')
print(etree.tostring(root))

取消注释行

#element.getparent().remove(element.getnext())

删除 "data2" 周围的元素但停止执行。结果 xml 是这个:

<root>
    <b attrib1="abc" attrib2="def">
        <c>data1</c>
    </b>
    <b attrib1="uvw" attrib2="xyz">
        <c>data3</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data4</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data5</c>
    </b>
    <b attrib1="abc" attrib2="def">
        <c>data6</c>
    </b>
</root>

我的印象是我 "cut the branch on which I am sitting"...

有什么解决这个问题的建议吗?

我认为你的怀疑是正确的,如果你在中断 except 块之前放置一个打印语句,你可以看到它提前中断了,因为这个元素已被删除(我认为)

<b attrib1="abc" attrib2="def">
    <c>data2</c>
</b>

尝试使用 getprevious() 而不是 getnext()。我还更新为使用列表理解来避免第一个元素上的错误(这当然会在 .getprevious() 处引发异常):

for element in [e for e in root.iter('b')][1:]:
    try:
        if element.getprevious().attrib == element.attrib:
            element.getparent().remove(element)
    except:
        print 'except  '
print etree.tostring(root)

结果:

<root>
<b attrib1="abc" attrib2="def">
    <c>data1</c>
</b>
<b attrib1="uvw" attrib2="xyz">
    <c>data3</c>
</b>
<b attrib1="abc" attrib2="def">
    <c>data4</c>
</b>
</root>