如何删除所有 xml python 中重复的 xml 节点

Question

我知道这几乎是

的重复

但是我无法弄清楚为什么我的示例（类似于真实数据问题）为什么没有完全删除重复项。

我使用的代码删除了 4 个中的 2 个而不是 3 个？

我正在尝试创建一个 python 脚本来清除 xml 个文件中的重复项。

代码；

tree = etree.parse(path)
root = tree.getroot()

def elements_equal(e1, e2):
    if type(e1) != type(e2):
        return False
    if e1.tag != e2.tag:
        return False
    if e1.text != e2.text:
        return False
    if e1.tail != e2.tail:
        return False
    if e1.attrib != e2.attrib:
        return False
    if len(e1) != len(e2):
        return False
    return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])

prev = ""
for page in root:
    elems_to_remove = []
    for elem in page:
        if elements_equal(elem, prev):
            print("found duplicate: %s" % elem.text)
            elems_to_remove.append(elem)
            continue
        prev = elem
    for elem_to_remove in elems_to_remove:
        page.remove(elem_to_remove)
tree.write("clean.xml")

xml;

<?xml version="1.0" encoding="UTF-8"?>  
<emails>  
<email>  
  <to>Vimal</to>  
  <from>Sonoo</from>  
  <heading>Hello</heading>  
    <body>Hello brother, how are you!</body>  
    <body>Hello brother, how are you!</body>  
    <body>Hello brother, how are you!</body>  
    <body>Hello brother, how are you!</body>  
</email>  
<email>  
  <to>Peter</to>  
  <from>Jack</from>  
  <heading>Birth day wish</heading>  
  <body>Happy birth day Tom!</body>  
</email>  
<email>  
  <to>James</to>  
  <from>Jaclin</from>  
  <heading>Morning walk</heading>  
  <body></body>  
</email>  
<email>  
  <to>Kartik</to>  
  <from>Kumar</from>  
  <heading>Health Tips</heading>  
  <body>Smoking is injurious to health!</body>  
</email>  
</emails>

希望这只是我遗漏了一些明显的情况，我可以了解那是什么并继续快乐。

Answer 1

您得到此结果的原因是第 3 个和第 4 个 <body> 元素之间存在差异 - 它们的 tail 属性的长度（分别为 7 和 3）。因此，

if e1.tail != e2.tail:
    return False

returns False.

您可以通过删除 tail 相等性作为测试或修改 xml 本身来处理它。

如何删除所有 xml python 中重复的 xml 节点

How do you remove duplicate xml nodes throughout all of the xml python

python

xml

lxml