用lxml获取标签、元素和change/add/delete这些元素

Obtain tags, elements and change/add/delete these elements with lxml

我想获取整个文档中的不同标签,然后获取它们的属性并将它们与我希望它们具有的属性进行比较(即 title 标签具有 id 属性,但我想更改该属性值并还希望它具有列属性)

这是 xml 代码的示例:

 <dita>
        <topic id="id15CDB0PL09E">
            <title id="id15CDB0R0VYB"><?FM MARKER [Header/Footer ] All?>Control
            </title>
            <shortdesc>CONTROL</shortdesc>
            <concept id="id15CDB0Q0Q4G">
                <title id="id15CDB0R0VHA">General
                </title>
                <conbody>
                    <paragraph>This section
                    </paragraph>
                </conbody>
                <concept id="id156F7H00GIE">
                    <title id="id15CDB0R0V1W">System
                    </title>
                    <conbody>
                        <paragraph>Engine
                        </paragraph>
                        <paragraph>The ECU
                        </paragraph>
                        <paragraph>The aircraft
                        </paragraph>
                        <paragraph>The system
                        </paragraph>
                   </conbody>
                </concept>
            </concept>
        </topic>
    </dita>

这是我到目前为止一直在编码的内容。

from lxml import etree
import numpy as np

tree = etree.parse("File.xml")
root = tree.getroot()
#Listas para guardar las tags
Lista = []
Atributos = []
tags = []
attributes = []
#Muestra los tag-atributos:texto en forma de diccionario
for element in root.iter():
    #Muestra las tags-atributos:texto
    #print("%s - %s : %s" % (element.tag, element.attrib, element.text))
    Lista.append(element.tag)
    Atributos.append(element.attrib)
#Muestra los valores unicos de las tags existentes
tags = np.unique(Lista)
attributes = np.unique(Atributos)
print(tags)
print(Atributos)
tree.write("Resultado.xml")

但它会导致 类型错误:'dict' 和 'dict'

实例之间不支持“<”

期望的输出是这样的

tags[topic,title,shortdesc,concept,conbody,para]
attributes[topic:{id} title:{id,columns},shortdesc:None,concept:None,conbody:id,para:id]

如果我理解正确的话,像这样的东西应该有用

#first, modify the xml to its desired form:
for t in root.xpath('//title'):
    t.attrib['columns']=''
for p in root.xpath('//paragraph'):
    p.attrib['id']=''
for cy in root.xpath('//conbody'):
    cy.attrib['id']=''
for ct in root.xpath('//concept'):
    ct.attrib.pop("id", None)

#check to make sure it worked:
print(etree.tostring(root).decode())

输出:

<dita>
        <topic id="id15CDB0PL09E">
            <title id="id15CDB0R0VYB" columns=""><?FM MARKER [Header/Footer ] All?>Control
            </title>
            <shortdesc>CONTROL</shortdesc>
            <concept>
                <title id="id15CDB0R0VHA" columns="">General
                </title>
                <conbody id="">
                    <paragraph id="">This section
                    </paragraph>
                </conbody>
                <concept>
                    <title id="id15CDB0R0V1W" columns="">System
                    </title>
                    <conbody id="">
                        <paragraph id="">Engine
                        </paragraph>
                        <paragraph id="">The ECU
                        </paragraph>
                        <paragraph id="">The aircraft
                        </paragraph>
                        <paragraph id="">The system
                        </paragraph>
                   </conbody>
                </concept>
            </concept>
        </topic>
    </dita>

现在,检查唯一标签:

tags = set([top.tag for top in root.xpath('/dita//*')])
print(tags)

输出:

{'concept', 'shortdesc', 'title', 'conbody', 'paragraph', 'topic'}

最后,列出属性

for tag in tags:
    print(tag,root.xpath(f'//{tag}')[0].attrib.keys())

输出:

concept []
shortdesc []
title ['id', 'columns']
conbody ['id']
paragraph ['id']
topic ['id']

显然,您可以通过更改顺序、将它们添加到字典等方式来修改输出。