用lxml获取标签、元素和change/add/delete这些元素
Obtain tags, elements and change/add/delete these elements with lxml
我想获取整个文档中的不同标签,然后获取它们的属性并将它们与我希望它们具有的属性进行比较(即 title 标签具有 id 属性,但我想更改该属性值并还希望它具有列属性)
这是 xml 代码的示例:
<dita>
<topic id="id15CDB0PL09E">
<title id="id15CDB0R0VYB"><?FM MARKER [Header/Footer ] All?>Control
</title>
<shortdesc>CONTROL</shortdesc>
<concept id="id15CDB0Q0Q4G">
<title id="id15CDB0R0VHA">General
</title>
<conbody>
<paragraph>This section
</paragraph>
</conbody>
<concept id="id156F7H00GIE">
<title id="id15CDB0R0V1W">System
</title>
<conbody>
<paragraph>Engine
</paragraph>
<paragraph>The ECU
</paragraph>
<paragraph>The aircraft
</paragraph>
<paragraph>The system
</paragraph>
</conbody>
</concept>
</concept>
</topic>
</dita>
这是我到目前为止一直在编码的内容。
from lxml import etree
import numpy as np
tree = etree.parse("File.xml")
root = tree.getroot()
#Listas para guardar las tags
Lista = []
Atributos = []
tags = []
attributes = []
#Muestra los tag-atributos:texto en forma de diccionario
for element in root.iter():
#Muestra las tags-atributos:texto
#print("%s - %s : %s" % (element.tag, element.attrib, element.text))
Lista.append(element.tag)
Atributos.append(element.attrib)
#Muestra los valores unicos de las tags existentes
tags = np.unique(Lista)
attributes = np.unique(Atributos)
print(tags)
print(Atributos)
tree.write("Resultado.xml")
但它会导致
类型错误:'dict' 和 'dict'
实例之间不支持“<”
期望的输出是这样的
tags[topic,title,shortdesc,concept,conbody,para]
attributes[topic:{id} title:{id,columns},shortdesc:None,concept:None,conbody:id,para:id]
如果我理解正确的话,像这样的东西应该有用
#first, modify the xml to its desired form:
for t in root.xpath('//title'):
t.attrib['columns']=''
for p in root.xpath('//paragraph'):
p.attrib['id']=''
for cy in root.xpath('//conbody'):
cy.attrib['id']=''
for ct in root.xpath('//concept'):
ct.attrib.pop("id", None)
#check to make sure it worked:
print(etree.tostring(root).decode())
输出:
<dita>
<topic id="id15CDB0PL09E">
<title id="id15CDB0R0VYB" columns=""><?FM MARKER [Header/Footer ] All?>Control
</title>
<shortdesc>CONTROL</shortdesc>
<concept>
<title id="id15CDB0R0VHA" columns="">General
</title>
<conbody id="">
<paragraph id="">This section
</paragraph>
</conbody>
<concept>
<title id="id15CDB0R0V1W" columns="">System
</title>
<conbody id="">
<paragraph id="">Engine
</paragraph>
<paragraph id="">The ECU
</paragraph>
<paragraph id="">The aircraft
</paragraph>
<paragraph id="">The system
</paragraph>
</conbody>
</concept>
</concept>
</topic>
</dita>
现在,检查唯一标签:
tags = set([top.tag for top in root.xpath('/dita//*')])
print(tags)
输出:
{'concept', 'shortdesc', 'title', 'conbody', 'paragraph', 'topic'}
最后,列出属性
for tag in tags:
print(tag,root.xpath(f'//{tag}')[0].attrib.keys())
输出:
concept []
shortdesc []
title ['id', 'columns']
conbody ['id']
paragraph ['id']
topic ['id']
显然,您可以通过更改顺序、将它们添加到字典等方式来修改输出。
我想获取整个文档中的不同标签,然后获取它们的属性并将它们与我希望它们具有的属性进行比较(即 title 标签具有 id 属性,但我想更改该属性值并还希望它具有列属性)
这是 xml 代码的示例:
<dita>
<topic id="id15CDB0PL09E">
<title id="id15CDB0R0VYB"><?FM MARKER [Header/Footer ] All?>Control
</title>
<shortdesc>CONTROL</shortdesc>
<concept id="id15CDB0Q0Q4G">
<title id="id15CDB0R0VHA">General
</title>
<conbody>
<paragraph>This section
</paragraph>
</conbody>
<concept id="id156F7H00GIE">
<title id="id15CDB0R0V1W">System
</title>
<conbody>
<paragraph>Engine
</paragraph>
<paragraph>The ECU
</paragraph>
<paragraph>The aircraft
</paragraph>
<paragraph>The system
</paragraph>
</conbody>
</concept>
</concept>
</topic>
</dita>
这是我到目前为止一直在编码的内容。
from lxml import etree
import numpy as np
tree = etree.parse("File.xml")
root = tree.getroot()
#Listas para guardar las tags
Lista = []
Atributos = []
tags = []
attributes = []
#Muestra los tag-atributos:texto en forma de diccionario
for element in root.iter():
#Muestra las tags-atributos:texto
#print("%s - %s : %s" % (element.tag, element.attrib, element.text))
Lista.append(element.tag)
Atributos.append(element.attrib)
#Muestra los valores unicos de las tags existentes
tags = np.unique(Lista)
attributes = np.unique(Atributos)
print(tags)
print(Atributos)
tree.write("Resultado.xml")
但它会导致 类型错误:'dict' 和 'dict'
实例之间不支持“<”期望的输出是这样的
tags[topic,title,shortdesc,concept,conbody,para]
attributes[topic:{id} title:{id,columns},shortdesc:None,concept:None,conbody:id,para:id]
如果我理解正确的话,像这样的东西应该有用
#first, modify the xml to its desired form:
for t in root.xpath('//title'):
t.attrib['columns']=''
for p in root.xpath('//paragraph'):
p.attrib['id']=''
for cy in root.xpath('//conbody'):
cy.attrib['id']=''
for ct in root.xpath('//concept'):
ct.attrib.pop("id", None)
#check to make sure it worked:
print(etree.tostring(root).decode())
输出:
<dita>
<topic id="id15CDB0PL09E">
<title id="id15CDB0R0VYB" columns=""><?FM MARKER [Header/Footer ] All?>Control
</title>
<shortdesc>CONTROL</shortdesc>
<concept>
<title id="id15CDB0R0VHA" columns="">General
</title>
<conbody id="">
<paragraph id="">This section
</paragraph>
</conbody>
<concept>
<title id="id15CDB0R0V1W" columns="">System
</title>
<conbody id="">
<paragraph id="">Engine
</paragraph>
<paragraph id="">The ECU
</paragraph>
<paragraph id="">The aircraft
</paragraph>
<paragraph id="">The system
</paragraph>
</conbody>
</concept>
</concept>
</topic>
</dita>
现在,检查唯一标签:
tags = set([top.tag for top in root.xpath('/dita//*')])
print(tags)
输出:
{'concept', 'shortdesc', 'title', 'conbody', 'paragraph', 'topic'}
最后,列出属性
for tag in tags:
print(tag,root.xpath(f'//{tag}')[0].attrib.keys())
输出:
concept []
shortdesc []
title ['id', 'columns']
conbody ['id']
paragraph ['id']
topic ['id']
显然,您可以通过更改顺序、将它们添加到字典等方式来修改输出。