使用元素树删除 xml 节点的所有内容和子元素
Delete all contents and child elements of an xml node using element tree
我有一个 XML 文件,我想删除具有给定属性=值的节点中的所有内容,但无法使元素树 .remove()
方法起作用。我收到 list.remove(x): x not in list
错误。
如果我有一个 div,包含多个段落和列表元素,具有属性 v1-9,deleted
我希望能够删除整个 div 及其所有内容。
import xml.etree.ElementTree as ET
#get target file
tree = ET.parse('tested.htm')
#pull into element tree
root = tree.getroot()
#confirm output
print(root)
#define xlmns tags
MadCap = {'MadCap': 'http://www.madcapsoftware.com/Schemas/MadCap.xsd'}
i=1
j=6
# specify state
state = "state.deleted-in-vers"
# specify version
vers = "version-number.v{}-{}".format(i,j)
# combine to get conditional string might need to double up b/c of order mattering here???
search = ".//*[@MadCap:conditions='{},{}']".format(vers,state)
#get matching elements
for elem in root.findall(search, MadCap):
print('---PARENT---')
print(elem)
print('attributes:', elem.attrib)
print('text:', elem.text)
elem.text = " "
elem.attrib = {}
for child in elem.iter():
print('-child element-')
print(child)
elem.remove(child)
print('==========')
为简单起见,我在上面省略了 i 和 j 上的循环。
这是目标的一个片段 xml,因此您可以了解如何使用这些属性。
<div MadCap:conditions="state.deleted-in-vers,version-number.v1-9">
<h4>Example with password prompts</h4>
<p>In the following example:</p>
<ul>
<li>We have included the value <code>connection.ask-pass</code>, so are being prompted for the password of the setup user. </li>
<li>This host has an installation user <code>hub-setup</code>. </li>
<li>We are installing to the host <code>hub.example.com</code>. We must provide the FQDN of the host.</li>
<li>The KeyStore we are installing to the <MadCap:variable name="Components/gateway-hub.gateway-hub-name" /> hosts is located at <code>/tmp/ssl_keystore</code> on the installation machine.</li>
<li>The TrustStore we are installing to the <MadCap:variable name="Components/gateway-hub.gateway-hub-name" /> hosts is located at <code>/tmp/ssl_truststore</code> on the installation machine.</li>
<li>We are not providing any of the password key-value pairs, and therefore are being prompted for the passwords. </li>
<li>This host has a runtime user <code>hub</code>.<ul><li>The runtime user is in group <code>gateway-hub</code>.</li></ul></li>
</ul>
<p>The <MadCap:variable name="3rd-party-products/formats.json-name" /> configuration file is the following:</p><pre xml:space="preserve">{
"connection": {
"ask_pass": true,
"user": "hub-setup"
},
"hosts": ["hub.example.com"],
"hub": {<MadCap:conditionalText MadCap:conditions="state.new-in-vers,version-number.v1-6">
"user" : "hub",
"group" : "gateway-hub",</MadCap:conditionalText>
"ssl": {
"key_store": "/tmp/ssl_keystore",
"trust_store": "/tmp/ssl_truststore"
}
}<MadCap:conditionalText MadCap:conditions="version-number.v1-6,state.deleted-in-vers">
"ansible" : {
"variables" : {
"hub_user": "hub",
"hub_group": "gateway-hub"
}
}</MadCap:conditionalText>
}</pre>
</div>
<div MadCap:conditions="state.deleted-in-vers,version-number.v1-9">
<h4>Example using SSH key</h4>
<p>In the next example:</p>
<ul>
<li>The SSH key for the setup user is located at <code>~/.ssh/HUB-SETUP-KEY.pem</code> on the installation machine, specified with <code>connection.private_key</code>. </li>
<li>The hosts have an installation user <code>hub-setup</code>. We must provide the FQDN of the host.</li>
<li>The hosts are specified in a list in a newline-delimited file at <code>/tmp/hosts</code> on the installation machine. </li>
<li>The KeyStore we are installing to the <MadCap:variable name="Components/gateway-hub.gateway-hub-name" /> hosts is located at <code>/tmp/ssl_keystore</code> on the installation machine.</li>
<li>The TrustStore we are installing to the <MadCap:variable name="Components/gateway-hub.gateway-hub-name" /> hosts is located at <code>/tmp/ssl_truststore</code> on the installation machine.</li>
<li>We are providing the passwords.</li>
<li>There is a runtime user on every host called <code>hub</code>.<ul><li>The runtime user is in group <code>gateway-hub</code>.</li></ul></li>
</ul>
<p>The <MadCap:variable name="3rd-party-products/formats.json-name" /> configuration file is the following:</p><pre xml:space="preserve">{
"connection": {
"private_key": "~/.ssh/HUB-SETUP-KEY.pem",
"user": "hub-setup"
},
"hosts_file": "/tmp/hosts",
"hub": {<MadCap:conditionalText MadCap:conditions="state.new-in-vers,version-number.v1-6">
"user" : "hub",
"group" : "gateway-hub",</MadCap:conditionalText>
"ssl": {
"key_store": "/tmp/ssl_keystore",
"key_store_password" "hub123",
"trust_store": "/tmp/ssl_truststore",
"trust_store_password": "hub123",
"key_password": "hub123"
}
}<MadCap:conditionalText MadCap:conditions="version-number.v1-6,state.deleted-in-vers">
"ansible" : {
"variables" : {
"hub_user": "hub",
"hub_group": "gateway-hub"
}
}</MadCap:conditionalText>
}</pre>
</div>
我发现使用 lxml 更容易完成任务,因为更容易删除元素。
试试下面的代码:
from lxml import etree as et
def remove_element(el):
parent = el.getparent()
if el.tail.strip():
prev = el.getprevious()
if prev is not None:
prev.tail = (prev.tail or '') + el.tail
else:
parent.text = (parent.text or '') + el.tail
parent.remove(el)
# Read source XML
parser = et.XMLParser(remove_blank_text=True)
tree = et.parse('Input.xml', parser)
root = tree.getroot()
# Replace the below namespace with your proper one
ns = {'mc': 'http://dummy.com'}
# Processing
for it in root.findall('.//*[@mc:conditions]', ns):
attr = it.attrib
attrTxt = ', '.join([ f'{key}: {value}'
for key, value in attr.items() ])
print(f'Elem.: {et.QName(it).localname:6}: {attrTxt}')
delFlag = False
cond = attr.get('{http://dummy.com}conditions')
if cond:
dct = { k: v for k, v in (x.split('.')
for x in cond.split(',')) }
vn = dct.get('version-number')
st = dct.get('state')
if vn == 'v1-6' and st.startswith('deleted'):
delFlag = True
print(f" {vn}, {st:15} {'Delete' if delFlag else 'Keep'}")
if delFlag:
remove_element(it)
# Print the result
print(et.tostring(tree, method='xml',
encoding='unicode', pretty_print=True))
当然,在目标版本中添加将这棵树保存到
输出文件。
要使用 单个 根元素正确格式化 XML,
我将您的内容封装在:
<main xmlns:MadCap="http://dummy.com">
...
</main>
编辑
在我之前的解决方案中,我使用 it.getparent().remove(it)
删除
有问题的元素。
但后来我发现了一个缺陷,如果来源
XML 包含 "mixed content",即被删除的元素之后的 "tail" 文本也被删除了(但不应该)。
为了防止它,我添加了 remove_element 函数来删除 only
元素本身并调用它而不是之前的 it.getparent().remove(it).
评论中问题后的解释
attrTxt的来源是attr字典的内容(当前元素的属性)。
这个片段实际上打印了这本没有大括号的字典。
仅用于跟踪,无处可及。
另一方面,dct起着更重要的作用。
它的来源是 cond,包含 conditions 属性的内容(属于
当前元素),例如state.new-in-vers,version-number.v1-6.
这段代码:
- 以逗号分隔内容。
- 在一个点上拆分以上每个部分。
- 根据这些对创建字典。
然后 vn 收到版本号 (v1-6) 和 st -状态
(新版本)。
这是嵌入这里的重要情报。
由于这两个片段可以以不同的顺序出现,因此您无法创建
匹配所有可能情况的任何 XPath 表达式。
但是如果你检查上面的变量,很明显这是否
元素应该是去除剂还是不去除剂。
我有一个 XML 文件,我想删除具有给定属性=值的节点中的所有内容,但无法使元素树 .remove()
方法起作用。我收到 list.remove(x): x not in list
错误。
如果我有一个 div,包含多个段落和列表元素,具有属性 v1-9,deleted
我希望能够删除整个 div 及其所有内容。
import xml.etree.ElementTree as ET
#get target file
tree = ET.parse('tested.htm')
#pull into element tree
root = tree.getroot()
#confirm output
print(root)
#define xlmns tags
MadCap = {'MadCap': 'http://www.madcapsoftware.com/Schemas/MadCap.xsd'}
i=1
j=6
# specify state
state = "state.deleted-in-vers"
# specify version
vers = "version-number.v{}-{}".format(i,j)
# combine to get conditional string might need to double up b/c of order mattering here???
search = ".//*[@MadCap:conditions='{},{}']".format(vers,state)
#get matching elements
for elem in root.findall(search, MadCap):
print('---PARENT---')
print(elem)
print('attributes:', elem.attrib)
print('text:', elem.text)
elem.text = " "
elem.attrib = {}
for child in elem.iter():
print('-child element-')
print(child)
elem.remove(child)
print('==========')
为简单起见,我在上面省略了 i 和 j 上的循环。
这是目标的一个片段 xml,因此您可以了解如何使用这些属性。
<div MadCap:conditions="state.deleted-in-vers,version-number.v1-9">
<h4>Example with password prompts</h4>
<p>In the following example:</p>
<ul>
<li>We have included the value <code>connection.ask-pass</code>, so are being prompted for the password of the setup user. </li>
<li>This host has an installation user <code>hub-setup</code>. </li>
<li>We are installing to the host <code>hub.example.com</code>. We must provide the FQDN of the host.</li>
<li>The KeyStore we are installing to the <MadCap:variable name="Components/gateway-hub.gateway-hub-name" /> hosts is located at <code>/tmp/ssl_keystore</code> on the installation machine.</li>
<li>The TrustStore we are installing to the <MadCap:variable name="Components/gateway-hub.gateway-hub-name" /> hosts is located at <code>/tmp/ssl_truststore</code> on the installation machine.</li>
<li>We are not providing any of the password key-value pairs, and therefore are being prompted for the passwords. </li>
<li>This host has a runtime user <code>hub</code>.<ul><li>The runtime user is in group <code>gateway-hub</code>.</li></ul></li>
</ul>
<p>The <MadCap:variable name="3rd-party-products/formats.json-name" /> configuration file is the following:</p><pre xml:space="preserve">{
"connection": {
"ask_pass": true,
"user": "hub-setup"
},
"hosts": ["hub.example.com"],
"hub": {<MadCap:conditionalText MadCap:conditions="state.new-in-vers,version-number.v1-6">
"user" : "hub",
"group" : "gateway-hub",</MadCap:conditionalText>
"ssl": {
"key_store": "/tmp/ssl_keystore",
"trust_store": "/tmp/ssl_truststore"
}
}<MadCap:conditionalText MadCap:conditions="version-number.v1-6,state.deleted-in-vers">
"ansible" : {
"variables" : {
"hub_user": "hub",
"hub_group": "gateway-hub"
}
}</MadCap:conditionalText>
}</pre>
</div>
<div MadCap:conditions="state.deleted-in-vers,version-number.v1-9">
<h4>Example using SSH key</h4>
<p>In the next example:</p>
<ul>
<li>The SSH key for the setup user is located at <code>~/.ssh/HUB-SETUP-KEY.pem</code> on the installation machine, specified with <code>connection.private_key</code>. </li>
<li>The hosts have an installation user <code>hub-setup</code>. We must provide the FQDN of the host.</li>
<li>The hosts are specified in a list in a newline-delimited file at <code>/tmp/hosts</code> on the installation machine. </li>
<li>The KeyStore we are installing to the <MadCap:variable name="Components/gateway-hub.gateway-hub-name" /> hosts is located at <code>/tmp/ssl_keystore</code> on the installation machine.</li>
<li>The TrustStore we are installing to the <MadCap:variable name="Components/gateway-hub.gateway-hub-name" /> hosts is located at <code>/tmp/ssl_truststore</code> on the installation machine.</li>
<li>We are providing the passwords.</li>
<li>There is a runtime user on every host called <code>hub</code>.<ul><li>The runtime user is in group <code>gateway-hub</code>.</li></ul></li>
</ul>
<p>The <MadCap:variable name="3rd-party-products/formats.json-name" /> configuration file is the following:</p><pre xml:space="preserve">{
"connection": {
"private_key": "~/.ssh/HUB-SETUP-KEY.pem",
"user": "hub-setup"
},
"hosts_file": "/tmp/hosts",
"hub": {<MadCap:conditionalText MadCap:conditions="state.new-in-vers,version-number.v1-6">
"user" : "hub",
"group" : "gateway-hub",</MadCap:conditionalText>
"ssl": {
"key_store": "/tmp/ssl_keystore",
"key_store_password" "hub123",
"trust_store": "/tmp/ssl_truststore",
"trust_store_password": "hub123",
"key_password": "hub123"
}
}<MadCap:conditionalText MadCap:conditions="version-number.v1-6,state.deleted-in-vers">
"ansible" : {
"variables" : {
"hub_user": "hub",
"hub_group": "gateway-hub"
}
}</MadCap:conditionalText>
}</pre>
</div>
我发现使用 lxml 更容易完成任务,因为更容易删除元素。
试试下面的代码:
from lxml import etree as et
def remove_element(el):
parent = el.getparent()
if el.tail.strip():
prev = el.getprevious()
if prev is not None:
prev.tail = (prev.tail or '') + el.tail
else:
parent.text = (parent.text or '') + el.tail
parent.remove(el)
# Read source XML
parser = et.XMLParser(remove_blank_text=True)
tree = et.parse('Input.xml', parser)
root = tree.getroot()
# Replace the below namespace with your proper one
ns = {'mc': 'http://dummy.com'}
# Processing
for it in root.findall('.//*[@mc:conditions]', ns):
attr = it.attrib
attrTxt = ', '.join([ f'{key}: {value}'
for key, value in attr.items() ])
print(f'Elem.: {et.QName(it).localname:6}: {attrTxt}')
delFlag = False
cond = attr.get('{http://dummy.com}conditions')
if cond:
dct = { k: v for k, v in (x.split('.')
for x in cond.split(',')) }
vn = dct.get('version-number')
st = dct.get('state')
if vn == 'v1-6' and st.startswith('deleted'):
delFlag = True
print(f" {vn}, {st:15} {'Delete' if delFlag else 'Keep'}")
if delFlag:
remove_element(it)
# Print the result
print(et.tostring(tree, method='xml',
encoding='unicode', pretty_print=True))
当然,在目标版本中添加将这棵树保存到 输出文件。
要使用 单个 根元素正确格式化 XML, 我将您的内容封装在:
<main xmlns:MadCap="http://dummy.com">
...
</main>
编辑
在我之前的解决方案中,我使用 it.getparent().remove(it)
删除
有问题的元素。
但后来我发现了一个缺陷,如果来源
XML 包含 "mixed content",即被删除的元素之后的 "tail" 文本也被删除了(但不应该)。
为了防止它,我添加了 remove_element 函数来删除 only 元素本身并调用它而不是之前的 it.getparent().remove(it).
评论中问题后的解释
attrTxt的来源是attr字典的内容(当前元素的属性)。 这个片段实际上打印了这本没有大括号的字典。 仅用于跟踪,无处可及。
另一方面,dct起着更重要的作用。 它的来源是 cond,包含 conditions 属性的内容(属于 当前元素),例如state.new-in-vers,version-number.v1-6.
这段代码:
- 以逗号分隔内容。
- 在一个点上拆分以上每个部分。
- 根据这些对创建字典。
然后 vn 收到版本号 (v1-6) 和 st -状态 (新版本)。 这是嵌入这里的重要情报。 由于这两个片段可以以不同的顺序出现,因此您无法创建 匹配所有可能情况的任何 XPath 表达式。 但是如果你检查上面的变量,很明显这是否 元素应该是去除剂还是不去除剂。