根据子元素的条件删除 XML 父元素 - Python

Question

我试图根据包含值“nan”的特定子元素的文本删除父 XML 元素。输入 XML 包含名称空间，这使得这比预期的更棘手，我可以单独删除 select 子元素，但不能删除 associated/adjacent 父元素。我只能删除与 gam:String 元素关联的“nan”值，但我想删除所有具有“nan”文本值的子元素及其关联的父元素。

下面是我正在使用的脚本，以及输入和（期望的）输出 XMLs ....非常感谢任何帮助！

脚本：

from lxml import etree
import os 

path = "C:\users\mdl518\Desktop\"

### Removing "Nan" Values
tree = etree.parse(os.path.join(path,"metadata_info.xml"))

for elem in tree_2.findall('.//{http://standards.iso.org/iso/19115/-3/gam/1.0}String'):
   if elem.text=='nan':
     parent = elem.getparent()
     parent.remove(elem)
    
with open(".//metadata_output.xml","wb") as f:
    f.write(etree.tostring(tree_2, xml_declaration=True, encoding='utf-8')) ## Removes elements with "nan" values

输入XML:

<?xml version='1.0' encoding='utf-8'?>
<nas:metadata xmlns:nas="http://www.arcgis.com/schema/nas/base"   
xmlns:mcc="http://standards.org/iso/19115/-3/mcc/1.0"    
xmlns:mdl="http://standards.org/iso/19115/-3/mdl/1.0" 
xmlns:mnl="http://standards.org/iso/19115/-3/mnl/1.0">
xmlns:lan="http://standards.org/iso/19115/-3/lan/1.0">
xmlns:lis="http://standards.org/iso/19115/-3/lis/1.0">
xmlns:gam="http://standards.org/iso/19115/-3/gam/1.0">
  <mdl:metadataIdentifier>
    <mcc:MD_Identifier>
      <mnl:name>
        <mnl:type>
          <gam:String>The Metadata File</gam:String>
        </mnl:type>
        <mnl:description>
          <mcc:listing codeList="http://arcgis.com/codelist/ScopeCode" codeListValue="dataset"</mcc:listing>
        </mnl:description>
      </mnl:name>
      <mnl:address>
        <mnl:defaultLocale>
          <lan:location>nan</lan:location>
        </mnl:defaultLocale>
      </mnl:address>
      <lan:language>
        <lan:type>
          <lis:name>English</lis:name>
        </lan:type>
       </lan:language>
     </mcc:MD_Identifier>
     <mcc:contactInfo>
       <mdl:POC>
         <mnl:name>
           <lis:person>Tom</lis:person>
         </mnl:name>
         <mnl:age>
           <gam:String>nan</gam:String>
         </mnl:age>
         <mnl:status>
           <lis:employment>nan</lis:employment>
         </mnl:status>
       </mdl:POC>
     </mcc:contactInfo>
   </mdl:metadataIdentifier>
 </nas:metadata>

输出XML：

<?xml version='1.0' encoding='utf-8'?>
<nas:metadata xmlns:nas="http://www.arcgis.com/schema/nas/base"   
xmlns:mcc="http://standards.org/iso/19115/-3/mcc/1.0"    
xmlns:mdl="http://standards.org/iso/19115/-3/mdl/1.0" 
xmlns:mnl="http://standards.org/iso/19115/-3/mnl/1.0">
xmlns:lan="http://standards.org/iso/19115/-3/lan/1.0">
xmlns:lis="http://standards.org/iso/19115/-3/lis/1.0">
xmlns:gam="http://standards.org/iso/19115/-3/gam/1.0">
  <mdl:metadataIdentifier>
    <mcc:MD_Identifier>
      <mnl:name>
        <mnl:type>
          <gam:String>The Metadata File</gam:String>
        </mnl:type>
        <mnl:description>
          <mcc:listing codeList="http://arcgis.com/codelist/ScopeCode" codeListValue="dataset"</mcc:listing>
        </mnl:description>
      </mnl:name>
      <lan:language>
        <lan:type>
          <lis:name>English</lis:name>
        </lan:type>
       </lan:language>
     </mcc:MD_Identifier>
     <mcc:contactInfo>
       <mdl:POC>
         <mnl:name>
           <lis:person>Tom</lis:person>
         </mnl:name>
       </mdl:POC>
     </mcc:contactInfo>
   </mdl:metadataIdentifier>
 </nas:metadata>

Answer 1

这必须分两个阶段完成：首先删除所有带有 nan 文本节点的节点，然后遍历第一步创建的空节点并将它们也删除：

#step 1 - remove nan nodes
for n in tree.xpath('//*[.="nan"]'):
    n.getparent().remove(n)]

#step 2 - select empty nodes and remove them as well
empty = [e for e in doc.xpath('//*[not(normalize-space())]')]

for emp in empty:
    try:
        emp.getparent().remove(emp)
    #one nested empty node is created by the first step; this step removes both nodes so try/except is necessary:
    except:
        continue
print(etree.tostring(doc).decode())

这应该会得到您想要的输出。

根据子元素的条件删除 XML 父元素 - Python

Remove XML Parent Elements Based on Condition of Child Element - Python

python

xml

parsing

automation

metadata