Python : 从 xml 文件构建差异 paths/trees

Question

这里是一个 xml 文件的例子：

<?xml version="1.0" encoding="utf-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
  <SOAP-ENV:Header />
  <SOAP-ENV:Body>
    <ADD_LandIndex_001>
      <CNTROLAREA>
        <BSR>
          <status>ADD</status>
          <NOUN>LandIndex</NOUN>
          <REVISION>001</REVISION>
        </BSR>
      </CNTROLAREA>
      <DATAAREA>
        <LandIndex>
          <reportId>AMI100031</reportId>
          <requestKey>R3278458</requestKey>
          <SubmittedBy>EN4871</SubmittedBy>
          <submittedOn>2015/01/06 4:20:11 PM</submittedOn>
          <LandIndex>
            <agreementdetail>
              <agreementid>001       4860</agreementid>
              <agreementtype>NATURAL GAS</agreementtype>
              <currentstatus>
                <status>ACTIVE</status>
                <statuseffectivedate>1965/02/18</statuseffectivedate>
                <termdate>1965/02/18</termdate>
              </currentstatus>
              <designatedrepresentative>
              </designatedrepresentative>
            </agreementdetail>
          </LandIndex>
        </LandIndex>
      </DATAAREA>
    </ADD_LandIndex_001>
  </SOAP-ENV:Body>
</SOAP-ENV:Envelope>

我想在列表中存储所有在我的 xml 文件中有文本的不同路径。所以我想要这样的东西：

['Envelope/Body/ADD_LandIndex_01/CNTROLAREA/BSR/status', 'Envelope/Body/ADD_LandIndex_01/CNTROLAREA/BSR/LandIndex', ...]

我尝试了一些不起作用的代码。当我在中间切换节点时，我看不到如何单独获取一个分支的最后一个元素以及如何从头开始所有路径（即 Envelope/Body/ADD_LandIndex_01/DATAAREA...

import xml.etree.ElementTree as et
import os
import pandas as pd
from re import search

filename = 'file_try.xml'
element_tree = et.parse(filename)
root = element_tree.getroot()
namespace = "{http://schemas.xmlsoap.org/soap/envelope/}"


def remove_namespace(string,namespace) :
    
    if search(namespace, string) :
        new_string = string.replace(namespace,'')
    else : 
        new_string= string
    return new_string

dico = {}
title = root.tag
print(title)

for element in root.findall('.//') :
    #print(element)
    if len(list(element)) > 0 :
        #print('True ') 
        title= remove_namespace(title + '/' + element.tag, namespace)
        print(title+ '\n')

    else :
        
        title = root.tag

谁能帮帮我？

谢谢

Answer 1

您可以为您的实际代码修改它，但基本上 - 它应该如下所示：

from lxml import etree
soap = """[your xml above]"""
root = etree.XML(soap.encode())    
tree = etree.ElementTree(root)
for target in root.xpath('//text()'):
    if len(target.strip())>0:       
        print(tree.getpath(target.getparent()).replace('SOAP-ENV:',''))

输出：

/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/status
/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN
/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/REVISION
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/reportId
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/requestKey
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/SubmittedBy
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/submittedOn
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementid
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementtype
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/status
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/statuseffectivedate
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/termdate

Python : 从 xml 文件构建差异 paths/trees

Python : Build the differents paths/trees from a xml file

python

xml

path

elementtree