Python : 从 xml 文件构建差异 paths/trees
Python : Build the differents paths/trees from a xml file
这里是一个 xml 文件的例子:
<?xml version="1.0" encoding="utf-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
<SOAP-ENV:Header />
<SOAP-ENV:Body>
<ADD_LandIndex_001>
<CNTROLAREA>
<BSR>
<status>ADD</status>
<NOUN>LandIndex</NOUN>
<REVISION>001</REVISION>
</BSR>
</CNTROLAREA>
<DATAAREA>
<LandIndex>
<reportId>AMI100031</reportId>
<requestKey>R3278458</requestKey>
<SubmittedBy>EN4871</SubmittedBy>
<submittedOn>2015/01/06 4:20:11 PM</submittedOn>
<LandIndex>
<agreementdetail>
<agreementid>001 4860</agreementid>
<agreementtype>NATURAL GAS</agreementtype>
<currentstatus>
<status>ACTIVE</status>
<statuseffectivedate>1965/02/18</statuseffectivedate>
<termdate>1965/02/18</termdate>
</currentstatus>
<designatedrepresentative>
</designatedrepresentative>
</agreementdetail>
</LandIndex>
</LandIndex>
</DATAAREA>
</ADD_LandIndex_001>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
我想在列表中存储所有在我的 xml 文件中有文本的不同路径。所以我想要这样的东西:
['Envelope/Body/ADD_LandIndex_01/CNTROLAREA/BSR/status', 'Envelope/Body/ADD_LandIndex_01/CNTROLAREA/BSR/LandIndex', ...]
我尝试了一些不起作用的代码。当我在中间切换节点时,我看不到如何单独获取一个分支的最后一个元素以及如何从头开始所有路径(即 Envelope/Body/ADD_LandIndex_01/DATAAREA...
import xml.etree.ElementTree as et
import os
import pandas as pd
from re import search
filename = 'file_try.xml'
element_tree = et.parse(filename)
root = element_tree.getroot()
namespace = "{http://schemas.xmlsoap.org/soap/envelope/}"
def remove_namespace(string,namespace) :
if search(namespace, string) :
new_string = string.replace(namespace,'')
else :
new_string= string
return new_string
dico = {}
title = root.tag
print(title)
for element in root.findall('.//') :
#print(element)
if len(list(element)) > 0 :
#print('True ')
title= remove_namespace(title + '/' + element.tag, namespace)
print(title+ '\n')
else :
title = root.tag
谁能帮帮我?
谢谢
您可以为您的实际代码修改它,但基本上 - 它应该如下所示:
from lxml import etree
soap = """[your xml above]"""
root = etree.XML(soap.encode())
tree = etree.ElementTree(root)
for target in root.xpath('//text()'):
if len(target.strip())>0:
print(tree.getpath(target.getparent()).replace('SOAP-ENV:',''))
输出:
/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/status
/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN
/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/REVISION
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/reportId
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/requestKey
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/SubmittedBy
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/submittedOn
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementid
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementtype
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/status
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/statuseffectivedate
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/termdate
这里是一个 xml 文件的例子:
<?xml version="1.0" encoding="utf-8"?>
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
<SOAP-ENV:Header />
<SOAP-ENV:Body>
<ADD_LandIndex_001>
<CNTROLAREA>
<BSR>
<status>ADD</status>
<NOUN>LandIndex</NOUN>
<REVISION>001</REVISION>
</BSR>
</CNTROLAREA>
<DATAAREA>
<LandIndex>
<reportId>AMI100031</reportId>
<requestKey>R3278458</requestKey>
<SubmittedBy>EN4871</SubmittedBy>
<submittedOn>2015/01/06 4:20:11 PM</submittedOn>
<LandIndex>
<agreementdetail>
<agreementid>001 4860</agreementid>
<agreementtype>NATURAL GAS</agreementtype>
<currentstatus>
<status>ACTIVE</status>
<statuseffectivedate>1965/02/18</statuseffectivedate>
<termdate>1965/02/18</termdate>
</currentstatus>
<designatedrepresentative>
</designatedrepresentative>
</agreementdetail>
</LandIndex>
</LandIndex>
</DATAAREA>
</ADD_LandIndex_001>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
我想在列表中存储所有在我的 xml 文件中有文本的不同路径。所以我想要这样的东西:
['Envelope/Body/ADD_LandIndex_01/CNTROLAREA/BSR/status', 'Envelope/Body/ADD_LandIndex_01/CNTROLAREA/BSR/LandIndex', ...]
我尝试了一些不起作用的代码。当我在中间切换节点时,我看不到如何单独获取一个分支的最后一个元素以及如何从头开始所有路径(即 Envelope/Body/ADD_LandIndex_01/DATAAREA...
import xml.etree.ElementTree as et
import os
import pandas as pd
from re import search
filename = 'file_try.xml'
element_tree = et.parse(filename)
root = element_tree.getroot()
namespace = "{http://schemas.xmlsoap.org/soap/envelope/}"
def remove_namespace(string,namespace) :
if search(namespace, string) :
new_string = string.replace(namespace,'')
else :
new_string= string
return new_string
dico = {}
title = root.tag
print(title)
for element in root.findall('.//') :
#print(element)
if len(list(element)) > 0 :
#print('True ')
title= remove_namespace(title + '/' + element.tag, namespace)
print(title+ '\n')
else :
title = root.tag
谁能帮帮我?
谢谢
您可以为您的实际代码修改它,但基本上 - 它应该如下所示:
from lxml import etree
soap = """[your xml above]"""
root = etree.XML(soap.encode())
tree = etree.ElementTree(root)
for target in root.xpath('//text()'):
if len(target.strip())>0:
print(tree.getpath(target.getparent()).replace('SOAP-ENV:',''))
输出:
/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/status
/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN
/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/REVISION
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/reportId
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/requestKey
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/SubmittedBy
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/submittedOn
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementid
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementtype
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/status
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/statuseffectivedate
/Envelope/Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/termdate