列出 xml 文件中的路径和数据以存储在数据框中

Question

这是一个 xml 文件：

<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
  <SOAP-ENV:Header />
  <SOAP-ENV:Body>
    <ADD_LandIndex_001>
      <CNTROLAREA>
        <BSR>
          <status>ADD</status>
          <NOUN>LandIndex</NOUN>
          <REVISION>001</REVISION>
        </BSR>
      </CNTROLAREA>
      <DATAAREA>
        <LandIndex>
          <reportId>AMI100031</reportId>
          <requestKey>R3278458</requestKey>
          <SubmittedBy>EN4871</SubmittedBy>
          <submittedOn>2015/01/06 4:20:11 PM</submittedOn>
          <LandIndex>
            <agreementdetail>
              <agreementid>001       4860</agreementid>
              <agreementtype>NATURAL GAS</agreementtype>
              <currentstatus>
                <status>ACTIVE</status>
                <statuseffectivedate>1965/02/18</statuseffectivedate>
                <termdate>1965/02/18</termdate>
              </currentstatus>
              <designatedrepresentative></designatedrepresentative>
            </agreementdetail>
          </LandIndex>
        </LandIndex>
      </DATAAREA>
    </ADD_LandIndex_001>
  </SOAP-ENV:Body>
</SOAP-ENV:Envelope>

我想保存在数据框中：1) 路径和 2) 与路径对应的元素的文本。为了做这个数据框，我想做一个字典来存储两者。所以首先我想得到一个这样的字典（我有与相应路径关联的值）。

{'/Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/status': 'ADD', /Envelope/Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN: 'LandIndex',...}

就像那样，我只需要使用函数 df=pd.DataFrame() 创建一个数据框，我可以将其导出到 excel sheet 中。我已经有了路径列表的一部分，但是我无法从这些路径中获取文本。我不明白 lxml 库是如何工作的。我尝试了 .text() 和 text_content() 函数，但出现错误。

这是我的代码：

from lxml import etree
import xml.etree.ElementTree as et
from bs4 import BeautifulSoup
import pandas as pd

filename = 'file_try.xml'

with open(filename, 'r') as f: 
    soap = f.read() 

root = etree.XML(soap.encode())    
tree = etree.ElementTree(root)


mylist_path = []
mylist_data = []
mydico = {}
mylist = []

for target in root.xpath('//text()'):

    if len(target.strip())>0:       
        path = tree.getpath(target.getparent()).replace('SOAP-ENV:','')
        mydico[path] = target.text()

        mylist_path.append(path)
        mylist_data.append(target.text())
        mylist.append(mydico)

df=pd.DataFrame(mylist)
df.to_excel("data_xml.xlsx") 

print(mylist_path)
print(mylist_data)

感谢您的帮助！

Answer 1

下面是一个遍历XML树的例子。为此，将需要递归函数。幸运的是 lxml 为此提供了所有功能。

from lxml import etree as et
from collections import defaultdict
import pandas as pd

d = defaultdict(list)
root = et.fromstring(xml)
tree = et.ElementTree(root)

def traverse(el, d):
    if len(list(el)) > 0:
        for child in el:
            traverse(child, d)
    else:
      if el.text is not None:
        d[tree.getelementpath(el)].append(el.text)

traverse(root, d)

df = pd.DataFrame(d)

df.head()

输出：

{
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/status': ['ADD'],
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/NOUN': ['LandIndex'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/CNTROLAREA/BSR/REVISION': ['001'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/reportId': ['AMI100031'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/requestKey': ['R3278458'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/SubmittedBy': ['EN4871'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/submittedOn': ['2015/01/06 4:20:11 PM'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementid': ['001       4860'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/agreementtype': ['NATURAL GAS'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/status': ['ACTIVE'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/statuseffectivedate': ['1965/02/18'], 
    '{http://schemas.xmlsoap.org/soap/envelope/}Body/ADD_LandIndex_001/DATAAREA/LandIndex/LandIndex/agreementdetail/currentstatus/termdate': ['1965/02/18']
}

请注意，字典 d 包含列表作为值。那是因为元素可以在 XML 中重复，否则最后一个值将覆盖前一个。如果您的特定 XML 不是这种情况，请使用常规 dict 而不是 defaultdict d = {} 并使用赋值而不是附加 d[tree.getelementpath(el)] = el.text.

读取文件时相同：

d = defaultdict(list)

with open('output.xml', 'rb') as file:
    root = et.parse(file).getroot()
    
tree = et.ElementTree(root)

def traverse(el, d):
    if len(list(el)) > 0:
        for child in el:
            traverse(child, d)
    else:
      if el.text is not None:
        d[tree.getelementpath(el)].append(el.text)

traverse(root, d)

df = pd.DataFrame(d)

print(d)

列出 xml 文件中的路径和数据以存储在数据框中

Listing path and data from a xml file to store in a dataframe

python

xml

xpath

elementtree

dataframe