我正在尝试将 xml 的内容放入 pandas 数据框中,其中列中元素的属性值及其子元素的文本内容

I'm trying to get contents of xml into a pandas dataframe, with values of attributes of element in column along with text content of its child element

1.I 有一个 xml 文件如下:

<BIBDS>
 <METADATA-TABLE Resource="ACTIVE">
    <COLUMNS>   Metadata    SystemID    Standard </COLUMNS>
    <DATA>  ydfgfbcs12  dq_EMAIL    mail    </DATA>
    <DATA>  asiuertb45  ss_FIRST_NAME   FirstName   </DATA>
    <DATA>  pojkeu12er  fg_LAST_NAME    LastName    </DATA>
 </METADATA-TABLE>
 <METADATA-TABLE Resource="OFFICIAL">
    <COLUMNS>   Metadata    SystemID    Standard </COLUMNS>
    <DATA>  thsgdqw9uq  dk_EMAIL    mail    </DATA>
    <DATA>  okjnsdqw12  kl_FIRST_NAME   FirstName   </DATA>
    <DATA>  tgetiq34er  ll_LAST_NAME    LastName    </DATA>
 </METADATA-TABLE>
</BIBDS>

这是我到目前为止想出的代码:

import xml.etree.ElementTree as et
tree = et.parse('filepath')
root = tree.getroot()
column_metadata_table = []
for mt in root.findall('METADATA-TABLE'):
  columntable = mt.find('COLUMNS').text
  column_metadata_table.append(columntable.split('\t'))
break


data_metadata_table = []
for mt in tree.iter('METADATA-TABLE'):
  datatable = mt.findall("DATA")
  for dat in datatable:
    data_metadata_table.append(dat.text.split('\t'))
df_metadata_table = pd.DataFrame(data_metadata_table,columns = column_metadata_table)

这将给我一个输出,其中的列名来自 (column-tag),其中的数据来自 (data-tag),但我需要另一列,其中包含资源值,其中列名作为资源。

作为数据帧的预期输出:

Metadata     SystemID        Standard       Resource

ydfgfbcs12   dq_EMAIL         mail          ACTIVE
asiuertb45   ss_FIRST_NAME    FirstName     ACTIVE
pojkeu12er   fg_LAST_NAME     LastName      ACTIVE
thsgdqw9uq   dk_EMAIL         mail          OFFICIAL
okjnsdqw12   kl_FIRST_NAME    FirstName     OFFICIAL
tgetiq34er  ll_LAST_NAME      LastName      OFFICIAL

您可以使用 BeautifulSoup 高效地做到这一点:

from bs4 import BeautifulSoup
import pandas as pd
import re

with open('filepath') as f:
    soup = BeautifulSoup(f.read())

whitespace_rx = re.compile(r'\s+')
df = pd.concat([
    pd.DataFrame(
        data=[whitespace_rx.split(row.text.strip()) for row in mt.find_all('data')],
        columns=whitespace_rx.split(mt.find('columns').text.strip())
    ).assign(Resource=mt.get('resource'))
    for mt in soup.find_all('metadata-table')
]).reset_index(drop=True)

输出:

>>> df
     Metadata       SystemID   Standard  Resource
0  ydfgfbcs12       dq_EMAIL       mail    ACTIVE
1  asiuertb45  ss_FIRST_NAME  FirstName    ACTIVE
2  pojkeu12er   fg_LAST_NAME   LastName    ACTIVE
3  thsgdqw9uq       dk_EMAIL       mail  OFFICIAL
4  okjnsdqw12  kl_FIRST_NAME  FirstName  OFFICIAL
5  tgetiq34er   ll_LAST_NAME   LastName  OFFICIAL

下面的代码似乎可以工作(请注意,代码不使用 any 外部库进行解析)

import xml.etree.ElementTree as ET
import pandas as pd

xml = '''<BIBDS>
 <METADATA-TABLE Resource="ACTIVE">
    <COLUMNS>   Metadata    SystemID    Standard </COLUMNS>
    <DATA>  ydfgfbcs12  dq_EMAIL    mail    </DATA>
    <DATA>  asiuertb45  ss_FIRST_NAME   FirstName   </DATA>
    <DATA>  pojkeu12er  fg_LAST_NAME    LastName    </DATA>
 </METADATA-TABLE>
 <METADATA-TABLE Resource="OFFICIAL">
    <COLUMNS>   Metadata    SystemID    Standard </COLUMNS>
    <DATA>  thsgdqw9uq  dk_EMAIL    mail    </DATA>
    <DATA>  okjnsdqw12  kl_FIRST_NAME   FirstName   </DATA>
    <DATA>  tgetiq34er  ll_LAST_NAME    LastName    </DATA>
 </METADATA-TABLE>
</BIBDS>'''

LOOKUP = {0:'Metadata',1:'SystemID',2:'Standard'}
root = ET.fromstring(xml)
df_data = []
for meta in root.findall('.//METADATA-TABLE'):
  res = meta.attrib['Resource']
  for data in meta.findall('DATA'):
    entry = {'Resource':res}
    elements = data.text.split()
    for idx,element in enumerate(elements):
      entry[LOOKUP[idx]] = element
    df_data.append(entry)
df = pd.DataFrame(df_data)
print(df)

输出

   Resource    Metadata       SystemID   Standard
0    ACTIVE  ydfgfbcs12       dq_EMAIL       mail
1    ACTIVE  asiuertb45  ss_FIRST_NAME  FirstName
2    ACTIVE  pojkeu12er   fg_LAST_NAME   LastName
3  OFFICIAL  thsgdqw9uq       dk_EMAIL       mail
4  OFFICIAL  okjnsdqw12  kl_FIRST_NAME  FirstName
5  OFFICIAL  tgetiq34er   ll_LAST_NAME   LastName  

通过将缺失的列名称和值添加到相关列表来稍微调整 OP 代码:

import xml.etree.ElementTree as et
import pandas as pd

tree = et.parse('tmp.xml')
root = tree.getroot()
column_metadata_table = []
for mt in root.findall('METADATA-TABLE'):
  columntable = mt.find('COLUMNS').text
  column_metadata_table = columntable.strip().split('\t')
  break

column_metadata_table.append('RESOURCE')

data_metadata_table = []
for mt in root.findall('METADATA-TABLE'):
  datatable = mt.findall("DATA")
  for dat in datatable:
    td = dat.text.strip().split('\t')
    td.append(mt.attrib['Resource'])
    data_metadata_table.append(td)
  
  #data_metadata_table

print(column_metadata_table)
print(data_metadata_table)

df_metadata_table = pd.DataFrame(data_metadata_table,columns = column_metadata_table)

print(df_metadata_table)

结果

     Metadata       SystemID   Standard  RESOURCE
0  ydfgfbcs12       dq_EMAIL       mail    ACTIVE
1  asiuertb45  ss_FIRST_NAME  FirstName    ACTIVE
2  pojkeu12er   fg_LAST_NAME   LastName    ACTIVE
3  thsgdqw9uq       dk_EMAIL       mail  OFFICIAL
4  okjnsdqw12  kl_FIRST_NAME  FirstName  OFFICIAL
5  tgetiq34er   ll_LAST_NAME   LastName  OFFICIAL