我正在尝试将 xml 的内容放入 pandas 数据框中,其中列中元素的属性值及其子元素的文本内容
I'm trying to get contents of xml into a pandas dataframe, with values of attributes of element in column along with text content of its child element
1.I 有一个 xml 文件如下:
<BIBDS>
<METADATA-TABLE Resource="ACTIVE">
<COLUMNS> Metadata SystemID Standard </COLUMNS>
<DATA> ydfgfbcs12 dq_EMAIL mail </DATA>
<DATA> asiuertb45 ss_FIRST_NAME FirstName </DATA>
<DATA> pojkeu12er fg_LAST_NAME LastName </DATA>
</METADATA-TABLE>
<METADATA-TABLE Resource="OFFICIAL">
<COLUMNS> Metadata SystemID Standard </COLUMNS>
<DATA> thsgdqw9uq dk_EMAIL mail </DATA>
<DATA> okjnsdqw12 kl_FIRST_NAME FirstName </DATA>
<DATA> tgetiq34er ll_LAST_NAME LastName </DATA>
</METADATA-TABLE>
</BIBDS>
这是我到目前为止想出的代码:
import xml.etree.ElementTree as et
tree = et.parse('filepath')
root = tree.getroot()
column_metadata_table = []
for mt in root.findall('METADATA-TABLE'):
columntable = mt.find('COLUMNS').text
column_metadata_table.append(columntable.split('\t'))
break
data_metadata_table = []
for mt in tree.iter('METADATA-TABLE'):
datatable = mt.findall("DATA")
for dat in datatable:
data_metadata_table.append(dat.text.split('\t'))
df_metadata_table = pd.DataFrame(data_metadata_table,columns = column_metadata_table)
这将给我一个输出,其中的列名来自 (column-tag),其中的数据来自 (data-tag),但我需要另一列,其中包含资源值,其中列名作为资源。
作为数据帧的预期输出:
Metadata SystemID Standard Resource
ydfgfbcs12 dq_EMAIL mail ACTIVE
asiuertb45 ss_FIRST_NAME FirstName ACTIVE
pojkeu12er fg_LAST_NAME LastName ACTIVE
thsgdqw9uq dk_EMAIL mail OFFICIAL
okjnsdqw12 kl_FIRST_NAME FirstName OFFICIAL
tgetiq34er ll_LAST_NAME LastName OFFICIAL
您可以使用 BeautifulSoup 高效地做到这一点:
from bs4 import BeautifulSoup
import pandas as pd
import re
with open('filepath') as f:
soup = BeautifulSoup(f.read())
whitespace_rx = re.compile(r'\s+')
df = pd.concat([
pd.DataFrame(
data=[whitespace_rx.split(row.text.strip()) for row in mt.find_all('data')],
columns=whitespace_rx.split(mt.find('columns').text.strip())
).assign(Resource=mt.get('resource'))
for mt in soup.find_all('metadata-table')
]).reset_index(drop=True)
输出:
>>> df
Metadata SystemID Standard Resource
0 ydfgfbcs12 dq_EMAIL mail ACTIVE
1 asiuertb45 ss_FIRST_NAME FirstName ACTIVE
2 pojkeu12er fg_LAST_NAME LastName ACTIVE
3 thsgdqw9uq dk_EMAIL mail OFFICIAL
4 okjnsdqw12 kl_FIRST_NAME FirstName OFFICIAL
5 tgetiq34er ll_LAST_NAME LastName OFFICIAL
下面的代码似乎可以工作(请注意,代码不使用 any 外部库进行解析)
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<BIBDS>
<METADATA-TABLE Resource="ACTIVE">
<COLUMNS> Metadata SystemID Standard </COLUMNS>
<DATA> ydfgfbcs12 dq_EMAIL mail </DATA>
<DATA> asiuertb45 ss_FIRST_NAME FirstName </DATA>
<DATA> pojkeu12er fg_LAST_NAME LastName </DATA>
</METADATA-TABLE>
<METADATA-TABLE Resource="OFFICIAL">
<COLUMNS> Metadata SystemID Standard </COLUMNS>
<DATA> thsgdqw9uq dk_EMAIL mail </DATA>
<DATA> okjnsdqw12 kl_FIRST_NAME FirstName </DATA>
<DATA> tgetiq34er ll_LAST_NAME LastName </DATA>
</METADATA-TABLE>
</BIBDS>'''
LOOKUP = {0:'Metadata',1:'SystemID',2:'Standard'}
root = ET.fromstring(xml)
df_data = []
for meta in root.findall('.//METADATA-TABLE'):
res = meta.attrib['Resource']
for data in meta.findall('DATA'):
entry = {'Resource':res}
elements = data.text.split()
for idx,element in enumerate(elements):
entry[LOOKUP[idx]] = element
df_data.append(entry)
df = pd.DataFrame(df_data)
print(df)
输出
Resource Metadata SystemID Standard
0 ACTIVE ydfgfbcs12 dq_EMAIL mail
1 ACTIVE asiuertb45 ss_FIRST_NAME FirstName
2 ACTIVE pojkeu12er fg_LAST_NAME LastName
3 OFFICIAL thsgdqw9uq dk_EMAIL mail
4 OFFICIAL okjnsdqw12 kl_FIRST_NAME FirstName
5 OFFICIAL tgetiq34er ll_LAST_NAME LastName
通过将缺失的列名称和值添加到相关列表来稍微调整 OP 代码:
import xml.etree.ElementTree as et
import pandas as pd
tree = et.parse('tmp.xml')
root = tree.getroot()
column_metadata_table = []
for mt in root.findall('METADATA-TABLE'):
columntable = mt.find('COLUMNS').text
column_metadata_table = columntable.strip().split('\t')
break
column_metadata_table.append('RESOURCE')
data_metadata_table = []
for mt in root.findall('METADATA-TABLE'):
datatable = mt.findall("DATA")
for dat in datatable:
td = dat.text.strip().split('\t')
td.append(mt.attrib['Resource'])
data_metadata_table.append(td)
#data_metadata_table
print(column_metadata_table)
print(data_metadata_table)
df_metadata_table = pd.DataFrame(data_metadata_table,columns = column_metadata_table)
print(df_metadata_table)
结果
Metadata SystemID Standard RESOURCE
0 ydfgfbcs12 dq_EMAIL mail ACTIVE
1 asiuertb45 ss_FIRST_NAME FirstName ACTIVE
2 pojkeu12er fg_LAST_NAME LastName ACTIVE
3 thsgdqw9uq dk_EMAIL mail OFFICIAL
4 okjnsdqw12 kl_FIRST_NAME FirstName OFFICIAL
5 tgetiq34er ll_LAST_NAME LastName OFFICIAL
1.I 有一个 xml 文件如下:
<BIBDS>
<METADATA-TABLE Resource="ACTIVE">
<COLUMNS> Metadata SystemID Standard </COLUMNS>
<DATA> ydfgfbcs12 dq_EMAIL mail </DATA>
<DATA> asiuertb45 ss_FIRST_NAME FirstName </DATA>
<DATA> pojkeu12er fg_LAST_NAME LastName </DATA>
</METADATA-TABLE>
<METADATA-TABLE Resource="OFFICIAL">
<COLUMNS> Metadata SystemID Standard </COLUMNS>
<DATA> thsgdqw9uq dk_EMAIL mail </DATA>
<DATA> okjnsdqw12 kl_FIRST_NAME FirstName </DATA>
<DATA> tgetiq34er ll_LAST_NAME LastName </DATA>
</METADATA-TABLE>
</BIBDS>
这是我到目前为止想出的代码:
import xml.etree.ElementTree as et
tree = et.parse('filepath')
root = tree.getroot()
column_metadata_table = []
for mt in root.findall('METADATA-TABLE'):
columntable = mt.find('COLUMNS').text
column_metadata_table.append(columntable.split('\t'))
break
data_metadata_table = []
for mt in tree.iter('METADATA-TABLE'):
datatable = mt.findall("DATA")
for dat in datatable:
data_metadata_table.append(dat.text.split('\t'))
df_metadata_table = pd.DataFrame(data_metadata_table,columns = column_metadata_table)
这将给我一个输出,其中的列名来自 (column-tag),其中的数据来自 (data-tag),但我需要另一列,其中包含资源值,其中列名作为资源。
作为数据帧的预期输出:
Metadata SystemID Standard Resource
ydfgfbcs12 dq_EMAIL mail ACTIVE
asiuertb45 ss_FIRST_NAME FirstName ACTIVE
pojkeu12er fg_LAST_NAME LastName ACTIVE
thsgdqw9uq dk_EMAIL mail OFFICIAL
okjnsdqw12 kl_FIRST_NAME FirstName OFFICIAL
tgetiq34er ll_LAST_NAME LastName OFFICIAL
您可以使用 BeautifulSoup 高效地做到这一点:
from bs4 import BeautifulSoup
import pandas as pd
import re
with open('filepath') as f:
soup = BeautifulSoup(f.read())
whitespace_rx = re.compile(r'\s+')
df = pd.concat([
pd.DataFrame(
data=[whitespace_rx.split(row.text.strip()) for row in mt.find_all('data')],
columns=whitespace_rx.split(mt.find('columns').text.strip())
).assign(Resource=mt.get('resource'))
for mt in soup.find_all('metadata-table')
]).reset_index(drop=True)
输出:
>>> df
Metadata SystemID Standard Resource
0 ydfgfbcs12 dq_EMAIL mail ACTIVE
1 asiuertb45 ss_FIRST_NAME FirstName ACTIVE
2 pojkeu12er fg_LAST_NAME LastName ACTIVE
3 thsgdqw9uq dk_EMAIL mail OFFICIAL
4 okjnsdqw12 kl_FIRST_NAME FirstName OFFICIAL
5 tgetiq34er ll_LAST_NAME LastName OFFICIAL
下面的代码似乎可以工作(请注意,代码不使用 any 外部库进行解析)
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<BIBDS>
<METADATA-TABLE Resource="ACTIVE">
<COLUMNS> Metadata SystemID Standard </COLUMNS>
<DATA> ydfgfbcs12 dq_EMAIL mail </DATA>
<DATA> asiuertb45 ss_FIRST_NAME FirstName </DATA>
<DATA> pojkeu12er fg_LAST_NAME LastName </DATA>
</METADATA-TABLE>
<METADATA-TABLE Resource="OFFICIAL">
<COLUMNS> Metadata SystemID Standard </COLUMNS>
<DATA> thsgdqw9uq dk_EMAIL mail </DATA>
<DATA> okjnsdqw12 kl_FIRST_NAME FirstName </DATA>
<DATA> tgetiq34er ll_LAST_NAME LastName </DATA>
</METADATA-TABLE>
</BIBDS>'''
LOOKUP = {0:'Metadata',1:'SystemID',2:'Standard'}
root = ET.fromstring(xml)
df_data = []
for meta in root.findall('.//METADATA-TABLE'):
res = meta.attrib['Resource']
for data in meta.findall('DATA'):
entry = {'Resource':res}
elements = data.text.split()
for idx,element in enumerate(elements):
entry[LOOKUP[idx]] = element
df_data.append(entry)
df = pd.DataFrame(df_data)
print(df)
输出
Resource Metadata SystemID Standard
0 ACTIVE ydfgfbcs12 dq_EMAIL mail
1 ACTIVE asiuertb45 ss_FIRST_NAME FirstName
2 ACTIVE pojkeu12er fg_LAST_NAME LastName
3 OFFICIAL thsgdqw9uq dk_EMAIL mail
4 OFFICIAL okjnsdqw12 kl_FIRST_NAME FirstName
5 OFFICIAL tgetiq34er ll_LAST_NAME LastName
通过将缺失的列名称和值添加到相关列表来稍微调整 OP 代码:
import xml.etree.ElementTree as et
import pandas as pd
tree = et.parse('tmp.xml')
root = tree.getroot()
column_metadata_table = []
for mt in root.findall('METADATA-TABLE'):
columntable = mt.find('COLUMNS').text
column_metadata_table = columntable.strip().split('\t')
break
column_metadata_table.append('RESOURCE')
data_metadata_table = []
for mt in root.findall('METADATA-TABLE'):
datatable = mt.findall("DATA")
for dat in datatable:
td = dat.text.strip().split('\t')
td.append(mt.attrib['Resource'])
data_metadata_table.append(td)
#data_metadata_table
print(column_metadata_table)
print(data_metadata_table)
df_metadata_table = pd.DataFrame(data_metadata_table,columns = column_metadata_table)
print(df_metadata_table)
结果
Metadata SystemID Standard RESOURCE
0 ydfgfbcs12 dq_EMAIL mail ACTIVE
1 asiuertb45 ss_FIRST_NAME FirstName ACTIVE
2 pojkeu12er fg_LAST_NAME LastName ACTIVE
3 thsgdqw9uq dk_EMAIL mail OFFICIAL
4 okjnsdqw12 kl_FIRST_NAME FirstName OFFICIAL
5 tgetiq34er ll_LAST_NAME LastName OFFICIAL