将动态 XML 文件转换为 pandas 数据帧
Convert dynamic XML file to pandas Dataframe
感谢您的宝贵时间和关注。这是我第一次在 Whosebug 上发帖,如果我笨拙,请原谅。
基本上我在 Python 中编码,但这是我第一次解析 XML 文件。我已经研究了几个星期,但我在一点或多点上受阻。
我的样本是:
<record date="2017-12-01" time="10:13:40.913" id="ALARM:Ctrl">
<field name="inst">run</field>
<field name="name">run0</field>
<field name="group">toto</field>
</record>
<record date="20197-12-02" time="21:07:06.66" id="ALARM:SFC">
<field name="inst">run</field>
<field name="name">run</field>
<field name="group">tata</field>
</record>
记录的数量是动态的,应答器字段的名称可以针对每条记录进行更改。在这里,我的代码将这个 xml 文件解析为 pandas 数据帧:
import pandas as pd
import xml.etree.ElementTree as et
import re
import itertools
with open('Alarm.xml') as f:
it = itertools.chain('<root>', f, '</root>')
root = et.fromstringlist(it)
df_cols = ["date", "time", "id", "inst","name", 'group']
rows = []
system_inst = []
system_name = []
group = []
for record in root.findall('record'):
ListDate = record.get('date')
ListTime = record.get('time')
ListId = record.get('id')
inst = record.getchildren()[0].text
name = record.getchildren()[1].text
group = record.getchildren()[2].text
rows.append({"date": ListDate, "time": ListTime, "id": ListId, "inst" : inst,
"name" : name, "group" : group})
out_df = pd.DataFrame(rows, columns = df_cols)
print(out_df)
但是对于每条记录,我可以有不同的字段丢失或不丢失,在这种情况下,我想要数据框中的“None”。但是我暂时没有找到解决方案。
再次感谢您的帮助。
这是一个非常有趣的问题 - 谢谢!
你想做的事情可以用 lxml 和 xpath 来完成。我会在接下来的过程中尝试解释:
from lxml import etree
import pandas as pd
records = """[your xml above]"""
root = etree.fromstring(records)
num_recs = int(root.xpath('count(//record)')) #count the number of records; 2, in this case
rec_grid = [[] for __ in range(num_recs)] #intitalize a list of sublists (2 in this case, with each sublist holding the relevant fields
fields = ["date","time","id","system_inst", "system_name", "flags", "alias", "group", "priority", "text", "trtext", "end", "duration", "ackts", "user", "acktext", "ivtext"]
paths = root.xpath('//record') #this contains a list of the 2 locations of the records
counter = 0
for path in paths:
for fld in fields[:3]: #the first 3 fields are in a different sub-location than the other 14
target = f'(./@{fld})' #using f-strings to populate the full path
if path.xpath(target):
rec_grid[counter].append(path.xpath(target)[0]) #we start populating our current sublist with the relevant info
else:
rec_grid[counter].append('NA')
for fld in fields[3:]: # and now for the rest of the fields
target = f'(./field[@name="{fld}"]/text())'
if path.xpath(target):
rec_grid[counter].append(path.xpath(target)[0])
else:
rec_grid[counter].append('NA')
counter+=1
df = pd.DataFrame(rec_grid, columns=fields) #now that we have our lists, create a df
df
输出太长,无法在此处重现,但它看起来像您问题中的链接图像。
您可以通过使用像这样的辅助函数来(稍微)简化它:
def proc_target(path,target,counter):
if path.xpath(target):
rec_grid[counter].append(path.xpath(target)[0])
else:
rec_grid[counter].append('NA')
并将 for
循环更改为:
for path in paths:
for fld in fields[:3]:
target = f'(./@{fld})'
proc_target(path,target,counter)
for fld in fields[3:]:
target = f'(./field[@name="{fld}"]/text())'
proc_target(path,target,counter)
counter+=1
我想知道这是否可行。
第一个
pip install pandas_read_xml
然后
import pandas_read_xml as pdx
df = pdx.read_xml('Alarm.xml')
您可能需要
df = pdx.flatten(df)
感谢您的宝贵时间和关注。这是我第一次在 Whosebug 上发帖,如果我笨拙,请原谅。
基本上我在 Python 中编码,但这是我第一次解析 XML 文件。我已经研究了几个星期,但我在一点或多点上受阻。
我的样本是:
<record date="2017-12-01" time="10:13:40.913" id="ALARM:Ctrl">
<field name="inst">run</field>
<field name="name">run0</field>
<field name="group">toto</field>
</record>
<record date="20197-12-02" time="21:07:06.66" id="ALARM:SFC">
<field name="inst">run</field>
<field name="name">run</field>
<field name="group">tata</field>
</record>
记录的数量是动态的,应答器字段的名称可以针对每条记录进行更改。在这里,我的代码将这个 xml 文件解析为 pandas 数据帧:
import pandas as pd
import xml.etree.ElementTree as et
import re
import itertools
with open('Alarm.xml') as f:
it = itertools.chain('<root>', f, '</root>')
root = et.fromstringlist(it)
df_cols = ["date", "time", "id", "inst","name", 'group']
rows = []
system_inst = []
system_name = []
group = []
for record in root.findall('record'):
ListDate = record.get('date')
ListTime = record.get('time')
ListId = record.get('id')
inst = record.getchildren()[0].text
name = record.getchildren()[1].text
group = record.getchildren()[2].text
rows.append({"date": ListDate, "time": ListTime, "id": ListId, "inst" : inst,
"name" : name, "group" : group})
out_df = pd.DataFrame(rows, columns = df_cols)
print(out_df)
但是对于每条记录,我可以有不同的字段丢失或不丢失,在这种情况下,我想要数据框中的“None”。但是我暂时没有找到解决方案。
再次感谢您的帮助。
这是一个非常有趣的问题 - 谢谢! 你想做的事情可以用 lxml 和 xpath 来完成。我会在接下来的过程中尝试解释:
from lxml import etree
import pandas as pd
records = """[your xml above]"""
root = etree.fromstring(records)
num_recs = int(root.xpath('count(//record)')) #count the number of records; 2, in this case
rec_grid = [[] for __ in range(num_recs)] #intitalize a list of sublists (2 in this case, with each sublist holding the relevant fields
fields = ["date","time","id","system_inst", "system_name", "flags", "alias", "group", "priority", "text", "trtext", "end", "duration", "ackts", "user", "acktext", "ivtext"]
paths = root.xpath('//record') #this contains a list of the 2 locations of the records
counter = 0
for path in paths:
for fld in fields[:3]: #the first 3 fields are in a different sub-location than the other 14
target = f'(./@{fld})' #using f-strings to populate the full path
if path.xpath(target):
rec_grid[counter].append(path.xpath(target)[0]) #we start populating our current sublist with the relevant info
else:
rec_grid[counter].append('NA')
for fld in fields[3:]: # and now for the rest of the fields
target = f'(./field[@name="{fld}"]/text())'
if path.xpath(target):
rec_grid[counter].append(path.xpath(target)[0])
else:
rec_grid[counter].append('NA')
counter+=1
df = pd.DataFrame(rec_grid, columns=fields) #now that we have our lists, create a df
df
输出太长,无法在此处重现,但它看起来像您问题中的链接图像。
您可以通过使用像这样的辅助函数来(稍微)简化它:
def proc_target(path,target,counter):
if path.xpath(target):
rec_grid[counter].append(path.xpath(target)[0])
else:
rec_grid[counter].append('NA')
并将 for
循环更改为:
for path in paths:
for fld in fields[:3]:
target = f'(./@{fld})'
proc_target(path,target,counter)
for fld in fields[3:]:
target = f'(./field[@name="{fld}"]/text())'
proc_target(path,target,counter)
counter+=1
我想知道这是否可行。
第一个
pip install pandas_read_xml
然后
import pandas_read_xml as pdx
df = pdx.read_xml('Alarm.xml')
您可能需要
df = pdx.flatten(df)