将动态 XML 文件转换为 pandas 数据帧

Question

感谢您的宝贵时间和关注。这是我第一次在 Whosebug 上发帖，如果我笨拙，请原谅。

基本上我在 Python 中编码，但这是我第一次解析 XML 文件。我已经研究了几个星期，但我在一点或多点上受阻。

我的样本是：

<record date="2017-12-01" time="10:13:40.913" id="ALARM:Ctrl">
  <field name="inst">run</field>
  <field name="name">run0</field>
  <field name="group">toto</field>

</record>
<record date="20197-12-02" time="21:07:06.66" id="ALARM:SFC">
    <field name="inst">run</field>
    <field name="name">run</field>
    <field name="group">tata</field>

</record>

记录的数量是动态的，应答器字段的名称可以针对每条记录进行更改。在这里，我的代码将这个 xml 文件解析为 pandas 数据帧：

import pandas as pd
import xml.etree.ElementTree as et
import re
import itertools


with open('Alarm.xml') as f:
    it = itertools.chain('<root>', f, '</root>')
    root = et.fromstringlist(it)

    df_cols = ["date", "time", "id", "inst","name",  'group']
    rows = []

    system_inst = []
    system_name = []
    group = []


    for record in root.findall('record'):

      ListDate = record.get('date')
      ListTime = record.get('time')
      ListId   = record.get('id')

      inst = record.getchildren()[0].text
      name = record.getchildren()[1].text
      group = record.getchildren()[2].text


      rows.append({"date": ListDate, "time": ListTime, "id": ListId, "inst" : inst,
                  "name" : name,  "group" : group})

    out_df = pd.DataFrame(rows, columns = df_cols)
    print(out_df)

但是对于每条记录，我可以有不同的字段丢失或不丢失，在这种情况下，我想要数据框中的“None”。但是我暂时没有找到解决方案。

再次感谢您的帮助。

Answer 1

这是一个非常有趣的问题 - 谢谢！你想做的事情可以用 lxml 和 xpath 来完成。我会在接下来的过程中尝试解释：

from lxml import etree
import pandas as pd
records = """[your xml above]"""

root = etree.fromstring(records)
num_recs = int(root.xpath('count(//record)')) #count the number of records; 2, in this case
rec_grid = [[] for __ in range(num_recs)] #intitalize a list of sublists (2 in this case, with each sublist holding the relevant fields
fields = ["date","time","id","system_inst", "system_name", "flags", "alias", "group", "priority", "text", "trtext", "end", "duration", "ackts", "user", "acktext", "ivtext"]

paths = root.xpath('//record') #this contains a list of the 2 locations of the records
counter = 0
for path in paths:    
    for fld in fields[:3]: #the first 3 fields are in a different sub-location than the other 14             
        target = f'(./@{fld})' #using f-strings to populate the full path
        if path.xpath(target):
                rec_grid[counter].append(path.xpath(target)[0]) #we start populating our current sublist with the relevant info            
        else:
                rec_grid[counter].append('NA')

    for fld in fields[3:]:  # and now for the rest of the fields            
        target = f'(./field[@name="{fld}"]/text())'
        if path.xpath(target):
            rec_grid[counter].append(path.xpath(target)[0]) 
        else:
            rec_grid[counter].append('NA')
    counter+=1

df = pd.DataFrame(rec_grid, columns=fields) #now that we have our lists, create a df
df

输出太长，无法在此处重现，但它看起来像您问题中的链接图像。

您可以通过使用像这样的辅助函数来（稍微）简化它：

def proc_target(path,target,counter):
    if path.xpath(target):
        rec_grid[counter].append(path.xpath(target)[0])               
    else:
        rec_grid[counter].append('NA')

并将 for 循环更改为：

for path in paths:    
    for fld in fields[:3]:              
        target = f'(./@{fld})'
        proc_target(path,target,counter)
    for fld in fields[3:]:              
        target = f'(./field[@name="{fld}"]/text())'
        proc_target(path,target,counter)        
    counter+=1

Answer 2

我想知道这是否可行。

第一个

pip install pandas_read_xml

然后

import pandas_read_xml as pdx

df = pdx.read_xml('Alarm.xml')

您可能需要

df = pdx.flatten(df)

将动态 XML 文件转换为 pandas 数据帧

Convert dynamic XML file to pandas Dataframe

xml

dataframe

xml-parsing

pandas