如何将多个 xml 文件的属性值解析为一个 pandas 数据帧？

Question

我在一个文件夹中有几个 XML 文件。它们是系统生成的，每晚都会弹出。每晚可能有 1 到 200 个。结构是刚性的，永远不会改变。它们包含的数据比我提供的示例更多，但时间戳数据足以解决我的问题。

我正在做的是编写一个脚本（下面也只包含我遇到问题的脚本部分），将其中的数据放入 pandas 数据框以供进一步使用处理，然后从文件夹中删除文件。

我的 XML 文件如下所示：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
    <scans>
        <scan timestamp="20200909T08:13:42" more_attributes="more_values"/>
    </scans>
<scan>

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
    <scans>
        <scan timestamp="20200909T08:22:55" more_attributes="more_values"/>
    </scans>
<scan>


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
    <scans>
        <scan timestamp="20200909T08:29:27" more_attributes="more_values"/>
    </scans>
<scan>


<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
    <scans>
        <scan timestamp="20200909T08:41:41" more_attributes="more_values"/>
    </scans>
<scan>

我的脚本如下所示：

import os
import pandas as pd 
import xml.etree.ElementTree as et 

path = 'my\path'
df_cols = ['timestamp']
rows = []

for filename in os.listdir(path):
    if filename.endswith('.xml'):
        fullname = os.path.join(path, filename)
        xtree = et.parse(fullname)
        xroot = xtree.getroot() 
        scans = xroot.find('scans')
        scan = scans.findall('scan')
        for n in scan:
            s_timestamp = n.attrib.get('timestamp')
                
            rows.append({'timestamp': s_timestamp})                    
            out_df = pd.DataFrame(rows, columns = df_cols)

现在如果我 print(s_timestamp) 我得到：

20200909T08:13:42
20200909T08:22:55
20200909T08:29:27
20200909T08:41:41

这是我希望我的数据框在附加后包含的内容。但是如果我 print(rows) 我得到这个：

[{'timestamp': '20200909T08:13:42'}]
[{'timestamp': '20200909T08:13:42'}, {'timestamp': '20200909T08:22:55'}]
[{'timestamp': '20200909T08:13:42'}, {'timestamp': '20200909T08:22:55'}, {'timestamp': '20200909T08:29:27'}]
[{'timestamp': '20200909T08:13:42'}, {'timestamp': '20200909T08:22:55'}, {'timestamp': '20200909T08:29:27'}, {'timestamp': '20200909T08:41:41'}]

因此，当我 print(out_df) 时我也得到了四个结果：

              timestamp
0     20200909T08:13:42
              timestamp
0     20200909T08:13:42
1     20200909T08:22:55
              timestamp
0     20200909T08:13:42
1     20200909T08:22:55
2     20200909T08:29:27
              timestamp
0     20200909T08:13:42
1     20200909T08:22:55
2     20200909T08:29:27
3     20200909T08:41:41

虽然我正在寻找的结果是：

              timestamp
0     20200909T08:13:42
1     20200909T08:22:55
2     20200909T08:29:27
3     20200909T08:41:41

我知道循环和追加中的某些东西导致了这个，但我不明白为什么会这样。

Answer 1

使用以下行创建 df 一次：

for filename in os.listdir(path):
    if filename.endswith('.xml'):
        fullname = os.path.join(path, filename)
        xtree = et.parse(fullname)
        xroot = xtree.getroot() 
        scans = xroot.find('scans')
        scan = scans.findall('scan')
        for n in scan:
            s_timestamp = n.attrib.get('timestamp')
                
            rows.append({'timestamp': s_timestamp})                    
out_df = pd.DataFrame(rows, columns = df_cols)

如何将多个 xml 文件的属性值解析为一个 pandas 数据帧？

How do I parse attribute values from multiple xml files to one pandas dataframe?

python

xml

elementtree

python-3.x

pandas