如何将多个 xml 文件的属性值解析为一个 pandas 数据帧?
How do I parse attribute values from multiple xml files to one pandas dataframe?
我在一个文件夹中有几个 XML 文件。它们是系统生成的,每晚都会弹出。每晚可能有 1 到 200 个。结构是刚性的,永远不会改变。它们包含的数据比我提供的示例更多,但时间戳数据足以解决我的问题。
我正在做的是编写一个脚本(下面也只包含我遇到问题的脚本部分),将其中的数据放入 pandas 数据框以供进一步使用处理,然后从文件夹中删除文件。
我的 XML 文件如下所示:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
<scans>
<scan timestamp="20200909T08:13:42" more_attributes="more_values"/>
</scans>
<scan>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
<scans>
<scan timestamp="20200909T08:22:55" more_attributes="more_values"/>
</scans>
<scan>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
<scans>
<scan timestamp="20200909T08:29:27" more_attributes="more_values"/>
</scans>
<scan>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
<scans>
<scan timestamp="20200909T08:41:41" more_attributes="more_values"/>
</scans>
<scan>
我的脚本如下所示:
import os
import pandas as pd
import xml.etree.ElementTree as et
path = 'my\path'
df_cols = ['timestamp']
rows = []
for filename in os.listdir(path):
if filename.endswith('.xml'):
fullname = os.path.join(path, filename)
xtree = et.parse(fullname)
xroot = xtree.getroot()
scans = xroot.find('scans')
scan = scans.findall('scan')
for n in scan:
s_timestamp = n.attrib.get('timestamp')
rows.append({'timestamp': s_timestamp})
out_df = pd.DataFrame(rows, columns = df_cols)
现在如果我 print(s_timestamp)
我得到:
20200909T08:13:42
20200909T08:22:55
20200909T08:29:27
20200909T08:41:41
这是我希望我的数据框在附加后包含的内容。但是如果我 print(rows)
我得到这个:
[{'timestamp': '20200909T08:13:42'}]
[{'timestamp': '20200909T08:13:42'}, {'timestamp': '20200909T08:22:55'}]
[{'timestamp': '20200909T08:13:42'}, {'timestamp': '20200909T08:22:55'}, {'timestamp': '20200909T08:29:27'}]
[{'timestamp': '20200909T08:13:42'}, {'timestamp': '20200909T08:22:55'}, {'timestamp': '20200909T08:29:27'}, {'timestamp': '20200909T08:41:41'}]
因此,当我 print(out_df)
时我也得到了四个结果:
timestamp
0 20200909T08:13:42
timestamp
0 20200909T08:13:42
1 20200909T08:22:55
timestamp
0 20200909T08:13:42
1 20200909T08:22:55
2 20200909T08:29:27
timestamp
0 20200909T08:13:42
1 20200909T08:22:55
2 20200909T08:29:27
3 20200909T08:41:41
虽然我正在寻找的结果是:
timestamp
0 20200909T08:13:42
1 20200909T08:22:55
2 20200909T08:29:27
3 20200909T08:41:41
我知道循环和追加中的某些东西导致了这个,但我不明白为什么会这样。
使用以下行创建 df 一次:
for filename in os.listdir(path):
if filename.endswith('.xml'):
fullname = os.path.join(path, filename)
xtree = et.parse(fullname)
xroot = xtree.getroot()
scans = xroot.find('scans')
scan = scans.findall('scan')
for n in scan:
s_timestamp = n.attrib.get('timestamp')
rows.append({'timestamp': s_timestamp})
out_df = pd.DataFrame(rows, columns = df_cols)
我在一个文件夹中有几个 XML 文件。它们是系统生成的,每晚都会弹出。每晚可能有 1 到 200 个。结构是刚性的,永远不会改变。它们包含的数据比我提供的示例更多,但时间戳数据足以解决我的问题。
我正在做的是编写一个脚本(下面也只包含我遇到问题的脚本部分),将其中的数据放入 pandas 数据框以供进一步使用处理,然后从文件夹中删除文件。
我的 XML 文件如下所示:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
<scans>
<scan timestamp="20200909T08:13:42" more_attributes="more_values"/>
</scans>
<scan>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
<scans>
<scan timestamp="20200909T08:22:55" more_attributes="more_values"/>
</scans>
<scan>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
<scans>
<scan timestamp="20200909T08:29:27" more_attributes="more_values"/>
</scans>
<scan>
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<scan>
<scans>
<scan timestamp="20200909T08:41:41" more_attributes="more_values"/>
</scans>
<scan>
我的脚本如下所示:
import os
import pandas as pd
import xml.etree.ElementTree as et
path = 'my\path'
df_cols = ['timestamp']
rows = []
for filename in os.listdir(path):
if filename.endswith('.xml'):
fullname = os.path.join(path, filename)
xtree = et.parse(fullname)
xroot = xtree.getroot()
scans = xroot.find('scans')
scan = scans.findall('scan')
for n in scan:
s_timestamp = n.attrib.get('timestamp')
rows.append({'timestamp': s_timestamp})
out_df = pd.DataFrame(rows, columns = df_cols)
现在如果我 print(s_timestamp)
我得到:
20200909T08:13:42
20200909T08:22:55
20200909T08:29:27
20200909T08:41:41
这是我希望我的数据框在附加后包含的内容。但是如果我 print(rows)
我得到这个:
[{'timestamp': '20200909T08:13:42'}]
[{'timestamp': '20200909T08:13:42'}, {'timestamp': '20200909T08:22:55'}]
[{'timestamp': '20200909T08:13:42'}, {'timestamp': '20200909T08:22:55'}, {'timestamp': '20200909T08:29:27'}]
[{'timestamp': '20200909T08:13:42'}, {'timestamp': '20200909T08:22:55'}, {'timestamp': '20200909T08:29:27'}, {'timestamp': '20200909T08:41:41'}]
因此,当我 print(out_df)
时我也得到了四个结果:
timestamp
0 20200909T08:13:42
timestamp
0 20200909T08:13:42
1 20200909T08:22:55
timestamp
0 20200909T08:13:42
1 20200909T08:22:55
2 20200909T08:29:27
timestamp
0 20200909T08:13:42
1 20200909T08:22:55
2 20200909T08:29:27
3 20200909T08:41:41
虽然我正在寻找的结果是:
timestamp
0 20200909T08:13:42
1 20200909T08:22:55
2 20200909T08:29:27
3 20200909T08:41:41
我知道循环和追加中的某些东西导致了这个,但我不明白为什么会这样。
使用以下行创建 df 一次:
for filename in os.listdir(path):
if filename.endswith('.xml'):
fullname = os.path.join(path, filename)
xtree = et.parse(fullname)
xroot = xtree.getroot()
scans = xroot.find('scans')
scan = scans.findall('scan')
for n in scan:
s_timestamp = n.attrib.get('timestamp')
rows.append({'timestamp': s_timestamp})
out_df = pd.DataFrame(rows, columns = df_cols)