将 XML 解析为 Dataframe 时生成重复值(向下填充?)

generating duplicate values (fill down?) when parsing XML into Dataframe

我在使用 Python 将 XML 解析为数据框时遇到问题。当我打印出这些值时,有些值似乎 'filldown',或者重复它们自己。 (见地址栏)。有谁知道可能出了什么问题?

import xml.etree.ElementTree as et
import pandas as pd
import xmltodict
import json

tree = et.parse('20191125_DMG_PI.xml')
root = tree.getroot()

df_cols = ["status", "priref", "full_name", "detail", "adres"]
rows = []

for record in root:
    for child in record:
        s_priref = child.get('priref')
        for field in child.findall('Address'):
            s_address = field.find('address').text
            #for sub in field.findall('address.country'):
            #   s_country = sub.find('value').text if s_country is not None else None
        for field in child.findall('name'):
            s_full_name = field.find('value').text
        for field in child.findall('name.status'):
            s_status = field.find('value').text
        for field in child.findall('level_of_detail'):
            s_detail = field.find('value').text
        rows.append({"status": s_status,
                     "priref": s_priref,
                     "full_name": s_full_name,
                     "detail": s_detail,
                     "adres": s_address},)

out_df = pd.DataFrame(rows, columns=df_cols)
print(out_df)

首先,findall() returns 一个空 list 如果没有找到符合搜索条件的东西,所以在循环中

for field in child.findall("..."):
    # this is only performed if child.findall() doesn't return empty

在这种情况下,这样做的结果是 s_addresss_full_names_statuss_detail 不一定会在外循环的每次迭代。因此,它们将保留相应 child.findall() 子句返回非空的最近一次迭代的值。

解决这个问题的简单方法是在外循环的每次迭代中将它们全部分配给某个初始值,即

for child in record:
    s_piref = child.get('piref')
    s_address = ''
    s_full_name = ''
    s_detail = ''
    s_status = ''
    # ...

虽然这样做可能更好(也许更多'pythonic'):

# Store child.findall() and field.find() keys in a dict
dict = {'Address' : 'address', 
        'name' : 'value', 
        'name.status' : 'value', 
        'level_of_detail' : 'value'}

# To store the reference keys
ref = ["adres", "full_name", "status", "detail", "piref"]

for record in root:
    # Initialize a second dict from the same keys mapping to 
    # empty strings instead
    s = dict.fromkeys(dict.keys(), '')
    s["piref"] = "piref"
    for key in dict:
        for field in child.findall(key):
            s[key] = field.find(m[key])
    rows.append(dict(zip(ref, s.values())),)

这应该与其他方法一样工作,但更容易根据需要添加更多 keys/fields。