如何将 xml 解析为具有兄弟元素的 table?

How to parse xml into a table with sibling elements?

我有 xml 看起来像这样:

xml = """
<portfolio>
    <assets>600000</assets>
    <assetClassDetails>
        <assetClassName>Bonds</assetClassName>
        <assetAmount>100000</assetAmount>
    </assetClassDetails>
    <assetClassDetails>
        <assetClassName>Equities</assetClassName>
        <assetAmount>500000</assetAmount>
    </assetClassDetails>
    <rateOfReturn>6.3</rateOfReturn>
</portfolio>
"""

我通过这样做将每个元素解析为 table:

root = etree.fromstring(xml)

tag = []
text = []
parent = []
double_parent = []

for element in root.iter():
    try:
        element_parent = element.getparent().tag
    except AttributeError:
        element_parent = 'none'
    try:
        element_double_parent = element.getparent().getparent().tag
    except AttributeError:
        element_double_parent = 'none'
    tag.append(element.tag)
    text.append(element.text)
    parent.append(element_parent)
    double_parent.append(element_double_parent)

df = pd.DataFrame({'tag' : tag, 'text' : text, 'parent' : parent, 'double_parent' : double_parent})

结果是:

tag                 text      parent            double_parent
portfolio           \n        none              none
assets              600000    portfolio         none
assetClassDetails   \n        portfolio         none
assetClassName      Bonds     assetClassDetails portfolio
assetAmount         100000    assetClassDetails portfolio
assetClassDetails   \n        portfolio         none
assetClassName      Equities  assetClassDetails portfolio
assetAmount         500000    assetClassDetails portfolio
rateOfReturn        6.3       portfolio         none

我正在努力解决如何旋转数据以使资产 class 名称和金额配对并绑定到投资组合标签(及其直接子项)的问题。我如何在结果中获得配对兄弟标签?

我想要的结果如下所示:

type        assets  rateOfReturn    assetClassName  assetAmount
portfolio   600000  6.3             Bonds           100000
portfolio   600000  6.3             Equities        500000

试试这样的东西:

rows = []
columns = ['assets',  'rateOfReturn',    'assetClassName',  'assetAmount']
for entry in root.xpath('//assetClassDetails'):
    row = []
    row.extend([entry.xpath('preceding-sibling::assets/text()')[0],
                entry.xpath('following-sibling::rateOfReturn/text()')[0],
                entry.xpath('./assetClassName/text()')[0],
                entry.xpath('./assetAmount/text()')[0]])
    rows.append(row)
pd.DataFrame(rows,columns=columns)

输出:

    assets  rateOfReturn    assetClassName  assetAmount
0   600000  6.3     Bonds   100000
1   600000  6.3     Equities    500000

使用另一个库的另一种有趣的方法:

import pandas_read_xml as pdx
df1 = pdx.read_xml(r'path\to\myfile.xml',['portfolio','assetClassDetails'])
df2 = pdx.read_xml(r'path\to\myfile.xml',['portfolio'])
pd.concat([df2[['assets','rateOfReturn']],df1], axis=1)

输出:

assets     rateOfReturn assetClassName  assetAmount
0   600000  6.3         Bonds             100000
1   600000  6.3         Equities        500000

以下(未使用任何外部库)

import xml.etree.ElementTree as ET

xml = """
<portfolio>
    <assets>600000</assets>
    <assetClassDetails>
        <assetClassName>Bonds</assetClassName>
        <assetAmount>100000</assetAmount>
    </assetClassDetails>
    <assetClassDetails>
        <assetClassName>Equities</assetClassName>
        <assetAmount>500000</assetAmount>
    </assetClassDetails>
    <rateOfReturn>6.3</rateOfReturn>
</portfolio>
"""
data = []
root = ET.fromstring(xml)
global_properties = {'assets': root.find('assets').text, 'rateOfReturn': root.find('rateOfReturn').text,
                     'type': root.tag}
for asset in root.findall('.//assetClassDetails'):
    entry = {x.tag: x.text for x in list(asset)}
    for k, v in global_properties.items():
        entry[k] = v
    data.append(entry)
for entry in data:
    print(entry)

输出

{'assetClassName': 'Bonds', 'assetAmount': '100000', 'assets': '600000', 'rateOfReturn': '6.3', 'type': 'portfolio'}
{'assetClassName': 'Equities', 'assetAmount': '500000', 'assets': '600000', 'rateOfReturn': '6.3', 'type': 'portfolio'}

@JackFleeting 提到的另一种使用包的方法是:

import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten

df = (pdx.read_xml(r'path\to\myfile.xml', ['portfolio'])
      .pipe(fully_flatten))

扁平化将列表(XML 中的同级标签)扩展为单独的行,或将字典(XML 中的子标签)扩展为单独的列。