如何将 xml 解析为具有兄弟元素的 table?
How to parse xml into a table with sibling elements?
我有 xml 看起来像这样:
xml = """
<portfolio>
<assets>600000</assets>
<assetClassDetails>
<assetClassName>Bonds</assetClassName>
<assetAmount>100000</assetAmount>
</assetClassDetails>
<assetClassDetails>
<assetClassName>Equities</assetClassName>
<assetAmount>500000</assetAmount>
</assetClassDetails>
<rateOfReturn>6.3</rateOfReturn>
</portfolio>
"""
我通过这样做将每个元素解析为 table:
root = etree.fromstring(xml)
tag = []
text = []
parent = []
double_parent = []
for element in root.iter():
try:
element_parent = element.getparent().tag
except AttributeError:
element_parent = 'none'
try:
element_double_parent = element.getparent().getparent().tag
except AttributeError:
element_double_parent = 'none'
tag.append(element.tag)
text.append(element.text)
parent.append(element_parent)
double_parent.append(element_double_parent)
df = pd.DataFrame({'tag' : tag, 'text' : text, 'parent' : parent, 'double_parent' : double_parent})
结果是:
tag text parent double_parent
portfolio \n none none
assets 600000 portfolio none
assetClassDetails \n portfolio none
assetClassName Bonds assetClassDetails portfolio
assetAmount 100000 assetClassDetails portfolio
assetClassDetails \n portfolio none
assetClassName Equities assetClassDetails portfolio
assetAmount 500000 assetClassDetails portfolio
rateOfReturn 6.3 portfolio none
我正在努力解决如何旋转数据以使资产 class 名称和金额配对并绑定到投资组合标签(及其直接子项)的问题。我如何在结果中获得配对兄弟标签?
我想要的结果如下所示:
type assets rateOfReturn assetClassName assetAmount
portfolio 600000 6.3 Bonds 100000
portfolio 600000 6.3 Equities 500000
试试这样的东西:
rows = []
columns = ['assets', 'rateOfReturn', 'assetClassName', 'assetAmount']
for entry in root.xpath('//assetClassDetails'):
row = []
row.extend([entry.xpath('preceding-sibling::assets/text()')[0],
entry.xpath('following-sibling::rateOfReturn/text()')[0],
entry.xpath('./assetClassName/text()')[0],
entry.xpath('./assetAmount/text()')[0]])
rows.append(row)
pd.DataFrame(rows,columns=columns)
输出:
assets rateOfReturn assetClassName assetAmount
0 600000 6.3 Bonds 100000
1 600000 6.3 Equities 500000
使用另一个库的另一种有趣的方法:
import pandas_read_xml as pdx
df1 = pdx.read_xml(r'path\to\myfile.xml',['portfolio','assetClassDetails'])
df2 = pdx.read_xml(r'path\to\myfile.xml',['portfolio'])
pd.concat([df2[['assets','rateOfReturn']],df1], axis=1)
输出:
assets rateOfReturn assetClassName assetAmount
0 600000 6.3 Bonds 100000
1 600000 6.3 Equities 500000
以下(未使用任何外部库)
import xml.etree.ElementTree as ET
xml = """
<portfolio>
<assets>600000</assets>
<assetClassDetails>
<assetClassName>Bonds</assetClassName>
<assetAmount>100000</assetAmount>
</assetClassDetails>
<assetClassDetails>
<assetClassName>Equities</assetClassName>
<assetAmount>500000</assetAmount>
</assetClassDetails>
<rateOfReturn>6.3</rateOfReturn>
</portfolio>
"""
data = []
root = ET.fromstring(xml)
global_properties = {'assets': root.find('assets').text, 'rateOfReturn': root.find('rateOfReturn').text,
'type': root.tag}
for asset in root.findall('.//assetClassDetails'):
entry = {x.tag: x.text for x in list(asset)}
for k, v in global_properties.items():
entry[k] = v
data.append(entry)
for entry in data:
print(entry)
输出
{'assetClassName': 'Bonds', 'assetAmount': '100000', 'assets': '600000', 'rateOfReturn': '6.3', 'type': 'portfolio'}
{'assetClassName': 'Equities', 'assetAmount': '500000', 'assets': '600000', 'rateOfReturn': '6.3', 'type': 'portfolio'}
@JackFleeting 提到的另一种使用包的方法是:
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten
df = (pdx.read_xml(r'path\to\myfile.xml', ['portfolio'])
.pipe(fully_flatten))
扁平化将列表(XML 中的同级标签)扩展为单独的行,或将字典(XML 中的子标签)扩展为单独的列。
我有 xml 看起来像这样:
xml = """
<portfolio>
<assets>600000</assets>
<assetClassDetails>
<assetClassName>Bonds</assetClassName>
<assetAmount>100000</assetAmount>
</assetClassDetails>
<assetClassDetails>
<assetClassName>Equities</assetClassName>
<assetAmount>500000</assetAmount>
</assetClassDetails>
<rateOfReturn>6.3</rateOfReturn>
</portfolio>
"""
我通过这样做将每个元素解析为 table:
root = etree.fromstring(xml)
tag = []
text = []
parent = []
double_parent = []
for element in root.iter():
try:
element_parent = element.getparent().tag
except AttributeError:
element_parent = 'none'
try:
element_double_parent = element.getparent().getparent().tag
except AttributeError:
element_double_parent = 'none'
tag.append(element.tag)
text.append(element.text)
parent.append(element_parent)
double_parent.append(element_double_parent)
df = pd.DataFrame({'tag' : tag, 'text' : text, 'parent' : parent, 'double_parent' : double_parent})
结果是:
tag text parent double_parent
portfolio \n none none
assets 600000 portfolio none
assetClassDetails \n portfolio none
assetClassName Bonds assetClassDetails portfolio
assetAmount 100000 assetClassDetails portfolio
assetClassDetails \n portfolio none
assetClassName Equities assetClassDetails portfolio
assetAmount 500000 assetClassDetails portfolio
rateOfReturn 6.3 portfolio none
我正在努力解决如何旋转数据以使资产 class 名称和金额配对并绑定到投资组合标签(及其直接子项)的问题。我如何在结果中获得配对兄弟标签?
我想要的结果如下所示:
type assets rateOfReturn assetClassName assetAmount
portfolio 600000 6.3 Bonds 100000
portfolio 600000 6.3 Equities 500000
试试这样的东西:
rows = []
columns = ['assets', 'rateOfReturn', 'assetClassName', 'assetAmount']
for entry in root.xpath('//assetClassDetails'):
row = []
row.extend([entry.xpath('preceding-sibling::assets/text()')[0],
entry.xpath('following-sibling::rateOfReturn/text()')[0],
entry.xpath('./assetClassName/text()')[0],
entry.xpath('./assetAmount/text()')[0]])
rows.append(row)
pd.DataFrame(rows,columns=columns)
输出:
assets rateOfReturn assetClassName assetAmount
0 600000 6.3 Bonds 100000
1 600000 6.3 Equities 500000
使用另一个库的另一种有趣的方法:
import pandas_read_xml as pdx
df1 = pdx.read_xml(r'path\to\myfile.xml',['portfolio','assetClassDetails'])
df2 = pdx.read_xml(r'path\to\myfile.xml',['portfolio'])
pd.concat([df2[['assets','rateOfReturn']],df1], axis=1)
输出:
assets rateOfReturn assetClassName assetAmount
0 600000 6.3 Bonds 100000
1 600000 6.3 Equities 500000
以下(未使用任何外部库)
import xml.etree.ElementTree as ET
xml = """
<portfolio>
<assets>600000</assets>
<assetClassDetails>
<assetClassName>Bonds</assetClassName>
<assetAmount>100000</assetAmount>
</assetClassDetails>
<assetClassDetails>
<assetClassName>Equities</assetClassName>
<assetAmount>500000</assetAmount>
</assetClassDetails>
<rateOfReturn>6.3</rateOfReturn>
</portfolio>
"""
data = []
root = ET.fromstring(xml)
global_properties = {'assets': root.find('assets').text, 'rateOfReturn': root.find('rateOfReturn').text,
'type': root.tag}
for asset in root.findall('.//assetClassDetails'):
entry = {x.tag: x.text for x in list(asset)}
for k, v in global_properties.items():
entry[k] = v
data.append(entry)
for entry in data:
print(entry)
输出
{'assetClassName': 'Bonds', 'assetAmount': '100000', 'assets': '600000', 'rateOfReturn': '6.3', 'type': 'portfolio'}
{'assetClassName': 'Equities', 'assetAmount': '500000', 'assets': '600000', 'rateOfReturn': '6.3', 'type': 'portfolio'}
@JackFleeting 提到的另一种使用包的方法是:
import pandas_read_xml as pdx
from pandas_read_xml import fully_flatten
df = (pdx.read_xml(r'path\to\myfile.xml', ['portfolio'])
.pipe(fully_flatten))
扁平化将列表(XML 中的同级标签)扩展为单独的行,或将字典(XML 中的子标签)扩展为单独的列。