XML 到 Pandas 数据帧转换

Question

XML 文件：

<start>
    <Hit>
         <hits path="xxxxx" id="xx" title="xxx">
         <hits path="aaaaa" id="aa" title="aaa">
    </Hit>
    <Hit>
         <hits path="bbbbb" id="bb" title="bbb">
    </Hit>
    <Hit>
         <hits path="qqqqq" id="qq" title="qqq">
         <hits path="wwwww" id="ww" title="www">
         <hits path="ttttt" id="tt" title="ttt">
    </Hit>
</start>

Python代码：

import xml.etree.cElementTree as et
tree = et.parse(xml_data)
root = tree.getroot()

for child in root:
    record = child.attrib.values()
    all_records.append(record)
    pd1 = pd.DataFrame(all_records,columns=subchild.attrib.keys())

我有非结构化的 XML 文件。 Hit 元素可以有随机数量的子 hits 元素。
我想列出所有 Hit 元素中所有第一个 hits 子元素。

回答：
数据框内容：

   path    id    title
0  xxxxx   xx    xxx
1  bbbbb   bb    bbb
2  qqqqq   qq    qqq

就是这样。应忽略所有其他项目。

record = child.attrib.values()

这行代码采用 hits 元素的所有值。即总共 6 个值。我只想要 3 个值，因为只有 3 个 Hit 标签可用。

怎么做？

Answer 1

我认为需要改变：

record = child.attrib.values()

至：

record = child[0].attrib.values()

对于 select 只有第一个值。

列表综合解法：

all_records = [child[0].attrib.values() for child in root ]

如果可能，一些空 Hit 元素：

all_records = []
for child in root:
    if len(child) > 0:
        record = child[0].attrib.values()
        all_records.append(record)

列表理解解决方案：

all_records = [child[0].attrib.values() for child in root if len(child) > 0]

XML 到 Pandas 数据帧转换

XML to Pandas Dataframe conversion

xml

elementtree

pandas