将深度嵌套的 XML 解析为 pandas 数据帧

Question

我正在尝试获取 XML 文件的特定部分并将其移动到 pandas 数据帧中。按照 xml.etree 的一些教程，我仍然坚持获取输出。到目前为止，我已经设法找到了子节点，但我无法访问它们（即无法从中获取实际数据）。所以，这就是我到目前为止所得到的。

tree=ET.parse('data.xml')
root=tree_edu.getroot()
root.tag
#find all nodes within xml data
tree_edu.findall(".//")
#access the node
tree.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")

我想要的是从节点 programDescriptions 获取数据，特别是子节点 programDescriptionText xml:lang="nl"，当然还有一些额外的数据。但首先关注这个。

一些要处理的数据：

<?xml version="1.0" encoding="UTF-8"?>
<programs xmlns="http://someUrl.nl/schema/enterprise/program">
<program xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://someUrl.nl/schema/enterprise/program http://someUrl.nl/schema/enterprise/program.xsd">
<customizableOnRequest>true</customizableOnRequest>
<editor>webmaster@url</editor>
<expires>2019-04-21</expires>
<format>Edu-dex 1.0</format>
<generator>www.Url.com</generator>
<includeInCatalog>Catalogs</includeInCatalog>
<inPublication>true</inPublication>
<lastEdited>2019-04-12T20:03:09Z</lastEdited>
<programAdmission>
    <applicationOpen>true</applicationOpen>
    <applicationType>individual</applicationType>
    <maxNumberOfParticipants>12</maxNumberOfParticipants>
    <minNumberOfParticipants>8</minNumberOfParticipants>
    <paymentDue>up-front</paymentDue>
    <requiredLevel>academic bachelor</requiredLevel>
    <startDateDetermination>fixed starting date</startDateDetermination>
</programAdmission>
<programCurriculum>
    <instructionMode>training</instructionMode>
    <teacher>
        <id>{D83FFC12-0863-44A6-BDBB-ED618627F09D}</id>
        <name>SomeName</name>
        <summary xml:lang="nl">
        Long text of the summary. Not needed.
        </summary>
    </teacher>
    <studyLoad period="hour">26</studyLoad>
</programCurriculum>
<programDescriptions>
    <programName xml:lang="nl">Program Course Name</programName>
    <programSummaryText xml:lang="nl">short Program Course Name summary</programSummaryText>
    <programSummaryHtml xml:lang="nl">short Program Course Name summary in HTML format</programSummaryHtml>
    <programDescriptionText xml:lang="nl">This part is needed from the XML.
        Big program description text. This part is needed to parse from the XML file.
    </programDescriptionText>
    <programDescriptionHtml xml:lang="nl">Not needed;
        Not needed as well;
    </programDescriptionHtml>
    <subjectText>
        <subject>curriculum</subject>
        <header1 xml:lang="nl">Beschrijving</header1>
        <descriptionHtml xml:lang="nl">Yet another HTML desscription;
            Not necessarily needed;</descriptionHtml>
        </subjectText>
    <searchword xml:lang="nl">search word</searchword>
    <webLink xml:lang="nl">website-url</webLink>
</programDescriptions>
<programSchedule>
    <programRun>
        <id>PR-019514</id>
        <status>application opened</status>
        <startDate isFinal="true">2019-06-26</startDate>
        <endDate isFinal="true">2020-02-11</endDate>
    </programRun>
</programSchedule>
</program>
</programs>

Answer 1

试试下面的代码：（55703748.xml 包含您发布的 xml）

import xml.etree.ElementTree as ET

tree = ET.parse('55703748.xml')
root = tree.getroot()
nodes = root.findall(".//{http://someUrl.nl/schema/enterprise/program}programSummaryText")
for node in nodes:
    print(node.text)

输出

short Program Course Name summary

将深度嵌套的 XML 解析为 pandas 数据帧

Parse deeply nested XML to pandas dataframe

python

xml

xml.etree

pandas