使用 ElementTree 解析嵌套 XML

Question

我有以下 XML 格式，我想使用 python 的 xml.etree.ElementTree 模块提取名称、区域和状态的值。

但是，到目前为止，我尝试获取此信息的尝试没有成功。

<feed>
    <entry>
        <id>uuid:asdfadsfasdf123123</id>
        <title type="text"></title>
        <content type="application/xml">
            <NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
                <Name>instancename</Name>
                <Region>US</Region>
                <Status>Active</Status>
            </NamespaceDescription>
        </content>
    </entry>
    <entry>
        <id>uuid:asdfadsfasdf234234</id>
        <title type="text"></title>
        <content type="application/xml">
            <NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
                <Name>instancename2</Name>
                <Region>US2</Region>
                <Status>Active</Status>
            </NamespaceDescription>
        </content>
    </entry>
</feed>

我的代码尝试：

NAMESPACE = '{http://www.w3.org/2005/Atom}'
root = et.fromstring(XML_STRING)
entry_root = root.findall('{0}entry'.format(NAMESPACE))
for child in entry_root:
    content_node = child.find('{0}content'.format(NAMESPACE))
    for content in content_node:
        for desc in content.iter():
            print desc.tag
            name = desc.find('{0}Name'.format(NAMESPACE))
            print name

desc.tag 为我提供了我想要访问的节点，但返回的是名称 None。知道我的代码有什么问题吗？

desc.tag 的输出：

{http://schemas.microsoft.com/netservices/2010/10/servicebus/connect}Name
{http://schemas.microsoft.com/netservices/2010/10/servicebus/connect}Region
{http://schemas.microsoft.com/netservices/2010/10/servicebus/connect}Status

Answer 1

您可以使用 lxml.etree 和默认命名空间映射来解析 XML，如下所示：

content = '''
<feed>
    <entry>
        <id>uuid:asdfadsfasdf123123</id>
        <title type="text"></title>
        <content type="application/xml">
            <NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
                <Name>instancename</Name>
                <Region>US</Region>
                <Status>Active</Status>
            </NamespaceDescription>
        </content>
    </entry>
    <entry>
        <id>uuid:asdfadsfasdf234234</id>
        <title type="text"></title>
        <content type="application/xml">
            <NamespaceDescription xmlns="http://schemas.microsoft.com/netservices/2010/10/servicebus/connect" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
                <Name>instancename2</Name>
                <Region>US2</Region>
                <Status>Active</Status>
            </NamespaceDescription>
        </content>
    </entry>
</feed>'''

from lxml import etree

tree = etree.XML(content)
ns = {'default': 'http://schemas.microsoft.com/netservices/2010/10/servicebus/connect'}

names = tree.xpath('//default:Name/text()', namespaces=ns)
regions = tree.xpath('//default:Region/text()', namespaces=ns)
statuses = tree.xpath('//default:Status/text()', namespaces=ns)

print(names)
print(regions)
print(statuses)

输出

['instancename', 'instancename2']
['US', 'US2']
['Active', 'Active']

此 XPath/namespace 功能可以调整为以您需要的任何格式输出数据。

Answer 2

我不知道为什么我以前没有看到这个。但是，我能够得到这些值。

root = et.fromstring(XML_STRING)
entry_root = root.findall('{0}entry'.format(NAMESPACE))
for child in entry_root:
    content_node = child.find('{0}content'.format(NAMESPACE))
    for descr in content_node:
        name_node = descr.find('{0}Name'.format(NAMESPACE))
        print name_node.text

使用 ElementTree 解析嵌套 XML

Parsing nested XML with ElementTree

python

xml

elementtree

xml-parsing