如何将 XML 文档解析为 Python 对象？

Question

我正在尝试消耗 XML API。我想要一些代表 XML 数据的 Python 对象。我有几个 XSD 和文档中的一些示例 API 回复。

这是一个示例 XML 响应：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<serial:serialHeaderType xmlns:isan="http://www.isan.org/ISAN/isan"
                         xmlns:title="http://www.isan.org/schema/v1.11/common/title"
                         xmlns:serial="http://www.isan.org/schema/v1.21/common/serial"
                         xmlns:externalid="http://www.isan.org/schema/v1.11/common/externalid"
                         xmlns:common="http://www.isan.org/schema/v1.11/common/common"
                         xmlns:participant="http://www.isan.org/schema/v1.11/common/participant"
                         xmlns:language="http://www.isan.org/schema/v1.11/common/language"
                         xmlns:country="http://www.isan.org/schema/v1.11/common/country">
    <common:status>
        <common:DataType>SERIAL_HEADER_TYPE</common:DataType>
        <common:ISAN root="0000-0002-3B9F"/>
        <common:WorkStatus>ACTIVE</common:WorkStatus>
    </common:status>
    <serial:SerialHeaderId root="0000-0002-3B9F"/>
    <serial:MainTitles>
        <title:TitleDetail>
            <title:Title>Braquo</title:Title>
            <title:Language>
                <language:LanguageLabel>French</language:LanguageLabel>
                <language:LanguageCode>
                    <language:CodingSystem>ISO639_2</language:CodingSystem>
                    <language:ISO639_2Code>FRE</language:ISO639_2Code>
                </language:LanguageCode>
            </title:Language>
            <title:TitleKind>ORIGINAL</title:TitleKind>
        </title:TitleDetail>
    </serial:MainTitles>
    <serial:TotalEpisodes>11</serial:TotalEpisodes>
    <serial:TotalSeasons>0</serial:TotalSeasons>
    <serial:MinDuration>
        <common:TimeUnit>MIN</common:TimeUnit>
        <common:TimeValue>45</common:TimeValue>
    </serial:MinDuration>
    <serial:MaxDuration>
        <common:TimeUnit>MIN</common:TimeUnit>
        <common:TimeValue>144</common:TimeValue>
    </serial:MaxDuration>
    <serial:MinYear>2009</serial:MinYear>
    <serial:MaxYear>2009</serial:MaxYear>
    <serial:MainParticipantList>
        <participant:Participant>
            <participant:FirstName>Frédéric</participant:FirstName>
            <participant:LastName>Schoendoerffer</participant:LastName>
            <participant:RoleCode>DIR</participant:RoleCode>
        </participant:Participant>
        <participant:Participant>
            <participant:FirstName>Karole</participant:FirstName>
            <participant:LastName>Rocher</participant:LastName>
            <participant:RoleCode>ACT</participant:RoleCode>
        </participant:Participant>
    </serial:MainParticipantList>
    <serial:CompanyList>
        <common:Company>
            <common:CompanyKind>PRO</common:CompanyKind>
            <common:CompanyName>R.T.B.F.</common:CompanyName>
        </common:Company>
        <common:Company>
            <common:CompanyKind>PRO</common:CompanyKind>
            <common:CompanyName>Capa Drama</common:CompanyName>
        </common:Company>
        <common:Company>
            <common:CompanyKind>PRO</common:CompanyKind>
            <common:CompanyName>Marathon</common:CompanyName>
        </common:Company>
    </serial:CompanyList>
</serial:serialHeaderType>

我尝试简单地忽略 XSD，并在 API 得到的 XML 上使用 lxml.objectify。我遇到了命名空间问题。必须使用其显式命名空间来引用每个子节点是一个真正的痛苦，并且不利于可读代码。

from lxml import objectify
obj = objectify.fromstring(response)
print obj.MainTitles.TitleDetail
# This will fail to find the element because you need to specify the namespace
print obj.MainTitles['{http://www.isan.org/schema/v1.11/common/title}TitleDetail']
# Or something like that, I couldn't get it to work, and I'd much rather use attributes and not specify the namespace

然后我尝试 generateDS 为我创建一些 Python class 定义。我已经丢失了这次尝试给我的错误消息，但我无法让它工作。它会为我给它的每个 XSD 生成一个模块，但它不会解析示例 XML.

我现在正在尝试 pyxb，到目前为止这看起来好多了。它生成比 generateDS 更好的定义（将它们分成多个可重用的模块）但它不会解析 XML:

from models import serial
obj = serial.CreateFromDocument(response)

Traceback (most recent call last):
  ...
  File "/vagrant/isan/isan.py", line 58, in lookup
    return serial.CreateFromDocument(resp.content)
  File "/vagrant/isan/models/serial.py", line 69, in CreateFromDocument
    instance = handler.rootObject()
  File "/home/vagrant/venv/lib/python2.7/site-packages/pyxb/binding/saxer.py", line 285, in rootObject
    raise pyxb.UnrecognizedDOMRootNodeError(self.__rootObject)
UnrecognizedDOMRootNodeError: <pyxb.utils.saxdom.Element object at 0x2b53664dc850>

无法识别的节点是示例中的 <serial:serialHeaderType> 节点。查看 pyxb 源代码，似乎这个错误来自 "if the top-level element got processed as a DOM instance" 但我不知道这意味着什么或如何防止它。

我已经运行试图探索这个了，我不知道下一步该做什么。

Answer 1

我很幸运地使用 Beautiful Soup 将 XML 解析为 Python。它非常简单，并且它们提供了非常强大的文档。在这里查看： http://www.crummy.com/software/BeautifulSoup/ http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Answer 2

UnrecognizedDOMRootNodeError 表示 PyXB 无法在已注册绑定的命名空间中找到该元素。在您的情况下，它在第一个元素上失败，即 {http://www.isan.org/schema/v1.21/common/serial}serialHeaderType.

schema for that namespace 定义了名为 SerialHeaderType 的复杂类型，但没有定义名为 serialHeaderType 的元素。事实上它没有定义顶级元素。所以 PyXB 无法识别它，并且 XML 无法验证。

您需要找到提供元素的命名空间的附加架构，或者您发送的消息实际上没有经过验证。这可能是因为有人期望从复杂类型到具有该类型的元素的隐式映射，或者因为它是通常会在 QName 是成员元素名称的其他元素中找到的片段。

UPDATE：您可以通过添加遵循 serial.py:

中生成的绑定

serialHeaderType = pyxb.binding.basis.element(pyxb.namespace.ExpandedName(Namespace, 'serialHeaderType'), SerialHeaderType)
Namespace.addCategoryObject('elementBinding', serialHeaderType.name().localName(), serialHeaderType)

如果你这样做，你不会得到 UnrecognizedDOMRootNodeError 但你将在：

获得 IncompleteElementContentError

<common:status>
    <common:DataType>SERIAL_HEADER_TYPE</common:DataType>
    <common:ISAN root="0000-0002-3B9F"/>
    <common:WorkStatus>ACTIVE</common:WorkStatus>
</common:status>

其中提供了以下详细信息：

The containing element {http://www.isan.org/schema/v1.11/common/common}status is defined at common.xsd[243:3].
The containing element type {http://www.isan.org/schema/v1.11/common/common}StatusType is defined at common.xsd[289:1]
The {http://www.isan.org/schema/v1.11/common/common}StatusType automaton is not in an accepting state.
Any accepted content has been stored in instance
The following element and wildcard content would be accepted:
    An element {http://www.isan.org/schema/v1.11/common/common}ActiveISAN per common.xsd[316:3]
    An element {http://www.isan.org/schema/v1.11/common/common}MatchingISANs per common.xsd[317:3]
    An element {http://www.isan.org/schema/v1.11/common/common}Description per common.xsd[318:3]
No content remains unconsumed

查看架构可确认至少缺少一个 {http://www.isan.org/schema/v1.11/common/common}Description 元素，但该元素是必需的。

所以这些文件似乎并不需要验证，而 PyXB 是可能使用了错误的技术。

如何将 XML 文档解析为 Python 对象？

How can I parse an XML document into a Python object?

python

xml

xsd

lxml

pyxb