查找在损坏的命名空间中定义的节点

Question

我已经下载 this XML 文件。

我正在尝试 includingNote 如下：

...
namespaces = { "skos" : "http://www.w3.org/2004/02/skos/core#", "xml" : "http://www.w3.org/XML/1998/namespace", 
                 "udc" : "http://udcdata.info/udc-schema#" }
...


includingNote = child.find("udc:includingNote[@xml:lang='en']", namespaces)
if includingNote:
  print includingNote.text.encode("utf8")

方案位于 here 并且似乎已损坏。

有没有办法为每个子节点打印 includingNote。

Answer 1

确实在udc-scheme中没有声明skos前缀，但是搜索XML文档是没有问题的。

以下程序提取了 639 includingNote 个元素：

from xml.etree import cElementTree as ET

namespaces = {"udc" : "http://udcdata.info/udc-schema#",
              "xml" : "http://www.w3.org/XML/1998/namespace"}

doc = ET.parse("udcsummary-skos.rdf")
includingNotes = doc.findall(".//udc:includingNote[@xml:lang='en']", namespaces)

print len(includingNotes)   # 639

for i in includingNotes:
    print i.text

注意在元素名称前使用 findall() 和 .// 以便搜索整个文档。

这是一个变体，它通过首先找到所有 Concept 个元素来 returns 相同的信息：

from xml.etree import cElementTree as ET

namespaces = {"udc" : "http://udcdata.info/udc-schema#",
              "skos" : "http://www.w3.org/2004/02/skos/core#",
              "xml" : "http://www.w3.org/XML/1998/namespace"}

doc = ET.parse("udcsummary-skos.rdf")
concepts = doc.findall(".//skos:Concept", namespaces)

for c in concepts:
    includingNote = c.find("udc:includingNote[@xml:lang='en']", namespaces)
    if includingNote is not None:
        print includingNote.text

注意 is not None 的用法。没有它，它就不起作用。这似乎是 ElementTree 的一个特点。参见 Why does bool(xml.etree.ElementTree.Element) evaluate to False?。

查找在损坏的命名空间中定义的节点

Find nodes defined in corrupted namespace

python

xml

elementtree

xml-namespaces