root.findall('foo') 和 root.iter('foo') 未返回任何结果

Question

我有一本很大的 xml 瑞典语词典。我正在寻找所有在文件中标记为 'subst.' 的名词。

这是文件的一部分，代表单词 'a' 的一个条目（文章）：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl"  href="transform_lexin.xsl"?>
<Dictionary xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="lexinAA.xsd">
  <Article ID="1000002" Sortkey="a">
    <Lemma Value="a" Variant="" Type="subst." ID="1000002" LemmaID="1" VariantID="3, 4" Rank="350">
      <Phonetic File="a.swf">a:</Phonetic>
      <Inflection Form="best.f.sing.">a:et</Inflection>
      <Inflection Form="obest.f.pl.">a:n</Inflection>
      <Inflection Form="best.f.pl.">a:na</Inflection>
      <Index Value="a" />
      <Index Value="a:et" />
      <Index Value="a:n" />
      <Index Value="a:na" />
      <Index Value="as" />
      <Index Value="a:ets" />
      <Index Value="a:ns" />
      <Index Value="a:nas" />
      <Lexeme ID="1" Lexemeno="1" LexemeID="1000006" VariantID="3">
        <Definition>första bokstaven i alfabetet</Definition>
        <Idiom ID="1000008" OldID="2">a och o<Definition ID="1000009">det viktigaste</Definition></Idiom>
        <Idiom ID="1000010" OldID="1">har man sagt a får man också säga b<Definition ID="1000011">har man börjat får man fortsätta</Definition></Idiom>
      </Lexeme>
      <Lexeme ID="2" Lexemeno="2" LexemeID="1000013" VariantID="4">
        <Definition>sjätte tonen i C-durskalan</Definition>
        <Compound OldID="" ID="2000667">a-moll</Compound>
        <Compound OldID="" ID="2000668">A-dur</Compound>
        <Index Value="a-moll" />
        <Index Value="a-molls" />
        <Index Value="a moll" />
        <Index Value="a molls" />
        <Index Value="A-dur" />
        <Index Value="A-durs" />
        <Index Value="A dur" />
        <Index Value="A durs" />
      </Lexeme>
    </Lemma>
  </Article>

当我尝试使用 findall 或 inter 方法查找名词时，它们 return 什么也没有。

import xml.etree.ElementTree as ET
import sys

tree = ET.parse(sys.argv[1])
root = tree.getroot()

for noun in root.findall('subst.'):
      print(noun.attrib)

如果我使用 findall() 和 iter()，我得到相同的空结果

但是，当我搜索 'Article' 而不是 'subst.' 时，我得到了所有字典条目：

for noun in root.iter('Article'):
      print(noun.attrib)
{'ID': '1179604', 'Sortkey': 'övning'}
{'ID': '1179617', 'Sortkey': 'övningskörning'}
{'ID': '1179637', 'Sortkey': 'övre'}
{'ID': '1179644', 'Sortkey': 'övrig'}
{'ID': '1179656', 'Sortkey': 'övärld'}

我试过其他关键词，比如 'Lemma'，但 return 什么都没有。当我使用 iter() 但不使用 findall() 时 'Idiom' returns 个项目

我显然遗漏了一些关于这些方法如何工作的明显信息。

Answer 1

这是一个 xslt 转换解决方案。由于 xml 源代码很大，您可以通过让 libxml 完成繁重的工作来享受更高的性能。要尝试它，请将以下内容复制到名为 swedish-dictionary.xsl:

的文件中

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text" />
    <xsl:strip-space elements="*"/>

    <xsl:template match="/">
        <xsl:apply-templates />
    </xsl:template>

    <xsl:template match="Dictionary/Article/Lemma[@Type = 'subst.']">
        <xsl:text>{'ID': '</xsl:text>
        <xsl:value-of select="../@ID" />
        <xsl:text>', 'Sortkey': '</xsl:text>
        <xsl:value-of select="../@Sortkey" />
        <xsl:text>'}</xsl:text>
    </xsl:template>
</xsl:stylesheet>

如果 xml 源文件名为 swedish-dictionary.xml，python 将如下所示：

from lxml import etree

with open('swedish-dictionary.xsl') as stylesheet:
    transform = etree.XSLT(etree.XML(stylesheet.read()))

with open('swedish-dictionary.xml') as xml:
    print(transform(etree.parse(xml)))

样本 xml 的结果：

{'ID': '1000002', 'Sortkey': 'a'}

您也可以使用 libxml 的 xsltproc 实用程序获得相同的结果：

xsltproc swedish-dictionary.xsl swedish-dictionary.xml

Answer 2

这是一种搜索具有 Lemma 子元素的 Article 元素的方法，其 Type 属性的值为 subst.:

import xml.etree.ElementTree as ET

tree = ET.parse("dictionary.xml")

for article in tree.findall("Article"):
    lemma = article.find("Lemma")
    if lemma and lemma.get("Type") == "subst.":
        print(article.attrib)

输出：

{'ID': '1000002', 'Sortkey': 'a'}

使用 lxml，您可以使用紧凑的 XPath 表达式获得相同的结果：

from lxml import etree

tree = etree.parse("article.xml")

for article in tree.xpath("Article[Lemma[@Type='subst.']]"):
    print(article.attrib)

root.findall('foo') 和 root.iter('foo') 未返回任何结果

root.findall('foo') and root.iter('foo') not returning any results

elementtree

python-3.x