在使用 lxml 进行 iterparsing 期间无法识别标记

Question

我对 lxml 有一个非常奇怪的问题，我尝试用 iterparse 解析我的 xml 文件，如下所示：

for event, elem in etree.iterparse(input_file, events=('start', 'end')):
    if elem.tag == 'tuv' and event == 'start':
        if elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'en':
            if elem.find('seg') is not None:
                write_in_some_file
        elif elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'de':
            if elem.find('seg') is not None:
                write_in_some_file

它非常简单并且几乎完美地工作，很快它通过我的 xml 文件，如果是一个元素，它会检查语言属性是 'en' 还是 'de'，它然后检查是否得到 child，如果是，它将其值写入文件

文件中有一个似乎不存在的 ，在执行 elem.find('seg') 时返回 None，你可以在这里看到它，你可以在下面的上下文中看到它 <seg>! keine Spalten und Ventile</seg>.

我不明白为什么这个看起来很好的标签会产生问题（因为我不能使用它的 .text），请注意其他所有标签都很好

<tu tuid="235084307" datatype="Text">
<prop type="score">1.67647</prop>
<prop type="score-zipporah">0.6683</prop>
<prop type="score-bicleaner">0.7813</prop>
<prop type="lengthRatio">0.740740740741</prop>
<tuv xml:lang="en">
 <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
 <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
 <seg>! no gaps and valves</seg>
</tuv>
<tuv xml:lang="de">
 <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
 <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
 <seg>! keine Spalten und Ventile</seg>
</tuv>
</tu>

Answer 1

我不确定这是否是您正在寻找的（我自己对此很陌生），但是

for event, elem in etree.iterparse('xml_try.txt', events=('start', 'end')):
if elem.tag == 'tuv' and event == 'start':
    if elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'en':
        if elem.find('seg') is not None:
            print(elem[2].text)
    elif elem.get('{http://www.w3.org/XML/1998/namespace}lang') == 'de':
        if elem.find('seg') is not None:
            print(elem[2].text)

生成此输出：

! no gaps and valves
! keine Spalten und Ventile

再次抱歉，如果这不是您想要的。

Answer 2

在 lxml docs 中有这个警告：

WARNING: During the 'start' event, any content of the element, such as the descendants, following siblings or text, is not yet available and should not be accessed. Only attributes are guaranteed to be set.

也许不使用 tu 中的 find() 来获取 seg 元素，而是更改 "if" 语句以匹配 seg 和 "end" 事件。

您可以使用 getparent() 从父 tu 获取 xml:lang 属性值。

示例（"test.xml" 带有额外的 "tu" 元素用于测试）

<tus>
    <tu tuid="235084307" datatype="Text">
        <prop type="score">1.67647</prop>
        <prop type="score-zipporah">0.6683</prop>
        <prop type="score-bicleaner">0.7813</prop>
        <prop type="lengthRatio">0.740740740741</prop>
        <tuv xml:lang="en">
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
            <seg>! no gaps and valves</seg>
        </tuv>
        <tuv xml:lang="de">
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
            <seg>! keine Spalten und Ventile</seg>
        </tuv>
    </tu>
    <tu tuid="235084307A" datatype="Text">
        <prop type="score">1.67647</prop>
        <prop type="score-zipporah">0.6683</prop>
        <prop type="score-bicleaner">0.7813</prop>
        <prop type="lengthRatio">0.740740740741</prop>
        <tuv xml:lang="en">
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
            <seg>! no gaps and valves #2</seg>
        </tuv>
        <tuv xml:lang="de">
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34/7969ccc9b6/bevi-clean-ball.html</prop>
            <prop type="source-document">http://www.beviclean.de/en/shop/product-details/artikel/bevi-accessoires/34//bevi-clean-ball.html</prop>
            <seg>! keine Spalten und Ventile #2</seg>
        </tuv>
    </tu>
</tus>

Python 3.x

from lxml import etree

for event, elem in etree.iterparse("test.xml", events=("start", "end")):

    if elem.tag == "seg" and event == "end":
        current_lang = elem.getparent().get("{http://www.w3.org/XML/1998/namespace}lang")
        if current_lang == "en":
            print(f"Writing en text \"{elem.text}\" to file...")
        elif current_lang == "de":
            print(f"Writing de text \"{elem.text}\" to file...")
        else:
            print(f"Unable to determine language. Not writing \"{elem.text}\" to any file.")

    if event == "end":
        elem.clear()

打印输出

Writing en text "! no gaps and valves" to file...
Writing de text "! keine Spalten und Ventile" to file...
Writing en text "! no gaps and valves #2" to file...
Writing de text "! keine Spalten und Ventile #2" to file...

在使用 lxml 进行 iterparsing 期间无法识别标记

Tag unrecognized during iterparsing using lxml

python

xml

tags

lxml

iterparse