在 ElementTree 中获取整个父标签的文本

Question

在使用 xml.etree.ElementTree as ET python 包时，我想在 XML 标签中获取整个文本，其中包含一些子节点。考虑以下 xml：

<p>This is the start of parent tag...
        <ref type="chlid1">child 1</ref>. blah1 blah1 blah1 <ref type="chlid2">child2</ref> blah2 blah2 blah2 
</p>

假设上面的XML在node，那么node.text就给我This is the start of parent tag...。但是，我想捕获 p 标签内的所有文本（及其子标签的文本），这将导致：This is the start of parent tag... child 1. blah1 blah1 blah1 child2 blah2 blah2 blah2。

是否有解决此问题的方法？我查看了文档，但无法真正找到有用的东西。

Answer 1

这确实是ElementTree的一个非常尴尬的特性。要点是：如果一个元素同时包含文本和子元素，并且如果子元素介于不同的中间文本节点之间，则子元素之后的文本被称为该元素的 tail 而不是它的 text.

为了收集作为元素的直接子元素或后代的所有文本，您需要访问该元素以及所有后代元素的 text 和 tail。

>>> from lxml import etree

>>> s = '<p>This is the start of parent tag...<ref type="chlid1">child 1</ref>. blah1 blah1 blah1 <ref type="chlid2">child2</ref> blah2 blah2 blah2 </p>'

>>> root = etree.fromstring(s)
>>> child1, child2 = root.getchildren()

>>> root.text
'This is the start of parent tag...'

>>> child1.text, child1.tail
('child 1', '. blah1 blah1 blah1 ')

>>> child2.text, child2.tail
('child2', ' blah2 blah2 blah2 ')

至于一个完整的解决方案，我发现 this answer 正在做一些非常相似的事情，你可以很容易地适应你的用例（通过不打印元素的名称）。

编辑：实际上，在我看来，到目前为止最简单的解决方案是to use itertext:

>>> ''.join(root.itertext())
'This is the start of parent tag...child 1. blah1 blah1 blah1 child2 blah2 blah2 blah2 '

Answer 2

您可以使用 ElementTree 做类似的事情：

import xml.etree.ElementTree as ET
data = """[your string above]"""
tree = ET.fromstring(data)
print(' '.join(tree.itertext()).strip())

输出：

This is the start of parent tag...
         child 1 . blah1 blah1 blah1  child2  blah2 blah2 blah2

在 ElementTree 中获取整个父标签的文本

Get the entire parent tag's text in ElementTree

python

xml

elementtree