Python etree.ElementTree 提取 XML 当文本包含 HTML 标签时，文本被截断

Question

我正在使用 python 的 xml.etree.ElementTree 抓取 pubmed xml 文档。文本中嵌入的 html 格式元素的存在导致针对给定 xml 元素 return 编辑了零散的文本。以下 xml 元素仅 return 斜体标记之前的文本。

<AbstractText>Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <i>Microdochium</i> species are the most harmful.</AbstractText>

这是有效但无法 return 包含 html.

的完整记录的示例代码

import xml.etree.ElementTree as ET
xmldata = 'directory/to/data.xml'
tree = ET.parse(xmldata)
root = tree.getroot()

abstracts = {}

for i in range(len(root)):
    for child in root[i].iter():
        if child.tag == 'ArticleTitle':
            title = child.text
            titles[i] = title

我也尝试过使用 lxml.etree 与 child.xpath('//AbstractText/text()') 类似的操作。这 return 将文档中的所有文本作为列表元素，但没有明确的方法将元素组合到原始摘要中（即，3 个摘要可能 return 3x 列表元素。

Answer 1

答案是itertext() --> 收集一个元素的inner text。

所以代码如下：

import xml.etree.ElementTree as ET
from io import StringIO

raw_data="""
<AbstractText>Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <i>Microdochium</i> species are the most harmful.</AbstractText>
"""
tree = ET.parse(StringIO(raw_data))
root = tree.getroot()
# in the element there is child element, that is reason text was coming till <i>
for e in root.findall("."):
    print(e.text,type(e))

Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <class 'xml.etree.ElementTree.Element'>

通过使用 itertext()

"".join(root.find(".").itertext()) # "".join(element.itertext())

'Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which Microdochium species are the most harmful.'

Python etree.ElementTree 提取 XML 当文本包含 HTML 标签时，文本被截断

Python etree.ElementTree extracted XML text is truncated when text contains HTML tags

python

xml

elementtree

xml-parsing

xml.etree