Python etree.ElementTree 提取 XML 当文本包含 HTML 标签时,文本被截断
Python etree.ElementTree extracted XML text is truncated when text contains HTML tags
我正在使用 python 的 xml.etree.ElementTree 抓取 pubmed xml 文档。文本中嵌入的 html 格式元素的存在导致针对给定 xml 元素 return 编辑了零散的文本。以下 xml 元素仅 return 斜体标记之前的文本。
<AbstractText>Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <i>Microdochium</i> species are the most harmful.</AbstractText>
这是有效但无法 return 包含 html.
的完整记录的示例代码
import xml.etree.ElementTree as ET
xmldata = 'directory/to/data.xml'
tree = ET.parse(xmldata)
root = tree.getroot()
abstracts = {}
for i in range(len(root)):
for child in root[i].iter():
if child.tag == 'ArticleTitle':
title = child.text
titles[i] = title
我也尝试过使用 lxml.etree 与 child.xpath('//AbstractText/text()') 类似的操作。这 return 将文档中的所有文本作为列表元素,但没有明确的方法将元素组合到原始摘要中(即,3 个摘要可能 return 3x 列表元素。
答案是itertext()
--> 收集一个元素的inner text。
所以代码如下:
import xml.etree.ElementTree as ET
from io import StringIO
raw_data="""
<AbstractText>Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <i>Microdochium</i> species are the most harmful.</AbstractText>
"""
tree = ET.parse(StringIO(raw_data))
root = tree.getroot()
# in the element there is child element, that is reason text was coming till <i>
for e in root.findall("."):
print(e.text,type(e))
Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <class 'xml.etree.ElementTree.Element'>
通过使用 itertext()
"".join(root.find(".").itertext()) # "".join(element.itertext())
'Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which Microdochium species are the most harmful.'
我正在使用 python 的 xml.etree.ElementTree 抓取 pubmed xml 文档。文本中嵌入的 html 格式元素的存在导致针对给定 xml 元素 return 编辑了零散的文本。以下 xml 元素仅 return 斜体标记之前的文本。
<AbstractText>Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <i>Microdochium</i> species are the most harmful.</AbstractText>
这是有效但无法 return 包含 html.
的完整记录的示例代码import xml.etree.ElementTree as ET
xmldata = 'directory/to/data.xml'
tree = ET.parse(xmldata)
root = tree.getroot()
abstracts = {}
for i in range(len(root)):
for child in root[i].iter():
if child.tag == 'ArticleTitle':
title = child.text
titles[i] = title
我也尝试过使用 lxml.etree 与 child.xpath('//AbstractText/text()') 类似的操作。这 return 将文档中的所有文本作为列表元素,但没有明确的方法将元素组合到原始摘要中(即,3 个摘要可能 return 3x 列表元素。
答案是itertext()
--> 收集一个元素的inner text。
所以代码如下:
import xml.etree.ElementTree as ET
from io import StringIO
raw_data="""
<AbstractText>Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <i>Microdochium</i> species are the most harmful.</AbstractText>
"""
tree = ET.parse(StringIO(raw_data))
root = tree.getroot()
# in the element there is child element, that is reason text was coming till <i>
for e in root.findall("."):
print(e.text,type(e))
Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which <class 'xml.etree.ElementTree.Element'>
通过使用 itertext()
"".join(root.find(".").itertext()) # "".join(element.itertext())
'Snow mold is a severe plant disease caused by psychrophilic or psychrotolerant fungi, of which Microdochium species are the most harmful.'