使用 Python ElementTree 从 xml 文档中提取文本
Extracting text from an xml doc with Python ElementTree
我有一份 xml 文档,格式如下
<samples>
<sample count="10" intentref="none">
Remember to
<annotation conceptref="cf1">
<annotation conceptref="cf2">record</annotation>
</annotation>
the
<annotation conceptref="cf3">movie</annotation>
<annotation conceptref="cf4">Taxi driver</annotation>
</sample>
</samples>
并且我想提取所有文本,要么是未封装在注释标签中的文本,要么是注释标签中的文本,以便重建原始短语
所以我的输出是 --> 记得录制电影 Taxi driver
问题显然是无法获取令牌 'the'
这是我的代码片段
import xml.etree.ElementTree as ET
samples = ET.fromstring("""
<samples>
<sample count="10" intentref="none">Remember to<annotation conceptref="cf1"><annotation conceptref="cf2">record</annotation></annotation>the<annotation conceptref="cf3">movie</annotation><annotation conceptref="cf4">Taxi driver</annotation></sample>
</samples>
""")
for sample in samples.iter("sample"):
print ('***'+sample.text+'***'+sample.tail)
for annotation in sample.iter('annotation'):
print(annotation.text)
for nested_annotation in annotation.getchildren():
print(nested_annotation.text)
我以为嵌套注释会成功..但是没有,这是结果
***Remember to'***
None
record
record
movie
Taxi driver
你们非常接近。我会这样做:
import xml.etree.ElementTree as ET
samples = ET.fromstring("""<samples>
<sample count="10" intentref="none">
Remember to
<annotation conceptref="cf1">
<annotation conceptref="cf2">record</annotation>
</annotation>
the
<annotation conceptref="cf3">movie</annotation>
<annotation conceptref="cf4">Taxi driver</annotation>
</sample>
</samples>
""")
for page in samples.findall('.//'):
text = page.text if page.text else ''
tail = page.tail if page.tail else ''
print(text + tail)
这会给你:
Remember to
the
record
movie
Taxi driver
您可能会发现单词的顺序不是您想要的顺序,但您可以通过记住同时具有尾部和文本的项目并在该次迭代后插入尾部来解决此问题。不确定那是正确的方法。
我认为您正在寻找 itertext
方法:
# Iterate over all the sample block
for sample in tree.xpath('//sample'):
print(''.join(sample.itertext()))
完整代码:
# Load module
import lxml.etree as etree
# Load data
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('data.xml', parser)
# Iterate over all the sample block
for sample in tree.xpath('//sample'):
print(''.join(sample.itertext()))
# programmer l'
# enregistreur
# des
# oeuvres
# La Chevauchée de Virginia
另一个解决方案。
from simplified_scrapy import SimplifiedDoc,req,utils
html = '''
<samples>
<sample count="10" intentref="none">
Remember to
<annotation conceptref="cf1">
<annotation conceptref="cf2">record</annotation>
</annotation>
the
<annotation conceptref="cf3">movie</annotation>
<annotation conceptref="cf4">Taxi driver</annotation>
</sample>
</samples>
'''
doc = SimplifiedDoc(html)
print(doc.selects('sample').text) # Extract all the text
# Another examples
for sample in doc.selects('sample'):
print (sample.count, sample.annotation.text)
结果:
['Remember to record the movie Taxi driver']
10 record
这里有更多例子。 https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
我有一份 xml 文档,格式如下
<samples>
<sample count="10" intentref="none">
Remember to
<annotation conceptref="cf1">
<annotation conceptref="cf2">record</annotation>
</annotation>
the
<annotation conceptref="cf3">movie</annotation>
<annotation conceptref="cf4">Taxi driver</annotation>
</sample>
</samples>
并且我想提取所有文本,要么是未封装在注释标签中的文本,要么是注释标签中的文本,以便重建原始短语 所以我的输出是 --> 记得录制电影 Taxi driver
问题显然是无法获取令牌 'the' 这是我的代码片段
import xml.etree.ElementTree as ET
samples = ET.fromstring("""
<samples>
<sample count="10" intentref="none">Remember to<annotation conceptref="cf1"><annotation conceptref="cf2">record</annotation></annotation>the<annotation conceptref="cf3">movie</annotation><annotation conceptref="cf4">Taxi driver</annotation></sample>
</samples>
""")
for sample in samples.iter("sample"):
print ('***'+sample.text+'***'+sample.tail)
for annotation in sample.iter('annotation'):
print(annotation.text)
for nested_annotation in annotation.getchildren():
print(nested_annotation.text)
我以为嵌套注释会成功..但是没有,这是结果
***Remember to'***
None
record
record
movie
Taxi driver
你们非常接近。我会这样做:
import xml.etree.ElementTree as ET
samples = ET.fromstring("""<samples>
<sample count="10" intentref="none">
Remember to
<annotation conceptref="cf1">
<annotation conceptref="cf2">record</annotation>
</annotation>
the
<annotation conceptref="cf3">movie</annotation>
<annotation conceptref="cf4">Taxi driver</annotation>
</sample>
</samples>
""")
for page in samples.findall('.//'):
text = page.text if page.text else ''
tail = page.tail if page.tail else ''
print(text + tail)
这会给你:
Remember to
the
record
movie
Taxi driver
您可能会发现单词的顺序不是您想要的顺序,但您可以通过记住同时具有尾部和文本的项目并在该次迭代后插入尾部来解决此问题。不确定那是正确的方法。
我认为您正在寻找 itertext
方法:
# Iterate over all the sample block
for sample in tree.xpath('//sample'):
print(''.join(sample.itertext()))
完整代码:
# Load module
import lxml.etree as etree
# Load data
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('data.xml', parser)
# Iterate over all the sample block
for sample in tree.xpath('//sample'):
print(''.join(sample.itertext()))
# programmer l'
# enregistreur
# des
# oeuvres
# La Chevauchée de Virginia
另一个解决方案。
from simplified_scrapy import SimplifiedDoc,req,utils
html = '''
<samples>
<sample count="10" intentref="none">
Remember to
<annotation conceptref="cf1">
<annotation conceptref="cf2">record</annotation>
</annotation>
the
<annotation conceptref="cf3">movie</annotation>
<annotation conceptref="cf4">Taxi driver</annotation>
</sample>
</samples>
'''
doc = SimplifiedDoc(html)
print(doc.selects('sample').text) # Extract all the text
# Another examples
for sample in doc.selects('sample'):
print (sample.count, sample.annotation.text)
结果:
['Remember to record the movie Taxi driver']
10 record
这里有更多例子。 https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples