在 python 中解析带有强调标记的 xml 文件
Parsing an xml file with an emphasis tag in it in python
我目前正在编写一个 python 脚本,可以提取 xml 文件中的所有文本。我正在使用 Element Tree 库来解释数据,但我 运行 遇到了这个问题,但是当数据结构如下时...
<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>
当我尝试读出文本时,我在暂停标记之前得到了片段的前半部分 ("Alright. So what we had")。
我想弄清楚的是,是否有一种方法可以忽略数据段中的标签并打印出所有文本。
另一个解决方案。
from simplified_scrapy import SimplifiedDoc,req,utils
html = '''<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>'''
doc = SimplifiedDoc(html)
print(doc.Segment)
print(doc.Segment.text)
结果:
{'StartTime': '639.752', 'EndTime': '642.270', 'Participant': 'fe016', 'tag': 'Segment', 'html': "\n But I bet it's a good <Pause /> superset of it.\n"}
But I bet it's a good superset of it.
这里有更多例子。 https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples
xml = '''<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>'''
# solution using ETree
from xml.etree import ElementTree as ET
root = ET.fromstring(xml)
pause = root.find('./Pause')
print(root.text + pause.tail)
我目前正在编写一个 python 脚本,可以提取 xml 文件中的所有文本。我正在使用 Element Tree 库来解释数据,但我 运行 遇到了这个问题,但是当数据结构如下时...
<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>
当我尝试读出文本时,我在暂停标记之前得到了片段的前半部分 ("Alright. So what we had")。
我想弄清楚的是,是否有一种方法可以忽略数据段中的标签并打印出所有文本。
另一个解决方案。
from simplified_scrapy import SimplifiedDoc,req,utils
html = '''<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>'''
doc = SimplifiedDoc(html)
print(doc.Segment)
print(doc.Segment.text)
结果:
{'StartTime': '639.752', 'EndTime': '642.270', 'Participant': 'fe016', 'tag': 'Segment', 'html': "\n But I bet it's a good <Pause /> superset of it.\n"}
But I bet it's a good superset of it.
这里有更多例子。 https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples
xml = '''<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
But I bet it's a good <Pause/> superset of it.
</Segment>'''
# solution using ETree
from xml.etree import ElementTree as ET
root = ET.fromstring(xml)
pause = root.find('./Pause')
print(root.text + pause.tail)