在 python 中解析带有强调标记的 xml 文件

Question

我目前正在编写一个 python 脚本，可以提取 xml 文件中的所有文本。我正在使用 Element Tree 库来解释数据，但我运行遇到了这个问题，但是当数据结构如下时...

<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
  But I bet it's a good <Pause/> superset of it.
</Segment>

当我尝试读出文本时，我在暂停标记之前得到了片段的前半部分 ("Alright. So what we had")。

我想弄清楚的是，是否有一种方法可以忽略数据段中的标签并打印出所有文本。

Answer 1

另一个解决方案。

from simplified_scrapy import SimplifiedDoc,req,utils
html = '''<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
  But I bet it's a good <Pause/> superset of it.
</Segment>'''
doc = SimplifiedDoc(html)
print(doc.Segment)
print(doc.Segment.text)

结果：

{'StartTime': '639.752', 'EndTime': '642.270', 'Participant': 'fe016', 'tag': 'Segment', 'html': "\n  But I bet it's a good <Pause /> superset of it.\n"}
But I bet it's a good superset of it.

这里有更多例子。 https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples

Answer 2

xml = '''<Segment StartTime="639.752" EndTime="642.270" Participant="fe016">
  But I bet it's a good <Pause/> superset of it.
</Segment>'''

# solution using ETree
from xml.etree import ElementTree as ET

root = ET.fromstring(xml)
pause = root.find('./Pause')
print(root.text + pause.tail)

在 python 中解析带有强调标记的 xml 文件

Parsing an xml file with an emphasis tag in it in python

python

xml

elementtree