lxml/beautifulsoup: 提取两个不同标签之间的文本
lxml/beautifulsoup: Extracting text between two different tags
我有一份 "XML" 文档,里面有一大堆像这样的东西:
Here is some text about a frog. <hello ref="1"/>This frog is <hello ref="2"/>orange<goodbye idref="2"/> and has polka-dots.<goodbye idref="1"/> Isn't this interesting?
由此,我需要的是:
- 这只青蛙是橙色的,上面有圆点。
- 橙色
除了用正则表达式做一些疯狂的事情之外,有没有办法使用 lxml
and/or BeautifulSoup
的某种组合来做到这一点?谢谢 :D
Xml 从标准库解析。
https://docs.python.org/2/library/xml.etree.elementtree.html
您可以使用 ref="1"
遍历 hello
标签的 next siblings,直到遇到 idref="1"
的 goodbye
元素:
from bs4 import BeautifulSoup, Tag
data = """
<data>Here is some text about a frog. <hello ref="1"/>This frog is <hello ref="2"/>orange<goodbye idref="2"/> and has polka-dots.<goodbye idref="1"/> Isn't this interesting?</data>
"""
soup = BeautifulSoup(data, "xml")
text = ""
for elm in soup.find("hello", ref="1").next_siblings:
if elm and elm.name == "goodbye" and elm.get("idref") == "1":
break
text += elm.get_text() if isinstance(elm, Tag) else elm
print(text)
打印:
This frog is orange and has polka-dots.
我有一份 "XML" 文档,里面有一大堆像这样的东西:
Here is some text about a frog. <hello ref="1"/>This frog is <hello ref="2"/>orange<goodbye idref="2"/> and has polka-dots.<goodbye idref="1"/> Isn't this interesting?
由此,我需要的是:
- 这只青蛙是橙色的,上面有圆点。
- 橙色
除了用正则表达式做一些疯狂的事情之外,有没有办法使用 lxml
and/or BeautifulSoup
的某种组合来做到这一点?谢谢 :D
Xml 从标准库解析。
https://docs.python.org/2/library/xml.etree.elementtree.html
您可以使用 ref="1"
遍历 hello
标签的 next siblings,直到遇到 idref="1"
的 goodbye
元素:
from bs4 import BeautifulSoup, Tag
data = """
<data>Here is some text about a frog. <hello ref="1"/>This frog is <hello ref="2"/>orange<goodbye idref="2"/> and has polka-dots.<goodbye idref="1"/> Isn't this interesting?</data>
"""
soup = BeautifulSoup(data, "xml")
text = ""
for elm in soup.find("hello", ref="1").next_siblings:
if elm and elm.name == "goodbye" and elm.get("idref") == "1":
break
text += elm.get_text() if isinstance(elm, Tag) else elm
print(text)
打印:
This frog is orange and has polka-dots.