lxml/beautifulsoup: 提取两个不同标签之间的文本

Question

我有一份 "XML" 文档，里面有一大堆像这样的东西：

Here is some text about a frog.  <hello ref="1"/>This frog is <hello ref="2"/>orange<goodbye idref="2"/> and has polka-dots.<goodbye idref="1"/>  Isn't this interesting?

由此，我需要的是：

这只青蛙是橙色的，上面有圆点。
橙色

除了用正则表达式做一些疯狂的事情之外，有没有办法使用 lxml and/or BeautifulSoup 的某种组合来做到这一点？谢谢 :D

Answer 1

Xml 从标准库解析。

https://docs.python.org/2/library/xml.etree.elementtree.html

Answer 2

您可以使用 ref="1" 遍历 hello 标签的 next siblings，直到遇到 idref="1" 的 goodbye 元素：

from bs4 import BeautifulSoup, Tag

data = """
<data>Here is some text about a frog.  <hello ref="1"/>This frog is <hello ref="2"/>orange<goodbye idref="2"/> and has polka-dots.<goodbye idref="1"/>  Isn't this interesting?</data>
"""
soup = BeautifulSoup(data, "xml")

text = ""
for elm in soup.find("hello", ref="1").next_siblings:
    if elm and elm.name == "goodbye" and elm.get("idref") == "1":
        break

    text += elm.get_text() if isinstance(elm, Tag) else elm

print(text)

打印：

This frog is orange and has polka-dots.

lxml/beautifulsoup: 提取两个不同标签之间的文本

lxml/beautifulsoup: Extracting text between two different tags

python

xml

lxml

beautifulsoup

xml-parsing