lxml/beautifulsoup: 提取两个不同标签之间的文本

lxml/beautifulsoup: Extracting text between two different tags

我有一份 "XML" 文档,里面有一大堆像这样的东西:

Here is some text about a frog.  <hello ref="1"/>This frog is <hello ref="2"/>orange<goodbye idref="2"/> and has polka-dots.<goodbye idref="1"/>  Isn't this interesting?

由此,我需要的是:

  1. 这只青蛙是橙色的,上面有圆点。
  2. 橙色

除了用正则表达式做一些疯狂的事情之外,有没有办法使用 lxml and/or BeautifulSoup 的某种组合来做到这一点?谢谢 :D

Xml 从标准库解析。

https://docs.python.org/2/library/xml.etree.elementtree.html

您可以使用 ref="1" 遍历 hello 标签的 next siblings,直到遇到 idref="1"goodbye 元素:

from bs4 import BeautifulSoup, Tag

data = """
<data>Here is some text about a frog.  <hello ref="1"/>This frog is <hello ref="2"/>orange<goodbye idref="2"/> and has polka-dots.<goodbye idref="1"/>  Isn't this interesting?</data>
"""
soup = BeautifulSoup(data, "xml")

text = ""
for elm in soup.find("hello", ref="1").next_siblings:
    if elm and elm.name == "goodbye" and elm.get("idref") == "1":
        break

    text += elm.get_text() if isinstance(elm, Tag) else elm

print(text)

打印:

This frog is orange and has polka-dots.