如何使用 BeautifulSoup 和 Python 仅 select 此文本节点？

Question

我有这个 html 结构 :

<div class="foo">
    <h3>Title</h3>
    <br>Some text I want to retrieve. <br><br> This text too.
    <br> (numbers and position of "br" tag indetermined) And this one too.
    <div class="subfoo">Some other text I don't want.</div>
</div>

在我的 python 脚本中，我写了 :

exampleSoup = bs4.BeautifulSoup(res.text, "html.parser")
elems = exampleSoup.select('.foo')
print(elems[0].getText())

不出所料，我得到了完整的文本：

Title
Some text I want to retrieve.
Some other text I don't want.

如何只获取 div 中没有标签的字符串，即 :"Some text I want to retrieve. This text too. And this one too." ？感谢您的帮助。

Answer 1

您可以使用 .next_sibling 获取树中的下一个元素。

例子

>>> soup = BeautifulSoup(html)
>>> print soup.prettify()
<html>
 <body>
  <div class="foo">
   <h3>
    Title
   </h3>
   Some text I want to retrieve.
   <div class="subfoo">
    Some other text I don't want.
   </div>
  </div>
 </body>
</html>

>>> print soup.find('div', { 'class' : 'foo' } ).h3.next_sibling.strip()
Some text I want to retrieve.

如何使用 BeautifulSoup 和 Python 仅 select 此文本节点？

How to select only this text node using BeautifulSoup and Python?

python

beautifulsoup

data-extraction