在 Beautiful Soup 中隔离 div

Question

BeautifulSoup 和 Python 有一个小问题。

我正在尝试隔离

中的标题

 < a href="">TITLE<?/a>

我使用的代码是：

for box in soup ('div', {'class': 'box'}):
    for a in box.findnext('a'):
       print a

这非常有效，但是有一个 div 导致了问题。通常的是：

<div class='box'>

尴尬的是：

<div class='box sponsored'>

如何只select第一个盒子而不是赞助的盒子？

谢谢

Answer 1

BeautifulSoup 有一个 special handling for the class attribute:

Remember that a single tag can have multiple values for its “class” attribute. When you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes.

强制 BeautifulSoup 以单个 box class 查看 div 元素的一种方法是使用以下 CSS selector:

soup.select('div[class=box]')

演示：

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... <div class='box'>
...     test1
... </div>
... <div class='box sponsored'>
...     test2
... </div>
... """
>>> 
>>> soup = BeautifulSoup(data, 'html')
>>> 
>>> for div in soup.select('div[class=box]'):
...     print div.text.strip()
... 
test1

在 Beautiful Soup 中隔离 div

Isolating divs in Beautiful Soup

python

beautifulsoup