在 Beautiful Soup 中隔离 div
Isolating divs in Beautiful Soup
BeautifulSoup 和 Python 有一个小问题。
我正在尝试隔离
中的标题
< a href="">TITLE<?/a>
我使用的代码是:
for box in soup ('div', {'class': 'box'}):
for a in box.findnext('a'):
print a
这非常有效,但是有一个 div 导致了问题。通常的是:
<div class='box'>
尴尬的是:
<div class='box sponsored'>
如何只select第一个盒子而不是赞助的盒子?
谢谢
BeautifulSoup
有一个 special handling for the class
attribute:
Remember that a single tag can have multiple values for its “class”
attribute. When you search for a tag that matches a certain CSS class,
you’re matching against any of its CSS classes.
强制 BeautifulSoup
以单个 box
class 查看 div
元素的一种方法是使用以下 CSS selector:
soup.select('div[class=box]')
演示:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <div class='box'>
... test1
... </div>
... <div class='box sponsored'>
... test2
... </div>
... """
>>>
>>> soup = BeautifulSoup(data, 'html')
>>>
>>> for div in soup.select('div[class=box]'):
... print div.text.strip()
...
test1
BeautifulSoup 和 Python 有一个小问题。
我正在尝试隔离
中的标题 < a href="">TITLE<?/a>
我使用的代码是:
for box in soup ('div', {'class': 'box'}):
for a in box.findnext('a'):
print a
这非常有效,但是有一个 div 导致了问题。通常的是:
<div class='box'>
尴尬的是:
<div class='box sponsored'>
如何只select第一个盒子而不是赞助的盒子?
谢谢
BeautifulSoup
有一个 special handling for the class
attribute:
Remember that a single tag can have multiple values for its “class” attribute. When you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes.
强制 BeautifulSoup
以单个 box
class 查看 div
元素的一种方法是使用以下 CSS selector:
soup.select('div[class=box]')
演示:
>>> from bs4 import BeautifulSoup
>>>
>>> data = """
... <div class='box'>
... test1
... </div>
... <div class='box sponsored'>
... test2
... </div>
... """
>>>
>>> soup = BeautifulSoup(data, 'html')
>>>
>>> for div in soup.select('div[class=box]'):
... print div.text.strip()
...
test1