如何使用 BeautifulSoup 检测页面底部并进入下一页?
How to detect bottom of the page using BeautifulSoup and get to the next page?
我正在尝试抓取网页并获取每篇文章的网址。代码如下
import requests
from bs4 import BeautifulSoup
main_url = "https://www.rfa.org/vietnamese/news/programs/story_archive?year=2006&month=1"
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
article_links = soup.find_all("div", {"class": "sectionteaser archive"})
for div in article_links:
links = div.findAll('a')
for a in links:
print(a['href'])
以上代码只完成了第一份工作,但还有更多页面需要处理。如何检测还有多少篇文章并全部获取?
有下一页分页时可以循环。这可以通过 class next
元素的存在来测试。每次循环都需要将request中的offset增加15。
import requests
from bs4 import BeautifulSoup as bs
n = 0
with requests.Session() as s:
while True:
url = f'https://www.rfa.org/vietnamese/news/programs/story_archive?year=2006&month=1&b_start:int={n*15}'
r = s.get(url)
soup = bs(r.text, 'lxml')
print([i.text.strip() for i in soup.select('.sectionteaser a > span')])
if soup.select_one('.next') is None:
break
n+=1
我正在尝试抓取网页并获取每篇文章的网址。代码如下
import requests
from bs4 import BeautifulSoup
main_url = "https://www.rfa.org/vietnamese/news/programs/story_archive?year=2006&month=1"
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
article_links = soup.find_all("div", {"class": "sectionteaser archive"})
for div in article_links:
links = div.findAll('a')
for a in links:
print(a['href'])
以上代码只完成了第一份工作,但还有更多页面需要处理。如何检测还有多少篇文章并全部获取?
有下一页分页时可以循环。这可以通过 class next
元素的存在来测试。每次循环都需要将request中的offset增加15。
import requests
from bs4 import BeautifulSoup as bs
n = 0
with requests.Session() as s:
while True:
url = f'https://www.rfa.org/vietnamese/news/programs/story_archive?year=2006&month=1&b_start:int={n*15}'
r = s.get(url)
soup = bs(r.text, 'lxml')
print([i.text.strip() for i in soup.select('.sectionteaser a > span')])
if soup.select_one('.next') is None:
break
n+=1