如何使用 BeautifulSoup 检测页面底部并进入下一页？

Question

我正在尝试抓取网页并获取每篇文章的网址。代码如下

import requests
from bs4 import BeautifulSoup

main_url = "https://www.rfa.org/vietnamese/news/programs/story_archive?year=2006&month=1"

re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
article_links = soup.find_all("div", {"class": "sectionteaser archive"})

for div in article_links:
    links = div.findAll('a')
    for a in links:
        print(a['href'])

以上代码只完成了第一份工作，但还有更多页面需要处理。如何检测还有多少篇文章并全部获取？

Answer 1

有下一页分页时可以循环。这可以通过 class next 元素的存在来测试。每次循环都需要将request中的offset增加15。

import requests
from bs4 import BeautifulSoup as bs

n = 0

with requests.Session() as s:
    
    while True:
        
        url = f'https://www.rfa.org/vietnamese/news/programs/story_archive?year=2006&month=1&b_start:int={n*15}'
        r = s.get(url)
        soup = bs(r.text, 'lxml')
        
        print([i.text.strip() for i in soup.select('.sectionteaser a > span')])
        
        if soup.select_one('.next') is None:
            break
        n+=1

如何使用 BeautifulSoup 检测页面底部并进入下一页？

How to detect bottom of the page using BeautifulSoup and get to the next page?

python

beautifulsoup

web-crawler