如何使用 BeautifulSoup Selenium 从整个页面获取 post 链接

Question

我在尝试使用 BeautifulSoup 和 Selenium 进行网页抓取时遇到问题。我遇到的问题是我想尝试从第 1-20 页提取数据。但是不知何故拉取成功的数据最多只有10页。有可能我最后取的页数限制是20多页，但是我做的代码结果只能拉取10页。有没有人了解能够在没有页面限制的情况下提取大量数据的问题？

options = webdriver.ChromeOptions()
options.add_argument('-headless')
options.add_argument('-no-sandbox')
options.add_argument('-disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',options=options)

apartment_urls = []

try:
   for page in range(1,20):
       print(f"Extraction Page# {page}")
       page="https://www.99.co/id/sewa/apartemen/jakarta?kamar_tidur_min=1&kamar_tidur_maks=4&kamar_mandi_min=1&kamar_mandi_maks=4&tipe_sewa=bulanan&hlmn=" + str(page) 
       driver.get(page)  
       time.sleep(5)
       soup = BeautifulSoup(driver.page_source, 'html.parser')
       apart_info_list = soup.select('h2.search-card-redesign__address a[href]')
       for link in apart_info_list:
           get_url = '{0}{1}'.format('https://www.99.co', link['href'])
           print(get_url)
           apartment_urls.append(get_url)

except:
   print("Good Bye!")

这是代码的输出。当第10,11,12页等我无法获取数据时

Answer 1

现在，分页工作正常，没有页数限制。

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://www.99.co/id/sewa/apartemen/jakarta?kamar_tidur_min=1&kamar_tidur_maks=4&kamar_mandi_min=1&kamar_mandi_maks=4&tipe_sewa=bulanan')
time.sleep(5)
driver.maximize_window()
while True:
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    apart_info_list = soup.select('h2.search-card-redesign__address a')
    for link in apart_info_list:
        get_url = '{0}{1}'.format('https://www.99.co', link['href'])
        print(get_url)

    next_button = driver.find_element(By.CSS_SELECTOR,'li.next > a ')
    if next_button:
        button = next_button.click()
        time.sleep(3)

    else:
        break

如果您更愿意使用：webdriverManager

替代解决方案：由于下一页 url 不是动态的，它也可以正常工作。

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://www.99.co/id/sewa/apartemen/jakarta?kamar_tidur_min=1&kamar_tidur_maks=4&kamar_mandi_min=1&kamar_mandi_maks=4&tipe_sewa=bulanan')
time.sleep(5)
driver.maximize_window()
while True:
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    apart_info_list = soup.select('h2.search-card-redesign__address a')
    for link in apart_info_list:
        get_url = '{0}{1}'.format('https://www.99.co', link['href'])
        print(get_url)

    # next_button = driver.find_element(By.CSS_SELECTOR,'li.next > a ')
    # if next_button:
    #     button = next_button.click()
    #     time.sleep(3)

    next_page = soup.select_one('li.next > a ')
    if next_page:
        next_page = f'https://www.99.co{next_page}'
       
    else:
        break

如何使用 BeautifulSoup Selenium 从整个页面获取 post 链接

How to get post links from the whole page using BeautifulSoup Selenium

loops

beautifulsoup

web-scraping

python-3.x

selenium-webdriver