如何在抓取网站时到达最后一页后停止 selenium webdriver?
How to stop the selenium webdriver after reaching the last page while scraping the website?
网站上的数据量(页数)不断变化,我需要通过分页循环抓取所有页面。
网站:https://monentreprise.bj/page/annonces
我试过的代码:
xpath= "//*[@id='yw3']/li[12]/a"
while True:
next_page = driver.find_elements(By.XPATH,xpath)
if len(next_page) < 1:
print("No more pages")
break
else:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, xpath))).click()
print('ok')
ok
连续打印
因为条件if len(next_page)<1
总是False。
例如,我尝试了 url monentreprise.bj/page/annonces?Company_page=99999999999999999999999,它给出了第 13 页,也就是最后一页
您可以尝试检查“下一页”按钮是否被禁用
这里有几个问题:
//*[@id='yw3']/li[12]/a
不是 next
分页按钮的正确定位符。
- 此处到达最后一页状态的更好指示是验证此基于 css_locator 的元素
.pagination .next
是否包含 disabled
class.
- 您必须向下滚动页面才能单击下一页按钮
- 您必须在单击分页按钮后添加延迟。否则这将不起作用。
这段代码对我有用:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome()
my_url = "https://monentreprise.bj/page/annonces"
driver.get(my_url)
next_page_parent = '.pagination .next'
next_page_parent_arrow = '.pagination .next a'
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(0.5)
parent = driver.find_element(By.CSS_SELECTOR,next_page_parent)
class_name = parent.get_attribute("class")
if "disabled" in class_name:
print("No more pages")
break
else:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, next_page_parent_arrow))).click()
time.sleep(1.5)
print('ok')
输出为:
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
No more pages
网站上的数据量(页数)不断变化,我需要通过分页循环抓取所有页面。
网站:https://monentreprise.bj/page/annonces
我试过的代码:
xpath= "//*[@id='yw3']/li[12]/a"
while True:
next_page = driver.find_elements(By.XPATH,xpath)
if len(next_page) < 1:
print("No more pages")
break
else:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, xpath))).click()
print('ok')
ok
连续打印
因为条件if len(next_page)<1
总是False。
例如,我尝试了 url monentreprise.bj/page/annonces?Company_page=99999999999999999999999,它给出了第 13 页,也就是最后一页
您可以尝试检查“下一页”按钮是否被禁用
这里有几个问题:
//*[@id='yw3']/li[12]/a
不是next
分页按钮的正确定位符。- 此处到达最后一页状态的更好指示是验证此基于 css_locator 的元素
.pagination .next
是否包含disabled
class. - 您必须向下滚动页面才能单击下一页按钮
- 您必须在单击分页按钮后添加延迟。否则这将不起作用。
这段代码对我有用:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome()
my_url = "https://monentreprise.bj/page/annonces"
driver.get(my_url)
next_page_parent = '.pagination .next'
next_page_parent_arrow = '.pagination .next a'
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(0.5)
parent = driver.find_element(By.CSS_SELECTOR,next_page_parent)
class_name = parent.get_attribute("class")
if "disabled" in class_name:
print("No more pages")
break
else:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, next_page_parent_arrow))).click()
time.sleep(1.5)
print('ok')
输出为:
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
ok
No more pages