通过 Selenium 抓取数据但抛出错误 TimeoutException

Question

我尝试抓取网站中的评论。对于 1 个网站，运行没问题。然而，当我创建一个循环在许多网站上进行爬网时，它会抛出一个错误 raise

TimeoutException(message, screen, stacktrace) TimeoutException

我现在尝试将等待时间从 30 秒增加到 50 秒，但仍然运行没用。这是我的代码：

import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from datetime import datetime

start_time = datetime.now()

result = pd.DataFrame()
df = pd.read_excel(r'D:\check_bols.xlsx')
ids = df['ids'].values.tolist() 

link = "https://www.bol.com/nl/ajax/dataLayerEndpoint.html?product_id="

for i in ids:
    
    link3 = link + str(i[-17:].replace("/",""))
    op = webdriver.ChromeOptions()
    op.add_argument('--ignore-certificate-errors')
    op.add_argument('--incognito')
    op.add_argument('--headless')
    driver = webdriver.Chrome(executable_path='D:/chromedriver.exe',options=op)
    driver.get(i)
    WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
    WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()

    soup = BeautifulSoup(driver.page_source, 'lxml')

    product_attributes = requests.get(link3).json()

    reviewtitle = [i.get_text() for i in soup.find_all("strong", class_="review__title") ]

    url = [i]*len(reviewtitle)

    productid = [product_attributes["dmp"]["productId"]]*len(reviewtitle)
  
    content= [i.get_text().strip()  for i in soup.find_all("div",attrs={"class":"review__body"})]
    
    author = [i.get_text() for i in soup.find_all("li",attrs={"data-test":"review-author-name"})]

    date  = [i.get_text() for i in soup.find_all("li",attrs={"data-test":"review-author-date"})]

    output = pd.DataFrame(list(zip(url, productid,reviewtitle, author, content, date )))
    
    result.append(output)
    
    result.to_excel(r'D:\bols.xlsx', index=False)
    
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))

以下是我尝试抓取的一些链接：

link1 link2

Answer 1

如评论中所述 - 您超时是因为您正在寻找不存在的按钮。

您需要捕获错误并跳过那些失败的行。你可以用 a try and except.

来做到这一点

我为您整理了一个示例。它被硬编码为一个 url（因为我没有你的数据 sheet）并且它是一个固定循环，目的是继续尝试单击“显示更多”按钮，即使它已经消失。

使用此解决方案时请注意同步时间。每次调用 WebDriverWait 时，如果它不存在，它将等待完整的持续时间。完成后您需要退出展开循环（第一次遇到错误时）并保持同步时间紧凑 - 否则脚本会很慢

首先，将这些添加到您的导入中：

from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException

那么这将运行而不是错误：

#not a fixed url:
driver.get('https://www.bol.com/nl/p/Matras-180x200-7-zones-koudschuim-premium-plus-tijk-15-cm-hard/9200000130825457/')

#accept the cookie once
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
   
for i in range(10):
    try:
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()
        print("I pressed load more")
    except (TimeoutException, StaleElementReferenceException):
        pass
        print("No more to load - but i didn't fail")

控制台的输出是这样的：

DevTools listening on ws://127.0.0.1:51223/devtools/browser/4b1a0033-8294-428d-802a-d0d2127c4b6f

I pressed load more

I pressed load more

No more to load - but i didn't fail

No more to load - but i didn't fail

No more to load - but i didn't fail

No more to load - but i didn't fail (and so on).

这是我的浏览器的外观 - 请注意我使用的 link 滚动条的大小 - 看起来它包含了所有评论：

Answer 2

我建议使用 Infinite While loop 并使用 try..except 块。如果找到元素，它将单击该元素，否则语句将转到 except 块并退出 while 循环。

driver.get("https://www.bol.com/nl/p/Matras-180x200-7-zones-koudschuim-premium-plus-tijk-15-cm-hard/9200000130825457/")
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
while True:
    try:
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()
        print("Lode more button found and clicked ")
    except:
        print("No more load more button available on the page.Please exit...")
        break

您的控制台输出将如下所示。

Lode more button found and clicked 
Lode more button found and clicked 
Lode more button found and clicked 
Lode more button found and clicked 
No more load more button available on the page.Please exit...

通过 Selenium 抓取数据但抛出错误 TimeoutException

Crawl data by Selenium but throws errors TimeoutException

python

selenium

timeoutexception

selenium-chromedriver

selenium-webdriver