通过 Selenium 抓取数据但抛出错误 TimeoutException
Crawl data by Selenium but throws errors TimeoutException
我尝试抓取网站中的评论。对于 1 个网站,运行 没问题。然而,当我创建一个循环在许多网站上进行爬网时,它会抛出一个错误 raise
TimeoutException(message, screen, stacktrace) TimeoutException
我现在尝试将等待时间从 30 秒增加到 50 秒,但仍然 运行 没用。
这是我的代码:
import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from datetime import datetime
start_time = datetime.now()
result = pd.DataFrame()
df = pd.read_excel(r'D:\check_bols.xlsx')
ids = df['ids'].values.tolist()
link = "https://www.bol.com/nl/ajax/dataLayerEndpoint.html?product_id="
for i in ids:
link3 = link + str(i[-17:].replace("/",""))
op = webdriver.ChromeOptions()
op.add_argument('--ignore-certificate-errors')
op.add_argument('--incognito')
op.add_argument('--headless')
driver = webdriver.Chrome(executable_path='D:/chromedriver.exe',options=op)
driver.get(i)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()
soup = BeautifulSoup(driver.page_source, 'lxml')
product_attributes = requests.get(link3).json()
reviewtitle = [i.get_text() for i in soup.find_all("strong", class_="review__title") ]
url = [i]*len(reviewtitle)
productid = [product_attributes["dmp"]["productId"]]*len(reviewtitle)
content= [i.get_text().strip() for i in soup.find_all("div",attrs={"class":"review__body"})]
author = [i.get_text() for i in soup.find_all("li",attrs={"data-test":"review-author-name"})]
date = [i.get_text() for i in soup.find_all("li",attrs={"data-test":"review-author-date"})]
output = pd.DataFrame(list(zip(url, productid,reviewtitle, author, content, date )))
result.append(output)
result.to_excel(r'D:\bols.xlsx', index=False)
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
以下是我尝试抓取的一些链接:
如评论中所述 - 您超时是因为您正在寻找不存在的按钮。
您需要捕获错误并跳过那些失败的行。你可以用 a try and except.
来做到这一点
我为您整理了一个示例。它被硬编码为一个 url(因为我没有你的数据 sheet)并且它是一个固定循环,目的是继续尝试单击“显示更多”按钮,即使它已经消失。
使用此解决方案时请注意同步时间。每次调用 WebDriverWait
时,如果它不存在,它将等待完整的持续时间。完成后您需要退出展开循环(第一次遇到错误时)并保持同步时间紧凑 - 否则脚本会很慢
首先,将这些添加到您的导入中:
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException
那么这将 运行 而不是错误:
#not a fixed url:
driver.get('https://www.bol.com/nl/p/Matras-180x200-7-zones-koudschuim-premium-plus-tijk-15-cm-hard/9200000130825457/')
#accept the cookie once
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
for i in range(10):
try:
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()
print("I pressed load more")
except (TimeoutException, StaleElementReferenceException):
pass
print("No more to load - but i didn't fail")
控制台的输出是这样的:
DevTools listening on
ws://127.0.0.1:51223/devtools/browser/4b1a0033-8294-428d-802a-d0d2127c4b6f
I pressed load more
I pressed load more
No more to load - but i didn't fail
No more to load - but i didn't fail
No more to load - but i didn't fail
No more to load - but i didn't fail
(and so on).
这是我的浏览器的外观 - 请注意我使用的 link 滚动条的大小 - 看起来它包含了所有评论:
我建议使用 Infinite While loop
并使用 try..except
块。如果找到元素,它将单击该元素,否则语句将转到 except 块并退出 while 循环。
driver.get("https://www.bol.com/nl/p/Matras-180x200-7-zones-koudschuim-premium-plus-tijk-15-cm-hard/9200000130825457/")
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
while True:
try:
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()
print("Lode more button found and clicked ")
except:
print("No more load more button available on the page.Please exit...")
break
您的控制台输出将如下所示。
Lode more button found and clicked
Lode more button found and clicked
Lode more button found and clicked
Lode more button found and clicked
No more load more button available on the page.Please exit...
我尝试抓取网站中的评论。对于 1 个网站,运行 没问题。然而,当我创建一个循环在许多网站上进行爬网时,它会抛出一个错误 raise
TimeoutException(message, screen, stacktrace) TimeoutException
我现在尝试将等待时间从 30 秒增加到 50 秒,但仍然 运行 没用。 这是我的代码:
import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from datetime import datetime
start_time = datetime.now()
result = pd.DataFrame()
df = pd.read_excel(r'D:\check_bols.xlsx')
ids = df['ids'].values.tolist()
link = "https://www.bol.com/nl/ajax/dataLayerEndpoint.html?product_id="
for i in ids:
link3 = link + str(i[-17:].replace("/",""))
op = webdriver.ChromeOptions()
op.add_argument('--ignore-certificate-errors')
op.add_argument('--incognito')
op.add_argument('--headless')
driver = webdriver.Chrome(executable_path='D:/chromedriver.exe',options=op)
driver.get(i)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()
soup = BeautifulSoup(driver.page_source, 'lxml')
product_attributes = requests.get(link3).json()
reviewtitle = [i.get_text() for i in soup.find_all("strong", class_="review__title") ]
url = [i]*len(reviewtitle)
productid = [product_attributes["dmp"]["productId"]]*len(reviewtitle)
content= [i.get_text().strip() for i in soup.find_all("div",attrs={"class":"review__body"})]
author = [i.get_text() for i in soup.find_all("li",attrs={"data-test":"review-author-name"})]
date = [i.get_text() for i in soup.find_all("li",attrs={"data-test":"review-author-date"})]
output = pd.DataFrame(list(zip(url, productid,reviewtitle, author, content, date )))
result.append(output)
result.to_excel(r'D:\bols.xlsx', index=False)
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
以下是我尝试抓取的一些链接:
如评论中所述 - 您超时是因为您正在寻找不存在的按钮。
您需要捕获错误并跳过那些失败的行。你可以用 a try and except.
来做到这一点我为您整理了一个示例。它被硬编码为一个 url(因为我没有你的数据 sheet)并且它是一个固定循环,目的是继续尝试单击“显示更多”按钮,即使它已经消失。
使用此解决方案时请注意同步时间。每次调用 WebDriverWait
时,如果它不存在,它将等待完整的持续时间。完成后您需要退出展开循环(第一次遇到错误时)并保持同步时间紧凑 - 否则脚本会很慢
首先,将这些添加到您的导入中:
from selenium.common.exceptions import TimeoutException
from selenium.common.exceptions import StaleElementReferenceException
那么这将 运行 而不是错误:
#not a fixed url:
driver.get('https://www.bol.com/nl/p/Matras-180x200-7-zones-koudschuim-premium-plus-tijk-15-cm-hard/9200000130825457/')
#accept the cookie once
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
for i in range(10):
try:
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()
print("I pressed load more")
except (TimeoutException, StaleElementReferenceException):
pass
print("No more to load - but i didn't fail")
控制台的输出是这样的:
DevTools listening on ws://127.0.0.1:51223/devtools/browser/4b1a0033-8294-428d-802a-d0d2127c4b6f
I pressed load more
I pressed load more
No more to load - but i didn't fail
No more to load - but i didn't fail
No more to load - but i didn't fail
No more to load - but i didn't fail (and so on).
这是我的浏览器的外观 - 请注意我使用的 link 滚动条的大小 - 看起来它包含了所有评论:
我建议使用 Infinite While loop
并使用 try..except
块。如果找到元素,它将单击该元素,否则语句将转到 except 块并退出 while 循环。
driver.get("https://www.bol.com/nl/p/Matras-180x200-7-zones-koudschuim-premium-plus-tijk-15-cm-hard/9200000130825457/")
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-test='consent-modal-confirm-btn']>span"))).click()
while True:
try:
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a.review-load-more__button.js-review-load-more-button"))).click()
print("Lode more button found and clicked ")
except:
print("No more load more button available on the page.Please exit...")
break
您的控制台输出将如下所示。
Lode more button found and clicked
Lode more button found and clicked
Lode more button found and clicked
Lode more button found and clicked
No more load more button available on the page.Please exit...