Selenium 下载整个 html

Question

我一直在尝试使用硒来抓取整个网页。我预计其中至少有少数是水疗中心，例如 Angular、React、Vue，所以这就是我使用 Selenium 的原因。

我需要下载整个页面（如果某些内容因为没有向下滚动而没有从延迟加载中加载，那也没关系）。我试过设置 time.sleep() 延迟，但没有奏效。获得页面后，我希望对其进行哈希处理并将其存储在数据库中，以便稍后进行比较并检查内容是否已更改。目前哈希值每次都不同，这是因为 selenium 没有下载整个页面，每次都缺少不同的部分量。我已经在多个网页上证实了这一点，而不仅仅是一个网页。

我可能还有 1000 多个网页需要手动浏览，只是获取所有链接，所以我没有时间在它们上面找到元素以确保它已加载。

这个过程需要多长时间并不重要。如果需要1个多小时就这样吧，速度并不重要，重要的是准确性。

如果您有其他想法，也请分享。

我的驱动声明

 from selenium import webdriver
 from selenium.common.exceptions import WebDriverException

 driverPath = '/usr/lib/chromium-browser/chromedriver'

 def create_web_driver():
     options = webdriver.ChromeOptions()
     options.add_argument('headless')

     # set the window size
     options.add_argument('window-size=1200x600')

     # try to initalize the driver
     try:
         driver = webdriver.Chrome(executable_path=driverPath, chrome_options=options)
     except WebDriverException:
         print("failed to start driver at path: " + driverPath)

     return driver

我的url调用我的超时=20

 driver.get(url)
 time.sleep(timeout)
 content = driver.page_source

 content = content.encode('utf-8')
 hashed_content = hashlib.sha512(content).hexdigest()

^ 每次都在这里得到不同的哈希值，因为相同的 url 不会生成相同的网页

Answer 1

根据我的经验，time.sleep() 不适用于动态加载时间。如果页面是 javascript-heavy，则必须使用 WebDriverWait 子句。

像这样：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get(url)

element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "[my-attribute='my-value']")))

将 10 更改为您想要的任何计时器，并将 By.CSS_SELECTOR 及其值更改为您想要用作 lo

参考的任何类型

您还可以将 WebDriverWait 包裹在带有 TimeoutException 异常的 Try/Except 语句周围，如果您想设置硬限制，您可以从子模块 selenium.common.exceptions 中获取该异常。

如果你真的想让它一直检查直到页面加载完毕，你可以将它设置在一个 while 循环中，因为我在文档中找不到任何关于等待的参考 "forever"，但你会必须尝试一下。

Answer 2

由于被测应用程序(AUT)基于Angular，React, Vue 在那种情况下 Selenium 似乎是完美的选择。

现在，您对 some content isn't loaded from lazy loading because of not scrolling 使用例可行这一事实感到满意。但在所有可能的方式中 ...do not have time to find an element on them to make sure it is loaded... 不能真正补偿诱导 time.sleep() 因为 time.sleep() 有某些缺点。您可以在How to sleep webdriver in python for milliseconds. It would be worth to mention that the state of the HTML DOM中找到详细的讨论。所有1000个奇数网页都会有所不同。

解决方案

几个可行的解决方案：

可能的解决方案是引入 WebDriverWait 并确保根据讨论 How can I make sure if some HTML elements are loaded for Selenium + Python? 加载一些 HTML 元素至少验证以下任何一项：
- 页面标题
- 页面标题
另一种解决方案是调整功能 pageLoadStrategy。您可以将所有 1000 个奇数网页的 pageLoadStrategy 设置为公共点，并指定一个值：
- normal（整页加载）
- eager（互动）
- none
您可以在 How to make Selenium not wait till full page load, which has a slow script?

如果您实施 pageLoadStrategy，page_source 方法将在同一触发点触发，您可能会看到相同的 hashed_content .

Selenium 下载整个 html

Selenium download entire html

python

selenium

dom

web-scraping

pageloadstrategy

解决方案