Selenium + Geckodriver 故障排除

Question

我正在使用 Firefox gecko driver 和 Python 中的 selenium 抓取论坛 post 标题，但遇到了一个我似乎无法弄清楚的障碍。

~$ geckodriver --version
geckodriver 0.19.0

The source code of this program is available from
testing/geckodriver in https://hg.mozilla.org/mozilla-central.

This program is subject to the terms of the Mozilla Public License 2.0.
You can obtain a copy of the license at https://mozilla.org/MPL/2.0/.

我正试图从论坛中收集过去几年的 post 标题，我的代码暂时可以正常工作。我坐下来看了它运行大约 20-30 分钟，它完全按照它应该做的去做。然而，然后我开始脚本，然后上床睡觉，第二天早上醒来时，我发现它已经处理了 ~22,000 posts。我目前正在抓取的网站每页有 25 post 秒，因此它在崩溃前通过了 ~880 个单独的 URL。

当它崩溃时会抛出以下错误：

WebDriverException: Message: Tried to run command without establishing a connection

最初我的代码是这样的：

FirefoxProfile = webdriver.FirefoxProfile('/home/me/jupyter-notebooks/FirefoxProfile/')
firefox_capabilities = DesiredCapabilities.FIREFOX
firefox_capabilities['marionette'] = True

driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
for url in urls:
    driver.get(url)
    ### code to process page here ###
driver.close()

我也试过：

driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
for url in urls:
    driver.get(url)
    ### code to process page here ###
    driver.close()

和

for url in urls:
    driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
    driver.get(url)
    ### code to process page here ###
    driver.close()

我在所有 3 个场景中都遇到了同样的错误，但只是在运行成功了一段时间之后，我不确定如何确定失败的原因。

如何确定在成功处理了数百个 url 之后出现此错误的原因？或者是否有某种我没有遵循 Selenium/Firefox 来处理这么多页面的最佳实践？

Answer 1

所有 3 个代码块都近乎完美，但存在如下小缺陷：

您的第一个代码块是：

driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
for url in urls:
    driver.get(url)
    ### code to process page here ###
driver.close()

代码块看起来很有前途，没有一个问题。在根据 Best Practices 的最后一步中，我们必须调用 driver.quit() 而不是 driver.close()已经阻止了 webdriver 实例驻留在 System Memory 中。你可以找出driver.close() & driver.quit() .

的区别

您的第二个代码块是：

driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
for url in urls:
    driver.get(url)
    ### code to process page here ###
    driver.close()

这个块很容易出错。一旦执行进入 for() 循环并在 url 上工作，最后我们将关闭 Browser Session/Instance。因此，当执行开始第二次迭代的循环时，脚本在 driver.get(url) 上出错，因为没有 Active Browser Session .

您的第三个代码块是：

for url in urls:
    driver = webdriver.Firefox(FirefoxProfile, capabilities=firefox_capabilities)
    driver.get(url)
    ### code to process page here ###
    driver.close()

代码块看起来很完整，没有与第一个代码块相同的问题。在最后一步中，我们必须调用 driver.quit() 而不是 driver.close() ，这会阻止悬挂 webdriver 实例驻留在 System Memory。由于悬空 webdriver 实例会产生杂务并在某个时间点继续占用端口 WebDriver 无法找不到空闲端口或无法打开新的Browser Session/Connection。因此，您会看到错误 WebDriverException: Message: Tried to 运行 command without establishing a connection

解决方案：

根据 Best Practices 尝试调用 driver.quit() 而不是 driver.close() 并打开一个新的 WebDriver 实例和一个新的 Web Browser Session.

Selenium + Geckodriver 故障排除

Selenium + Geckodriver troubleshooting

selenium

python-2.7

selenium-webdriver

firefox-marionette

geckodriver

解决方案：