如何使用 python 和 selenium webdriver 抓取 https 网站数据

Question

我已经尝试抓取 www.zomato.com 一个多星期了，现在我已经在网上搜索了我的问题，但我找不到合适的解决方案。所以我在这里发布了我的问题。

这是我的网络爬虫代码。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import NoAlertPresentException
import sys
import lxml
import unittest, time, re

class Sel(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.PhantomJS(executable_path='\phantomjs.exe')#phantom js
        self.driver.implicitly_wait(30)
        self.base_url = "https://www.zomato.com"
        self.verificationErrors = []
        self.accept_next_alert = True
    def test_sel(self):
        driver = self.driver
        delay = 3
        driver.get(self.base_url + "hyderabad")
        driver.find_element_by_link_text("All").click()
        for i in range(1,100):
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(4)
        html_source = driver.page_source
        data = html_source.encode('utf-8')


if __name__ == "__main__":

当我运行在 python 3.4 即 directory/py -3.4 selenium.py 我得到这个错误
selenium-python-phantomJS-SSL.
谁能帮我解决这个问题？
最好的问候。

Answer 1

您需要在您的请求中添加适当的 accept-encoding headers。

'accept-encoding':'gzip, deflate, sdch, br'

Answer 2

首先，您发布的有错误的屏幕截图不是来自您发布的代码。您的代码示例显示您正在调用 webdriver.PhantomJS，但屏幕截图清楚地显示您在调用 webdriver.Firefox.

时遇到错误

此外，屏幕截图中的错误消息会准确告诉您问题所在以及解决方法："geckodriver executable needs to be in PATH"。

使用带有 selenium 的 Firefox。您需要安装 geckodriver 并使其在您的 PATH 上可用。 geckodriver（如 chromedriver）是一个不随 Firefox 或 Selenium 提供的外部组件...它必须单独安装。

您可以在这里下载 geckodriver：https://github.com/mozilla/geckodriver/releases

如何使用 python 和 selenium webdriver 抓取 https 网站数据

How to scrape https website data using python and selenium webdriver

https

selenium

ssl-certificate

phantomjs

python-3.4