提交使用 Scrapy 动态呈现的表单?

Submit form that renders dynamically with Scrapy?

我正在尝试使用 Scrapy 提交一个动态生成的用户登录表单,然后在对应于成功登录的页面上解析 HTML。

我想知道如何使用 Scrapy 或 Scrapy 和 Selenium 的组合来做到这一点。 Selenium 使得在 DOM 上找到元素成为可能,但我想知道是否可以在获取完整的 HTML 之后 "give control back" 到 Scrapy 以允许它执行表单提交并保存必要的 cookie、会话数据等,以便抓取页面。

基本上,我认为需要 Selenium 的唯一原因是因为我需要在 Scrapy 查找 <form> 元素之前从 Javascript 呈现页面。但是,还有其他选择吗?

谢谢!

编辑:这个问题类似于 this one, but unfortunately the accepted answer deals with the Requests library instead of Selenium or Scrapy. Though that scenario may be possible in some cases (watch this to learn more),正如 alecxe 指出的那样,如果 "parts of the page [such as forms] are loaded via API calls and inserted into the page with the help of javascript code being executed in the browser".

,则可能需要 Selenium

Scrapy 实际上并不适合 coursera 网站,因为它非常异步。部分页面通过 API 调用加载,并借助在浏览器中执行的 javascript 代码插入页面。 Scrapy 不是浏览器,无法处理。

这说明了问题 - 为什么不使用公开可用的 Coursera API

除了记录的内容之外,您还可以看到在浏览器开发人员工具中调用的其他端点 - 您需要经过身份验证才能使用它们。例如,如果您已登录,您可以看到您已修读的课程列表:

有一个对 memberships.v1 端点的调用。

举个例子,让我们开始 selenium,登录并使用 get_cookies() 获取 cookie。然后,让我们生成一个 Requestmemberships.v1 端点,以获取提供我们从 selenium:

获得的 cookie 的存档课程列表
import json

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


LOGIN = 'email'
PASSWORD = 'password'

class CourseraSpider(scrapy.Spider):
    name = "courseraSpider"
    allowed_domains = ["coursera.org"]

    def start_requests(self):
        self.driver = webdriver.Chrome()
        self.driver.maximize_window()
        self.driver.get('https://www.coursera.org/login')

        form = WebDriverWait(self.driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@data-js='login-body']//div[@data-js='facebook-button-divider']/following-sibling::form")))
        email = WebDriverWait(form, 10).until(EC.visibility_of_element_located((By.ID, 'user-modal-email')))
        email.send_keys(LOGIN)

        password = form.find_element_by_name('password')
        password.send_keys(PASSWORD)

        login = form.find_element_by_xpath('//button[. = "Log In"]')
        login.click()

        WebDriverWait(self.driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h2[. = 'My Courses']")))

        self.driver.get('https://www.coursera.org/')
        cookies = self.driver.get_cookies()

        self.driver.close()

        courses_url = 'https://www.coursera.org/api/memberships.v1'
        params = {
            'fields': 'courseId,enrolledTimestamp,grade,id,lastAccessedTimestamp,role,v1SessionId,vc,vcMembershipId,courses.v1(display,partnerIds,photoUrl,specializations,startDate,v1Details),partners.v1(homeLink,name),v1Details.v1(sessionIds),v1Sessions.v1(active,dbEndDate,durationString,hasSigTrack,startDay,startMonth,startYear),specializations.v1(logo,name,partnerIds,shortName)&includes=courseId,vcMembershipId,courses.v1(partnerIds,specializations,v1Details),v1Details.v1(sessionIds),specializations.v1(partnerIds)',
            'q': 'me',
            'showHidden': 'false',
            'filter': 'archived'
        }

        params = '&'.join(key + '=' + value for key, value in params.iteritems())
        yield scrapy.Request(courses_url + '?' + params, cookies=cookies)

    def parse(self, response):
        data = json.loads(response.body)

        for course in data['linked']['courses.v1']:
            print course['name']

对我来说,它打印:

Algorithms, Part I
Computing for Data Analysis
Pattern-Oriented Software Architectures for Concurrent and Networked Software
Computer Networks

这证明我们可以给 Scrapy 来自 selenium 的 cookie 并成功地从 "for logged in users only" 页面提取数据。


此外,请确保您没有违反 Terms of Use 中的规则,特别是:

In addition, as a condition of accessing the Sites, you agree not to ... (c) use any high-volume, automated or electronic means to access the Sites (including without limitation, robots, spiders, scripts or web-scraping tools);