Scrapy Splash 爬行 Javascript 网站

Question

我可以使用以下代码抓取 Javascript 呈现的页面：

import scrapy
from scrapy_splash import SplashRequest

class QuotejscrawlerSpider(scrapy.Spider):
    name = 'quotejscrawler'

    def start_requests(self):
        yield SplashRequest(
             url = 'http://www.horsedeathwatch.com/',
             callback=self.parse,
         )

    def parse(self, response):
        for quote in response.xpath("//tr"):
            item = {
                'horse': quote.xpath('td[@data-th="Horse"]/a/text()').extract(),
                'date': quote.xpath('td[@data-th="Date"]/text()').extract(),
                'cause': quote.xpath('td[@data-th="Cause of Death"]/text()').extract(),
            }
            yield item

我想通过单击每个网页上的 "Next" 按钮来抓取多个网页。我是新手。有什么建议吗？

Answer 1

据我所知，似乎有 2 (non-python) 种方法可以使脚本启动：

通过 js_source 参数传递 javascript 代码
通过 lua_source parameter (there are some examples 传递 lua 代码显示如何使用 scrapy-splash)

也就是说，我认为 reverse-engineer 网站发出的请求并在您的 python 代码中实现这些请求会简单得多（至少在这种情况下），完全避免需要为了飞溅。

Scrapy Splash 爬行 Javascript 网站

Scrapy Splash Crawling Javascript Website

python

scrapy

scrapy-splash