Scrapy Shell 和 Scrapy Splash

Question

我们一直在使用 scrapy-splash middleware 通过 docker 中的 Splash javascript 引擎运行传递抓取的 HTML 源容器。

如果我们想在蜘蛛中使用Splash，我们配置几个required project settings and yield a Request specifying specific meta arguments:

yield Request(url, self.parse_result, meta={
    'splash': {
        'args': {
            # set rendering arguments here
            'html': 1,
            'png': 1,

            # 'url' is prefilled from request url
        },

        # optional parameters
        'endpoint': 'render.json',  # optional; default is render.json
        'splash_url': '<url>',      # overrides SPLASH_URL
        'slot_policy': scrapyjs.SlotPolicy.PER_DOMAIN,
    }
})

这与记录的一样。但是，我们如何在 Scrapy Shell 中使用 scrapy-splash？

Answer 1

只需将您想要 shell 的 url 包裹在 splash http api.

中

所以你会想要这样的东西：

scrapy shell 'http://localhost:8050/render.html?url=http://domain.com/page-with-javascript.html&timeout=10&wait=0.5'

其中 localhost:port 是您的启动服务运行
url 是 url 你想抓取，不要忘记 url 引用 它！
render.html 是可能的 http api 端点之一，在这种情况下 returns 重新显示 html 页面
timeout 超时时间（以秒为单位）
wait 在 html.

之前等待 javascript 执行的时间（以秒为单位）

Answer 2

你可以运行 scrapy shell 在配置的 Scrapy 项目中不带参数，然后创建 req = scrapy_splash.SplashRequest(url, ...) 并调用 fetch(req)。

Answer 3

对于使用 Docker 工具箱的 windows 用户：

将单引号改为双引号以防止 invalid hostname:http 错误。
将本地主机更改为鲸鱼徽标下方的 docker ip 地址。对我来说是 192.168.99.100.

我终于明白了：

scrapy shell "http://192.168.99.100:8050/render.html?url="https://samplewebsite.com/category/banking-insurance-financial-services/""

Scrapy Shell 和 Scrapy Splash

Scrapy Shell and Scrapy Splash

scrapy

web-scraping

scrapy-shell

scrapy-splash

splash-js-render