运行 从文件启动

Run Splash From File

我已经研究了几天了,我发现了很多类似于我的问题的答案,但不是真的所以我决定继续 post 这个问题。我正在使用 scrapy-splash 来抓取 KBB。我能够通过使用 send_text 和 send_keys 绕过愚蠢的第一次使用弹出窗口的事情,这在 Splash 的浏览器版本中非常有效。它像我想要的那样提取动态内容,太棒了!

这是简单的代码 copy-paste-ability;

function main(splash, args)
  assert(splash:go(args.url))
  assert(splash:wait(0.5))
  splash:send_text("24153")
  splash:send_keys("<Return>")
  assert(splash:wait(5))
  return {
    html = splash:html(),
    png = splash:png(),
    har = splash:har(),
  }
end

现在我正在尝试让它工作 in-script 因为我希望能够一次渲染多个 HTML 文件。这是我到目前为止的代码,我现在只有两个 URL 可以测试:

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    name = "cars"
    start_urls = ["https://www.kbb.com/ford/escape/2017/titanium/", "https://www.kbb.com/honda/cr-v/2017/touring/"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse,
                endpoint='render.html',
                args={'wait': 0.5, 'send_text':24153, 'send_keys':'<Return>', 'wait': 5.0},
            )

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'car-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

当我尝试 运行 时,它一直告诉我超时了:

2018-01-16 19:34:31 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-16 19:35:02 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://192.168.65.0:8050/robots.txt> (failed 2 times): TCP connection timed out: 60: Operation timed out.
2018-01-16 19:35:31 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-16 19:36:17 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://192.168.65.0:8050/robots.txt> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2018-01-16 19:36:17 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://192.168.65.0:8050/robots.txt>: TCP connection timed out: 60: Operation timed out.
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2018-01-16 19:36:31 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-16 19:37:31 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-16 19:37:32 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.kbb.com/ford/escape/2017/titanium/ via http://192.168.65.0:8050/render.html> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2018-01-16 19:37:32 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.kbb.com/honda/cr-v/2017/touring/ via http://192.168.65.0:8050/render.html> (failed 1 times): TCP connection timed out: 60: Operation timed out.
2018-01-16 19:38:31 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-01-16 19:38:48 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.kbb.com/ford/escape/2017/titanium/ via http://192.168.65.0:8050/render.html> (failed 2 times): TCP connection timed out: 60: Operation timed out.
2018-01-16 19:38:48 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.kbb.com/honda/cr-v/2017/touring/ via http://192.168.65.0:8050/render.html> (failed 2 times): TCP connection timed out: 60: Operation timed out.

这是我在底部的 settings.py 自定义内容,不确定您是否需要全部内容,因为大部分内容已被注释掉:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPLASH_URL = 'http://localhost:8050/'
SPLASH_URL = 'http://192.168.65.0:8050' 

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

我一直在关注多个教程,试图让它发挥作用。我假设它与 SPIDER_MIDDLEWARES 事情有关,但我不知道需要对此进行哪些更改。我对蜘蛛还很陌生,所以非常感谢任何帮助。

这花了将近两周的时间,但我终于得到了我想要的东西。不得不切换到 AutoBlog,因为 KBB 没有我需要的一切。 AutoBlog 的问题是它只在您实际滚动到页面底部时加载页面底部,因此我使用 mouse_click 单击导航按钮将其向下滚动到我需要的页面部分。然后我在渲染前等待了几秒钟。

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    name = "cars"
    start_urls = ["https://www.autoblog.com/buy/2017-Ford-Escape-SE__4dr_4x4/", "https://www.autoblog.com/buy/2017-Honda-CR_V-EX_L__4dr_Front_wheel_Drive/"]

    script="""
    function main(splash, args)
      assert(splash:go(args.url))
      assert(splash:wait(10.0))
      splash:mouse_click(800, 335)  
      assert(splash:wait(10.0))
      return {
        html = splash:html()
      }
    end
    """

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse,
                endpoint='execute',
                args={'lua_source': self.script, 'wait': 1.0},
            )

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'car-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

还有一些修饰需要做,需要添加更多的 URL,但它是一个功能代码块!