scrapy.FormRequest.from_response 比。 SplashFormRequest.from_response

scrapy.FormRequest.from_response VS. SplashFormRequest.from_response

我正在尝试以与仅使用 scrapy 完全相同的方式使用 scrapy splash 登录。 我查看了文档 Doc,上面写着 "SplashFormRequest.from_response is also supported, and works as described in scrapy documentation" 但是,如 splash 文档中所述,简单地更改一行代码和更改设置不会带来任何结果。我做错了什么? 代码:

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    name = 'lost'
    start_urls = ["myurl",]

def parse(self, response):
    return SplashFormRequest.from_response(
        response,
        formdata={'username': 'pass', 'password': 'pass'},
        callback=self.after_login
    ) 

def after_login(self, response):
    print response.body
    if "keyword" in response.body:
        self.logger.error("Success")
    else:
        self.logger.error("Failed")

添加到设置:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':     810,
                           }

SPLASH_URL = 'http://localhost:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

错误日志:

python@debian:~/Python/code/lostfilm$ scrapy crawl lost
2017-01-26 20:24:22 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot:   lostfilm)
2017-01-26 20:24:22 [scrapy.utils.log] INFO: Overridden settings:   {'NEWSPIDER_MODULE': 'lostfilm.spiders', 'ROBOTSTXT_OBEY': True,  'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'SPIDER_MODULES': ['lostfilm.spiders'], 'BOT_NAME': 'lostfilm', 'HTTPCACHE_STORAGE':   'scrapy_splash.SplashAwareFSCacheStorage'}
2017-01-26 20:24:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
Unhandled error in Deferred:
2017-01-26 20:24:22 [twisted] CRITICAL: Unhandled error in Deferred:

2017-01-26 20:24:22 [twisted] CRITICAL: 
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py",  line 1299, in _inlineCallbacks
  result = g.send(result)
 File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line  90, in crawl
six.reraise(*exc_info)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 72, in crawl
self.engine = self._create_engine()
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 97, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 69, in __init__
self.downloader = downloader_cls(crawler)
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/__init__.py", line 88, in __init__
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/misc.py", line 49, in load_object
raise NameError("Module '%s' doesn't define any object named '%s'" %  (module, name))
NameError: Module 'scrapy.downloadermiddlewares.httpcompression' doesn't  define any object named 'HttpCompresionMiddlerware'

您可能还需要使用 Splash 执行第一个请求。

默认情况下,start_urls 属性将发出 "simple" scrapy.Request,而不是 SplashRequest

您需要为您的蜘蛛覆盖 start_requests 方法:

class MySpider(scrapy.Spider):
    name = 'lost'
    start_urls = ["myurl",]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url)
    ...