使用 Scrapy 时得到 twisted.defer.CancelledError

Question

每当我运行 scrapy crawl 命令时，都会出现以下错误：

2016-03-12 00:16:56 [scrapy] ERROR: Error downloading <GET http://XXXXXXX/rnd/sites/default/files/Agreement%20of%20FFCCA(1).pdf>
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/http11.py", line 246, in _cb_bodyready
    raise defer.CancelledError()
CancelledError
2016-03-12 00:16:56 [scrapy] ERROR: Error downloading <GET http://XXXXXX/rnd/sites/default/files/S&P_Chemicals,etc.20150903.doc>
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/http11.py", line 246, in _cb_bodyready
    raise defer.CancelledError()
CancelledError

我曾尝试在互联网上搜索此错误，但无济于事。

我的爬虫代码如下：

import os
import StringIO
import sys
import scrapy
from scrapy.conf import settings
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class IntSpider(CrawlSpider):
    name = "intranetspidey"
    allowed_domains = ["*****"]
    start_urls = [
        "******"
    ]
    rules = (
        Rule(LinkExtractor(deny_extensions=["ppt","pptx"],deny=(r'.*\?.*') ),
             follow=True,
             callback='parse_webpage'),
    )


    def get_pdf_text(self, response):
        """ Peek inside PDF to check possible violations.
        @return: PDF content as searcable plain-text string
        """
        try:
                from pyPdf import PdfFileReader
        except ImportError:
                print "Needed: easy_install pyPdf"
                raise 
        stream = StringIO.StringIO(response.body)
        reader = PdfFileReader(stream)
        text = u""

        if reader.getDocumentInfo().title:
                # Title is optional, may be None
                text += reader.getDocumentInfo().title

        for page in reader.pages:
                # XXX: Does handle unicode properly?
                text += page.extractText()

        return text 

    def parse_webpage(self, response):

        ct = response.headers.get("content-type", "").lower()
        if "pdf" in ct or ".pdf" in response.url:
            data = self.get_pdf_text(response)

        elif "html" in ct:
              do something

我刚刚开始使用 Scrapy，非常感谢您提供知识渊博的解决方案。

Answer 1

你的 output/log 中是否有这样一行：

Expected response size X larger than download max size Y.

听起来您请求的响应超过 1GB。您的错误来自于 download handler which defaults to one gig but can be overridden easy in:

项目 settings.py 文件 DOWNLOAD_MAXSIZE，
蜘蛛的custom_settings与download_maxsize或
一本手册 Request meta key，共 download_maxsize。

Answer 2

啊——简单！ :)

刚打开the source code where the error is thrown... seems like the page is more than maxsize... which leads us here.

所以，问题是您正在尝试获取大型文档。在设置中增加 DOWNLOAD_MAXSIZE 限制，你应该没问题。

注意：您的性能会受到影响，因为您阻止 CPU 进行 PDF 解码，并且在这种情况下不会发出进一步的请求。 Scrapy 的架构是严格的单线程的。这里有两个（出于许多）解决方案：

a) 使用 file pipeline 下载文件，然后使用其他系统对其进行批处理。

b) 使用 reactor.spawnProcess() 并使用单独的进程进行 PDF 解码。 (see here)。这允许您使用 Python 或任何其他命令行工具进行 PDF 解码。

使用 Scrapy 时得到 twisted.defer.CancelledError

Getting twisted.defer.CancelledError when using Scrapy

python

twisted

scrapy