使用 Scrapy 时得到 twisted.defer.CancelledError
Getting twisted.defer.CancelledError when using Scrapy
每当我 运行 scrapy crawl 命令时,都会出现以下错误:
2016-03-12 00:16:56 [scrapy] ERROR: Error downloading <GET http://XXXXXXX/rnd/sites/default/files/Agreement%20of%20FFCCA(1).pdf>
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/http11.py", line 246, in _cb_bodyready
raise defer.CancelledError()
CancelledError
2016-03-12 00:16:56 [scrapy] ERROR: Error downloading <GET http://XXXXXX/rnd/sites/default/files/S&P_Chemicals,etc.20150903.doc>
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/http11.py", line 246, in _cb_bodyready
raise defer.CancelledError()
CancelledError
我曾尝试在互联网上搜索此错误,但无济于事。
我的爬虫代码如下:
import os
import StringIO
import sys
import scrapy
from scrapy.conf import settings
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class IntSpider(CrawlSpider):
name = "intranetspidey"
allowed_domains = ["*****"]
start_urls = [
"******"
]
rules = (
Rule(LinkExtractor(deny_extensions=["ppt","pptx"],deny=(r'.*\?.*') ),
follow=True,
callback='parse_webpage'),
)
def get_pdf_text(self, response):
""" Peek inside PDF to check possible violations.
@return: PDF content as searcable plain-text string
"""
try:
from pyPdf import PdfFileReader
except ImportError:
print "Needed: easy_install pyPdf"
raise
stream = StringIO.StringIO(response.body)
reader = PdfFileReader(stream)
text = u""
if reader.getDocumentInfo().title:
# Title is optional, may be None
text += reader.getDocumentInfo().title
for page in reader.pages:
# XXX: Does handle unicode properly?
text += page.extractText()
return text
def parse_webpage(self, response):
ct = response.headers.get("content-type", "").lower()
if "pdf" in ct or ".pdf" in response.url:
data = self.get_pdf_text(response)
elif "html" in ct:
do something
我刚刚开始使用 Scrapy,非常感谢您提供知识渊博的解决方案。
你的 output/log 中是否有这样一行:
Expected response size X larger than download max size Y.
听起来您请求的响应超过 1GB。您的错误来自于 download handler which defaults to one gig but can be overridden easy in:
- 项目
settings.py
文件 DOWNLOAD_MAXSIZE,
- 蜘蛛的
custom_settings
与download_maxsize
或
- 一本手册 Request meta key,共
download_maxsize
。
啊——简单! :)
刚打开the source code where the error is thrown... seems like the page is more than maxsize
... which leads us here.
所以,问题是您正在尝试获取大型文档。在设置中增加 DOWNLOAD_MAXSIZE
限制,你应该没问题。
注意:您的性能会受到影响,因为您阻止 CPU 进行 PDF 解码,并且在这种情况下不会发出进一步的请求。 Scrapy 的架构是严格的单线程的。这里有两个(出于许多)解决方案:
a) 使用 file pipeline 下载文件,然后使用其他系统对其进行批处理。
b) 使用 reactor.spawnProcess()
并使用单独的进程进行 PDF 解码。 (see here)。这允许您使用 Python 或任何其他命令行工具进行 PDF 解码。
每当我 运行 scrapy crawl 命令时,都会出现以下错误:
2016-03-12 00:16:56 [scrapy] ERROR: Error downloading <GET http://XXXXXXX/rnd/sites/default/files/Agreement%20of%20FFCCA(1).pdf>
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/http11.py", line 246, in _cb_bodyready
raise defer.CancelledError()
CancelledError
2016-03-12 00:16:56 [scrapy] ERROR: Error downloading <GET http://XXXXXX/rnd/sites/default/files/S&P_Chemicals,etc.20150903.doc>
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/http11.py", line 246, in _cb_bodyready
raise defer.CancelledError()
CancelledError
我曾尝试在互联网上搜索此错误,但无济于事。
我的爬虫代码如下:
import os
import StringIO
import sys
import scrapy
from scrapy.conf import settings
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class IntSpider(CrawlSpider):
name = "intranetspidey"
allowed_domains = ["*****"]
start_urls = [
"******"
]
rules = (
Rule(LinkExtractor(deny_extensions=["ppt","pptx"],deny=(r'.*\?.*') ),
follow=True,
callback='parse_webpage'),
)
def get_pdf_text(self, response):
""" Peek inside PDF to check possible violations.
@return: PDF content as searcable plain-text string
"""
try:
from pyPdf import PdfFileReader
except ImportError:
print "Needed: easy_install pyPdf"
raise
stream = StringIO.StringIO(response.body)
reader = PdfFileReader(stream)
text = u""
if reader.getDocumentInfo().title:
# Title is optional, may be None
text += reader.getDocumentInfo().title
for page in reader.pages:
# XXX: Does handle unicode properly?
text += page.extractText()
return text
def parse_webpage(self, response):
ct = response.headers.get("content-type", "").lower()
if "pdf" in ct or ".pdf" in response.url:
data = self.get_pdf_text(response)
elif "html" in ct:
do something
我刚刚开始使用 Scrapy,非常感谢您提供知识渊博的解决方案。
你的 output/log 中是否有这样一行:
Expected response size X larger than download max size Y.
听起来您请求的响应超过 1GB。您的错误来自于 download handler which defaults to one gig but can be overridden easy in:
- 项目
settings.py
文件 DOWNLOAD_MAXSIZE, - 蜘蛛的
custom_settings
与download_maxsize
或 - 一本手册 Request meta key,共
download_maxsize
。
啊——简单! :)
刚打开the source code where the error is thrown... seems like the page is more than maxsize
... which leads us here.
所以,问题是您正在尝试获取大型文档。在设置中增加 DOWNLOAD_MAXSIZE
限制,你应该没问题。
注意:您的性能会受到影响,因为您阻止 CPU 进行 PDF 解码,并且在这种情况下不会发出进一步的请求。 Scrapy 的架构是严格的单线程的。这里有两个(出于许多)解决方案:
a) 使用 file pipeline 下载文件,然后使用其他系统对其进行批处理。
b) 使用 reactor.spawnProcess()
并使用单独的进程进行 PDF 解码。 (see here)。这允许您使用 Python 或任何其他命令行工具进行 PDF 解码。