当脚本 运行 时,Scrapy 不调用分配的管道
Scrapy not calling the assigned pipeline when run from a script
我有一段代码可以测试scrapy。我的目标是使用 scrapy 而不必从终端调用 scrapy
命令,这样我就可以将这段代码嵌入到其他地方。
代码如下:
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.item import Item, Field
from scrapy.crawler import CrawlerProcess
import json
class JsonWriterPipeline(object):
file = None
def open_spider(self, spider):
self.file = open('items.json', 'wb')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
class StackItem(Item):
title = Field()
url = Field()
class StackSpider(Spider):
name = "stack"
allowed_domains = ["whosebug.com"]
start_urls = ["http://whosebug.com/questions?pagesize=50&sort=newest"]
def parse(self, response):
questions = Selector(response).xpath('//div[@class="summary"]/h3')
for question in questions:
item = StackItem()
item['title'] = question.xpath('a[@class="question-hyperlink"]/text()').extract()[0]
item['url'] = question.xpath('a[@class="question-hyperlink"]/@href').extract()[0]
yield item
if __name__ == '__main__':
settings = dict()
settings['USER_AGENT'] = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
settings['ITEM_PIPELINES'] = {'JsonWriterPipeline': 1}
process = CrawlerProcess(settings=settings)
spider = StackSpider()
process.crawl(spider)
process.start()
如您所见,代码是自包含的,我重写了两个设置; USER_AGENT 和 ITEM_PIPELINES。但是,当我在 JsonWriterPipeline
class 中设置调试点时,我看到代码已执行但从未到达调试点,因此未使用自定义管道。
如何解决这个问题?
当 运行 你的脚本使用 scrapy 1.3.2 和 Python 3.5.
时,我得到 2 个错误
第一个:
Unhandled error in Deferred:
2017-02-21 13:47:23 [twisted] CRITICAL: Unhandled error in Deferred:
2017-02-21 13:47:23 [twisted] CRITICAL:
Traceback (most recent call last):
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/utils/misc.py", line 39, in load_object
dot = path.rindex('.')
ValueError: substring not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/crawler.py", line 72, in crawl
self.engine = self._create_engine()
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/crawler.py", line 97, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/core/engine.py", line 70, in __init__
self.scraper = Scraper(crawler)
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/core/scraper.py", line 71, in __init__
self.itemproc = itemproc_cls.from_crawler(crawler)
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/utils/misc.py", line 41, in load_object
raise ValueError("Error loading object '%s': not a full path" % path)
ValueError: Error loading object 'JsonWriterPipeline': not a full path
您需要提供管道的完整路径。例如这里,__main__
命名空间有效:
settings['ITEM_PIPELINES'] = {'__main__.JsonWriterPipeline': 1}
其次(使用上述管道 class 修复),您将获得:
2017-02-21 13:47:52 [scrapy.core.scraper] ERROR: Error processing {'title': 'Apply Remote Commits to a Local Pull Request',
'url': '/questions/42367647/apply-remote-commits-to-a-local-pull-request'}
Traceback (most recent call last):
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "test.py", line 20, in process_item
self.file.write(line)
TypeError: a bytes-like object is required, not 'str'
您可以通过将项目 JSON 写为字节来解决这个问题:
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line.encode('ascii'))
return item
我有一段代码可以测试scrapy。我的目标是使用 scrapy 而不必从终端调用 scrapy
命令,这样我就可以将这段代码嵌入到其他地方。
代码如下:
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.item import Item, Field
from scrapy.crawler import CrawlerProcess
import json
class JsonWriterPipeline(object):
file = None
def open_spider(self, spider):
self.file = open('items.json', 'wb')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
class StackItem(Item):
title = Field()
url = Field()
class StackSpider(Spider):
name = "stack"
allowed_domains = ["whosebug.com"]
start_urls = ["http://whosebug.com/questions?pagesize=50&sort=newest"]
def parse(self, response):
questions = Selector(response).xpath('//div[@class="summary"]/h3')
for question in questions:
item = StackItem()
item['title'] = question.xpath('a[@class="question-hyperlink"]/text()').extract()[0]
item['url'] = question.xpath('a[@class="question-hyperlink"]/@href').extract()[0]
yield item
if __name__ == '__main__':
settings = dict()
settings['USER_AGENT'] = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
settings['ITEM_PIPELINES'] = {'JsonWriterPipeline': 1}
process = CrawlerProcess(settings=settings)
spider = StackSpider()
process.crawl(spider)
process.start()
如您所见,代码是自包含的,我重写了两个设置; USER_AGENT 和 ITEM_PIPELINES。但是,当我在 JsonWriterPipeline
class 中设置调试点时,我看到代码已执行但从未到达调试点,因此未使用自定义管道。
如何解决这个问题?
当 运行 你的脚本使用 scrapy 1.3.2 和 Python 3.5.
时,我得到 2 个错误第一个:
Unhandled error in Deferred:
2017-02-21 13:47:23 [twisted] CRITICAL: Unhandled error in Deferred:
2017-02-21 13:47:23 [twisted] CRITICAL:
Traceback (most recent call last):
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/utils/misc.py", line 39, in load_object
dot = path.rindex('.')
ValueError: substring not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/crawler.py", line 72, in crawl
self.engine = self._create_engine()
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/crawler.py", line 97, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/core/engine.py", line 70, in __init__
self.scraper = Scraper(crawler)
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/core/scraper.py", line 71, in __init__
self.itemproc = itemproc_cls.from_crawler(crawler)
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/middleware.py", line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/middleware.py", line 34, in from_settings
mwcls = load_object(clspath)
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/scrapy/utils/misc.py", line 41, in load_object
raise ValueError("Error loading object '%s': not a full path" % path)
ValueError: Error loading object 'JsonWriterPipeline': not a full path
您需要提供管道的完整路径。例如这里,__main__
命名空间有效:
settings['ITEM_PIPELINES'] = {'__main__.JsonWriterPipeline': 1}
其次(使用上述管道 class 修复),您将获得:
2017-02-21 13:47:52 [scrapy.core.scraper] ERROR: Error processing {'title': 'Apply Remote Commits to a Local Pull Request',
'url': '/questions/42367647/apply-remote-commits-to-a-local-pull-request'}
Traceback (most recent call last):
File "/home/paul/.virtualenvs/scrapy13.py3/lib/python3.5/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "test.py", line 20, in process_item
self.file.write(line)
TypeError: a bytes-like object is required, not 'str'
您可以通过将项目 JSON 写为字节来解决这个问题:
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line.encode('ascii'))
return item