Scrapy (1.0) - 未收到信号
Scrapy (1.0) - Signals not received
我想做的是在打开 scrapy 蜘蛛时触发一个函数 (abc),这应该由 scrapys 'signals' 触发。
(稍后我想将其更改为'closed'以将每个蜘蛛的统计信息保存到数据库中以进行日常监控。)
所以现在我尝试了这个简单的解决方案只是为了打印出一些东西,当我 运行 爬虫进程打开时我希望在控制台中看到什么。
发生的事情是爬虫 运行 没问题,但是当蜘蛛处于 openend 状态时不会打印 'abc' 的输出,这应该会触发输出。最后我发布了在控制台中看到的内容,这只是蜘蛛 运行 非常好。
为什么 abc 函数没有在日志中看到 'INFO: Spider opened' 的地方(或根本没有)被信号触发?
MyCrawlerProcess.py:
from twisted.internet import reactor
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
def abc():
print '######################works!######################'
def from_crawler(crawler):
crawler.signals.connect(abc, signal=signals.spider_opened)
process.crawl('Dissident')
process.start() # the script will block here until the crawling is finished
控制台输出:
2016-03-17 13:00:14 [scrapy] INFO: Scrapy 1.0.4 started (bot: Chrome 41.0.2227.1. Mozilla/5.0 (Macintosh; Intel Mac Osource)
2016-03-17 13:00:14 [scrapy] INFO: Optional features available: ssl, http11
2016-03-17 13:00:14 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapytry.spiders', 'SPIDER_MODULES': ['scrapytry.spiders'], 'DOWNLOAD_DELAY': 5, 'BOT_NAME': 'Chrome 41.0.2227.1. Mozilla/5.0 (Macintosh; Intel Mac Osource'}
2016-03-17 13:00:14 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-17 13:00:14 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-17 13:00:14 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-17 13:00:14 [scrapy] INFO: Enabled item pipelines: ImagesPipeline, FilesPipeline, ScrapytryPipeline
2016-03-17 13:00:14 [scrapy] INFO: Spider opened
2016-03-17 13:00:14 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-17 13:00:14 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-17 13:00:14 [scrapy] DEBUG: Crawled (200) <GET http://www.xyz.zzm/> (referer: None)
仅仅定义 from_crawler
是不够的,因为它没有挂接到 scrapy 框架中。查看 the docs here,它展示了如何创建一个完全符合您的要求的扩展。请务必遵循通过 MYEXT_ENABLED
设置启用扩展的说明。
我想做的是在打开 scrapy 蜘蛛时触发一个函数 (abc),这应该由 scrapys 'signals' 触发。
(稍后我想将其更改为'closed'以将每个蜘蛛的统计信息保存到数据库中以进行日常监控。) 所以现在我尝试了这个简单的解决方案只是为了打印出一些东西,当我 运行 爬虫进程打开时我希望在控制台中看到什么。
发生的事情是爬虫 运行 没问题,但是当蜘蛛处于 openend 状态时不会打印 'abc' 的输出,这应该会触发输出。最后我发布了在控制台中看到的内容,这只是蜘蛛 运行 非常好。
为什么 abc 函数没有在日志中看到 'INFO: Spider opened' 的地方(或根本没有)被信号触发?
MyCrawlerProcess.py:
from twisted.internet import reactor
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
def abc():
print '######################works!######################'
def from_crawler(crawler):
crawler.signals.connect(abc, signal=signals.spider_opened)
process.crawl('Dissident')
process.start() # the script will block here until the crawling is finished
控制台输出:
2016-03-17 13:00:14 [scrapy] INFO: Scrapy 1.0.4 started (bot: Chrome 41.0.2227.1. Mozilla/5.0 (Macintosh; Intel Mac Osource)
2016-03-17 13:00:14 [scrapy] INFO: Optional features available: ssl, http11
2016-03-17 13:00:14 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapytry.spiders', 'SPIDER_MODULES': ['scrapytry.spiders'], 'DOWNLOAD_DELAY': 5, 'BOT_NAME': 'Chrome 41.0.2227.1. Mozilla/5.0 (Macintosh; Intel Mac Osource'}
2016-03-17 13:00:14 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-17 13:00:14 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-17 13:00:14 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-17 13:00:14 [scrapy] INFO: Enabled item pipelines: ImagesPipeline, FilesPipeline, ScrapytryPipeline
2016-03-17 13:00:14 [scrapy] INFO: Spider opened
2016-03-17 13:00:14 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-17 13:00:14 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-17 13:00:14 [scrapy] DEBUG: Crawled (200) <GET http://www.xyz.zzm/> (referer: None)
仅仅定义 from_crawler
是不够的,因为它没有挂接到 scrapy 框架中。查看 the docs here,它展示了如何创建一个完全符合您的要求的扩展。请务必遵循通过 MYEXT_ENABLED
设置启用扩展的说明。