无法从 scrapy.CrawlerProcess 获取 Scrapy 统计数据
Can't get Scrapy Stats from scrapy.CrawlerProcess
我是来自另一个脚本的 运行 scrapy 蜘蛛,我需要从 Crawler 检索并保存到变量统计信息。我查看了文档和其他 Whosebug 问题,但未能解决此问题。
这是我正在 运行 抓取的脚本:
import scrapy
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess({})
process.crawl(spiders.MySpider)
process.start()
stats = CrawlerProcess.stats.getstats() # I need something like this
我希望 stats 包含这条数据 (scrapy.statscollectors):
{'downloader/request_bytes': 44216,
'downloader/request_count': 36,
'downloader/request_method_count/GET': 36,
'downloader/response_bytes': 1061929,
'downloader/response_count': 36,
'downloader/response_status_count/200': 36,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 9, 16, 31, 2, 382546),
'log_count/DEBUG': 37,
'log_count/ERROR': 35,
'log_count/INFO': 9,
'memusage/max': 62623744,
'memusage/startup': 62623744,
'request_depth_max': 1,
'response_received_count': 36,
'scheduler/dequeued': 36,
'scheduler/dequeued/memory': 36,
'scheduler/enqueued': 36,
'scheduler/enqueued/memory': 36,
'start_time': datetime.datetime(2018, 11, 9, 16, 30, 38, 140469)}
我检查了 CrawlerProcess,它 returns 在抓取过程完成后从其 'crawlers' 字段中延迟和删除爬虫。
有办法解决吗?
最好的,
彼得
根据the documentation, CrawlerProcess.crawl
accepts either a crawler or a spider class, and you're able to create a crawler from the spider class via CrawlerProcess.create_crawler
。
因此您可以在开始爬网过程之前创建爬虫实例,然后检索预期的属性。
下面我通过编辑几行原始代码为您提供了一个示例:
import scrapy
from scrapy.crawler import CrawlerProcess
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['http://httpbin.org/get']
def parse(self, response):
self.crawler.stats.inc_value('foo')
process = CrawlerProcess({})
crawler = process.create_crawler(TestSpider)
process.crawl(crawler)
process.start()
stats_obj = crawler.stats
stats_dict = crawler.stats.get_stats()
# perform the actions you want with the stats object or dict
我是来自另一个脚本的 运行 scrapy 蜘蛛,我需要从 Crawler 检索并保存到变量统计信息。我查看了文档和其他 Whosebug 问题,但未能解决此问题。
这是我正在 运行 抓取的脚本:
import scrapy
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess({})
process.crawl(spiders.MySpider)
process.start()
stats = CrawlerProcess.stats.getstats() # I need something like this
我希望 stats 包含这条数据 (scrapy.statscollectors):
{'downloader/request_bytes': 44216,
'downloader/request_count': 36,
'downloader/request_method_count/GET': 36,
'downloader/response_bytes': 1061929,
'downloader/response_count': 36,
'downloader/response_status_count/200': 36,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 9, 16, 31, 2, 382546),
'log_count/DEBUG': 37,
'log_count/ERROR': 35,
'log_count/INFO': 9,
'memusage/max': 62623744,
'memusage/startup': 62623744,
'request_depth_max': 1,
'response_received_count': 36,
'scheduler/dequeued': 36,
'scheduler/dequeued/memory': 36,
'scheduler/enqueued': 36,
'scheduler/enqueued/memory': 36,
'start_time': datetime.datetime(2018, 11, 9, 16, 30, 38, 140469)}
我检查了 CrawlerProcess,它 returns 在抓取过程完成后从其 'crawlers' 字段中延迟和删除爬虫。
有办法解决吗?
最好的, 彼得
根据the documentation, CrawlerProcess.crawl
accepts either a crawler or a spider class, and you're able to create a crawler from the spider class via CrawlerProcess.create_crawler
。
因此您可以在开始爬网过程之前创建爬虫实例,然后检索预期的属性。
下面我通过编辑几行原始代码为您提供了一个示例:
import scrapy
from scrapy.crawler import CrawlerProcess
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['http://httpbin.org/get']
def parse(self, response):
self.crawler.stats.inc_value('foo')
process = CrawlerProcess({})
crawler = process.create_crawler(TestSpider)
process.crawl(crawler)
process.start()
stats_obj = crawler.stats
stats_dict = crawler.stats.get_stats()
# perform the actions you want with the stats object or dict