Scrapy 暂停和恢复爬取,结果目录
Scrapy Pausing and resuming crawls, results directory
我已经使用恢复模式完成了一个抓取项目。但我不知道结果在哪里。
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
我查看了https://docs.scrapy.org/en/latest/topics/jobs.html,但它没有显示任何相关信息
¿包含结果的文件在哪里?
2020-09-10 23:31:31 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-10 23:31:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/error/scrapy.core.downloader.handlers.http11.TunnelError': 22,
'bans/error/twisted.internet.error.ConnectionRefusedError': 2,
'bans/error/twisted.internet.error.TimeoutError': 6891,
'bans/error/twisted.web._newclient.ResponseNeverReceived': 8424,
'bans/status/500': 9598,
'bans/status/503': 56,
'downloader/exception_count': 15339,
'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 22,
'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 2,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 6891,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 8424,
'downloader/request_bytes': 9530,
'downloader/request_count': 172,
'downloader/request_method_count/GET': 172,
'downloader/response_bytes': 1848,
'downloader/response_count': 170,
'downloader/response_status_count/200': 169,
'downloader/response_status_count/500': 9,
'downloader/response_status_count/503': 56,
'elapsed_time_seconds': 1717,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 9, 11, 2, 31, 31, 32),
'httperror/response_ignored_count': 67,
'httperror/response_ignored_status_count/500': 67,
'item_scraped_count': 120,
'log_count/DEBUG': 357,
'log_count/ERROR': 119,
'log_count/INFO': 1764,
'log_count/WARNING': 240,
'proxies/dead': 1,
'proxies/good': 1,
'proxies/mean_backoff': 0.0,
'proxies/reanimated': 0,
'proxies/unchecked': 0,
'response_received_count': 169,
'retry/count': 1019,
'retry/max_reached': 93,
'retry/reason_count/500 Internal Server Error': 867,
'retry/reason_count/twisted.internet.error.TimeoutError': 80,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 72,
'scheduler/dequeued': 1722,
'scheduler/dequeued/disk': 1722,
'scheduler/enqueued': 1722,
'scheduler/enqueued/disk': 1722,
'start_time': datetime.datetime(2015, 9, 9, 2, 48, 56, 908)}
2020-09-10 23:31:31 [scrapy.core.engine] INFO: Spider closed (finished)
(Face python 3.8) D:\Selenium\Face python 3.8\TORBUSCADORDELINKS\TORBUSCADORDELINKS\spiders>
'retry/reason_count/500 Internal Server Error': 867,
'retry/reason_count/twisted.internet.error.TimeoutError': 80,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 72,
'scheduler/dequeued': 1722673,
'scheduler/dequeued/disk': 1722,
'scheduler/enqueued': 1722,
'scheduler/enqueued/disk': 1722,
'start_time': datetime.datetime(2020, 9, 9, 2, 48, 56, 908)}
2020-09-10 23:31:31 [scrapy.core.engine] INFO: Spider closed (finished)
你的命令,
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
不表示输出文件路径。
因此,您的结果无处可去。
使用 -o
command-line 开关指定输出路径。
另请参阅 the Scrapy tutorial,其中涵盖了这一点。或者 运行 scrapy crawl --help
.
我已经使用恢复模式完成了一个抓取项目。但我不知道结果在哪里。
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
我查看了https://docs.scrapy.org/en/latest/topics/jobs.html,但它没有显示任何相关信息
¿包含结果的文件在哪里?
2020-09-10 23:31:31 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-10 23:31:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'bans/error/scrapy.core.downloader.handlers.http11.TunnelError': 22,
'bans/error/twisted.internet.error.ConnectionRefusedError': 2,
'bans/error/twisted.internet.error.TimeoutError': 6891,
'bans/error/twisted.web._newclient.ResponseNeverReceived': 8424,
'bans/status/500': 9598,
'bans/status/503': 56,
'downloader/exception_count': 15339,
'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 22,
'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 2,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 6891,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 8424,
'downloader/request_bytes': 9530,
'downloader/request_count': 172,
'downloader/request_method_count/GET': 172,
'downloader/response_bytes': 1848,
'downloader/response_count': 170,
'downloader/response_status_count/200': 169,
'downloader/response_status_count/500': 9,
'downloader/response_status_count/503': 56,
'elapsed_time_seconds': 1717,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 9, 11, 2, 31, 31, 32),
'httperror/response_ignored_count': 67,
'httperror/response_ignored_status_count/500': 67,
'item_scraped_count': 120,
'log_count/DEBUG': 357,
'log_count/ERROR': 119,
'log_count/INFO': 1764,
'log_count/WARNING': 240,
'proxies/dead': 1,
'proxies/good': 1,
'proxies/mean_backoff': 0.0,
'proxies/reanimated': 0,
'proxies/unchecked': 0,
'response_received_count': 169,
'retry/count': 1019,
'retry/max_reached': 93,
'retry/reason_count/500 Internal Server Error': 867,
'retry/reason_count/twisted.internet.error.TimeoutError': 80,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 72,
'scheduler/dequeued': 1722,
'scheduler/dequeued/disk': 1722,
'scheduler/enqueued': 1722,
'scheduler/enqueued/disk': 1722,
'start_time': datetime.datetime(2015, 9, 9, 2, 48, 56, 908)}
2020-09-10 23:31:31 [scrapy.core.engine] INFO: Spider closed (finished)
(Face python 3.8) D:\Selenium\Face python 3.8\TORBUSCADORDELINKS\TORBUSCADORDELINKS\spiders>
'retry/reason_count/500 Internal Server Error': 867,
'retry/reason_count/twisted.internet.error.TimeoutError': 80,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 72,
'scheduler/dequeued': 1722673,
'scheduler/dequeued/disk': 1722,
'scheduler/enqueued': 1722,
'scheduler/enqueued/disk': 1722,
'start_time': datetime.datetime(2020, 9, 9, 2, 48, 56, 908)}
2020-09-10 23:31:31 [scrapy.core.engine] INFO: Spider closed (finished)
你的命令,
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
不表示输出文件路径。
因此,您的结果无处可去。
使用 -o
command-line 开关指定输出路径。
另请参阅 the Scrapy tutorial,其中涵盖了这一点。或者 运行 scrapy crawl --help
.