Scrapy 只处理 iterable 中的前 10 个请求
Scrapy only processes first 10 requests in iterable
我有一个从站点地图开始的抓取工具,抓取(几个)100 个唯一的 url,然后对这 100 个页面进行进一步处理。但是,我只收到前 10 个网址的回调。蜘蛛日志似乎只在前 10 个 url 上调用 HTTP GET。
class MySpider(scrapy.spider.BaseSpider):
# settings ...
def parse(self, response):
urls = [...]
for url in urls:
request = scrapy.http.Request(url, callback=self.parse_part2)
print url
yield request
def parse_part2(self, response):
print response.url
# do more parsing here
我考虑过以下选项:
- 打乱列表
- 设置下载延迟(很确定我没有速率限制)
- dont_filter=真参数
- 返回一组请求而不是产生
- 禁用并行请求
是否有一些我不知道的神秘的 max_branching_factor 标志?
编辑:日志,完全正常。
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url1>
yay callback!
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url2>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url3>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url4>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url5>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url6>
yay callback!
yay callback!
yay callback!
yay callback!
yay callback!
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url7>
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200) <GET url8>
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200) <GET url9>
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200) <GET url10>
yay callback!
2015-02-11 02:05:13-0800 [mysite] INFO: Closing spider (finished)
2015-02-11 02:05:13-0800 [mysite] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4590,
'downloader/request_count': 11,
'downloader/request_method_count/GET': 11,
'downloader/response_bytes': 638496,
'downloader/response_count': 11,
'downloader/response_status_count/200': 11,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 2, 11, 10, 5, 13, 260322),
'log_count/DEBUG': 17,
'log_count/INFO': 3,
'request_depth_max': 1,
'response_received_count': 11,
'scheduler/dequeued': 11,
'scheduler/dequeued/memory': 11,
'scheduler/enqueued': 11,
'scheduler/enqueued/memory': 11,
'start_time': datetime.datetime(2015, 2, 11, 10, 5, 12, 492811)}
2015-02-11 02:05:13-0800 [mysite] INFO: Spider closed (finished)
尝试将 LOG_LEVEL 设置为调试您会看到更多日志。
如果你这样做 so.please 将它们粘贴到
所以我在我的一个设置文件中找到了这个属性
max_requests / MAX_REQUESTS = 10
它负责蜘蛛提早退出(哎呀)
我有一个从站点地图开始的抓取工具,抓取(几个)100 个唯一的 url,然后对这 100 个页面进行进一步处理。但是,我只收到前 10 个网址的回调。蜘蛛日志似乎只在前 10 个 url 上调用 HTTP GET。
class MySpider(scrapy.spider.BaseSpider):
# settings ...
def parse(self, response):
urls = [...]
for url in urls:
request = scrapy.http.Request(url, callback=self.parse_part2)
print url
yield request
def parse_part2(self, response):
print response.url
# do more parsing here
我考虑过以下选项:
- 打乱列表
- 设置下载延迟(很确定我没有速率限制)
- dont_filter=真参数
- 返回一组请求而不是产生
- 禁用并行请求
是否有一些我不知道的神秘的 max_branching_factor 标志?
编辑:日志,完全正常。
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url1>
yay callback!
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url2>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url3>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url4>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url5>
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url6>
yay callback!
yay callback!
yay callback!
yay callback!
yay callback!
2015-02-11 02:05:12-0800 [mysite] DEBUG: Crawled (200) <GET url7>
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200) <GET url8>
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200) <GET url9>
yay callback!
2015-02-11 02:05:13-0800 [mysite] DEBUG: Crawled (200) <GET url10>
yay callback!
2015-02-11 02:05:13-0800 [mysite] INFO: Closing spider (finished)
2015-02-11 02:05:13-0800 [mysite] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4590,
'downloader/request_count': 11,
'downloader/request_method_count/GET': 11,
'downloader/response_bytes': 638496,
'downloader/response_count': 11,
'downloader/response_status_count/200': 11,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 2, 11, 10, 5, 13, 260322),
'log_count/DEBUG': 17,
'log_count/INFO': 3,
'request_depth_max': 1,
'response_received_count': 11,
'scheduler/dequeued': 11,
'scheduler/dequeued/memory': 11,
'scheduler/enqueued': 11,
'scheduler/enqueued/memory': 11,
'start_time': datetime.datetime(2015, 2, 11, 10, 5, 12, 492811)}
2015-02-11 02:05:13-0800 [mysite] INFO: Spider closed (finished)
尝试将 LOG_LEVEL 设置为调试您会看到更多日志。
如果你这样做 so.please 将它们粘贴到
所以我在我的一个设置文件中找到了这个属性
max_requests / MAX_REQUESTS = 10
它负责蜘蛛提早退出(哎呀)