Scrapy 发送多个请求
Scrapy send multiple requests
我正在编写必须随时从远程 Json 文件读取和处理日期和时间信息的代码。
我写的代码如下:
import scrapy
class TimeSpider(scrapy.Spider):
name = 'getTime'
allowed_domains = ['worldtimeapi.org']
start_urls = ['http://worldtimeapi.org']
def parse(self,response):
time_json='http://worldtimeapi.org/api/timezone/Asia/Tehran'
for i in range(5):
print(i)
yield scrapy.Request(url=time_json, callback=self.parse_json)
def parse_json(self,response):
print(response.json())
它给出的输出如下:
0
1
2
3
4
{'abbreviation': '+0430', 'client_ip': '45.136.231.43', 'datetime': '2022-04-22T22:01:44.198723+04:30', 'day_of_week': 5, 'day_of_year': 112, 'dst': True, 'dst_from': '2022-03-21T20:30:00+00:00', 'dst_offset': 3600, 'dst_until': '2022-09-21T19:30:00+00:00', 'raw_offset': 12600, 'timezone': 'Asia/Tehran', 'unixtime': 1650648704, 'utc_datetime': '2022-04-22T17:31:44.198723+00:00', 'utc_offset': '+04:30', 'week_number': 16}
如你所见,程序只调用一次parse_json函数,而它必须在每个循环中调用该函数
谁能帮我解决这个问题?
额外的请求正在被 scrapy 的默认重复过滤器删除。
避免这种情况的最简单方法是传递 dont_filter
参数:
yield scrapy.Request(url=time_json, callback=self.parse_json, dont_filter=True)
来自the docs:
dont_filter (bool) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False
.
我正在编写必须随时从远程 Json 文件读取和处理日期和时间信息的代码。 我写的代码如下:
import scrapy
class TimeSpider(scrapy.Spider):
name = 'getTime'
allowed_domains = ['worldtimeapi.org']
start_urls = ['http://worldtimeapi.org']
def parse(self,response):
time_json='http://worldtimeapi.org/api/timezone/Asia/Tehran'
for i in range(5):
print(i)
yield scrapy.Request(url=time_json, callback=self.parse_json)
def parse_json(self,response):
print(response.json())
它给出的输出如下:
0
1
2
3
4
{'abbreviation': '+0430', 'client_ip': '45.136.231.43', 'datetime': '2022-04-22T22:01:44.198723+04:30', 'day_of_week': 5, 'day_of_year': 112, 'dst': True, 'dst_from': '2022-03-21T20:30:00+00:00', 'dst_offset': 3600, 'dst_until': '2022-09-21T19:30:00+00:00', 'raw_offset': 12600, 'timezone': 'Asia/Tehran', 'unixtime': 1650648704, 'utc_datetime': '2022-04-22T17:31:44.198723+00:00', 'utc_offset': '+04:30', 'week_number': 16}
如你所见,程序只调用一次parse_json函数,而它必须在每个循环中调用该函数
谁能帮我解决这个问题?
额外的请求正在被 scrapy 的默认重复过滤器删除。
避免这种情况的最简单方法是传递 dont_filter
参数:
yield scrapy.Request(url=time_json, callback=self.parse_json, dont_filter=True)
来自the docs:
dont_filter (bool) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to
False
.