Scrapy 跳过请求基于先前从同一个蜘蛛抓取的请求
Scrapy skip request based on previous crawl from same spider
在下面的例子中,每个桶都有很多球。两个桶中可能有也可能没有红球。为了确定一个球是否是红色的,我们抓取它。
如果找到一个红球,我想停止抓取其余的球(即我不希望发送下一个球的请求,我知道它不会是红色的,因为我已经找到了)。
桶和球标识符是基础 URL 的查询参数。
我尝试过的#1
保持class状态并检查桶是否已经有红球
class BucketsBallsSpider(scrapy.Spider):
name = 'test_spider'
base_url = 'https://bucketswithballs.com'
buckets = []
balls = []
buckets_with_red_balls = []
def start_requests(self):
for bucket in self.buckets:
for ball in self.balls:
if bucket in self.buckets_with_red_balls:
break
url = add_or_replace_parameter(self.base_url, 'bucket', bucket)
url = add_or_replace_parameter(url, 'ball', ball)
yield scrapy.Request(url, self.parse)
def parse(self, response, **kwargs):
is_red_ball = response.xpath('//*[@id="is_red_ball"]').extract()
if is_red_ball:
bucket_id = url_query_parameter(response.url, 'bucket')
self.buckets_with_red_balls.append(bucket_id)
yield {'bucket_with_red_ball': bucket_id}
我试过的#2
在解析方法中生成请求
class BucketsBallsSpider(scrapy.Spider):
name = 'test_spider'
base_url = 'https://bucketswithballs.com'
buckets = []
balls = []
buckets_with_red_balls = []
def start_requests(self):
# Start from first bucket and first ball
url = add_or_replace_parameter(self.base_url, 'bucket', self.buckets[0])
url = add_or_replace_parameter(url, 'ball', self.balls[0])
yield scrapy.Request(url, self.parse)
def parse(self, response, **kwargs):
is_red_ball = response.xpath('//*[@id="is_red_ball"]').extract()
if is_red_ball:
bucket_id = url_query_parameter(response.url, 'bucket')
self.buckets_with_red_balls.append(bucket_id)
yield {'bucket_with_red_ball': bucket_id}
# Scrapy filter will skip duplicates
for bucket in self.buckets:
for ball in self.balls:
if bucket in self.buckets_with_red_balls:
break
url = add_or_replace_parameter(self.base_url, 'bucket', bucket)
url = add_or_replace_parameter(url, 'ball', ball)
yield scrapy.Request(url, self.parse)
对于每个示例,Scrapy 在控制台中告诉我它抓取了每个 URL。出于性能原因,我想避免这种情况。
它不会工作,因为 Scrapy 以异步方式工作,我不认为你可以停止其他请求,因为它们可能已经在进行中。你可以提出 CloseSpider() exception to terminate the spider when a red ball is found, but concurrent requests would finish before the spider closes. See the Scrapy architecture here
如果你需要它在发现红球后停止并且不发出 任何 请求,我想你希望它是同步的。例如,使用 Python requests
可能会更容易。
话虽如此,我已将您的示例更新为同步工作(我尚未测试)。这将强制 Scrapy 1 接 1 地发出请求,无论它被配置为发出多个并发请求,但效率不高。
class BucketsBallsSpider(scrapy.Spider):
name = 'test_spider'
base_url = 'https://bucketswithballs.com'
buckets = []
balls = []
current_bucket_idx = 0
current_ball_idx = 0
def start_requests(self):
# Start from first bucket and first ball
url = add_or_replace_parameter(self.base_url, 'bucket', self.buckets[0])
url = add_or_replace_parameter(url, 'ball', self.balls[0])
yield scrapy.Request(url, self.parse)
def parse(self, response, **kwargs):
is_red_ball = response.xpath('//*[@id="is_red_ball"]').extract()
if is_red_ball:
bucket_id = url_query_parameter(response.url, 'bucket')
yield {'bucket_with_red_ball': bucket_id}
return
next_bucket, next_ball = self._get_next_bucket_and_ball()
if not next_bucket:
return
url = add_or_replace_parameter(self.base_url, 'bucket', next_bucket)
url = add_or_replace_parameter(url, 'ball', next_ball)
yield scrapy.Request(url, self.parse)
def _get_next_bucket_and_ball(self):
if self.current_ball_idx < len(self.balls) - 1:
self.current_ball_idx += 1
else:
self.current_ball_idx = 0
if self.current_bucket_idx < len(self.buckets) - 1:
self.current_bucket_idx += 1
else:
# No more buckets/balls to try
return None, None
next_bucket = self.buckets[self.current_bucket_idx]
next_ball = self.balls[self.current_ball_idx]
return next_bucket, next_ball
在下面的例子中,每个桶都有很多球。两个桶中可能有也可能没有红球。为了确定一个球是否是红色的,我们抓取它。
如果找到一个红球,我想停止抓取其余的球(即我不希望发送下一个球的请求,我知道它不会是红色的,因为我已经找到了)。
桶和球标识符是基础 URL 的查询参数。
我尝试过的#1
保持class状态并检查桶是否已经有红球
class BucketsBallsSpider(scrapy.Spider):
name = 'test_spider'
base_url = 'https://bucketswithballs.com'
buckets = []
balls = []
buckets_with_red_balls = []
def start_requests(self):
for bucket in self.buckets:
for ball in self.balls:
if bucket in self.buckets_with_red_balls:
break
url = add_or_replace_parameter(self.base_url, 'bucket', bucket)
url = add_or_replace_parameter(url, 'ball', ball)
yield scrapy.Request(url, self.parse)
def parse(self, response, **kwargs):
is_red_ball = response.xpath('//*[@id="is_red_ball"]').extract()
if is_red_ball:
bucket_id = url_query_parameter(response.url, 'bucket')
self.buckets_with_red_balls.append(bucket_id)
yield {'bucket_with_red_ball': bucket_id}
我试过的#2
在解析方法中生成请求
class BucketsBallsSpider(scrapy.Spider):
name = 'test_spider'
base_url = 'https://bucketswithballs.com'
buckets = []
balls = []
buckets_with_red_balls = []
def start_requests(self):
# Start from first bucket and first ball
url = add_or_replace_parameter(self.base_url, 'bucket', self.buckets[0])
url = add_or_replace_parameter(url, 'ball', self.balls[0])
yield scrapy.Request(url, self.parse)
def parse(self, response, **kwargs):
is_red_ball = response.xpath('//*[@id="is_red_ball"]').extract()
if is_red_ball:
bucket_id = url_query_parameter(response.url, 'bucket')
self.buckets_with_red_balls.append(bucket_id)
yield {'bucket_with_red_ball': bucket_id}
# Scrapy filter will skip duplicates
for bucket in self.buckets:
for ball in self.balls:
if bucket in self.buckets_with_red_balls:
break
url = add_or_replace_parameter(self.base_url, 'bucket', bucket)
url = add_or_replace_parameter(url, 'ball', ball)
yield scrapy.Request(url, self.parse)
对于每个示例,Scrapy 在控制台中告诉我它抓取了每个 URL。出于性能原因,我想避免这种情况。
它不会工作,因为 Scrapy 以异步方式工作,我不认为你可以停止其他请求,因为它们可能已经在进行中。你可以提出 CloseSpider() exception to terminate the spider when a red ball is found, but concurrent requests would finish before the spider closes. See the Scrapy architecture here
如果你需要它在发现红球后停止并且不发出 任何 请求,我想你希望它是同步的。例如,使用 Python requests
可能会更容易。
话虽如此,我已将您的示例更新为同步工作(我尚未测试)。这将强制 Scrapy 1 接 1 地发出请求,无论它被配置为发出多个并发请求,但效率不高。
class BucketsBallsSpider(scrapy.Spider):
name = 'test_spider'
base_url = 'https://bucketswithballs.com'
buckets = []
balls = []
current_bucket_idx = 0
current_ball_idx = 0
def start_requests(self):
# Start from first bucket and first ball
url = add_or_replace_parameter(self.base_url, 'bucket', self.buckets[0])
url = add_or_replace_parameter(url, 'ball', self.balls[0])
yield scrapy.Request(url, self.parse)
def parse(self, response, **kwargs):
is_red_ball = response.xpath('//*[@id="is_red_ball"]').extract()
if is_red_ball:
bucket_id = url_query_parameter(response.url, 'bucket')
yield {'bucket_with_red_ball': bucket_id}
return
next_bucket, next_ball = self._get_next_bucket_and_ball()
if not next_bucket:
return
url = add_or_replace_parameter(self.base_url, 'bucket', next_bucket)
url = add_or_replace_parameter(url, 'ball', next_ball)
yield scrapy.Request(url, self.parse)
def _get_next_bucket_and_ball(self):
if self.current_ball_idx < len(self.balls) - 1:
self.current_ball_idx += 1
else:
self.current_ball_idx = 0
if self.current_bucket_idx < len(self.buckets) - 1:
self.current_bucket_idx += 1
else:
# No more buckets/balls to try
return None, None
next_bucket = self.buckets[self.current_bucket_idx]
next_ball = self.balls[self.current_ball_idx]
return next_bucket, next_ball