Scrapy 跳过请求基于先前从同一个蜘蛛抓取的请求

Question

在下面的例子中，每个桶都有很多球。两个桶中可能有也可能没有红球。为了确定一个球是否是红色的，我们抓取它。

如果找到一个红球，我想停止抓取其余的球（即我不希望发送下一个球的请求，我知道它不会是红色的，因为我已经找到了）。

桶和球标识符是基础 URL 的查询参数。

我尝试过的#1

保持class状态并检查桶是否已经有红球

class BucketsBallsSpider(scrapy.Spider):
    name = 'test_spider'
    base_url = 'https://bucketswithballs.com'
    buckets = []
    balls = []
    buckets_with_red_balls = []
    
    def start_requests(self):
        for bucket in self.buckets:
            for ball in self.balls:
                if bucket in self.buckets_with_red_balls:
                    break
                url = add_or_replace_parameter(self.base_url, 'bucket', bucket)
                url = add_or_replace_parameter(url, 'ball', ball)
                yield scrapy.Request(url, self.parse)
                
    def parse(self, response, **kwargs):
        is_red_ball = response.xpath('//*[@id="is_red_ball"]').extract()
        if is_red_ball:
            bucket_id = url_query_parameter(response.url, 'bucket')
            self.buckets_with_red_balls.append(bucket_id)
            yield {'bucket_with_red_ball': bucket_id}

我试过的#2

在解析方法中生成请求

class BucketsBallsSpider(scrapy.Spider):
    name = 'test_spider'
    base_url = 'https://bucketswithballs.com'
    buckets = []
    balls = []
    buckets_with_red_balls = []

    def start_requests(self):
        # Start from first bucket and first ball
        url = add_or_replace_parameter(self.base_url, 'bucket', self.buckets[0])
        url = add_or_replace_parameter(url, 'ball', self.balls[0])
        yield scrapy.Request(url, self.parse)

    def parse(self, response, **kwargs):
        is_red_ball = response.xpath('//*[@id="is_red_ball"]').extract()
        if is_red_ball:
            bucket_id = url_query_parameter(response.url, 'bucket')
            self.buckets_with_red_balls.append(bucket_id)
            yield {'bucket_with_red_ball': bucket_id}

        # Scrapy filter will skip duplicates
        for bucket in self.buckets:
            for ball in self.balls:
                if bucket in self.buckets_with_red_balls:
                    break
                url = add_or_replace_parameter(self.base_url, 'bucket', bucket)
                url = add_or_replace_parameter(url, 'ball', ball)
                yield scrapy.Request(url, self.parse)

对于每个示例，Scrapy 在控制台中告诉我它抓取了每个 URL。出于性能原因，我想避免这种情况。

Answer 1

它不会工作，因为 Scrapy 以异步方式工作，我不认为你可以停止其他请求，因为它们可能已经在进行中。你可以提出 CloseSpider() exception to terminate the spider when a red ball is found, but concurrent requests would finish before the spider closes. See the Scrapy architecture here

如果你需要它在发现红球后停止并且不发出任何请求，我想你希望它是同步的。例如，使用 Python requests 可能会更容易。

话虽如此，我已将您的示例更新为同步工作（我尚未测试）。这将强制 Scrapy 1 接 1 地发出请求，无论它被配置为发出多个并发请求，但效率不高。

class BucketsBallsSpider(scrapy.Spider):
name = 'test_spider'
base_url = 'https://bucketswithballs.com'
buckets = []
balls = []

current_bucket_idx = 0
current_ball_idx = 0

def start_requests(self):
    # Start from first bucket and first ball
    url = add_or_replace_parameter(self.base_url, 'bucket', self.buckets[0])
    url = add_or_replace_parameter(url, 'ball', self.balls[0])
    yield scrapy.Request(url, self.parse)

def parse(self, response, **kwargs):
    is_red_ball = response.xpath('//*[@id="is_red_ball"]').extract()
    if is_red_ball:
        bucket_id = url_query_parameter(response.url, 'bucket')
        yield {'bucket_with_red_ball': bucket_id}
        return

    next_bucket, next_ball = self._get_next_bucket_and_ball()
    if not next_bucket:
        return      

    url = add_or_replace_parameter(self.base_url, 'bucket', next_bucket)
    url = add_or_replace_parameter(url, 'ball', next_ball)
    yield scrapy.Request(url, self.parse)

def _get_next_bucket_and_ball(self):
    if self.current_ball_idx < len(self.balls) - 1:
        self.current_ball_idx += 1

    else:
        self.current_ball_idx = 0   
        if self.current_bucket_idx < len(self.buckets) - 1:
            self.current_bucket_idx += 1
        else:
            # No more buckets/balls to try
            return None, None

    next_bucket = self.buckets[self.current_bucket_idx]
    next_ball = self.balls[self.current_ball_idx]
    return next_bucket, next_ball

Scrapy 跳过请求基于先前从同一个蜘蛛抓取的请求

Scrapy skip request based on previous crawl from same spider

python

scrapy

python-3.x

scrapy-pipeline

我尝试过的#1

我试过的#2