如何使用 Scrapy 进行分页并访问每个页面上找到的所有链接

How to use Scrapy to do pagination and visit all links found on each page

我有以下蜘蛛,我尝试结合分页和规则来访问每个页面上的链接。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Paging(CrawlSpider):
    name = "paging"
    start_urls = ['https://ausschreibungen-deutschland.de/1/']

    # Visit all 10 links (it recognizes all 10 sublinks starting with a number and a _)
    rules = (
        Rule(LinkExtractor(allow=r"/[0-9]+_"), callback='parse', follow=True),
    )

    def parse(self, response):
        
        # just get all the text 
        all_text = response.xpath("//text()").getall()

        yield {
            "text": " ".join(all_text),
            "url": response.url
        }
        
        # visit next page 
        # next_page_url = response.xpath('//a[@class="button next"]').extract_first()

        # if next_page_url is not None:
            # yield scrapy.Request(response.urljoin(next_page_url))

我想实现以下行为:

从第 1 页 https://ausschreibungen-deutschland.de/1/ 开始,访问所有 10 个链接并获取文本。 (已实施)

转到第 2 页 https://ausschreibungen-deutschland.de/2/,访问所有 10 个链接并获取文本。

转到第 3 页 https://ausschreibungen-deutschland.de/3/,访问所有 10 个链接并获取文本。

转到第 4 页 ...

我将如何结合这两个概念?

我已经在start_urls中完成了分页,您可以根据需要增加或减少页码。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Paging(CrawlSpider):
    name = "paging"
    start_urls = ['https://ausschreibungen-deutschland.de/'+str(x)+'/' for x in range(1,11)]

    # Visit all 10 links (it recognizes all 10 sublinks starting with a number and a _)
    rules = (
        Rule(LinkExtractor(allow=r"/[0-9]+_"), callback='parse', follow=False),
    )

    def parse(self, response):
        
        # just get all the text 
        #all_text = response.xpath("//text()").getall()

        yield {
            #"text": " ".join(all_text),
            'title':response.xpath('//*[@class="card-body bg-primary text-white card-body-title"]/h2//text()').get(),
            "url": response.url
        }