如何使用 Scrapy 进行分页并访问每个页面上找到的所有链接

Question

我有以下蜘蛛，我尝试结合分页和规则来访问每个页面上的链接。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Paging(CrawlSpider):
    name = "paging"
    start_urls = ['https://ausschreibungen-deutschland.de/1/']

    # Visit all 10 links (it recognizes all 10 sublinks starting with a number and a _)
    rules = (
        Rule(LinkExtractor(allow=r"/[0-9]+_"), callback='parse', follow=True),
    )

    def parse(self, response):
        
        # just get all the text 
        all_text = response.xpath("//text()").getall()

        yield {
            "text": " ".join(all_text),
            "url": response.url
        }
        
        # visit next page 
        # next_page_url = response.xpath('//a[@class="button next"]').extract_first()

        # if next_page_url is not None:
            # yield scrapy.Request(response.urljoin(next_page_url))

我想实现以下行为：

从第 1 页 https://ausschreibungen-deutschland.de/1/ 开始，访问所有 10 个链接并获取文本。（已实施）

转到第 2 页 https://ausschreibungen-deutschland.de/2/，访问所有 10 个链接并获取文本。

转到第 3 页 https://ausschreibungen-deutschland.de/3/，访问所有 10 个链接并获取文本。

转到第 4 页 ...

我将如何结合这两个概念？

Answer 1

我已经在start_urls中完成了分页，您可以根据需要增加或减少页码。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Paging(CrawlSpider):
    name = "paging"
    start_urls = ['https://ausschreibungen-deutschland.de/'+str(x)+'/' for x in range(1,11)]

    # Visit all 10 links (it recognizes all 10 sublinks starting with a number and a _)
    rules = (
        Rule(LinkExtractor(allow=r"/[0-9]+_"), callback='parse', follow=False),
    )

    def parse(self, response):
        
        # just get all the text 
        #all_text = response.xpath("//text()").getall()

        yield {
            #"text": " ".join(all_text),
            'title':response.xpath('//*[@class="card-body bg-primary text-white card-body-title"]/h2//text()').get(),
            "url": response.url
        }

如何使用 Scrapy 进行分页并访问每个页面上找到的所有链接

How to use Scrapy to do pagination and visit all links found on each page

python

scrapy

web-scraping