Scrapy CrawlSpider 遍历整个网站

Question

我有一个简单的 CrawlSpider，它可以抓取特定网站的首页。我想让蜘蛛继续执行 ?p=1、?p=2 等等，直到它检测到站点迭代结束。我该怎么做？

class PomosCrawlSpider(CrawlSpider):
    name = 'crawlobituaries'
    
    allowed_domains = ['some.at']
    start_urls = [
        'https://www.some.at',
    ]

    rules = (
        Rule(LinkExtractor(allow='traueranzeigen'), callback='parse_obi'),
    )

    def parse_obi(self, response):
        logging.info(response.url)

        for post in response.css('.homepage_unterseiten_layout_todesanzeigen'):
            for entry in post.css('a'):
                item = {
                    'name': entry.css('.homepage_unterseiten_layout_titel::text').get(),
                    'date': entry.css('.homepage_unterseiten_layout_datum::text').get()
                }
                yield item

Answer 1

你的蜘蛛只抓取第一页的原因是你没有在你的 Rule 定义中添加 follow=True 以便蜘蛛跟踪链接并提取更多链接。您还需要添加一个 Rule 定义以跟随可以使用 restrict_css 方法定义的下一页，并在导航 div 中包含 class。请参阅下面的示例代码。

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import logging

class PomosCrawlSpider(CrawlSpider):
    name = 'crawlobituaries'

    allowed_domains = ['bestattung-aichinger.at']
    start_urls = [
        'https://www.bestattung-aichinger.at',
    ]

    rules = (
        Rule(LinkExtractor(restrict_text='Traueranzeigen'), callback='parse_obi', follow=True),
        Rule(LinkExtractor(restrict_css=".seitenzahlen"), callback='parse_obi', follow=True),
    )

    def parse_obi(self, response):
        logging.info(response.url)

        for post in response.css('.homepage_unterseiten_layout_todesanzeigen'):
            for entry in post.css('a'):
                item = {
                    'name': entry.css('.homepage_unterseiten_layout_titel::text').get(),
                    'date': entry.css('.homepage_unterseiten_layout_datum::text').get()
                }
                yield item

Scrapy CrawlSpider 遍历整个网站

Scrapy CrawlSpider iterating through entire site

scrapy