Scrapy Crawl Spider 没有跟踪链接

Question

我写这个脚本是为了从 https://de.rs-online.com/web/c/automation/elektromechanische-magnete/hubmagnete-linear/ 获取数据。我的目标是跟踪所有链接并从所有这些页面中提取项目。但我不知道这个脚本有什么问题，它没有跟踪链接。如果我使用基本蜘蛛，那么它很容易从页面获取项目，但对于爬行蜘蛛，它不起作用。它没有抛出任何错误，而是抛出以下消息_

2022-02-19 21:36:56 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-02-19 21:36:56 [scrapy.core.engine] INFO: Spider opened
2022-02-19 21:36:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-02-19 21:36:56 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ProductSpider(CrawlSpider):
    name = 'product'
    allowed_domains = ['de.rs-online.com']
    start_urls = ['https://de.rs-online.com/web/c/automation/elektromechanische-magnete/hubmagnete-linear/']

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//tr/td/div/div/div[2]/div/a"), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        yield {
            'Title': response.xpath("//h1/text()").get(),
            'Categories': response.xpath("(//ol[@class='Breadcrumbstyled__StyledBreadcrumbList-sc-g4avu2-1 gHzygm']/li/a)[4]/text()").get(),
            'RS Best.-Nr.': response.xpath("//dl[@data-testid='key-details-desktop']/dd[1]/text()").get(),
            'URL': response.url
        }

Answer 1

如果您想跟随 所有链接 而不进行任何过滤，那么您只需在 Rule 定义中省略 restrict_xpaths 参数即可。但是，请注意，您在 parse_item 回调中的 xpath 不正确，因此您仍然会收到空项目。重新检查您的 xpaths 并正确定义它们以获得您想要的信息。

rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )

Scrapy Crawl Spider 没有跟踪链接

Scrapy Crawl Spider is not following the links

scrapy