scrapy 规则不调用解析方法

Question

我是 scrapy 的新手，正在尝试抓取域，跟踪所有内部链接并使用模式 /example/.*[=12=] 抓取 url 的标题

抓取有效，但标题的抓取无效，因为输出文件为空。很可能我弄错了规则。这是为了实现我正在寻找的使用规则的正确语法吗？

import scrapy
class BidItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()

spider.py

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from bid.items import BidItem

class GetbidSpider(CrawlSpider):
    name = 'getbid'
    allowed_domains = ['domain.de']
    start_urls = ['https://www.domain.de/']

    rules = (
        Rule(
            LinkExtractor(), 
            follow=True
        ),
        Rule(
            LinkExtractor(allow=['example/.*']), 
            callback='parse_item'
        ),
    )

    def parse_item(self, response):
         href = BidItem()
         href['url']    = response.url
         href['title']  = response.css("h1::text").extract()
         return href

抓取：scrapy抓取getbid -o 012916.csv

Answer 1

来自CrawlSpider docs：

If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.

由于您的第一个规则将匹配所有链接，因此它将始终被使用，所有其他规则将被忽略。

解决问题就像切换规则的顺序一样简单。

scrapy 规则不调用解析方法

scrapy rules do not call parsing method

python

scrapy

scrapy-spider