scrapy 规则不调用解析方法
scrapy rules do not call parsing method
我是 scrapy 的新手,正在尝试抓取域,跟踪所有内部链接并使用模式 /example/.*[=12=] 抓取 url 的标题
抓取有效,但标题的抓取无效,因为输出文件为空。很可能我弄错了规则。这是为了实现我正在寻找的使用规则的正确语法吗?
import scrapy
class BidItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
spider.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from bid.items import BidItem
class GetbidSpider(CrawlSpider):
name = 'getbid'
allowed_domains = ['domain.de']
start_urls = ['https://www.domain.de/']
rules = (
Rule(
LinkExtractor(),
follow=True
),
Rule(
LinkExtractor(allow=['example/.*']),
callback='parse_item'
),
)
def parse_item(self, response):
href = BidItem()
href['url'] = response.url
href['title'] = response.css("h1::text").extract()
return href
抓取:scrapy抓取getbid -o 012916.csv
If multiple rules match the same link, the first one will be used,
according to the order they’re defined in this attribute.
由于您的第一个规则将匹配所有链接,因此它将始终被使用,所有其他规则将被忽略。
解决问题就像切换规则的顺序一样简单。
我是 scrapy 的新手,正在尝试抓取域,跟踪所有内部链接并使用模式 /example/.*[=12=] 抓取 url 的标题
抓取有效,但标题的抓取无效,因为输出文件为空。很可能我弄错了规则。这是为了实现我正在寻找的使用规则的正确语法吗?
import scrapy
class BidItem(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
spider.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from bid.items import BidItem
class GetbidSpider(CrawlSpider):
name = 'getbid'
allowed_domains = ['domain.de']
start_urls = ['https://www.domain.de/']
rules = (
Rule(
LinkExtractor(),
follow=True
),
Rule(
LinkExtractor(allow=['example/.*']),
callback='parse_item'
),
)
def parse_item(self, response):
href = BidItem()
href['url'] = response.url
href['title'] = response.css("h1::text").extract()
return href
抓取:scrapy抓取getbid -o 012916.csv
If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute.
由于您的第一个规则将匹配所有链接,因此它将始终被使用,所有其他规则将被忽略。
解决问题就像切换规则的顺序一样简单。