如何在继承的 CrawlSpider 中重用基于 scrapy Spider 的蜘蛛的解析方法?

How can I reuse the parse method of my scrapy Spider-based spider in an inheriting CrawlSpider?

我目前有一个基于 Spider 的蜘蛛,我编写它用于抓取 start_urls 的输入 JSON 数组:

from scrapy.spider import Spider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from foo.items import AtlanticFirearmsItem
from scrapy.contrib.loader import ItemLoader

import json
import datetime
import re

class AtlanticFirearmsSpider(Spider):
    name = "atlantic_firearms"
    allowed_domains = ["atlanticfirearms.com"]

    def __init__(self, start_urls='[]', *args, **kwargs):
      super(AtlanticFirearmsSpider, self).__init__(*args, **kwargs)
      self.start_urls = json.loads(start_urls)

    def parse(self, response):
      l = ItemLoader(item=AtlanticFirearmsItem(), response=response)
      product = l.load_item()
      return product

我可以像这样从命令行调用它,它做得很好:

scrapy crawl atlantic_firearms -a start_urls='["http://www.atlanticfirearms.com/component/virtuemart/shipping-rifles/ak-47-receiver-aam-47-detail.html", "http://www.atlanticfirearms.com/component/virtuemart/shipping-accessories/nitride-ak47-7-62x39mm-barrel-detail.html"]'

但是,我正在尝试添加一个基于 CrawlSpider 的蜘蛛来抓取从它继承并重新使用 parse 方法逻辑的整个站点。我的第一次尝试是这样的:

class AtlanticFirearmsCrawlSpider(CrawlSpider, AtlanticFirearmsSpider):
    name = "atlantic_firearms_crawler"
    start_urls = [
        "http://www.atlanticfirearms.com"
    ]
    rules = (
        # I know, I need to update these to LxmlLinkExtractor
        Rule(SgmlLinkExtractor(allow=['detail.html']), callback='parse'),
        Rule(SgmlLinkExtractor(allow=[], deny=['/bro', '/news', '/howtobuy', '/component/search', 'askquestion'])),
    )

运行 这只蜘蛛

scrapy crawl atlantic_firearms_crawler

抓取网站但从不解析任何项目。我想是因为 CrawlSpider apparently has its own definition of parse,所以我搞砸了。

当我将 callback='parse' 更改为 callback='parse_item' 并将 AtlanticFirearmsSpider 中的 parse 方法重命名为 parse_item 时,效果非常好,可以抓取整个站点并且成功解析项目。但是如果我再次尝试调用我原来的 atlantic_firearms spider,它会出错并显示 NotImplementedError,显然是因为基于 Spider 的蜘蛛真的希望将解析方法定义为 parse.

我在这些蜘蛛之间重用我的逻辑的最佳方式是什么,这样我既可以提供 start_urls 的 JSON 数组,又可以进行全站抓取?

您可以在这里避免多重继承

将两只蜘蛛合二为一。如果 start_urls 将从命令行传递 - 它会表现得像 CrawlSpider,否则就像一个普通的蜘蛛:

from scrapy import Item
from scrapy.contrib.spiders import CrawlSpider, Rule

from foo.items import AtlanticFirearmsItem
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.linkextractors import LinkExtractor

import json


class AtlanticFirearmsSpider(CrawlSpider):
    name = "atlantic_firearms"
    allowed_domains = ["atlanticfirearms.com"]

    def __init__(self, start_urls=None, *args, **kwargs):
        if start_urls:
            self.start_urls = json.loads(start_urls)
            self.rules = []
            self.parse = self.parse_response
        else:
            self.start_urls = ["http://www.atlanticfirearms.com/"]
            self.rules = [
                Rule(LinkExtractor(allow=['detail.html']), callback='parse_response'),
                Rule(LinkExtractor(allow=[], deny=['/bro', '/news', '/howtobuy', '/component/search', 'askquestion']))
            ]

        super(AtlanticFirearmsSpider, self).__init__(*args, **kwargs)

    def parse_response(self, response):
        l = ItemLoader(item=AtlanticFirearmsItem(), response=response)
        product = l.load_item()
        return product

或者,或者,只需将 parse() 方法中的逻辑提取到一个库函数中,然后从两个不相关的蜘蛛、单独的蜘蛛中调用。