如何将实例变量添加到 Scrapy CrawlSpider？

Question

我正在运行设置一个 CrawlSpider，我想通过向 process_request 传递一个函数来实现一些逻辑以在运行中停止跟踪某些链接。 =12=]

此函数使用蜘蛛的 class 变量来跟踪当前状态，并依赖于它（以及引用 URL ), 链接被删除或继续处理:

class BroadCrawlSpider(CrawlSpider):
    name = 'bitsy'
    start_urls = ['http://scrapy.org']
    foo = 5

    rules = (
        Rule(LinkExtractor(), callback='parse_item', process_request='filter_requests', follow=True),
    )

    def parse_item(self, response):
        <some code>

    def filter_requests(self, request):
        if self.foo == 6 and request.headers.get('Referer', None) == someval:
             raise IgnoreRequest("Ignored request: bla %s" % request)
        return request

我认为如果我在同一台机器上运行多个蜘蛛，它们都会使用相同的 class 变量，这不是我的意图。

有没有办法向 CrawlSpiders 添加 instance 变量？我运行 Scrapy 时只创建了一个蜘蛛实例吗？

我可能可以使用每个进程 ID 都有值的字典来解决这个问题，但这会很难看...

Answer 1

我认为 spider arguments 是您的解决方案。

当像 scrapy crawl some_spider 这样调用 scrapy 时，你可以添加像 scrapy crawl some_spider -a foo=bar 这样的参数，蜘蛛会通过它的构造函数接收值，例如：

class SomeSpider(scrapy.Spider):
    def __init__(self, foo=None, *args, **kwargs):
        super(SomeSpider, self).__init__(*args, **kwargs)
        # Do something with foo

更重要的是，作为 scrapy.Spider actually sets all additional arguments as instance attributes，您甚至不需要显式覆盖 __init__ 方法，只需访问 .foo 属性即可。 :)

如何将实例变量添加到 Scrapy CrawlSpider？

How to add instance variable to Scrapy CrawlSpider?

python

scrapy

scrapy-spider