如何在scrapy spider中访问管道数据库池

Question

首先，这是我要做的事情：

我有一个 XMLFeedSpider，它遍历 XML 文件中的产品列表（节点）并创建项目，这些项目通过管道保存到我的数据库中。第一次看到产品时，我需要创建请求以在产品的 url 字段上进行一些抓取以获取图像等。在后续读取提要时，如果我看到我不想要的相同产品浪费 time/resources 做这件事，只想跳过这些额外的请求。要查看要跳过的产品，我需要访问我的数据库以查看该产品是否存在。

以下是我可以想到的多种方法：

只需为蜘蛛中的每个产品创建一个数据库请求。这个似乎是个坏主意。
在我的商品商店管道中，我已经创建了一个数据库池，如下所示： dbpool = adbapi.ConnectionPool('psycopg2', cp_max=2, cp_min=1, **dbargs) 只使用它似乎更有效，所以我不会不断地创建新的数据库连接。我不知道如何在我的蜘蛛中访问实例化管道 class（这可能更像是一个一般的 python 问题）。
注意：这个人基本上是在问同样的问题，但并没有真正得到他想要的答案。 How to get the pipeline object in Scrapy spider
也许在开始抓取之前将所有产品 url 加载到内存中，以便我可以在处理产品时比较它们？哪里是做这个的好地方？
其他建议？

更新：这是我的带数据库池的管道

class PostgresStorePipeline(object):
    """A pipeline to store the item in a MySQL database.
    This implementation uses Twisted's asynchronous database API.
    """

    def __init__(self, dbpool):
        print "Opening connection pool..."
        dispatcher.connect(self.spider_closed, signals.spider_closed)
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls, settings):
        dbargs = dict(
            host=settings['MYSQL_HOST'],
            database=settings['MYSQL_DBNAME'],
            user=settings['MYSQL_USER'],
            password=settings['MYSQL_PASSWD'],
            #charset='utf8',
            #use_unicode=True,
        )
        dbpool = adbapi.ConnectionPool('psycopg2', cp_max=2, cp_min=1, **dbargs)
        return cls(dbpool)

Answer 1

我想你的意思是 item 是 URL，请记住，对于 scrapy，item 是数据输出，pipeline 是机制处理那些输出项。

当然，您不需要打开很多连接来进行数据库查询，但是您必须进行必要的查询。这取决于您的数据库中有多少记录只执行一个查询或每个 URL，您应该测试哪一个更适合您的情况。

我建议您设置自己的 DUPEFILTER_CLASS，例如：

from scrapy.dupefilters import RFPDupeFilter

class DBDupeFilter(RFPDupeFilter):

    def __init__(self, *args, **kwargs):
        # self.cursor = .....                       # instantiate your cursor
        super(DBDupeFilter, self).__init__(*args, **kwargs)

    def request_seen(self, request):
        if self.cursor.execute("myquery"):          # if exists
            return True
        else:
            return super(DBDupeFilter, self).request_seen(request)

    def close(self, reason):
        self.cursor.close()                         # close  your cursor
        super(DBDupeFilter, self).close(reason)

更新

这里的问题是 DUPEFILTER_CLASS 没有在它的 request_seen 对象上提供蜘蛛，甚至没有提供构造函数，所以我认为你最好的选择是 Downloader Middleware，您可以在其中引发 IgnoreRequest 异常。

在spider上实例化db连接，你可以在spider本身（构造函数）上做这个，或者你也可以通过中间件或管道上的信号添加它，我们将添加它在中间件上：

from scrapy.exceptions import IgnoreRequest

class DBMiddleware(object):

    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        o = cls()
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        return o

    def spider_opened(self, spider):
        spider.dbpool = adbapi.ConnectionPool('psycopg2', cp_max=2, cp_min=1, **dbargs)

    def process_request(self, request, spider):
        if spider.dbpool... # check if request.url inside the database
            raise IgnoreRequest()

现在在你的管道上，删除 dbpool 的实例化，并在必要时从 spider 参数中获取它，记住比 process_item 接收项目和spider 作为参数，所以你应该能够使用 spider.dbpool 来检查你的数据库连接。
记得要activate your middleware.

那样的话，您应该只在蜘蛛对象中执行一个数据库连接实例。

如何在scrapy spider中访问管道数据库池

How to access pipeline database pool in scrapy spider

python

scrapy

scrapy-spider