将变量传递给 Scrapy

Passing Variables to Scrapy

我有一个基本的 scrapy 项目,我在其中硬编码了 2 个变量 - pProd 和 pReviews。我现在想从 csv 文件中读取这些变量,或者在调用蜘蛛时传递它们。在过去的几个小时里,我一直在尝试,但在调用蜘蛛时使用 -a 属性似乎一无所获。例如:

scrapy crawl myspider -a Prod="P123" -a Revs="200" -o test.csv

这是我使用硬编码变量的代码:

import scrapy
from scrapy import Spider, Request
import re
import json

class myspider(Spider):
    name = 'myspider'
    allowed_domains = ['mydom.com']
    start_urls = ['https://api.mydom.com']

    def start_requests(self):
        urls = ["https://api.mydom.com"]
        pProd = "P123"
        pReviews = 200
        for url in urls:
            #Generate URL as API only brings back 100 at a time
            for i in range(0, pReviews, 100):
                links = 'https://api.mydom.com/data/reviews.json?Filter=ProductId%3A' + pProd + '&Offset=' + str(i) + '&passkey=123qwe'
                yield scrapy.Request(
                    url=str(links),
                    cb_kwargs={'ProductID' : pProd},
                    callback=self.parse_reviews,
                )
                
    def parse_reviews(self, response, ProductID):
        data = json.loads(response.text)
        proddata = data['Includes']
        reviews = data['Results']
        p_prodid = ProductID
        try:
            p_prodcat = proddata['Products'][ProductID]['CategoryId']
        except:
            p_prodcat = None
                                
        for review in reviews:
            try:
                r_reviewdate = review['SubmissionTime']
            except:
                r_reviewdate = None
                        
            yield{
                'prodid' : p_prodid,
                'prodcat' : p_prodcat,
                'reviewdate' : r_reviewdate,
            }

我尝试了几种不同的方法,包括在 def start_requests 中添加变量名称,例如:

def start_requests(self, pProd='', pReviews='', **kwargs):

但似乎无处可去。希望能提供一些关于我哪里出错的指导。

你不必每次都声明构造函数(init),你可以像以前一样指定参数:

scrapy crawl myspider -a parameter1=value1 -a parameter2=value2

并且在您的爬虫代码中,您可以将它们用作爬虫参数:

class MySpider(Spider):
    name = 'myspider'
    ...
    def parse(self, response):
        ...
        if self.parameter1 == value1:
            # this is True

        # or also
        if getattr(self, parameter2) == value2:
            # this is also True

*来自How to pass a user defined argument in scrapy spider