将变量传递给 Scrapy
Passing Variables to Scrapy
我有一个基本的 scrapy 项目,我在其中硬编码了 2 个变量 - pProd 和 pReviews。我现在想从 csv 文件中读取这些变量,或者在调用蜘蛛时传递它们。在过去的几个小时里,我一直在尝试,但在调用蜘蛛时使用 -a 属性似乎一无所获。例如:
scrapy crawl myspider -a Prod="P123" -a Revs="200" -o test.csv
这是我使用硬编码变量的代码:
import scrapy
from scrapy import Spider, Request
import re
import json
class myspider(Spider):
name = 'myspider'
allowed_domains = ['mydom.com']
start_urls = ['https://api.mydom.com']
def start_requests(self):
urls = ["https://api.mydom.com"]
pProd = "P123"
pReviews = 200
for url in urls:
#Generate URL as API only brings back 100 at a time
for i in range(0, pReviews, 100):
links = 'https://api.mydom.com/data/reviews.json?Filter=ProductId%3A' + pProd + '&Offset=' + str(i) + '&passkey=123qwe'
yield scrapy.Request(
url=str(links),
cb_kwargs={'ProductID' : pProd},
callback=self.parse_reviews,
)
def parse_reviews(self, response, ProductID):
data = json.loads(response.text)
proddata = data['Includes']
reviews = data['Results']
p_prodid = ProductID
try:
p_prodcat = proddata['Products'][ProductID]['CategoryId']
except:
p_prodcat = None
for review in reviews:
try:
r_reviewdate = review['SubmissionTime']
except:
r_reviewdate = None
yield{
'prodid' : p_prodid,
'prodcat' : p_prodcat,
'reviewdate' : r_reviewdate,
}
我尝试了几种不同的方法,包括在 def start_requests 中添加变量名称,例如:
def start_requests(self, pProd='', pReviews='', **kwargs):
但似乎无处可去。希望能提供一些关于我哪里出错的指导。
你不必每次都声明构造函数(init),你可以像以前一样指定参数:
scrapy crawl myspider -a parameter1=value1 -a parameter2=value2
并且在您的爬虫代码中,您可以将它们用作爬虫参数:
class MySpider(Spider):
name = 'myspider'
...
def parse(self, response):
...
if self.parameter1 == value1:
# this is True
# or also
if getattr(self, parameter2) == value2:
# this is also True
*来自How to pass a user defined argument in scrapy spider
我有一个基本的 scrapy 项目,我在其中硬编码了 2 个变量 - pProd 和 pReviews。我现在想从 csv 文件中读取这些变量,或者在调用蜘蛛时传递它们。在过去的几个小时里,我一直在尝试,但在调用蜘蛛时使用 -a 属性似乎一无所获。例如:
scrapy crawl myspider -a Prod="P123" -a Revs="200" -o test.csv
这是我使用硬编码变量的代码:
import scrapy
from scrapy import Spider, Request
import re
import json
class myspider(Spider):
name = 'myspider'
allowed_domains = ['mydom.com']
start_urls = ['https://api.mydom.com']
def start_requests(self):
urls = ["https://api.mydom.com"]
pProd = "P123"
pReviews = 200
for url in urls:
#Generate URL as API only brings back 100 at a time
for i in range(0, pReviews, 100):
links = 'https://api.mydom.com/data/reviews.json?Filter=ProductId%3A' + pProd + '&Offset=' + str(i) + '&passkey=123qwe'
yield scrapy.Request(
url=str(links),
cb_kwargs={'ProductID' : pProd},
callback=self.parse_reviews,
)
def parse_reviews(self, response, ProductID):
data = json.loads(response.text)
proddata = data['Includes']
reviews = data['Results']
p_prodid = ProductID
try:
p_prodcat = proddata['Products'][ProductID]['CategoryId']
except:
p_prodcat = None
for review in reviews:
try:
r_reviewdate = review['SubmissionTime']
except:
r_reviewdate = None
yield{
'prodid' : p_prodid,
'prodcat' : p_prodcat,
'reviewdate' : r_reviewdate,
}
我尝试了几种不同的方法,包括在 def start_requests 中添加变量名称,例如:
def start_requests(self, pProd='', pReviews='', **kwargs):
但似乎无处可去。希望能提供一些关于我哪里出错的指导。
你不必每次都声明构造函数(init),你可以像以前一样指定参数:
scrapy crawl myspider -a parameter1=value1 -a parameter2=value2
并且在您的爬虫代码中,您可以将它们用作爬虫参数:
class MySpider(Spider):
name = 'myspider'
...
def parse(self, response):
...
if self.parameter1 == value1:
# this is True
# or also
if getattr(self, parameter2) == value2:
# this is also True
*来自How to pass a user defined argument in scrapy spider