Scrapy with Django - 传递更多信息 URL
Scrapy with Django- Pass more information that just URL
我正在尝试将 Scrapy 提供给 运行 蜘蛛,因为我已经将大量 URL 存储在数据库中。
"Spiders" 一切正常。
我无法让 Scrapy "remember" 它正在处理的对象。下面的代码让它使用 URL 字段将其匹配回我的 django 数据库。
问题是,URL 在浏览器中访问时经常会发生变化,所以 scrapy 不知道将数据放在哪里。
理想情况下-我可以'tell'抓取对象的主键-消除所有错误空间。
import sys, os, scrapy, django
from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import DropItem
## Django init ##
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) ## direct to where manage.py is
os.environ['DJANGO_SETTINGS_MODULE'] = 'XYZDB.settings'
django.setup()
#################
## Settings ##
#queryset_chunksize = 1000
##############
from XYZ import models
from parsers import dj, asos, theiconic
stores = [dj, asos, theiconic]
parsers = dict((i.domain, i) for i in stores)
def urls():
for i in models.Variation.objects.iterator():
yield i.link_original if i.link_original else i.link
class Superspider(scrapy.Spider):
name = 'Superspider'
start_urls = urls()
def parse(self, response):
for i in parsers:
if i in response.url:
return parsers[i].parse(response)
## Reference - models
'''
Stock_CHOICES = (
(1, 'In Stock'),
(2, 'Low Stock'),
(3, 'Out of Stock'),
(4, 'Discontinued'),
)
'''
class ProductPipeline:
def process_item(self, item, spider):
var = models.Variation.objects.get(link_original=item['url'])
size = models.Size.objects.get(variation=var)
if item['stock'] != models.Stock.objects.filter(size=size)[0]:
models.Stock(size=size, stock=item['stock']).save()
if int(item['price']) != int(models.Price.objects.filter(variation=var)):
models.Price(variation=var, price=item['price']).save()
return
if __name__ == '__main__':
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'ITEM_PIPELINES': {'__main__.ProductPipeline': 1,},
'DOWNLOAD_DELAY': 0.4
})
process.crawl(Superspider)
process.start()
你可以利用scrapy的response.meta attribute. Replace the definition of start_urls with the routine start_requests(self)
where you can yield Request(your url, meta={'pk': primary key})
. You can then access the meta data in the parse() routine using item['pk'] = response.meta['pk']
. start_requests() docs.
我正在尝试将 Scrapy 提供给 运行 蜘蛛,因为我已经将大量 URL 存储在数据库中。
"Spiders" 一切正常。
我无法让 Scrapy "remember" 它正在处理的对象。下面的代码让它使用 URL 字段将其匹配回我的 django 数据库。
问题是,URL 在浏览器中访问时经常会发生变化,所以 scrapy 不知道将数据放在哪里。
理想情况下-我可以'tell'抓取对象的主键-消除所有错误空间。
import sys, os, scrapy, django
from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import DropItem
## Django init ##
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) ## direct to where manage.py is
os.environ['DJANGO_SETTINGS_MODULE'] = 'XYZDB.settings'
django.setup()
#################
## Settings ##
#queryset_chunksize = 1000
##############
from XYZ import models
from parsers import dj, asos, theiconic
stores = [dj, asos, theiconic]
parsers = dict((i.domain, i) for i in stores)
def urls():
for i in models.Variation.objects.iterator():
yield i.link_original if i.link_original else i.link
class Superspider(scrapy.Spider):
name = 'Superspider'
start_urls = urls()
def parse(self, response):
for i in parsers:
if i in response.url:
return parsers[i].parse(response)
## Reference - models
'''
Stock_CHOICES = (
(1, 'In Stock'),
(2, 'Low Stock'),
(3, 'Out of Stock'),
(4, 'Discontinued'),
)
'''
class ProductPipeline:
def process_item(self, item, spider):
var = models.Variation.objects.get(link_original=item['url'])
size = models.Size.objects.get(variation=var)
if item['stock'] != models.Stock.objects.filter(size=size)[0]:
models.Stock(size=size, stock=item['stock']).save()
if int(item['price']) != int(models.Price.objects.filter(variation=var)):
models.Price(variation=var, price=item['price']).save()
return
if __name__ == '__main__':
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'ITEM_PIPELINES': {'__main__.ProductPipeline': 1,},
'DOWNLOAD_DELAY': 0.4
})
process.crawl(Superspider)
process.start()
你可以利用scrapy的response.meta attribute. Replace the definition of start_urls with the routine start_requests(self)
where you can yield Request(your url, meta={'pk': primary key})
. You can then access the meta data in the parse() routine using item['pk'] = response.meta['pk']
. start_requests() docs.