如何从管道执行特定的蜘蛛而不再次激活它
How to execute an specific spider from the Pipeline without activating it again
简介
我要删除的网站有两个网址:
/top
列出顶级球员
/player/{name}
显示名称为 {name}
的玩家信息
从第一个 URL 开始,我得到了玩家姓名和位置,然后我可以使用给定的名称调用第二个 URL。我目前的目标是将所有数据存储在数据库中。
问题
我创建了两个蜘蛛。第一个爬行 /top
,第二个爬行 /player/{name}
第一个蜘蛛找到的每个玩家。但是,为了能够将第一个蜘蛛数据插入数据库,我需要调用配置文件蜘蛛,因为它是一个外键,如以下查询所述:
INSERT INTO top_players (player_id, position) values (1, 1)
INSERT INTO players (name) values ('John Doe')
问题
是否可以从管道执行蜘蛛以获得蜘蛛结果?我的意思是,被调用的蜘蛛不应该再次激活管道。
我建议您对抓取过程有更多的控制。尤其是从第一页和详情页抓取名称、位置。
试试这个:
# -*- coding: utf-8 -*-
import scrapy
class MyItem(scrapy.Item):
name = scrapy.Field()
position= scrapy.Field()
detail=scrapy.Field()
class MySpider(scrapy.Spider):
name = '<name of spider>'
allowed_domains = ['mywebsite.org']
start_urls = ['http://mywebsite.org/<path to the page>']
def parse(self, response):
rows = response.xpath('//a[contains(@href,"<div id or class>")]')
#loop over all links to stories
for row in rows:
myItem = MyItem() # Create a new item
myItem['name'] = row.xpath('./text()').extract() # assign name from link
myItem['position']=row.xpath('./text()').extract() # assign position from link
detail_url = response.urljoin(row.xpath('./@href').extract()[0]) # extract url from link
request = scrapy.Request(url = detail_url, callback = self.parse_detail) # create request for detail page with story
request.meta['myItem'] = myItem # pass the item with the request
yield request
def parse_detail(self, response):
myItem = response.meta['myItem'] # extract the item (with the name) from the response
text_raw = response.xpath('//font[@size=3]//text()').extract() # extract the detail (text)
myItem['detail'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item
yield myItem # return the item
简介
我要删除的网站有两个网址:
/top
列出顶级球员/player/{name}
显示名称为{name}
的玩家信息
从第一个 URL 开始,我得到了玩家姓名和位置,然后我可以使用给定的名称调用第二个 URL。我目前的目标是将所有数据存储在数据库中。
问题
我创建了两个蜘蛛。第一个爬行 /top
,第二个爬行 /player/{name}
第一个蜘蛛找到的每个玩家。但是,为了能够将第一个蜘蛛数据插入数据库,我需要调用配置文件蜘蛛,因为它是一个外键,如以下查询所述:
INSERT INTO top_players (player_id, position) values (1, 1)
INSERT INTO players (name) values ('John Doe')
问题
是否可以从管道执行蜘蛛以获得蜘蛛结果?我的意思是,被调用的蜘蛛不应该再次激活管道。
我建议您对抓取过程有更多的控制。尤其是从第一页和详情页抓取名称、位置。 试试这个:
# -*- coding: utf-8 -*-
import scrapy
class MyItem(scrapy.Item):
name = scrapy.Field()
position= scrapy.Field()
detail=scrapy.Field()
class MySpider(scrapy.Spider):
name = '<name of spider>'
allowed_domains = ['mywebsite.org']
start_urls = ['http://mywebsite.org/<path to the page>']
def parse(self, response):
rows = response.xpath('//a[contains(@href,"<div id or class>")]')
#loop over all links to stories
for row in rows:
myItem = MyItem() # Create a new item
myItem['name'] = row.xpath('./text()').extract() # assign name from link
myItem['position']=row.xpath('./text()').extract() # assign position from link
detail_url = response.urljoin(row.xpath('./@href').extract()[0]) # extract url from link
request = scrapy.Request(url = detail_url, callback = self.parse_detail) # create request for detail page with story
request.meta['myItem'] = myItem # pass the item with the request
yield request
def parse_detail(self, response):
myItem = response.meta['myItem'] # extract the item (with the name) from the response
text_raw = response.xpath('//font[@size=3]//text()').extract() # extract the detail (text)
myItem['detail'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item
yield myItem # return the item