如何使用 Scrapy 将对象从一个规则发送到另一个规则
How to send objets from one rule to other with Scrapy
我正在尝试抓取 Glassdoor 公司评级,在某些时候,我需要将一些对象从一个规则发送到另一个规则。
这是搜索的主要link:https://www.glassdoor.com/Reviews/lisbon-reviews-SRCH_IL.0,6_IM1121.htm
我在第一个规则上访问这个页面,获取一些信息,然后我需要从这个页面转到另一个 link,按照 XPath 表达式进入评论页面 //a[@class='eiCell cell reviews '].
这就是问题所在,我怎样才能在 parse_item 中使用 XPath 表达式跟随这个 link 而不会丢失我得到的信息?
class GetComentsSpider(CrawlSpider):
name = 'get_coments'
allowed_domains = ['www.glassdoor.com']
start_urls = ['http://https://www.glassdoor.com/Reviews/portugal-reviews-SRCH_IL.0,8_IN195.htm/']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
download_delay = 0.1
rules = (
#Acess the page, get the link from each company and move to parse_item
Rule(LinkExtractor(restrict_xpaths="//div[@class=' margBotXs']/a"), callback='parse_item', follow=True),
Rule(LinkExtractor(restrict_xpaths="//a[@class='eiCell cell reviews ']"), callback='parse_item', follow=True),
#Pagination
Rule(LinkExtractor(restrict_xpaths="//li[@class='next']/a"), follow=True),
)
def parse_item(self, response):
#get company name and rating
name = response.xpath("(//span[@class='updateBy'])[1]").get()
rating = response.xpath("//span[@class='bigRating strong margRtSm h1']/text()").get()
#Here i need to go to the link of //a[@class='eiCell cell reviews '] to get more data
#without losing the name and rating
yield {
"Name" : name,
"Rating" : rating
}
您可以使用 Request(..., meta=...)
发送到其他解析器
(并且您不需要 Rule
来获得此请求的 url)
def parse_item(self, response):
name = response.xpath("(//span[@class='updateBy'])[1]").get()
rating = response.xpath("//span[@class='bigRating strong margRtSm h1']/text()").get()
item = {
"Name" : name,
"Rating" : rating
}
url = ... #Here i need to go to the link of //a[@class='eiCell cell reviews '] to get more data
yield Request(url, callback='other_parser', meta={"item": item})
def other_parser(self, response):
item = response.meta['item']
item['other'] = ... # add values to item
yield item
我正在尝试抓取 Glassdoor 公司评级,在某些时候,我需要将一些对象从一个规则发送到另一个规则。
这是搜索的主要link:https://www.glassdoor.com/Reviews/lisbon-reviews-SRCH_IL.0,6_IM1121.htm
我在第一个规则上访问这个页面,获取一些信息,然后我需要从这个页面转到另一个 link,按照 XPath 表达式进入评论页面 //a[@class='eiCell cell reviews '].
这就是问题所在,我怎样才能在 parse_item 中使用 XPath 表达式跟随这个 link 而不会丢失我得到的信息?
class GetComentsSpider(CrawlSpider):
name = 'get_coments'
allowed_domains = ['www.glassdoor.com']
start_urls = ['http://https://www.glassdoor.com/Reviews/portugal-reviews-SRCH_IL.0,8_IN195.htm/']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
download_delay = 0.1
rules = (
#Acess the page, get the link from each company and move to parse_item
Rule(LinkExtractor(restrict_xpaths="//div[@class=' margBotXs']/a"), callback='parse_item', follow=True),
Rule(LinkExtractor(restrict_xpaths="//a[@class='eiCell cell reviews ']"), callback='parse_item', follow=True),
#Pagination
Rule(LinkExtractor(restrict_xpaths="//li[@class='next']/a"), follow=True),
)
def parse_item(self, response):
#get company name and rating
name = response.xpath("(//span[@class='updateBy'])[1]").get()
rating = response.xpath("//span[@class='bigRating strong margRtSm h1']/text()").get()
#Here i need to go to the link of //a[@class='eiCell cell reviews '] to get more data
#without losing the name and rating
yield {
"Name" : name,
"Rating" : rating
}
您可以使用 Request(..., meta=...)
(并且您不需要 Rule
来获得此请求的 url)
def parse_item(self, response):
name = response.xpath("(//span[@class='updateBy'])[1]").get()
rating = response.xpath("//span[@class='bigRating strong margRtSm h1']/text()").get()
item = {
"Name" : name,
"Rating" : rating
}
url = ... #Here i need to go to the link of //a[@class='eiCell cell reviews '] to get more data
yield Request(url, callback='other_parser', meta={"item": item})
def other_parser(self, response):
item = response.meta['item']
item['other'] = ... # add values to item
yield item