如何使用 yield 函数从多个页面抓取数据
How to use the yield function to scrape data from multiple pages
我正在尝试从亚马逊印度网站抓取数据。在以下情况下,我无法使用 yield() 方法收集响应和解析元素:
1)我必须从产品页面移动到评论页面
2) 我必须从一个评论页面移动到另一个评论页面
代码流:
1) customerReviewData() 调用 getCustomerRatingsAndComments(response)
2) getCustomerRatingsAndComments(响应)
找到评论页URL,调用yield请求方法,回调方法为getCrrFromReviewPage(request),此评论页url
3) getCrrFromReviewPage() 获取第一个评论页面的新响应并从第一个评论页面(页面已加载)中抓取所有元素并将其添加到 customerReviewDataList[]
4)如果存在下一页则获取URL并递归调用getCrrFromReviewPage()方法,从下一页开始爬取元素,直至爬取所有评论页
5) 所有评论都添加到 customerReviewDataList[]
我试过使用 yield() 更改参数,还查阅了 yield() 和 Request/Response yield
的 scrapy 文档
# -*- coding: utf-8 -*-
import scrapy
import logging
customerReviewDataList = []
customerReviewData = {}
#Get product name in <H1>
def getProductTitleH1(response):
titleH1 = response.xpath('normalize-space(//*[@id="productTitle"]/text())').extract()
return titleH1
def getCustomerRatingsAndComments(response):
#Fetches the relative url
reviewRelativePageUrl = response.css('#reviews-medley-footer a::attr(href)').extract()[0]
if reviewRelativePageUrl:
#get absolute URL
reviewPageAbsoluteUrl = response.urljoin(reviewRelativePageUrl)
yield Request(url = reviewPageAbsoluteUrl, callback = getCrrFromReviewPage())
self.log("yield request complete")
return len(customerReviewDataList)
def getCrrFromReviewPage():
userReviewsAndRatings = response.xpath('//div[@id="cm_cr-review_list"]/div[@data-hook="review"]')
for userReviewAndRating in userReviewsAndRatings:
customerReviewData[reviewTitle] = response.css('#cm_cr-review_list .review-title span ::text').extract()
customerReviewData[reviewDescription] = response.css('#cm_cr-review_list .review-text span::text').extract()
customerReviewDataList.append(customerReviewData)
reviewNextPageRelativeUrl = response.css('#cm_cr-pagination_bar .a-pagination .a-last a::attr(href)')[0].extract()
if reviewNextPageRelativeUrl:
reviewNextPageAbsoluteUrl = response.urljoin(reviewNextPageRelativeUrl)
yield Request(url = reviewNextPageAbsoluteUrl, callback = getCrrFromReviewPage())
class UsAmazonSpider(scrapy.Spider):
name = 'Test_Crawler'
allowed_domains = ['amazon.in']
start_urls = ['https://www.amazon.in/Philips-Trimmer-Cordless-Corded-QT4011/dp/B00JJIDBIC/ref=sr_1_3?keywords=philips&qid=1554266853&s=gateway&sr=8-3']
def parse(self, response):
titleH1 = getProductTitleH1(response),
customerReviewData = getCustomerRatingsAndComments(response)
yield{
'Title_H1' : titleH1,
'customer_Review_Data' : customerReviewData
}
我收到以下回复:
{'Title_H1': (['Philips Beard Trimmer Cordless and Corded for Men QT4011/15'],), 'customer_Review_Data': <generator object getCustomerRatingsAndComments at 0x048AC630>}
"Customer_review_Data"应该是title和review的dict列表
我不知道我在这里犯了什么错误。
当我使用 log() 或 print() 查看 customerReviewDataList[] 中捕获的数据时,也无法在控制台中看到数据。
我能够抓取 customerReviewDataList[] 中的所有评论,如果它们出现在产品页面中,
在这种情况下,我必须使用 yield 函数,我得到的输出是这样的 [https://ibb.co/kq8w6cf]
这就是我正在寻找的输出类型:
{'customerReviewTitle': ['Difficult to find a charger adapter'],'customerReviewComment': ['I already have a phillips trimmer which was only cordless. ], 'customerReviewTitle': ['Good Product'],'customerReviewComment': ['Solves my need perfectly HK']}]}
感谢任何帮助。提前致谢。
您应该完成 Scrapy 教程。 Following links 部分应该对你特别有帮助。
这是您的代码的简化版本:
def data_request_iterator():
yield Request('https://example.org')
class MySpider(Spider):
name = 'myspider'
start_urls = ['https://example.com']
def parse(self, response):
yield {
'title': response.css('title::text').get(),
'data': data_request_iterator(),
}
相反,它应该是这样的:
class MySpider(Spider):
name = 'myspider'
start_urls = ['https://example.com']
def parse(self, response):
item = {
'title': response.css('title::text').get(),
}
yield Request('https://example.org', meta={'item': item}, callback=self.parse_data)
def parse_data(self, response):
item = response.meta['item']
# TODO: Extend item with data from this second response as needed.
yield item
我正在尝试从亚马逊印度网站抓取数据。在以下情况下,我无法使用 yield() 方法收集响应和解析元素: 1)我必须从产品页面移动到评论页面 2) 我必须从一个评论页面移动到另一个评论页面
代码流:
1) customerReviewData() 调用 getCustomerRatingsAndComments(response)
2) getCustomerRatingsAndComments(响应) 找到评论页URL,调用yield请求方法,回调方法为getCrrFromReviewPage(request),此评论页url
3) getCrrFromReviewPage() 获取第一个评论页面的新响应并从第一个评论页面(页面已加载)中抓取所有元素并将其添加到 customerReviewDataList[]
4)如果存在下一页则获取URL并递归调用getCrrFromReviewPage()方法,从下一页开始爬取元素,直至爬取所有评论页
5) 所有评论都添加到 customerReviewDataList[]
我试过使用 yield() 更改参数,还查阅了 yield() 和 Request/Response yield
的 scrapy 文档# -*- coding: utf-8 -*-
import scrapy
import logging
customerReviewDataList = []
customerReviewData = {}
#Get product name in <H1>
def getProductTitleH1(response):
titleH1 = response.xpath('normalize-space(//*[@id="productTitle"]/text())').extract()
return titleH1
def getCustomerRatingsAndComments(response):
#Fetches the relative url
reviewRelativePageUrl = response.css('#reviews-medley-footer a::attr(href)').extract()[0]
if reviewRelativePageUrl:
#get absolute URL
reviewPageAbsoluteUrl = response.urljoin(reviewRelativePageUrl)
yield Request(url = reviewPageAbsoluteUrl, callback = getCrrFromReviewPage())
self.log("yield request complete")
return len(customerReviewDataList)
def getCrrFromReviewPage():
userReviewsAndRatings = response.xpath('//div[@id="cm_cr-review_list"]/div[@data-hook="review"]')
for userReviewAndRating in userReviewsAndRatings:
customerReviewData[reviewTitle] = response.css('#cm_cr-review_list .review-title span ::text').extract()
customerReviewData[reviewDescription] = response.css('#cm_cr-review_list .review-text span::text').extract()
customerReviewDataList.append(customerReviewData)
reviewNextPageRelativeUrl = response.css('#cm_cr-pagination_bar .a-pagination .a-last a::attr(href)')[0].extract()
if reviewNextPageRelativeUrl:
reviewNextPageAbsoluteUrl = response.urljoin(reviewNextPageRelativeUrl)
yield Request(url = reviewNextPageAbsoluteUrl, callback = getCrrFromReviewPage())
class UsAmazonSpider(scrapy.Spider):
name = 'Test_Crawler'
allowed_domains = ['amazon.in']
start_urls = ['https://www.amazon.in/Philips-Trimmer-Cordless-Corded-QT4011/dp/B00JJIDBIC/ref=sr_1_3?keywords=philips&qid=1554266853&s=gateway&sr=8-3']
def parse(self, response):
titleH1 = getProductTitleH1(response),
customerReviewData = getCustomerRatingsAndComments(response)
yield{
'Title_H1' : titleH1,
'customer_Review_Data' : customerReviewData
}
我收到以下回复:
{'Title_H1': (['Philips Beard Trimmer Cordless and Corded for Men QT4011/15'],), 'customer_Review_Data': <generator object getCustomerRatingsAndComments at 0x048AC630>}
"Customer_review_Data"应该是title和review的dict列表
我不知道我在这里犯了什么错误。
当我使用 log() 或 print() 查看 customerReviewDataList[] 中捕获的数据时,也无法在控制台中看到数据。
我能够抓取 customerReviewDataList[] 中的所有评论,如果它们出现在产品页面中,
在这种情况下,我必须使用 yield 函数,我得到的输出是这样的 [https://ibb.co/kq8w6cf]
这就是我正在寻找的输出类型:
{'customerReviewTitle': ['Difficult to find a charger adapter'],'customerReviewComment': ['I already have a phillips trimmer which was only cordless. ], 'customerReviewTitle': ['Good Product'],'customerReviewComment': ['Solves my need perfectly HK']}]}
感谢任何帮助。提前致谢。
您应该完成 Scrapy 教程。 Following links 部分应该对你特别有帮助。
这是您的代码的简化版本:
def data_request_iterator():
yield Request('https://example.org')
class MySpider(Spider):
name = 'myspider'
start_urls = ['https://example.com']
def parse(self, response):
yield {
'title': response.css('title::text').get(),
'data': data_request_iterator(),
}
相反,它应该是这样的:
class MySpider(Spider):
name = 'myspider'
start_urls = ['https://example.com']
def parse(self, response):
item = {
'title': response.css('title::text').get(),
}
yield Request('https://example.org', meta={'item': item}, callback=self.parse_data)
def parse_data(self, response):
item = response.meta['item']
# TODO: Extend item with data from this second response as needed.
yield item