Scrapy 请求 - 回调函数未在嵌套请求中调用
Scrapy requests - Callback funtion not being called in nested requests
我正在尝试从亚马逊上抓取一些产品以获取有关我的竞争对手的一些信息。这是我正在采用的流程:
Make a query in the search bar ->
Visit every product page of the results gotten from the query ->
Gather information from that product ->
Check if the product matches the quantity that we looked for (I.E. We might want to collect only products sold in a pack of n items ... like a kit of n toner cartridges)
-> If it does, yield the item.
-> If not, find a variation in that ad that represents a pack of such n items
-> If such a variation exists, go visit that variation of the product, modify some information of the item (such as price and asin) and yield that item.
我这里有一个特殊案例。我不会 post 我拥有的所有功能,但我宁愿 post 一些有代表性的功能( 以使其更短和更通用,以便它可能会有用以后给别人).
这是我的代码结构:
def start_requests(self):
for i, prod in enumerate(products):
url = 'https://www.amazon.it/s?' + urlencode({'k': prod['query']})
competitors = scrapy.Request(url=url, callback=self.parse_keyword_response, meta={'prod':prod})
yield competitors
def parse_keyword_response(self, response):
# Function that loops on the results of the query made,
# and collects all the products that actually match our search
products = response.xpath('//*[@data-asin]')
prod = response.meta['prod']
competitors =[]
for product in products:
asin = product.xpath('@data-asin').extract_first()
product_url = f"https://www.amazon.it/dp/{asin}"
competitor = scrapy.Request(url=product_url, callback=self.parse_competitor_product_page, meta={'asin': asin, 'prod':prod})
yield competitor
competitors.append(competitor)
def parse_competitor_product_page(self, response):
# Function that scrapes information from a product page and yields the competitor
# only if it actually matches our search.
' Do some work and scrape required product attributes'
competitor = ProductItem()
competitor['product'] = prod_name
competitor['asin'] = asin
competitor['Title'] = title
competitor['producer'] = producer
competitor['MainImage'] = image
competitor['Rating'] = rating
competitor['NumberOfReviews'] = number_of_reviews
competitor['price'] = price
competitor['AvailableSizes'] = sizes
competitor['AvailableColors'] = colors
competitor['Varieties'] = varieties
competitor['BulletPoints'] = bullet_points
competitor['SellerRank'] = seller_rank
if self.is_right_product(prod, competitor, response):
yield competitor
def is_right_product(self, product, competitor, response):
# Function that checks whether a resulting competitor actually matches the product that
# we looked for. It returns a boolean if it does. It also alters some attributes of that
# competitor if a right variation is found on its page.
' I will omit some if else branches as those work well and I will only post the faulty
branch (which happens to be the one that should modify the competitor object because
a right variation is found on its page. '
if product_is_right_quantity(competitor):
return True
else:
variation = find_variation_of_right_quantity(product['quantity'], competitor)
if vatiation is not None:
competitor = self..update_product_to_right_variation(competitor, variation, response)
print("variation check done")
return True
else:
return False
def update_product_to_right_variation(self, product, variation_name, response):
print("IN UPDATE PRODUCT TO RIGHT VARIATION")
variation_asin = response.xpath(f'//div[@id="variation_color_name"]/ul/li[contains(@title, \'{variation_name}\')]/@data-defaultasin').get()
product_url = f"https://www.amazon.it/dp/{variation_asin}"
print(product_url)
yield scrapy.Request(url=product_url, callback=self.update_competitor_from_product_page, errback=self.errback_http, meta={'prod':product, 'asin':variation_asin})
def update_competitor_from_product_page(self, response):
print("INSIIDE UPDATE COMPETITOR FROM PRODUCT PAGE")
prod = response.meta['prod']
asin = response.meta['asin']
price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()
prod['price'] = price
prod['Title'] = title
prod['asin'] = asin
response.meta['prod'] = prod
print(prod['price'])
return prod
如您所见,我放置了一些用于调试目的的打印语句。
update_competitor_from_product_page 中的打印语句永远不会得到输出。
其他人都这样。因此,应该用作 update_product_to_right_variation 中发出的请求的回调函数的函数永远不会被调用。因此,竞争对手对象保持不变。
我是异步编程的新手,也是 Scrapy 的新手。
首先,我想知道为什么我的回调函数从来没有被调用过。其次,怎么才能做到心中所想?
我无法测试它,但问题可能是您尝试 yield Request
在函数 parse_competitor_product_page()
中执行,函数 is_right_product()
在 parse_competitor_product_page()
中执行- 但是函数 parse_competitor_product_page()
中的 yield
/return
无法将其直接发送到 Scrapy 引擎,而是将其发送到之前的函数 is_right_product()
应该 yield
/return
它到上一个函数 parse_competitor_product_page()
- 在 parse_competitor_product_page()
中你应该 yield
它然后它会发送它 Scrapy
引擎将执行它。
在你的代码中你 yield Request
从 parse_competitor_product_page()
到 is_right_product()
但在 is_right_product()
你发送 return True
/return False
所以它没有' t 发送 Request
到 parse_competitor_product_page()
并且它不能将它发送到 Scrapy engine
我想你需要这样的东西
def parse_competitor_product_page(self, response):
# Function that scrapes information from a product page and yields the competitor
# only if it actually matches our search.
' Do some work and scrape required product attributes'
competitor = ProductItem()
competitor['product'] = prod_name
competitor['asin'] = asin
competitor['Title'] = title
competitor['producer'] = producer
competitor['MainImage'] = image
competitor['Rating'] = rating
competitor['NumberOfReviews'] = number_of_reviews
competitor['price'] = price
competitor['AvailableSizes'] = sizes
competitor['AvailableColors'] = colors
competitor['Varieties'] = varieties
competitor['BulletPoints'] = bullet_points
competitor['SellerRank'] = seller_rank
variaton = self.is_right_product(prod, competitor):
if variation is True or variation is None:
# send to Scarpy's Engine: ProductItem without changes
yield competitor
else:
# send to Scarpy's Engine: Request to page with variation
yield self.update_product_to_right_variation(competitor, variation)
def is_right_product(self, product, competitor):
# Function that checks whether a resulting competitor actually matches the product that
# we looked for. It returns a boolean if it does. It also alters some attributes of that
# competitor if a right variation is found on its page.
'''I will omit some if else branches as those work well and I will only post the faulty
branch (which happens to be the one that should modify the competitor object because
a right variation is found on its page. '''
if product_is_right_quantity(competitor):
return True # it will assing `True` to `variaton = ...` in `parse_competitor_product_page()`
# it will assing `variation` or `None` to `variaton = ...` in `parse_competitor_product_page()`
return find_variation_of_right_quantity(product['quantity'], competitor)
def update_product_to_right_variation(self, competitor, variation_asin):
print("IN UPDATE PRODUCT TO RIGHT VARIATION")
variation_asin = response.xpath(f'//div[@id="variation_color_name"]/ul/li[contains(@title, \'{variation_name}\')]/@data-defaultasin').get()
product_url = f"https://www.amazon.it/dp/{variation_asin}"
print(product_url)
# send back to `parse_competitor_product_page()`
return scrapy.Request(url=product_url,
callback=self.update_competitor_from_product_page,
errback=self.errback_http,
meta={'prod':competitor, 'asin':variation_asin})
def update_competitor_from_product_page(self, response):
print("INSIIDE UPDATE COMPETITOR FROM PRODUCT PAGE")
prod = response.meta['prod']
asin = response.meta['asin']
price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()
#title = ...
prod['price'] = price
prod['Title'] = title
prod['asin'] = asin
#response.meta['prod'] = prod # useless
print(prod['price'])
# send to Scarpy's Engine: item with changes
yield prod
我正在尝试从亚马逊上抓取一些产品以获取有关我的竞争对手的一些信息。这是我正在采用的流程:
Make a query in the search bar ->
Visit every product page of the results gotten from the query ->
Gather information from that product ->
Check if the product matches the quantity that we looked for (I.E. We might want to collect only products sold in a pack of n items ... like a kit of n toner cartridges)
-> If it does, yield the item.
-> If not, find a variation in that ad that represents a pack of such n items
-> If such a variation exists, go visit that variation of the product, modify some information of the item (such as price and asin) and yield that item.
我这里有一个特殊案例。我不会 post 我拥有的所有功能,但我宁愿 post 一些有代表性的功能( 以使其更短和更通用,以便它可能会有用以后给别人).
这是我的代码结构:
def start_requests(self):
for i, prod in enumerate(products):
url = 'https://www.amazon.it/s?' + urlencode({'k': prod['query']})
competitors = scrapy.Request(url=url, callback=self.parse_keyword_response, meta={'prod':prod})
yield competitors
def parse_keyword_response(self, response):
# Function that loops on the results of the query made,
# and collects all the products that actually match our search
products = response.xpath('//*[@data-asin]')
prod = response.meta['prod']
competitors =[]
for product in products:
asin = product.xpath('@data-asin').extract_first()
product_url = f"https://www.amazon.it/dp/{asin}"
competitor = scrapy.Request(url=product_url, callback=self.parse_competitor_product_page, meta={'asin': asin, 'prod':prod})
yield competitor
competitors.append(competitor)
def parse_competitor_product_page(self, response):
# Function that scrapes information from a product page and yields the competitor
# only if it actually matches our search.
' Do some work and scrape required product attributes'
competitor = ProductItem()
competitor['product'] = prod_name
competitor['asin'] = asin
competitor['Title'] = title
competitor['producer'] = producer
competitor['MainImage'] = image
competitor['Rating'] = rating
competitor['NumberOfReviews'] = number_of_reviews
competitor['price'] = price
competitor['AvailableSizes'] = sizes
competitor['AvailableColors'] = colors
competitor['Varieties'] = varieties
competitor['BulletPoints'] = bullet_points
competitor['SellerRank'] = seller_rank
if self.is_right_product(prod, competitor, response):
yield competitor
def is_right_product(self, product, competitor, response):
# Function that checks whether a resulting competitor actually matches the product that
# we looked for. It returns a boolean if it does. It also alters some attributes of that
# competitor if a right variation is found on its page.
' I will omit some if else branches as those work well and I will only post the faulty
branch (which happens to be the one that should modify the competitor object because
a right variation is found on its page. '
if product_is_right_quantity(competitor):
return True
else:
variation = find_variation_of_right_quantity(product['quantity'], competitor)
if vatiation is not None:
competitor = self..update_product_to_right_variation(competitor, variation, response)
print("variation check done")
return True
else:
return False
def update_product_to_right_variation(self, product, variation_name, response):
print("IN UPDATE PRODUCT TO RIGHT VARIATION")
variation_asin = response.xpath(f'//div[@id="variation_color_name"]/ul/li[contains(@title, \'{variation_name}\')]/@data-defaultasin').get()
product_url = f"https://www.amazon.it/dp/{variation_asin}"
print(product_url)
yield scrapy.Request(url=product_url, callback=self.update_competitor_from_product_page, errback=self.errback_http, meta={'prod':product, 'asin':variation_asin})
def update_competitor_from_product_page(self, response):
print("INSIIDE UPDATE COMPETITOR FROM PRODUCT PAGE")
prod = response.meta['prod']
asin = response.meta['asin']
price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()
prod['price'] = price
prod['Title'] = title
prod['asin'] = asin
response.meta['prod'] = prod
print(prod['price'])
return prod
如您所见,我放置了一些用于调试目的的打印语句。
update_competitor_from_product_page 中的打印语句永远不会得到输出。
其他人都这样。因此,应该用作 update_product_to_right_variation 中发出的请求的回调函数的函数永远不会被调用。因此,竞争对手对象保持不变。
我是异步编程的新手,也是 Scrapy 的新手。
首先,我想知道为什么我的回调函数从来没有被调用过。其次,怎么才能做到心中所想?
我无法测试它,但问题可能是您尝试 yield Request
在函数 parse_competitor_product_page()
中执行,函数 is_right_product()
在 parse_competitor_product_page()
中执行- 但是函数 parse_competitor_product_page()
中的 yield
/return
无法将其直接发送到 Scrapy 引擎,而是将其发送到之前的函数 is_right_product()
应该 yield
/return
它到上一个函数 parse_competitor_product_page()
- 在 parse_competitor_product_page()
中你应该 yield
它然后它会发送它 Scrapy
引擎将执行它。
在你的代码中你 yield Request
从 parse_competitor_product_page()
到 is_right_product()
但在 is_right_product()
你发送 return True
/return False
所以它没有' t 发送 Request
到 parse_competitor_product_page()
并且它不能将它发送到 Scrapy engine
我想你需要这样的东西
def parse_competitor_product_page(self, response):
# Function that scrapes information from a product page and yields the competitor
# only if it actually matches our search.
' Do some work and scrape required product attributes'
competitor = ProductItem()
competitor['product'] = prod_name
competitor['asin'] = asin
competitor['Title'] = title
competitor['producer'] = producer
competitor['MainImage'] = image
competitor['Rating'] = rating
competitor['NumberOfReviews'] = number_of_reviews
competitor['price'] = price
competitor['AvailableSizes'] = sizes
competitor['AvailableColors'] = colors
competitor['Varieties'] = varieties
competitor['BulletPoints'] = bullet_points
competitor['SellerRank'] = seller_rank
variaton = self.is_right_product(prod, competitor):
if variation is True or variation is None:
# send to Scarpy's Engine: ProductItem without changes
yield competitor
else:
# send to Scarpy's Engine: Request to page with variation
yield self.update_product_to_right_variation(competitor, variation)
def is_right_product(self, product, competitor):
# Function that checks whether a resulting competitor actually matches the product that
# we looked for. It returns a boolean if it does. It also alters some attributes of that
# competitor if a right variation is found on its page.
'''I will omit some if else branches as those work well and I will only post the faulty
branch (which happens to be the one that should modify the competitor object because
a right variation is found on its page. '''
if product_is_right_quantity(competitor):
return True # it will assing `True` to `variaton = ...` in `parse_competitor_product_page()`
# it will assing `variation` or `None` to `variaton = ...` in `parse_competitor_product_page()`
return find_variation_of_right_quantity(product['quantity'], competitor)
def update_product_to_right_variation(self, competitor, variation_asin):
print("IN UPDATE PRODUCT TO RIGHT VARIATION")
variation_asin = response.xpath(f'//div[@id="variation_color_name"]/ul/li[contains(@title, \'{variation_name}\')]/@data-defaultasin').get()
product_url = f"https://www.amazon.it/dp/{variation_asin}"
print(product_url)
# send back to `parse_competitor_product_page()`
return scrapy.Request(url=product_url,
callback=self.update_competitor_from_product_page,
errback=self.errback_http,
meta={'prod':competitor, 'asin':variation_asin})
def update_competitor_from_product_page(self, response):
print("INSIIDE UPDATE COMPETITOR FROM PRODUCT PAGE")
prod = response.meta['prod']
asin = response.meta['asin']
price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()
#title = ...
prod['price'] = price
prod['Title'] = title
prod['asin'] = asin
#response.meta['prod'] = prod # useless
print(prod['price'])
# send to Scarpy's Engine: item with changes
yield prod