Scrapy 请求 - 回调函数未在嵌套请求中调用

Scrapy requests - Callback funtion not being called in nested requests

我正在尝试从亚马逊上抓取一些产品以获取有关我的竞争对手的一些信息。这是我正在采用的流程:

Make a query in the search bar ->
Visit every product page of the results gotten from the query -> 
Gather information from that product ->
Check if the product matches the quantity that we looked for (I.E. We might want to collect only products sold in a pack of n items ... like a kit of n toner cartridges)
    -> If it does, yield the item.
    -> If not, find a variation in that ad that represents a pack of such n items
         -> If such a variation exists, go visit that variation of the product, modify some information of the item (such as price and asin) and yield that item.

我这里有一个特殊案例。我不会 post 我拥有的所有功能,但我宁愿 post 一些有代表性的功能( 以使其更短和更通用,以便它可能会有用以后给别人).

这是我的代码结构:

def start_requests(self):
        for i, prod in enumerate(products):
            url = 'https://www.amazon.it/s?' + urlencode({'k': prod['query']})
            competitors = scrapy.Request(url=url, callback=self.parse_keyword_response, meta={'prod':prod})
            yield competitors


def parse_keyword_response(self, response):
        # Function that loops on the results of the query made, 
        # and collects all the products that actually match our search
        products = response.xpath('//*[@data-asin]')
        prod = response.meta['prod']

        competitors =[]

        for product in products:
            asin = product.xpath('@data-asin').extract_first()
            product_url = f"https://www.amazon.it/dp/{asin}"
            competitor = scrapy.Request(url=product_url, callback=self.parse_competitor_product_page, meta={'asin': asin, 'prod':prod})
            yield competitor
            competitors.append(competitor)


def parse_competitor_product_page(self, response):
        # Function that scrapes information from a product page and yields the competitor
        # only if it actually matches our search.

        ' Do some work and scrape required product attributes'

        competitor = ProductItem()
        competitor['product'] = prod_name
        competitor['asin'] = asin
        competitor['Title'] = title
        competitor['producer'] = producer
        competitor['MainImage'] = image
        competitor['Rating'] = rating
        competitor['NumberOfReviews'] = number_of_reviews
        competitor['price'] = price
        competitor['AvailableSizes'] = sizes
        competitor['AvailableColors'] = colors
        competitor['Varieties'] = varieties
        competitor['BulletPoints'] = bullet_points
        competitor['SellerRank'] = seller_rank

        if self.is_right_product(prod, competitor, response):
            yield competitor

def is_right_product(self, product, competitor, response):
       # Function that checks whether a resulting competitor actually matches the product that 
       # we looked for. It returns a boolean if it does. It also alters some attributes of that
       # competitor if a right variation is found on its page.

      ' I will omit some if else branches as those work well and I will only post the faulty 
           branch (which happens to be the one that should modify the competitor object because 
           a right variation is found on its page. '

      if product_is_right_quantity(competitor):
           return True
      else:
           variation = find_variation_of_right_quantity(product['quantity'], competitor)
           if vatiation is not None:
                competitor = self..update_product_to_right_variation(competitor, variation, response)
                print("variation check done")
                return True
           else:
                return False

def update_product_to_right_variation(self, product, variation_name, response):
        print("IN UPDATE PRODUCT TO RIGHT VARIATION")
        variation_asin = response.xpath(f'//div[@id="variation_color_name"]/ul/li[contains(@title, \'{variation_name}\')]/@data-defaultasin').get()
        product_url = f"https://www.amazon.it/dp/{variation_asin}"
        print(product_url)
        yield scrapy.Request(url=product_url, callback=self.update_competitor_from_product_page, errback=self.errback_http, meta={'prod':product, 'asin':variation_asin})

def update_competitor_from_product_page(self, response):
        print("INSIIDE UPDATE COMPETITOR FROM PRODUCT PAGE")
        prod = response.meta['prod']
        asin = response.meta['asin']

        price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()

        prod['price'] = price
        prod['Title'] = title
        prod['asin'] = asin

        response.meta['prod'] = prod
        print(prod['price'])
        return prod

如您所见,我放置了一些用于调试目的的打印语句。

update_competitor_from_product_page 中的打印语句永远不会得到输出

其他人都这样。因此,应该用作 update_product_to_right_variation 中发出的请求的回调函数的函数永远不会被调用。因此,竞争对手对象保持不变。

我是异步编程的新手,也是 Scrapy 的新手。

首先,我想知道为什么我的回调函数从来没有被调用过。其次,怎么才能做到心中所想?

我无法测试它,但问题可能是您尝试 yield Request 在函数 parse_competitor_product_page() 中执行,函数 is_right_product()parse_competitor_product_page() 中执行- 但是函数 parse_competitor_product_page() 中的 yield/return 无法将其直接发送到 Scrapy 引擎,而是将其发送到之前的函数 is_right_product() 应该 yield/return 它到上一个函数 parse_competitor_product_page() - 在 parse_competitor_product_page() 中你应该 yield 它然后它会发送它 Scrapy 引擎将执行它。

在你的代码中你 yield Requestparse_competitor_product_page()is_right_product() 但在 is_right_product() 你发送 return True/return False 所以它没有' t 发送 Requestparse_competitor_product_page() 并且它不能将它发送到 Scrapy engine


我想你需要这样的东西

def parse_competitor_product_page(self, response):
    # Function that scrapes information from a product page and yields the competitor
    # only if it actually matches our search.

    ' Do some work and scrape required product attributes'

    competitor = ProductItem()
    competitor['product'] = prod_name
    competitor['asin'] = asin
    competitor['Title'] = title
    competitor['producer'] = producer
    competitor['MainImage'] = image
    competitor['Rating'] = rating
    competitor['NumberOfReviews'] = number_of_reviews
    competitor['price'] = price
    competitor['AvailableSizes'] = sizes
    competitor['AvailableColors'] = colors
    competitor['Varieties'] = varieties
    competitor['BulletPoints'] = bullet_points
    competitor['SellerRank'] = seller_rank

    variaton = self.is_right_product(prod, competitor):
    if variation is True or variation is None:
        # send to Scarpy's Engine: ProductItem without changes
        yield competitor
    else:
        # send to Scarpy's Engine: Request to page with variation
        yield self.update_product_to_right_variation(competitor, variation)


def is_right_product(self, product, competitor):
    # Function that checks whether a resulting competitor actually matches the product that 
    # we looked for. It returns a boolean if it does. It also alters some attributes of that
    # competitor if a right variation is found on its page.

    '''I will omit some if else branches as those work well and I will only post the faulty 
       branch (which happens to be the one that should modify the competitor object because 
       a right variation is found on its page. '''

    if product_is_right_quantity(competitor):
        return True  # it will assing `True` to `variaton = ...` in `parse_competitor_product_page()`
    
    # it will assing `variation` or `None` to `variaton = ...` in `parse_competitor_product_page()`
    return find_variation_of_right_quantity(product['quantity'], competitor)


def update_product_to_right_variation(self, competitor, variation_asin):
    print("IN UPDATE PRODUCT TO RIGHT VARIATION")
    
    variation_asin = response.xpath(f'//div[@id="variation_color_name"]/ul/li[contains(@title, \'{variation_name}\')]/@data-defaultasin').get()
    
    product_url = f"https://www.amazon.it/dp/{variation_asin}"
    
    print(product_url)
    
    # send back to `parse_competitor_product_page()`
    return scrapy.Request(url=product_url,
                          callback=self.update_competitor_from_product_page,
                          errback=self.errback_http,
                          meta={'prod':competitor, 'asin':variation_asin})


def update_competitor_from_product_page(self, response):
    print("INSIIDE UPDATE COMPETITOR FROM PRODUCT PAGE")
    prod = response.meta['prod']
    asin = response.meta['asin']

    price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()
    #title = ...
    
    prod['price'] = price
    prod['Title'] = title
    prod['asin'] = asin

    #response.meta['prod'] = prod # useless
    print(prod['price'])
    
    # send to Scarpy's Engine: item with changes
    yield prod