了解使用 Scrapy 时的无限加载 - 怎么了?

Understanding infinite loading when using Scrapy - what's wrong?

上下文

我正在尝试从 this website 获取所有数据,以便稍后在某些模型训练项目 (ML) 中使用它。

我选择使用 Scrapy + Python 3.7 来完成。到目前为止,一切都很好。我已经设置了我的 Scrapy 项目结构,并开始在 scraper 上工作。为此,我创建了一些需要遵循的步骤,以便相应地获取我需要的数据。

步骤

  1. 首先,我们在访问site's sitemap we can get all the categories that we need. (There's also a direct Products page时可以看到,但不幸的是,无法通过这种方式获取类别,因此这不是解决方案。

  1. 现在,我们需要做的是访问每个子类别,这将引导我们进入产品页面(无限加载所在的位置)。我以第一个子类别为例。

  1. 当我们向下滚动浏览产品时,我们可以看到我们有一个无限加载并且正在请求将更多产品放入前端:

  1. 最后,单击每个产品并从中获取一些数据(这部分与我要问的内容无关,因此您可以跳过我的代码中的 Product class我会在下面粘贴)

代码

我尝试使用以下代码重现上述内容:

import json
import re

import scrapy


PRODUCTS_XPATH = "//div[@class='col-md-3']//a/@href"


class Product:
    def __init__(self, response):
        self.response = response

    def get_brand_name(self):
        brand_name = self.response.xpath(
            "normalize-space(//*[@class='product-brand-name-details']/text())"
        ).extract()
        if not brand_name[0]:
            brand_name = self.response.xpath(
                "normalize-space(//h3[@class='font-weight-bold']/text())"
            ).extract()
        return brand_name[0] if brand_name else 'Could not get product brand name.'

    def get_brand_name_details(self):
        brand_name_details = self.response.xpath(
            "normalize-space(//*[@class='product-name-details']/text())"
        ).extract()
        if not brand_name_details[0]:
            brand_name_details = self.response.xpath(
                "normalize-space(//h1[@class='title font-weight-bold']/text())"
            ).extract()
        return brand_name_details[0] if brand_name_details else 'Could not get product brand name details.'

    def get_real_category(self):
        return self.response.meta.get('product_category')

    def get_sku_details(self):
        sku_details = self.response.xpath(
            "normalize-space(//*[@class='product-sku-details']/text())"
        ).extract()
        if not sku_details[0]:
            sku_details = self.response.xpath(
                "normalize-space(//h5[@class='font-weight-bold']/text())"
            ).extract()
        return sku_details[0] if sku_details else 'Could not get product sku details.'

    def get_short_desc_details(self):
        short_desc_details = self.response.xpath(
            "normalize-space(//p[@class='pt-2']/text())"
        ).extract()
        return short_desc_details[0] if short_desc_details else 'Could not get product short desc details.'

    def get_detail_list_price(self):
        detail_list_price = self.response.xpath(
            "normalize-space(//*[@class='product-detail-list-price']//text())"
        ).extract()
        return detail_list_price[0] if detail_list_price else 'Could not get product detail list price.'

    def get_price(self):
        price = self.response.xpath(
            "normalize-space(//*[@class='price']//text())"
        ).extract()
        return price[0] if price else 'Could not get product price.'

    def get_detail_price_save(self):
        detail_price_save = self.response.xpath(
            "normalize-space(//*[@class='product-detail-price-save']//text())"
        ).extract()
        return detail_price_save[0] if detail_price_save else 'Could not get product detail price save.'

    def get_detail_note(self):
        detail_note = self.response.xpath(
            "normalize-space(//*[@class='product-detail-note']//text())"
        ).extract()
        return detail_note[0] if detail_note else 'Could not get product detail note.'

    def get_detail_long_desc(self):
        detail_long_descriptions = self.response.xpath(
            "//*[@id='desc']/node()"
        ).extract()

        detail_long_desc = ''.join([x.strip() for x in detail_long_descriptions if x.strip()])
        return detail_long_desc if detail_long_desc else 'Could not get product detail long desc.'

    def get_image(self):
        image = self.response.xpath(
            "normalize-space(//*[@id='mainContent_imgDetail']/@src)"
        ).extract()
        return f'https://bannersolutions.com{image[0]}' if image else 'Could not get product image.'

    def get_pieces_in_stock(self):
        pieces_in_stock = self.response.xpath(
            "normalize-space(//*[@class='badge-success']//text())"
        ).extract()
        return pieces_in_stock[0] if pieces_in_stock else 'Unknown pieces in stock.'

    def get_meta_description(self):
        meta_description = self.response.xpath(
            "normalize-space(//*[@name='description']/@content)"
        ).extract()
        return meta_description[0] if meta_description else 'Could not get product meta description.'

    def to_json(self):
        return {
            'product_brand_name_details': self.get_brand_name_details(),
            'product_brand_name': self.get_brand_name(),
            'product_category': self.get_real_category(),
            'product_sku_details': self.get_sku_details(),
            'product_short_desc_details': self.get_short_desc_details(),
            'product_detail_list_price': self.get_detail_list_price(),
            'product_price': self.get_price(),
            'product_detail_price_save': self.get_detail_price_save(),
            'product_detail_note': self.get_detail_note(),
            'product_detail_long_desc': self.get_detail_long_desc(),
            'product_image': self.get_image(),
            'product_in_stock': self.get_pieces_in_stock(),
            'product_meta_description': self.get_meta_description()
        }


class BannerSolutionsSpider(scrapy.Spider):
    name = 'bannersolutions'
    start_urls = ['https://bannersolutions.com/Sitemap']

    allowed_domains = ['bannersolutions.com']

    def start_crawl(self, response):
        for url in self.start_urls:
            yield scrapy.Request(url)

    def parse(self, response):
        for category in response.xpath('(//div[@class="col-md-3"])[1]/ul/li'):
            main_category_name = category.xpath('./a/text()').get()
            sub_category_name = category.xpath('./ul/li/a/text()').get()
            category_url = category.xpath('./ul/li/a/@href').get()

            if category_url:
                yield scrapy.Request(f'https://bannersolutions.com{category_url}', callback=self.parse_categories,
                                     meta={'product_category': f'{main_category_name}/{sub_category_name}'})

    def parse_categories(self, response):
        title = response.xpath('//h1[@class="title"]/text()').get()
        products_in_category = re.match(r'.*\((\d+)\)', title).group(1)
        no_of_requests = int(products_in_category) // 8 + 1
        in_cat_id = response.url.split('/')[-1]

        for i in range(1, no_of_requests):
            payload = {
                'pageIndex': str(i),
                'inViewType': 'grid',
                'inPageSize': '8',
                'inCatID': in_cat_id,
                'inFilters': '',
                'inSortType': ''
            }

            yield scrapy.Request(
                'https://bannersolutions.com/catalog.aspx/GetProducts',
                method='POST',
                headers={"content-type": "application/json"},
                body=json.dumps(payload),
                callback=self.parse_plm,
                meta={'product_category': response.meta.get('product_category')}
            )

    def parse_plm(self, response):
        products_str_html = json.loads(response.body).get('d')
        product_url = scrapy.selector.Selector(text=products_str_html).xpath(
            '//div[@class="product-image-container"]//a/@href'
        ).get()

        yield scrapy.Request(
            f'https://bannersolutions.com{product_url}',
            callback=self.parse_product,
            meta={'product_category': response.meta.get('product_category')}
        )

    def parse_product(self, response):
        product = Product(response).to_json()
        yield product

问题

我的代码的问题是不是所有的产品都被解析,只有 ~3k / 70k。现在,我认为问题出在第 148-165 行之间。我已经通过调试器 运行 解决了它,但我仍然无法找出问题所在。

有人可以解释一下我的代码逻辑有什么问题吗?

不确定这是否是唯一的问题,因为我没有时间进一步测试它,但您在此处加载 8 批量数据时似乎只解析了第一个产品:

# ...
product_url = scrapy.selector.Selector(text=products_str_html).xpath(
    '//div[@class="product-image-container"]//a/@href'
).get()
# ...

.get() 方法不会 return 所有网址。您可以使用 getall() 方法,而不是 return 一个包含所有 url 的列表:

# ...
product_url = scrapy.selector.Selector(text=products_str_html).xpath(
    '//div[@class="product-image-container"]//a/@href'
).getall()
# ...

然后循环遍历 returned 列表并生成您之前生成的内容:

# ...
products_urls = scrapy.selector.Selector(text=products_str_html).xpath(
    '//div[@class="product-image-container"]//a/@href'
).getall()

for product_url in products_urls:
    yield scrapy.Request(
        f'https://bannersolutions.com{product_url}',
        callback=self.parse_product,
        meta={'product_category': response.meta.get('product_category')}
    )

您在 BannerSolutionsSpider class 的 parse 方法中犯了与您在 parse_plm 方法中相同的错误( 由@突出显示Cajuu')。您使用 get 方法获取所有超链接,而不是使用 getall 方法,该方法仅 returns 每个子类别的第一个 URL。

您可以尝试下面的解决方案,它提供了所有子类别的 url 进行解析。

for category in response.xpath('(//div[@class="col-md-3"])[1]/ul/li'):
    main_category_name = category.xpath('./a/text()').get()
    for sub_category in category.xpath('./ul/li'):
        sub_category_name = sub_category.xpath('./a/text()').get()
        sub_category_url = sub_category.xpath('./a/@href').get()
        yield scrapy.Request(f'https://bannersolutions.com{sub_category_url}', callback=self.parse_categories, meta={'product_category': f'{main_category_name}/{sub_category_name}'})