通过滚动抓取动态亚马逊页面

Question

我正在尝试为特定类别在亚马逊的 Best Seller 100 上抓取产品。例如-

https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_nav_0

这 100 件产品分为两页，每页 50 件产品。

早些时候，页面是静态的，所有 50 种产品过去都出现在页面上。但是，现在页面是动态的，我需要向下滚动才能看到页面上的所有 50 种产品。

我之前使用 scrapy 来抓取页面。如果你能帮我解决这个问题，我将不胜感激。谢谢！

在下面添加我的代码 -

import scrapy
from scrapy_splash import SplashRequest

class BsrNewSpider(scrapy.Spider):
    name = 'bsr_new'
    allowed_domains = ['www.amazon.in']
    #start_urls = ['https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_0']

script = '''
    function main(splash, args)
        splash.private_mode_enabled = false
        url = args.url
        assert(splash:go(url))
        assert(splash:wait(0.5))
        return splash:html()
    end
'''

def start_requests(self):
    url = 'https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_0'
    yield SplashRequest(url, callback = self.parse, endpoint = "execute", args = {
        'lua_source': self.script
    })

def parse(self, response):
    for rev in response.xpath("//div[@id='gridItemRoot']"):   
        yield {
            'Segment': "Home", #Enter name of the segment here
            #'Sub-segment':segment,
            'ASIN' : rev.xpath(".//div/div[@class='zg-grid-general-faceout']/div/a[@class='a-link-normal']/@href").re('\S*/dp/(\S+)_\S+')[0][:10],
            'Rank' : rev.xpath(".//span[@class='zg-bdg-text']/text()").get(),
            'Name' : rev.xpath("normalize-space(.//a[@class='a-link-normal']/span/div/text())").get(),
            'No. of Ratings' : rev.xpath(".//span[contains(@class,'a-size-small')]/text()").get(),
            'Rating' : rev.xpath(".//span[@class='a-icon-alt']/text()").get(),
            'Price' : rev.xpath(".//span[@class='a-size-base a-color-price']//text()").get()
            }      
        
        next_page = response.xpath("//a[text()='Next page']/@href").get()
        if next_page:
            url = response.urljoin(next_page)
            yield SplashRequest(url, callback = self.parse, endpoint = "execute", args = {
                'lua_source': self.script
            })

问候斯里詹

Answer 1

这是一个不需要 Splash 的替代方法。

所有 50 个产品的 ASIN 都隐藏在首页本身。您可以提取这些 ASIN 并构建所有这 50 个产品 URL。

import scrapy
import json

class AmazonSpider(scrapy.Spider):
    custom_settings ={
        'DEFAULT_REQUEST_HEADERS':''# Important
    }
    name = 'amazon'
    start_urls = ['https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_pg_1?_encoding=UTF8&pg=1']

    def parse(self, response):
        raw_data = response.css('[data-client-recs-list]::attr(data-client-recs-list)').get()
        data = json.loads(raw_data)
        for item in data:
            url = 'https://www.amazon.com/dp/{}'.format(item['id'])
            yield scrapy.Request(url, callback=self.parse_item)
    def parse_item(self, response,):
        ...

通过滚动抓取动态亚马逊页面

Scraping dynamic amazon page with scrolling

scrapy

web-scraping

scrapy-splash