通过滚动抓取动态亚马逊页面
Scraping dynamic amazon page with scrolling
我正在尝试为特定类别在亚马逊的 Best Seller 100 上抓取产品。例如-
https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_nav_0
这 100 件产品分为两页,每页 50 件产品。
早些时候,页面是静态的,所有 50 种产品过去都出现在页面上。但是,现在页面是动态的,我需要向下滚动才能看到页面上的所有 50 种产品。
我之前使用 scrapy 来抓取页面。如果你能帮我解决这个问题,我将不胜感激。谢谢!
在下面添加我的代码 -
import scrapy
from scrapy_splash import SplashRequest
class BsrNewSpider(scrapy.Spider):
name = 'bsr_new'
allowed_domains = ['www.amazon.in']
#start_urls = ['https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_0']
script = '''
function main(splash, args)
splash.private_mode_enabled = false
url = args.url
assert(splash:go(url))
assert(splash:wait(0.5))
return splash:html()
end
'''
def start_requests(self):
url = 'https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_0'
yield SplashRequest(url, callback = self.parse, endpoint = "execute", args = {
'lua_source': self.script
})
def parse(self, response):
for rev in response.xpath("//div[@id='gridItemRoot']"):
yield {
'Segment': "Home", #Enter name of the segment here
#'Sub-segment':segment,
'ASIN' : rev.xpath(".//div/div[@class='zg-grid-general-faceout']/div/a[@class='a-link-normal']/@href").re('\S*/dp/(\S+)_\S+')[0][:10],
'Rank' : rev.xpath(".//span[@class='zg-bdg-text']/text()").get(),
'Name' : rev.xpath("normalize-space(.//a[@class='a-link-normal']/span/div/text())").get(),
'No. of Ratings' : rev.xpath(".//span[contains(@class,'a-size-small')]/text()").get(),
'Rating' : rev.xpath(".//span[@class='a-icon-alt']/text()").get(),
'Price' : rev.xpath(".//span[@class='a-size-base a-color-price']//text()").get()
}
next_page = response.xpath("//a[text()='Next page']/@href").get()
if next_page:
url = response.urljoin(next_page)
yield SplashRequest(url, callback = self.parse, endpoint = "execute", args = {
'lua_source': self.script
})
问候
斯里詹
这是一个不需要 Splash 的替代方法。
所有 50 个产品的 ASIN 都隐藏在首页本身。您可以提取这些 ASIN 并构建所有这 50 个产品 URL。
import scrapy
import json
class AmazonSpider(scrapy.Spider):
custom_settings ={
'DEFAULT_REQUEST_HEADERS':''# Important
}
name = 'amazon'
start_urls = ['https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_pg_1?_encoding=UTF8&pg=1']
def parse(self, response):
raw_data = response.css('[data-client-recs-list]::attr(data-client-recs-list)').get()
data = json.loads(raw_data)
for item in data:
url = 'https://www.amazon.com/dp/{}'.format(item['id'])
yield scrapy.Request(url, callback=self.parse_item)
def parse_item(self, response,):
...
我正在尝试为特定类别在亚马逊的 Best Seller 100 上抓取产品。例如-
https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_nav_0
这 100 件产品分为两页,每页 50 件产品。
早些时候,页面是静态的,所有 50 种产品过去都出现在页面上。但是,现在页面是动态的,我需要向下滚动才能看到页面上的所有 50 种产品。
我之前使用 scrapy 来抓取页面。如果你能帮我解决这个问题,我将不胜感激。谢谢!
在下面添加我的代码 -
import scrapy
from scrapy_splash import SplashRequest
class BsrNewSpider(scrapy.Spider):
name = 'bsr_new'
allowed_domains = ['www.amazon.in']
#start_urls = ['https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_0']
script = '''
function main(splash, args)
splash.private_mode_enabled = false
url = args.url
assert(splash:go(url))
assert(splash:wait(0.5))
return splash:html()
end
'''
def start_requests(self):
url = 'https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_0'
yield SplashRequest(url, callback = self.parse, endpoint = "execute", args = {
'lua_source': self.script
})
def parse(self, response):
for rev in response.xpath("//div[@id='gridItemRoot']"):
yield {
'Segment': "Home", #Enter name of the segment here
#'Sub-segment':segment,
'ASIN' : rev.xpath(".//div/div[@class='zg-grid-general-faceout']/div/a[@class='a-link-normal']/@href").re('\S*/dp/(\S+)_\S+')[0][:10],
'Rank' : rev.xpath(".//span[@class='zg-bdg-text']/text()").get(),
'Name' : rev.xpath("normalize-space(.//a[@class='a-link-normal']/span/div/text())").get(),
'No. of Ratings' : rev.xpath(".//span[contains(@class,'a-size-small')]/text()").get(),
'Rating' : rev.xpath(".//span[@class='a-icon-alt']/text()").get(),
'Price' : rev.xpath(".//span[@class='a-size-base a-color-price']//text()").get()
}
next_page = response.xpath("//a[text()='Next page']/@href").get()
if next_page:
url = response.urljoin(next_page)
yield SplashRequest(url, callback = self.parse, endpoint = "execute", args = {
'lua_source': self.script
})
问候 斯里詹
这是一个不需要 Splash 的替代方法。
所有 50 个产品的 ASIN 都隐藏在首页本身。您可以提取这些 ASIN 并构建所有这 50 个产品 URL。
import scrapy
import json
class AmazonSpider(scrapy.Spider):
custom_settings ={
'DEFAULT_REQUEST_HEADERS':''# Important
}
name = 'amazon'
start_urls = ['https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_pg_1?_encoding=UTF8&pg=1']
def parse(self, response):
raw_data = response.css('[data-client-recs-list]::attr(data-client-recs-list)').get()
data = json.loads(raw_data)
for item in data:
url = 'https://www.amazon.com/dp/{}'.format(item['id'])
yield scrapy.Request(url, callback=self.parse_item)
def parse_item(self, response,):
...