我如何抓取这种动态生成的网站数据?
how do i scrape this kind of dynamic generated website data?
我正在尝试抓取电子商务网站,
示例 link:https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1
数据是通过 React 呈现的,当我在少数 link 上执行 抓取 时,大部分数据都作为 null
返回,当我查看页面源我实际上找不到 HTML 可通过检查元素获得,只是 Javascript 标签内的 json。我在相同的 link 上测试了几次 运行 scrapy scraper 和以前没有找到的数据,实际上是 returns 内容,所以它是随机的.我不知道我应该如何抓取这种网站。
我也在使用用户代理池并在请求之间中断。
script = '''
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(1.5))
return splash:html()
end
'''
def start_requests(self):
url= [
'https://www.lazada.sg/products/esogoal-tactical-sling-bag-outdoor-chest-pack-shoulder-backpack-military-sport-bag-for-trekking-camping-hiking-rover-sling-daypack-for-men-women-i204814494-s353896924.html?mp=1',
'https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1',
'https://www.lazada.sg/products/esogoal-selfie-stick-tripod-extendable-selfie-stick-monopod-with-integrated-tripod-and-bluetooth-remote-shutter-wireless-selfie-stick-tripod-for-cellphonecameras-i205279097-s309050125.html?mp=1',
'https://www.lazada.sg/products/esogoal-mini-umbrella-travel-umbrella-sun-rain-umbrella8-ribs-98cm-big-surface-lightweight-compact-parasol-uv-protection-for-men-women-i204815487-s308312226.html?mp=1',
'https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1'
]
for link in url:
yield SplashRequest(url=link, callback=self.parse, endpoint='render.html', args={'wait' : 0.5, 'lua_source' : self.script}, dont_filter=True)
def parse(self, response):
yield {
'title' : response.xpath("//span[@class='pdp-mod-product-badge-title']/text()").extract_first(),
'price' : response.xpath("//span[contains(@class, 'pdp-price')]/text()").extract_first(),
'description' : response.xpath("//div[@id='module_product_detail']").extract_first()
}
我试试这个:
将 'execute' 作为 splash 方法的参数而不是 'render html'
from scrapy_splash import SplashRequest
class DynamicSpider(scrapy.Spider):
name = 'products'
url = [
'https://www.lazada.sg/products/esogoal-tactical-sling-bag-outdoor-chest-pack-shoulder-backpack-military-sport-bag-for-trekking-camping-hiking-rover-sling-daypack-for-men-women-i204814494-s353896924.html?mp=1',
'https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1',
'https://www.lazada.sg/products/esogoal-selfie-stick-tripod-extendable-selfie-stick-monopod-with-integrated-tripod-and-bluetooth-remote-shutter-wireless-selfie-stick-tripod-for-cellphonecameras-i205279097-s309050125.html?mp=1',
'https://www.lazada.sg/products/esogoal-mini-umbrella-travel-umbrella-sun-rain-umbrella8-ribs-98cm-big-surface-lightweight-compact-parasol-uv-protection-for-men-women-i204815487-s308312226.html?mp=1',
'https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1',
]
script = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(1.5))
return {
html = splash:html()
}
end
"""
def start_requests(self):
for link in self.url:
yield SplashRequest(
url=link,
callback=self.parse,
endpoint='execute',
args={'wait': 0.5, 'lua_source': self.script},
dont_filter=True,
)
def parse(self, response):
yield {
'title': response.xpath("//span[@class='pdp-mod-product-badge-title']/text()").extract_first(),
'price': response.xpath("//span[contains(@class, 'pdp-price')]/text()").extract_first(),
'description': response.xpath("//div[@id='module_product_detail']/h2/text()").extract_first()
}
这是结果
我正在尝试抓取电子商务网站, 示例 link:https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1
数据是通过 React 呈现的,当我在少数 link 上执行 抓取 时,大部分数据都作为 null
返回,当我查看页面源我实际上找不到 HTML 可通过检查元素获得,只是 Javascript 标签内的 json。我在相同的 link 上测试了几次 运行 scrapy scraper 和以前没有找到的数据,实际上是 returns 内容,所以它是随机的.我不知道我应该如何抓取这种网站。
我也在使用用户代理池并在请求之间中断。
script = '''
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(1.5))
return splash:html()
end
'''
def start_requests(self):
url= [
'https://www.lazada.sg/products/esogoal-tactical-sling-bag-outdoor-chest-pack-shoulder-backpack-military-sport-bag-for-trekking-camping-hiking-rover-sling-daypack-for-men-women-i204814494-s353896924.html?mp=1',
'https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1',
'https://www.lazada.sg/products/esogoal-selfie-stick-tripod-extendable-selfie-stick-monopod-with-integrated-tripod-and-bluetooth-remote-shutter-wireless-selfie-stick-tripod-for-cellphonecameras-i205279097-s309050125.html?mp=1',
'https://www.lazada.sg/products/esogoal-mini-umbrella-travel-umbrella-sun-rain-umbrella8-ribs-98cm-big-surface-lightweight-compact-parasol-uv-protection-for-men-women-i204815487-s308312226.html?mp=1',
'https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1'
]
for link in url:
yield SplashRequest(url=link, callback=self.parse, endpoint='render.html', args={'wait' : 0.5, 'lua_source' : self.script}, dont_filter=True)
def parse(self, response):
yield {
'title' : response.xpath("//span[@class='pdp-mod-product-badge-title']/text()").extract_first(),
'price' : response.xpath("//span[contains(@class, 'pdp-price')]/text()").extract_first(),
'description' : response.xpath("//div[@id='module_product_detail']").extract_first()
}
我试试这个:
将 'execute' 作为 splash 方法的参数而不是 'render html'
from scrapy_splash import SplashRequest class DynamicSpider(scrapy.Spider): name = 'products' url = [ 'https://www.lazada.sg/products/esogoal-tactical-sling-bag-outdoor-chest-pack-shoulder-backpack-military-sport-bag-for-trekking-camping-hiking-rover-sling-daypack-for-men-women-i204814494-s353896924.html?mp=1', 'https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1', 'https://www.lazada.sg/products/esogoal-selfie-stick-tripod-extendable-selfie-stick-monopod-with-integrated-tripod-and-bluetooth-remote-shutter-wireless-selfie-stick-tripod-for-cellphonecameras-i205279097-s309050125.html?mp=1', 'https://www.lazada.sg/products/esogoal-mini-umbrella-travel-umbrella-sun-rain-umbrella8-ribs-98cm-big-surface-lightweight-compact-parasol-uv-protection-for-men-women-i204815487-s308312226.html?mp=1', 'https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1', ] script = """ function main(splash, args) assert(splash:go(args.url)) assert(splash:wait(1.5)) return { html = splash:html() } end """ def start_requests(self): for link in self.url: yield SplashRequest( url=link, callback=self.parse, endpoint='execute', args={'wait': 0.5, 'lua_source': self.script}, dont_filter=True, ) def parse(self, response): yield { 'title': response.xpath("//span[@class='pdp-mod-product-badge-title']/text()").extract_first(), 'price': response.xpath("//span[contains(@class, 'pdp-price')]/text()").extract_first(), 'description': response.xpath("//div[@id='module_product_detail']/h2/text()").extract_first() }
这是结果