Scrapy-splash response.css() 获取不到元素
Scrapy-splash response.css() can't get an element
我正在尝试从动态 JS 内容网站中抓取,我正在尝试获取当前页面的面包屑。
面包屑由 4 个 类 组成,名称为:'.breadcrumbs-link'
为此,我使用 scrapy-splash 编写了这段代码:
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
name = "quotes4"
start_urls = ["https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse, endpoint='render.html',args= {'wait': 10})
def parse(self, response):
print ('Result:')
print(len(response.css('.breadcrumbs-link').extract())) # OUTPUT: 0
print(response.css('.breadcrumbs-link').extract()) # OUTPUT: []
我的方法可能有什么问题?
本网站(https://www.woolworths.com.au
)使用Angular。如果你去 Splash FAQ page, there is a section "Website is not rendered correctly" 我们可以看到:
non-working localStorage in Private Mode. This is a common issue e.g.
for websites based on AngularJS. If rendering doesn’t work, try
disabling Private mode (see How do I disable Private mode?).
在 link 我们可以看到:
How do I disable Private mode?
With Splash>=2.0, you can disable Private mode (which is “on” by
default). There are two ways to go about it:
at startup, with the --disable-private-mode
argument, e.g., if
you’re using Docker:
$ sudo docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode
at runtime when using the /execute
endpoint and setting
splash.private_mode_enabled attribute to false
最简单的方法是使用 --disable-private-mode
禁用私人模式,但如果您不想这样做,则可以传递一个 Lua 脚本来暂时禁用您的私人模式蜘蛛,然后在完成后再次启用它:
import scrapy
from scrapy_splash import SplashRequest
LUA_SCRIPT = """
function main(splash)
splash.private_mode_enabled = false
splash:go(splash.args.url)
splash:wait(2)
html = splash:html()
splash.private_mode_enabled = true
return html
end
"""
class MySpider(scrapy.Spider):
name = "quotes4"
start_urls = ["https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='execute',
args={
'wait': 1,
"lua_source":LUA_SCRIPT})
def parse(self, response):
print ('Result:')
print(".breadcrumbs-link len = %d" % (len(response.css('.breadcrumbs-link').extract()))) # OUTPUT: 4
print(".breadcrumbs-link = %s" % (response.css('.breadcrumbs-link').extract())) # OUTPUT: [...HTML ELEMENTS...]
它通过禁用私人模式对我有用,结果:
Result:
.breadcrumbs-link len = 4
.breadcrumbs-link = ['<li class="breadcrumbs-link" ng-repeat="link ....
我正在尝试从动态 JS 内容网站中抓取,我正在尝试获取当前页面的面包屑。
面包屑由 4 个 类 组成,名称为:'.breadcrumbs-link'
为此,我使用 scrapy-splash 编写了这段代码:
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
name = "quotes4"
start_urls = ["https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse, endpoint='render.html',args= {'wait': 10})
def parse(self, response):
print ('Result:')
print(len(response.css('.breadcrumbs-link').extract())) # OUTPUT: 0
print(response.css('.breadcrumbs-link').extract()) # OUTPUT: []
我的方法可能有什么问题?
本网站(https://www.woolworths.com.au
)使用Angular。如果你去 Splash FAQ page, there is a section "Website is not rendered correctly" 我们可以看到:
non-working localStorage in Private Mode. This is a common issue e.g. for websites based on AngularJS. If rendering doesn’t work, try disabling Private mode (see How do I disable Private mode?).
在 link 我们可以看到:
How do I disable Private mode?
With Splash>=2.0, you can disable Private mode (which is “on” by default). There are two ways to go about it:
at startup, with the
--disable-private-mode
argument, e.g., if you’re using Docker:$ sudo docker run -it -p 8050:8050 scrapinghub/splash --disable-private-mode
at runtime when using the
/execute
endpoint and setting splash.private_mode_enabled attribute tofalse
最简单的方法是使用 --disable-private-mode
禁用私人模式,但如果您不想这样做,则可以传递一个 Lua 脚本来暂时禁用您的私人模式蜘蛛,然后在完成后再次启用它:
import scrapy
from scrapy_splash import SplashRequest
LUA_SCRIPT = """
function main(splash)
splash.private_mode_enabled = false
splash:go(splash.args.url)
splash:wait(2)
html = splash:html()
splash.private_mode_enabled = true
return html
end
"""
class MySpider(scrapy.Spider):
name = "quotes4"
start_urls = ["https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='execute',
args={
'wait': 1,
"lua_source":LUA_SCRIPT})
def parse(self, response):
print ('Result:')
print(".breadcrumbs-link len = %d" % (len(response.css('.breadcrumbs-link').extract()))) # OUTPUT: 4
print(".breadcrumbs-link = %s" % (response.css('.breadcrumbs-link').extract())) # OUTPUT: [...HTML ELEMENTS...]
它通过禁用私人模式对我有用,结果:
Result:
.breadcrumbs-link len = 4
.breadcrumbs-link = ['<li class="breadcrumbs-link" ng-repeat="link ....