scrapy 响应与页面源代码完全不同
scrapy response is nothing like the page source
我正在尝试使用 scrapy shell 进入“ykc1.greatwestlife.com”,这应该是一个 public 网站,尽管如果我手动查看页面源代码会有很多东西, 我无法使用 scrapy 得到正确的响应。
scrapy shell response result
这种情况需要使用scrapy-splash吗?
有任何想法吗?谢谢
您实际上可以看到两个 back-to-back 请求,由
<head>
<script language="javascript">
document.cookie = "cmsUserPortalLocale=en;path=/";
document.cookie = "cmsTheme=advgwl;path=/";
document.cookie = "siteBrand="+escape(location.hostname)+"; path=/";
window.location.reload(true);
</script>
第一个请求要小得多,并且可能导致您遇到的问题。值得庆幸的是,由于 cookie 看起来是静态的,您可以很容易地重现该行为:
def parse(self, response):
# this is required because the response that arrives to parse()
# has session cookies but we need to add 3 more to them
new_cookies = {
"cmsUserPortalLocale": "en",
"cmsTheme": "advgwl",
"siteBrand": "ykc1.greatwestlife.com",
}
yield response.follow(url=request.url, cookies=new_cookies,
callback=self.parse_home)
def parse_home(self, response):
# and now you have the full body
我正在尝试使用 scrapy shell 进入“ykc1.greatwestlife.com”,这应该是一个 public 网站,尽管如果我手动查看页面源代码会有很多东西, 我无法使用 scrapy 得到正确的响应。
scrapy shell response result
这种情况需要使用scrapy-splash吗? 有任何想法吗?谢谢
您实际上可以看到两个 back-to-back 请求,由
<head>
<script language="javascript">
document.cookie = "cmsUserPortalLocale=en;path=/";
document.cookie = "cmsTheme=advgwl;path=/";
document.cookie = "siteBrand="+escape(location.hostname)+"; path=/";
window.location.reload(true);
</script>
第一个请求要小得多,并且可能导致您遇到的问题。值得庆幸的是,由于 cookie 看起来是静态的,您可以很容易地重现该行为:
def parse(self, response):
# this is required because the response that arrives to parse()
# has session cookies but we need to add 3 more to them
new_cookies = {
"cmsUserPortalLocale": "en",
"cmsTheme": "advgwl",
"siteBrand": "ykc1.greatwestlife.com",
}
yield response.follow(url=request.url, cookies=new_cookies,
callback=self.parse_home)
def parse_home(self, response):
# and now you have the full body