scrapy 响应与页面源代码完全不同

Question

我正在尝试使用 scrapy shell 进入“ykc1.greatwestlife.com”，这应该是一个 public 网站，尽管如果我手动查看页面源代码会有很多东西, 我无法使用 scrapy 得到正确的响应。

scrapy shell response result

这种情况需要使用scrapy-splash吗？有任何想法吗？谢谢

Answer 1

您实际上可以看到两个 back-to-back 请求，由

      <head>
        <script language="javascript">
            document.cookie = "cmsUserPortalLocale=en;path=/";
            document.cookie = "cmsTheme=advgwl;path=/";    
            document.cookie = "siteBrand="+escape(location.hostname)+"; path=/";
            window.location.reload(true);
        </script>

第一个请求要小得多，并且可能导致您遇到的问题。值得庆幸的是，由于 cookie 看起来是静态的，您可以很容易地重现该行为：

def parse(self, response):
    # this is required because the response that arrives to parse()
    # has session cookies but we need to add 3 more to them
    new_cookies = {
      "cmsUserPortalLocale": "en",
      "cmsTheme": "advgwl",
      "siteBrand": "ykc1.greatwestlife.com",
    }
    yield response.follow(url=request.url, cookies=new_cookies,
                          callback=self.parse_home)

def parse_home(self, response):
    # and now you have the full body

scrapy 响应与页面源代码完全不同

scrapy response is nothing like the page source

python

scrapy

web-scraping

scrapy-splash