如何从 scrapy splash 的响应中获取 cookie
How to get cookies from response of scrapy splash
我想从 splash 的响应对象中获取 cookie 值。但它没有像我预期的那样工作。
这是爬虫代码
class AmazonSpider(scrapy.Spider):
name = 'amazon'
allowed_domains = ['amazon.com']
def start_requests(self):
url = 'https://www.amazon.com/gp/goldbox?ref_=nav_topnav_deals'
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
print(response.headers)
输出日志:
2019-08-17 11:53:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/robots.txt> (referer: None)
2019-08-17 11:53:08 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.99.100:8050/robots.txt> (referer: None)
2019-08-17 11:53:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/gp/goldbox?ref_=nav_topnav_deals via http://192.168.99.100:8050/render.html> (referer: None)
{b'Date': [b'Sat, 17 Aug 2019 06:23:09 GMT'], b'Server': [b'TwistedWeb/18.9.0'], b'Content-Type': [b'text/html; charset=utf-8']}
2019-08-17 11:53:24 [scrapy.core.engine] INFO: Closing spider (finished)
您可以尝试以下方法:
- 写一个小 Lua 脚本 returns html + cookies:
lua_request = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go(splash.args.url))
splash:wait(0.5)
return {
html = splash:html(),
cookies = splash:get_cookies()
}
end
"""
将您的请求更改为以下内容:
yield SplashRequest(
url,
self.parse,
endpoint='execute',
args={'lua_source': self.lua_request}
)
然后在您的解析方法中找到 cookie,如下所示:
def parse(self, response):
cookies = response.data['cookies']
headers = response.headers
我想从 splash 的响应对象中获取 cookie 值。但它没有像我预期的那样工作。
这是爬虫代码
class AmazonSpider(scrapy.Spider):
name = 'amazon'
allowed_domains = ['amazon.com']
def start_requests(self):
url = 'https://www.amazon.com/gp/goldbox?ref_=nav_topnav_deals'
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
print(response.headers)
输出日志:
2019-08-17 11:53:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/robots.txt> (referer: None)
2019-08-17 11:53:08 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://192.168.99.100:8050/robots.txt> (referer: None)
2019-08-17 11:53:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/gp/goldbox?ref_=nav_topnav_deals via http://192.168.99.100:8050/render.html> (referer: None)
{b'Date': [b'Sat, 17 Aug 2019 06:23:09 GMT'], b'Server': [b'TwistedWeb/18.9.0'], b'Content-Type': [b'text/html; charset=utf-8']}
2019-08-17 11:53:24 [scrapy.core.engine] INFO: Closing spider (finished)
您可以尝试以下方法: - 写一个小 Lua 脚本 returns html + cookies:
lua_request = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go(splash.args.url))
splash:wait(0.5)
return {
html = splash:html(),
cookies = splash:get_cookies()
}
end
"""
将您的请求更改为以下内容:
yield SplashRequest(
url,
self.parse,
endpoint='execute',
args={'lua_source': self.lua_request}
)
然后在您的解析方法中找到 cookie,如下所示:
def parse(self, response):
cookies = response.data['cookies']
headers = response.headers