scrapy-splash returns 它自己的 headers 而不是来自网站的原始 headers
scrapy-splash returns its own headers and not the original headers from the site
我使用 scrapy-splash 构建我的蜘蛛。现在我需要的是维护 session,所以我使用 scrapy.downloadermiddlewares.cookies.CookiesMiddleware 并且它处理 set-cookie header。我知道它处理 set-cookie header 因为我设置了 COOKIES_DEBUG=True 并且这导致 CookeMiddleware 打印输出关于 set-cookie header。
问题:当我也将 Splash 添加到图片时,set-cookie 打印输出消失了,实际上我得到的响应是 headers
{'Date': ['Sun, 25 Sep 2016 12:09:55 GMT'], 'Content-Type': ['text/html; charset=utf-8'], 'Server': ['TwistedWeb/16.1.1']}
这与使用 TwistedWeb 的启动渲染引擎有关。
有没有指令让splash也给我原回复headers?
要获得原始响应 headers,您可以在 scrapy-splash 自述文件中写入 Splash Lua script; see examples:
Use a Lua script to get an HTML response with cookies, headers, body and method set to correct values; lua_source argument value is cached on Splash server and is not sent with each request (it requires Splash 2.1+):
import scrapy
from scrapy_splash import SplashRequest
script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
"""
class MySpider(scrapy.Spider):
# ...
yield SplashRequest(url, self.parse_result,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script},
headers={'X-My-Header': 'value'},
)
def parse_result(self, response):
# here response.body contains result HTML;
# response.headers are filled with headers from last
# web page loaded to Splash;
# cookies from all responses and from JavaScript are collected
# and put into Set-Cookie response header, so that Scrapy
# can remember them.
scrapy-splash 还提供 built-in helpers for cookie handling; they are enabled in this example as soon as scrapy-splash is configured,如自述文件中所述。
我使用 scrapy-splash 构建我的蜘蛛。现在我需要的是维护 session,所以我使用 scrapy.downloadermiddlewares.cookies.CookiesMiddleware 并且它处理 set-cookie header。我知道它处理 set-cookie header 因为我设置了 COOKIES_DEBUG=True 并且这导致 CookeMiddleware 打印输出关于 set-cookie header。
问题:当我也将 Splash 添加到图片时,set-cookie 打印输出消失了,实际上我得到的响应是 headers {'Date': ['Sun, 25 Sep 2016 12:09:55 GMT'], 'Content-Type': ['text/html; charset=utf-8'], 'Server': ['TwistedWeb/16.1.1']} 这与使用 TwistedWeb 的启动渲染引擎有关。
有没有指令让splash也给我原回复headers?
要获得原始响应 headers,您可以在 scrapy-splash 自述文件中写入 Splash Lua script; see examples:
Use a Lua script to get an HTML response with cookies, headers, body and method set to correct values; lua_source argument value is cached on Splash server and is not sent with each request (it requires Splash 2.1+):
import scrapy
from scrapy_splash import SplashRequest
script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
"""
class MySpider(scrapy.Spider):
# ...
yield SplashRequest(url, self.parse_result,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script},
headers={'X-My-Header': 'value'},
)
def parse_result(self, response):
# here response.body contains result HTML;
# response.headers are filled with headers from last
# web page loaded to Splash;
# cookies from all responses and from JavaScript are collected
# and put into Set-Cookie response header, so that Scrapy
# can remember them.
scrapy-splash 还提供 built-in helpers for cookie handling; they are enabled in this example as soon as scrapy-splash is configured,如自述文件中所述。