使用 Scrapy-Splash 的代理服务器
Proxy servers with Scrapy-Splash
我正在尝试让代理服务器在我的本地启动实例上工作。我已经阅读了几份文件,但没有找到任何可行的例子。我注意到这个 https://github.com/scrapy-plugins/scrapy-splash/issues/107 是原因。我不再得到那个回溯,但仍然不能将 Splash 与代理一起使用。下面是新的错误消息。如果有人可以帮我解决这个问题,请提前致谢。 None 我的请求甚至通过了 Splash。
def parse_json(self, response):
json_data = response.body
load = json.loads(json_data.decode('utf-8'))
dump = json.dumps(load,sort_keys=True,indent=2)
LUA_SOURCE = """
function main(splash)
local host = "proxy.crawlera.com"
local port = 8010
local user = "APIKEY"
local password = ""
local session_header = "X-Crawlera-Session"
local session_id = "create"
splash:on_request(function (request)
request:set_header("X-Crawlera-UA", "desktop")
request:set_header(session_header, session_id)
request:set_proxy{host, port, username=user, password=password}
end)
splash:on_response_headers(function (response)
if response.headers[session_header] ~= nil then
session_id = response.headers[session_header]
end
end)
splash:go(splash.args.url)
return splash:html()
end
"""
for link in load['d']['blogtopics']:
link = link['Uri']
yield SplashRequest(link, self.parse_blog, endpoint='execute', args={'wait': 3, 'lua_source': LUA_SOURCE})
2017-03-29 09:26:37 [scrapy.core.engine] DEBUG: Crawled (503) <GET http://community.martindale.com/legal-blogs/Practice_Areas/b/corporate__securities_law/archive/2011/08/11/sec-adopts-new-rules-replacing-credit-ratings-as-a-criterion-for-the-use-of-short-form-shelf-registration.aspx via http://localhost:8050/execute> (referer: None)
问题似乎是由 Crawlera 中间件引起的。没有对 SplashRequest 的处理。它尝试通过代理访问我的本地主机。
我正在尝试让代理服务器在我的本地启动实例上工作。我已经阅读了几份文件,但没有找到任何可行的例子。我注意到这个 https://github.com/scrapy-plugins/scrapy-splash/issues/107 是原因。我不再得到那个回溯,但仍然不能将 Splash 与代理一起使用。下面是新的错误消息。如果有人可以帮我解决这个问题,请提前致谢。 None 我的请求甚至通过了 Splash。
def parse_json(self, response):
json_data = response.body
load = json.loads(json_data.decode('utf-8'))
dump = json.dumps(load,sort_keys=True,indent=2)
LUA_SOURCE = """
function main(splash)
local host = "proxy.crawlera.com"
local port = 8010
local user = "APIKEY"
local password = ""
local session_header = "X-Crawlera-Session"
local session_id = "create"
splash:on_request(function (request)
request:set_header("X-Crawlera-UA", "desktop")
request:set_header(session_header, session_id)
request:set_proxy{host, port, username=user, password=password}
end)
splash:on_response_headers(function (response)
if response.headers[session_header] ~= nil then
session_id = response.headers[session_header]
end
end)
splash:go(splash.args.url)
return splash:html()
end
"""
for link in load['d']['blogtopics']:
link = link['Uri']
yield SplashRequest(link, self.parse_blog, endpoint='execute', args={'wait': 3, 'lua_source': LUA_SOURCE})
2017-03-29 09:26:37 [scrapy.core.engine] DEBUG: Crawled (503) <GET http://community.martindale.com/legal-blogs/Practice_Areas/b/corporate__securities_law/archive/2011/08/11/sec-adopts-new-rules-replacing-credit-ratings-as-a-criterion-for-the-use-of-short-form-shelf-registration.aspx via http://localhost:8050/execute> (referer: None)
问题似乎是由 Crawlera 中间件引起的。没有对 SplashRequest 的处理。它尝试通过代理访问我的本地主机。