如何在启用 Zyte 智能代理管理器(前 Crawlera)的情况下使用 scrapy spider 进行身份验证?
How to authenticate using scrapy spider with Zyte Smart Proxy Manager (former Crawlera) enabled?
我按照 scrapy-zyte-smartproxy 文档将代理使用集成到我的蜘蛛中。现在我的蜘蛛无法登录了。
为了做到这一点,我们必须在 your_project/middleware.py
文件中使用 crawlera sessions, furthermore, we need to disable crawlera cookies. There's an old PR but it's still not merged and doesn't work. You need to create your own scrapy middleware 来为每个蜘蛛请求附加 crawlera headers。
from scrapy import Request
class ZyteSmartProxySessionMiddleware(object):
def process_spider_output(self, response, result, spider):
def _set_session(request_or_item):
if not isinstance(request_or_item, Request):
return request_or_item
request = request_or_item
header = b'X-Crawlera-Session'
session = response.headers.get(header)
error = response.headers.get(b'X-Crawlera-Error')
session_is_bad = error == b'bad_session_id'
if session is not None and not session_is_bad:
request.headers[header] = session
request.headers['X-Crawlera-Cookies'] = 'disable'
return request
return (_set_session(request_or_item)
for request_or_item in result or ())
在您的 settings.py
文件中启用此中间件。
SPIDER_MIDDLEWARES = {
'your_project.middlewares.ZyteSmartProxySessionMiddleware': True,
}
要启动 session 将 X-Crawlera-Session: create
header 附加到您的 scrapy 蜘蛛中的登录请求。
def parse(self, response):
auth_data = {'username': self.user, 'password': self.password}
request = FormRequest.from_response(response, formdata=auth_data,
callback=self.redirect_to_select)
request.headers.setdefault('X-Crawlera-Session', 'create')
return request
请注意,根据文档,蜘蛛在此之后会变慢。
There is a default delay of 12 seconds between each request using the same IP. These delays can differ for more popular domains.
我按照 scrapy-zyte-smartproxy 文档将代理使用集成到我的蜘蛛中。现在我的蜘蛛无法登录了。
为了做到这一点,我们必须在 your_project/middleware.py
文件中使用 crawlera sessions, furthermore, we need to disable crawlera cookies. There's an old PR but it's still not merged and doesn't work. You need to create your own scrapy middleware 来为每个蜘蛛请求附加 crawlera headers。
from scrapy import Request
class ZyteSmartProxySessionMiddleware(object):
def process_spider_output(self, response, result, spider):
def _set_session(request_or_item):
if not isinstance(request_or_item, Request):
return request_or_item
request = request_or_item
header = b'X-Crawlera-Session'
session = response.headers.get(header)
error = response.headers.get(b'X-Crawlera-Error')
session_is_bad = error == b'bad_session_id'
if session is not None and not session_is_bad:
request.headers[header] = session
request.headers['X-Crawlera-Cookies'] = 'disable'
return request
return (_set_session(request_or_item)
for request_or_item in result or ())
在您的 settings.py
文件中启用此中间件。
SPIDER_MIDDLEWARES = {
'your_project.middlewares.ZyteSmartProxySessionMiddleware': True,
}
要启动 session 将 X-Crawlera-Session: create
header 附加到您的 scrapy 蜘蛛中的登录请求。
def parse(self, response):
auth_data = {'username': self.user, 'password': self.password}
request = FormRequest.from_response(response, formdata=auth_data,
callback=self.redirect_to_select)
request.headers.setdefault('X-Crawlera-Session', 'create')
return request
请注意,根据文档,蜘蛛在此之后会变慢。
There is a default delay of 12 seconds between each request using the same IP. These delays can differ for more popular domains.