如何在启用 Zyte 智能代理管理器(前 Crawlera)的情况下使用 scrapy spider 进行身份验证?

How to authenticate using scrapy spider with Zyte Smart Proxy Manager (former Crawlera) enabled?

我按照 scrapy-zyte-smartproxy 文档将代理使用集成到我的蜘蛛中。现在我的蜘蛛无法登录了。

为了做到这一点,我们必须在 your_project/middleware.py 文件中使用 crawlera sessions, furthermore, we need to disable crawlera cookies. There's an old PR but it's still not merged and doesn't work. You need to create your own scrapy middleware 来为每个蜘蛛请求附加 crawlera headers。

from scrapy import Request


class ZyteSmartProxySessionMiddleware(object):
    def process_spider_output(self, response, result, spider):
        def _set_session(request_or_item):
            if not isinstance(request_or_item, Request):
                return request_or_item

            request = request_or_item
            header = b'X-Crawlera-Session'
            session = response.headers.get(header)
            error = response.headers.get(b'X-Crawlera-Error')
            session_is_bad = error == b'bad_session_id'

            if session is not None and not session_is_bad:
                request.headers[header] = session
                request.headers['X-Crawlera-Cookies'] = 'disable'

            return request

        return (_set_session(request_or_item)
                for request_or_item in result or ())

在您的 settings.py 文件中启用此中间件。

SPIDER_MIDDLEWARES = {
    'your_project.middlewares.ZyteSmartProxySessionMiddleware': True,
}

要启动 session 将 X-Crawlera-Session: create header 附加到您的 scrapy 蜘蛛中的登录请求。

def parse(self, response):
    auth_data = {'username': self.user, 'password': self.password}
    request = FormRequest.from_response(response, formdata=auth_data,
                                        callback=self.redirect_to_select)
    request.headers.setdefault('X-Crawlera-Session', 'create')
    return request

请注意,根据文档,蜘蛛在此之后会变慢。

There is a default delay of 12 seconds between each request using the same IP. These delays can differ for more popular domains.