使用中间件将重定向 url 替换为原始请求后无法以正确的方式发送请求

Unable to send requests in the right way after replacing redirected url with original one using middleware

我使用 scrapy 创建了一个脚本来从网页中获取一些字段。着陆页的 url 和内页的 url 经常被重定向,因此我创建了一个中间件来处理该重定向。但是,当我遇到 时,我可以理解我需要在 process_request() 中将重定向的 url 替换为原始

中的 return request

当蜘蛛发送请求时,meta={'dont_redirect': True,"handle_httpstatus_list": [301,302,307,429]}总是这样。

由于所有请求都没有被重定向,我尝试在 _retry() 方法中替换重定向的 urls。

def process_request(self, request, spider):
    request.headers['User-Agent'] = self.ua.random

def process_exception(self, request, exception, spider):
    return self._retry(request, spider)

def _retry(self, request, spider):
    request.dont_filter = True
    if request.meta.get('redirect_urls'):
        redirect_url = request.meta['redirect_urls'][0]
        redirected = request.replace(url=redirect_url)
        redirected.dont_filter = True
        return redirected
    return request

def process_response(self, request, response, spider):
    if response.status in [301, 302, 307, 429]:
        return self._retry(request, spider)
    return response

Question: How can I send requests after replacing redirected url with original one using middleware?

编辑:

我将其放在答案的开头,因为它是一种更快的 one-shot 解决方案,可能适合您。

Scrapy 2.5 引入了 get_retry_request,允许您重试来自蜘蛛回调的请求。

来自文档:

Returns a new Request object to retry the specified request, or None if retries of the specified request have been exhausted.

所以你可以这样做:

def parse(self, response):
    if response.status in [301, 302, 307, 429]:
        new_request_or_none = get_retry_request(
            response.request,
            spider=self,
            reason='tried to redirect',
            max_retry_times = 10
        )
        if new_request_or_none:
            yield new_request_or_none
        else:
            # exhausted all retries
            ...

但是话又说回来,如果网站抛出它们以指示某些 non-permanent 事件,例如重定向到维护页面,您应该确保只重试从 3 开始的状态代码。至于状态 429,请参阅下面我关于使用延迟的建议。

编辑 2:

在早于 21.7.0 的 Twisted 版本上,使用 deferLater 的协程 async_sleep 实现可能无法工作。改用这个:

from twisted.internet import defer, reactor

async def async_sleep(delay, return_value=None):
    deferred = defer.Deferred()
    reactor.callLater(delay, deferred.callback, return_value)
    return await deferred

原回答:

如果我没理解错的话,你只是想在发生重定向时重试原始请求,对吗?

在这种情况下,您可以使用 RedirectMiddleware:

强制重试原本会被重定向的请求
# middlewares.py
from scrapy.downloadermiddlewares.redirect import RedirectMiddleware

class CustomRedirectMiddleware(RedirectMiddleware):
    """
    Modifies RedirectMiddleware to set response status to 503 on redirects.
    Make sure this appears in the DOWNLOADER_MIDDLEWARES setting with a lower priority (higher number) than RetryMiddleware
    (or whatever the downloader middleware responsible for retrying on status 503 is called).
    """

    def process_response(self, request, response, spider):
        if response.status in (301, 302, 303, 307, 308):  # 429 already is in scrapy's default retry list
            return response.replace(status=503)  # Now this response is RetryMiddleware's problem

        return super().process_response(request, response, spider)

但是,在每次出现这些状态代码时重试可能会导致其他问题。因此,您可能想在 if 中添加一些附加条件,例如检查是否存在一些 header 可能表明站点正在维护或类似的东西。

虽然我们正在做这件事,但由于您在列表中包含了状态代码 429,我认为您可能会收到一些“请求过多”的回复。在重试这个特定案例之前,您可能应该让您的蜘蛛等待一段时间。这可以通过以下 RetryMiddleware:

来实现
# middlewares.py
from twisted.internet import task, reactor
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message

async def async_sleep(delay, callable=None, *args, **kw):
    return await task.deferLater(reactor, delay, callable, *args, **kw)

class TooManyRequestsRetryMiddleware(RetryMiddleware):
    """
    Modifies RetryMiddleware to delay retries on status 429.
    """

    DEFAULT_DELAY = 10  # Delay in seconds. Tune this to your needs
    MAX_DELAY = 60  # Sometimes, RETRY-AFTER has absurd values

async def process_response(self, request, response, spider):
    """
    Like RetryMiddleware.process_response, but, if response status is 429,
    retry the request only after waiting at most self.MAX_DELAY seconds.
    Respect the Retry-After header if it's less than self.MAX_DELAY.
    If Retry-After is absent/invalid, wait only self.DEFAULT_DELAY seconds.
    """

    if request.meta.get('dont_retry', False):
        return response

    if response.status in self.retry_http_codes:
        if response.status == 429:
            retry_after = response.headers.get('retry-after')
            try:
                retry_after = int(retry_after)
            except (ValueError, TypeError):
                delay = self.DEFAULT_DELAY
            else:
                delay = min(self.MAX_DELAY, retry_after)
            spider.logger.info(f'Retrying {request} in {delay} seconds.')

            spider.crawler.engine.pause()
            await async_sleep(delay)
            spider.crawler.engine.unpause()

        reason = response_status_message(response.status)
        return self._retry(request, reason, spider) or response

    return response

不要忘记通过在项目的 settings.py:

中编辑 DOWNLOADER_MIDDLEWARES 来告诉 Scrapy 使用这些中间件
# settings.py
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'your_project_name.middlewares.TooManyRequestsRetryMiddleware': 550,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': None,
    'your_project_name.middlewares.CustomRedirectMiddleware': 600
}