使用中间件将重定向 url 替换为原始请求后无法以正确的方式发送请求
Unable to send requests in the right way after replacing redirected url with original one using middleware
我使用 scrapy 创建了一个脚本来从网页中获取一些字段。着陆页的 url 和内页的 url 经常被重定向,因此我创建了一个中间件来处理该重定向。但是,当我遇到 时,我可以理解我需要在 process_request()
中将重定向的 url 替换为原始
中的 return request
。
当蜘蛛发送请求时,meta={'dont_redirect': True,"handle_httpstatus_list": [301,302,307,429]}
总是这样。
由于所有请求都没有被重定向,我尝试在 _retry()
方法中替换重定向的 urls。
def process_request(self, request, spider):
request.headers['User-Agent'] = self.ua.random
def process_exception(self, request, exception, spider):
return self._retry(request, spider)
def _retry(self, request, spider):
request.dont_filter = True
if request.meta.get('redirect_urls'):
redirect_url = request.meta['redirect_urls'][0]
redirected = request.replace(url=redirect_url)
redirected.dont_filter = True
return redirected
return request
def process_response(self, request, response, spider):
if response.status in [301, 302, 307, 429]:
return self._retry(request, spider)
return response
Question: How can I send requests after replacing redirected url with original one using middleware?
编辑:
我将其放在答案的开头,因为它是一种更快的 one-shot 解决方案,可能适合您。
Scrapy 2.5 引入了 get_retry_request
,允许您重试来自蜘蛛回调的请求。
来自文档:
Returns a new Request
object to retry the specified request, or None
if retries of the specified request have been exhausted.
所以你可以这样做:
def parse(self, response):
if response.status in [301, 302, 307, 429]:
new_request_or_none = get_retry_request(
response.request,
spider=self,
reason='tried to redirect',
max_retry_times = 10
)
if new_request_or_none:
yield new_request_or_none
else:
# exhausted all retries
...
但是话又说回来,如果网站抛出它们以指示某些 non-permanent 事件,例如重定向到维护页面,您应该确保只重试从 3 开始的状态代码。至于状态 429,请参阅下面我关于使用延迟的建议。
编辑 2:
在早于 21.7.0 的 Twisted 版本上,使用 deferLater
的协程 async_sleep
实现可能无法工作。改用这个:
from twisted.internet import defer, reactor
async def async_sleep(delay, return_value=None):
deferred = defer.Deferred()
reactor.callLater(delay, deferred.callback, return_value)
return await deferred
原回答:
如果我没理解错的话,你只是想在发生重定向时重试原始请求,对吗?
在这种情况下,您可以使用 RedirectMiddleware
:
强制重试原本会被重定向的请求
# middlewares.py
from scrapy.downloadermiddlewares.redirect import RedirectMiddleware
class CustomRedirectMiddleware(RedirectMiddleware):
"""
Modifies RedirectMiddleware to set response status to 503 on redirects.
Make sure this appears in the DOWNLOADER_MIDDLEWARES setting with a lower priority (higher number) than RetryMiddleware
(or whatever the downloader middleware responsible for retrying on status 503 is called).
"""
def process_response(self, request, response, spider):
if response.status in (301, 302, 303, 307, 308): # 429 already is in scrapy's default retry list
return response.replace(status=503) # Now this response is RetryMiddleware's problem
return super().process_response(request, response, spider)
但是,在每次出现这些状态代码时重试可能会导致其他问题。因此,您可能想在 if
中添加一些附加条件,例如检查是否存在一些 header 可能表明站点正在维护或类似的东西。
虽然我们正在做这件事,但由于您在列表中包含了状态代码 429,我认为您可能会收到一些“请求过多”的回复。在重试这个特定案例之前,您可能应该让您的蜘蛛等待一段时间。这可以通过以下 RetryMiddleware
:
来实现
# middlewares.py
from twisted.internet import task, reactor
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
async def async_sleep(delay, callable=None, *args, **kw):
return await task.deferLater(reactor, delay, callable, *args, **kw)
class TooManyRequestsRetryMiddleware(RetryMiddleware):
"""
Modifies RetryMiddleware to delay retries on status 429.
"""
DEFAULT_DELAY = 10 # Delay in seconds. Tune this to your needs
MAX_DELAY = 60 # Sometimes, RETRY-AFTER has absurd values
async def process_response(self, request, response, spider):
"""
Like RetryMiddleware.process_response, but, if response status is 429,
retry the request only after waiting at most self.MAX_DELAY seconds.
Respect the Retry-After header if it's less than self.MAX_DELAY.
If Retry-After is absent/invalid, wait only self.DEFAULT_DELAY seconds.
"""
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
if response.status == 429:
retry_after = response.headers.get('retry-after')
try:
retry_after = int(retry_after)
except (ValueError, TypeError):
delay = self.DEFAULT_DELAY
else:
delay = min(self.MAX_DELAY, retry_after)
spider.logger.info(f'Retrying {request} in {delay} seconds.')
spider.crawler.engine.pause()
await async_sleep(delay)
spider.crawler.engine.unpause()
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return response
不要忘记通过在项目的 settings.py
:
中编辑 DOWNLOADER_MIDDLEWARES
来告诉 Scrapy 使用这些中间件
# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'your_project_name.middlewares.TooManyRequestsRetryMiddleware': 550,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': None,
'your_project_name.middlewares.CustomRedirectMiddleware': 600
}
我使用 scrapy 创建了一个脚本来从网页中获取一些字段。着陆页的 url 和内页的 url 经常被重定向,因此我创建了一个中间件来处理该重定向。但是,当我遇到 process_request()
中将重定向的 url 替换为原始
return request
。
当蜘蛛发送请求时,meta={'dont_redirect': True,"handle_httpstatus_list": [301,302,307,429]}
总是这样。
由于所有请求都没有被重定向,我尝试在 _retry()
方法中替换重定向的 urls。
def process_request(self, request, spider):
request.headers['User-Agent'] = self.ua.random
def process_exception(self, request, exception, spider):
return self._retry(request, spider)
def _retry(self, request, spider):
request.dont_filter = True
if request.meta.get('redirect_urls'):
redirect_url = request.meta['redirect_urls'][0]
redirected = request.replace(url=redirect_url)
redirected.dont_filter = True
return redirected
return request
def process_response(self, request, response, spider):
if response.status in [301, 302, 307, 429]:
return self._retry(request, spider)
return response
Question: How can I send requests after replacing redirected url with original one using middleware?
编辑:
我将其放在答案的开头,因为它是一种更快的 one-shot 解决方案,可能适合您。
Scrapy 2.5 引入了 get_retry_request
,允许您重试来自蜘蛛回调的请求。
来自文档:
Returns a new
Request
object to retry the specified request, orNone
if retries of the specified request have been exhausted.
所以你可以这样做:
def parse(self, response):
if response.status in [301, 302, 307, 429]:
new_request_or_none = get_retry_request(
response.request,
spider=self,
reason='tried to redirect',
max_retry_times = 10
)
if new_request_or_none:
yield new_request_or_none
else:
# exhausted all retries
...
但是话又说回来,如果网站抛出它们以指示某些 non-permanent 事件,例如重定向到维护页面,您应该确保只重试从 3 开始的状态代码。至于状态 429,请参阅下面我关于使用延迟的建议。
编辑 2:
在早于 21.7.0 的 Twisted 版本上,使用 deferLater
的协程 async_sleep
实现可能无法工作。改用这个:
from twisted.internet import defer, reactor
async def async_sleep(delay, return_value=None):
deferred = defer.Deferred()
reactor.callLater(delay, deferred.callback, return_value)
return await deferred
原回答:
如果我没理解错的话,你只是想在发生重定向时重试原始请求,对吗?
在这种情况下,您可以使用 RedirectMiddleware
:
# middlewares.py
from scrapy.downloadermiddlewares.redirect import RedirectMiddleware
class CustomRedirectMiddleware(RedirectMiddleware):
"""
Modifies RedirectMiddleware to set response status to 503 on redirects.
Make sure this appears in the DOWNLOADER_MIDDLEWARES setting with a lower priority (higher number) than RetryMiddleware
(or whatever the downloader middleware responsible for retrying on status 503 is called).
"""
def process_response(self, request, response, spider):
if response.status in (301, 302, 303, 307, 308): # 429 already is in scrapy's default retry list
return response.replace(status=503) # Now this response is RetryMiddleware's problem
return super().process_response(request, response, spider)
但是,在每次出现这些状态代码时重试可能会导致其他问题。因此,您可能想在 if
中添加一些附加条件,例如检查是否存在一些 header 可能表明站点正在维护或类似的东西。
虽然我们正在做这件事,但由于您在列表中包含了状态代码 429,我认为您可能会收到一些“请求过多”的回复。在重试这个特定案例之前,您可能应该让您的蜘蛛等待一段时间。这可以通过以下 RetryMiddleware
:
# middlewares.py
from twisted.internet import task, reactor
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.utils.response import response_status_message
async def async_sleep(delay, callable=None, *args, **kw):
return await task.deferLater(reactor, delay, callable, *args, **kw)
class TooManyRequestsRetryMiddleware(RetryMiddleware):
"""
Modifies RetryMiddleware to delay retries on status 429.
"""
DEFAULT_DELAY = 10 # Delay in seconds. Tune this to your needs
MAX_DELAY = 60 # Sometimes, RETRY-AFTER has absurd values
async def process_response(self, request, response, spider):
"""
Like RetryMiddleware.process_response, but, if response status is 429,
retry the request only after waiting at most self.MAX_DELAY seconds.
Respect the Retry-After header if it's less than self.MAX_DELAY.
If Retry-After is absent/invalid, wait only self.DEFAULT_DELAY seconds.
"""
if request.meta.get('dont_retry', False):
return response
if response.status in self.retry_http_codes:
if response.status == 429:
retry_after = response.headers.get('retry-after')
try:
retry_after = int(retry_after)
except (ValueError, TypeError):
delay = self.DEFAULT_DELAY
else:
delay = min(self.MAX_DELAY, retry_after)
spider.logger.info(f'Retrying {request} in {delay} seconds.')
spider.crawler.engine.pause()
await async_sleep(delay)
spider.crawler.engine.unpause()
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response
return response
不要忘记通过在项目的 settings.py
:
DOWNLOADER_MIDDLEWARES
来告诉 Scrapy 使用这些中间件
# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
'your_project_name.middlewares.TooManyRequestsRetryMiddleware': 550,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': None,
'your_project_name.middlewares.CustomRedirectMiddleware': 600
}