如果网站因 robots.txt 而未抓取,则获取响应

Get the response if site didn't crawl due to robots.txt

我正在尝试抓取用户定义的网站,但无法抓取 robots.txt 阻止抓取的网站。很好,但我想得到可以向用户显示 "the site you have entered doesn't allow to crawl due to robots.txt" 的响应。

还有其他 3 种类型的预防,我得到了相应的代码和处理,但只有这种异常(即 robots.txt 的预防)我无法处理。所以,请让我知道是否有任何方法可以处理此案例并显示适当的错误消息。

我正在使用 Python 3.5.2 和 Scrapy 1.5

你应该使用ROBOTSTXT_OBEY

ROBOTSTXT_OBEY=False

有关RobotsTxtMiddleware的更多信息:

This middleware filters out requests forbidden by the robots.txt exclusion standard.

To make sure Scrapy respects robots.txt make sure the middleware is enabled and the ROBOTSTXT_OBEY setting is enabled.

If Request.meta has dont_obey_robotstxt key set to True the request will be ignored by this middleware even if ROBOTSTXT_OBEY is enabled.