如果网站因 robots.txt 而未抓取，则获取响应

Question

我正在尝试抓取用户定义的网站，但无法抓取 robots.txt 阻止抓取的网站。很好，但我想得到可以向用户显示 "the site you have entered doesn't allow to crawl due to robots.txt" 的响应。

还有其他 3 种类型的预防，我得到了相应的代码和处理，但只有这种异常（即 robots.txt 的预防）我无法处理。所以，请让我知道是否有任何方法可以处理此案例并显示适当的错误消息。

我正在使用 Python 3.5.2 和 Scrapy 1.5

Answer 1

你应该使用ROBOTSTXT_OBEY

ROBOTSTXT_OBEY=False

有关RobotsTxtMiddleware的更多信息：

This middleware filters out requests forbidden by the robots.txt exclusion standard.

To make sure Scrapy respects robots.txt make sure the middleware is enabled and the ROBOTSTXT_OBEY setting is enabled.

If Request.meta has dont_obey_robotstxt key set to True the request will be ignored by this middleware even if ROBOTSTXT_OBEY is enabled.

如果网站因 robots.txt 而未抓取，则获取响应

Get the response if site didn't crawl due to robots.txt

python

scrapy

scrapy-spider

scrapyd