Scrapy：如何判断 robots.txt 是否存在

Question

我知道我可以使用 python 并触发 http(s) 请求自行检查 robots.txt 文件是否存在。由于 Scrapy 正在检查和下载它以使蜘蛛遵守其中的规则，因此蜘蛛 class 中是否有属性或方法或任何东西让我知道 robots.txt要抓取的给定网站是否存在？

尝试使用抓取工具统计信息：

见here

self.crawler.stats.inc_value(f'robotstxt/response_status_count/{response.status}')

我对有和没有 robots.txt 的网站进行了几次测试，我可以看到关于 robots.txt 存在的正确信息。例如，在我的 Spider class 中记录 self.crawler.stats.__dict__ 在我的 spider_close 信号处理程序中我看到：

'robotstxt/response_status_count/200': 1 网站 robots.txt 'robotstxt/response_status_count/404': 1 没有 robots.txt

的网站

好吧，如果蜘蛛在爬行过程中遇到多个域，那么这将不起作用，统计结果将类似于：

"robotstxt/response_status_count/200": 1,
"robotstxt/response_status_count/301": 6,
"robotstxt/response_status_count/404": 9,
"robotstxt/response_status_count/403": 1

但我无法将 HTTP 状态代码响应映射到域...

Answer 1

我不这么认为，您可能必须基于 RobotsTxtMiddleware 制作自定义中间件。它有方法 _parse_robots 和 _robots_error，你可以使用它们来确定是否存在 robots.txt。

https://github.com/scrapy/scrapy/blob/e27eff47ac9ae9a9b9c43426ebddd424615df50a/scrapy/downloadermiddlewares/robotstxt.py

Scrapy：如何判断 robots.txt 是否存在

Scrapy: How to tell if robots.txt exists

python

robots.txt

scrapy