robots.txt 文件的规则是什么？

What are the rules for the robots.txt file?

robots.txt

我正在尝试构建一个 robots.txt 解析器，但我在脑海中停留在一个简单的问题上：robots.txt 文件的规则是什么？

我开始搜索它，并在 1996 年的 robotstxt.org I found a document 上为 robots.txt 文件定义了一些规则。本文档明确定义了 User-agent、Allow 和 Disallow.[=13 的所有规则=]

寻找 robots.txt 的一些示例，我找到了标签，例如 Sitemap 和 Host.

我一直在寻找它，我在维基百科上找到了这个 document。解释一些额外的标签。

但我的意思是，因为我不能那么相信维基百科，而且网络爬虫技术在不断发展，为 robots.txt 文件创建新规则，有没有我可以找到的地方可以在 robots.txt 文件中定义的每条规则？

你会发现最官方的东西是：http://www.robotstxt.org

但我认为关于 robots.txt 实际工作/在实践中实际使用的内容更为重要，而不是某人在某些规范中编写的内容。

Google 的 robots.txt 信息页面及其在线检查器是一个很好的起点：https://support.google.com/webmasters/answer/6062608?rd=1 (as is also recommended at http://www.robotstxt.org/checker.html )

http://www.robotstxt.org/orig.html 是 official/original robots.txt 规范。¹

定义字段User-agent和Disallow，指定clients must ignore unknown fields. This allows others to create extensions (like, for example, the Sitemap field defined by the Sitemaps protocol).

没有注册表（因此存在名称冲突的风险），也没有负责收集所有扩展的标准组织。

2008 年，Google (their announcement), Microsoft², and Yahoo!³ (their announcement) 聚集在一起并就他们将支持的一组功能达成一致（请注意，他们为 Disallow 值引入了保留字符，而在原始规范，所有字符都将按字面解释）。
然而，这只记录了他们的解释（对于他们的机器人）；这不是其他机器人必须遵循的某种规范。但是查看他们的文档（例如，from Bing, from Google Search, from Yandex）可以让您了解其中的内容。

¹ http://www.robotstxt.org/norobots-rfc.txt 是 RFC 的初稿，但据我所知这从来都不是 committed/published.

² 他们的公告 seems to be 404.

³ 本来他们的公告似乎是在 http://www.ysearchblog.com/archives/000587.html，但现在是 404。