'Allow' 在 robots.txt 中的用法

Question

最近看到一个网站的robots.txt如下：

User-agent: *
Allow: /login
Allow: /register

我只能找到 Allow 个条目，找不到 Disallow 个条目。

从this，我了解到robots.txt几乎是一个黑名单文件，Disallow个页面被抓取。因此，Allow 仅用于允许已被 Disallow 阻止的域的子部分。与此类似：

Allow: /crawlthis
Disallow: /

但是，robots.txt 没有 Disallow 个条目。那么，这个robots.txt是不是让Google抓取所有的页面呢？或者，它是否只允许使用 Allow 标记的指定页面？

Answer 1

你是对的，这个 robots.txt 文件允许 Google 抓取网站上的所有页面。可以在此处找到完整的指南：http://www.robotstxt.org/robotstxt.html.

如果您希望只允许 googleBot 抓取指定的页面，那么正确的格式应该是：

User Agent:*
Disallow:/
Allow: /login
Allow: /register

（我通常会禁止这些特定页面，因为它们不会为搜索者提供太多价值。）

请务必注意，Allow 命令行仅适用于某些机器人（包括 Googlebot）

Answer 2

具有 Allow 行但没有 Disallow 行的 robots.txt 记录没有意义。无论如何，默认情况下允许抓取所有内容。

根据 original robots.txt specification（未定义 Allow），它甚至是无效的，因为至少需要一行 Disallow（粗体强调我的）：

The record starts with one or more User-agent lines, followed by one or more Disallow lines […]

At least one Disallow field needs to be present in a record.

换句话说，像

这样的记录

User-agent: *
Allow: /login
Allow: /register

相当于记录

User-agent: *
Disallow:

即允许抓取所有内容，包括（但不限于）路径以 /login 和 /register.

开头的 URL

Usage of 'Allow' in robots.txt