robots.txt: 通配符也是没有字符的意思吗？

Question

我有以下示例 robots.txt 和有关通配符的问题：

User-agent: *

Disallow: /*/admin/*

此规则现在是否适用于两个页面：

那么通配符可以代表没有字符吗？

Answer 1

在最初的robots.txt规范中，Disallow值中的*没有特殊含义，只是一个字符而已。因此，遵循原始规范的机器人将抓取 http://www.example.org/admin 以及 http://www.example.org/es/admin.

一些机器人支持原始 robots.txt 规范的 "extensions"，并且流行的扩展将 Disallow 值中的 * 解释为通配符。但是，这些扩展在某处并未标准化，每个机器人可能对其进行不同的解释。

最流行的定义可以说是 from Google Search（Google 表示 Bing、Yahoo 和 Ask 使用相同的定义）：

* designates 0 or more instances of any valid character

根据上述定义解释 * 时，您的两个网址仍会被抓取。

您的 /*/admin/* 要求路径中有三个斜杠，但是 http://www.example.org/admin 只有一个，而 http://www.example.org/es/admin 只有两个。

（另请注意，User-agent 和 Disallow 行之间不允许有空行。）

你可能想用这个：

User-agent: *
Disallow: /admin
Disallow: /*/admin

这将阻止至少相同，但可能比您想要阻止的更多（取决于您的 URL）：

User-agent: *
Disallow: /*admin

请记住，遵循原始 robots.txt 规范的机器人会忽略它，因为它们按字面解释 *。如果你想涵盖这两种机器人，你将不得不添加多个记录：一个带有 User-agent: * 的记录用于遵循原始规范的机器人，以及一个列出所有用户代理的记录（在 User-agent 中）支持通配符。

robots.txt: Does Wildcard mean no characters too?