robots.txt 中令人困惑的通配符:*+*、*%2B*、*%2b*

Confusing wildcard in robots.txt: *+*, *%2B*, *%2b*

robots.txt 中的这 3 行是什么意思(显然,我指的是 *+**%2B**%2b*)?

Disallow: /collections/*+*
Disallow: /collections/*%2B*
Disallow: /collections/*%2b*

原来的"standard"只定义了

Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.

这意味着,所有路径 字面匹配 (没有字符具有模式匹配中的特殊含义)。

但它也指出

It is not an official standard backed by a standards body, or owned by any commercial organisation. It is not enforced by anybody...


更现代的Google documentation解释

Google, Bing, Yahoo, and Ask support a limited form of "wildcards" for path values. These are:

  • * designates 0 or more instances of any valid character.

  • $ designates the end of the URL.

所以

Disallow: /collections/*+*
Disallow: /collections/*%2B*
Disallow: /collections/*%2b*

将禁止所有以 /collections/ 开头后跟任何包含

的路径
  • +
  • %2B
  • %2b

因为这些字符在路径模式中没有特殊含义。