机器人不关注 robots.txt 文件

Question

似乎有些机器人没有遵循我的 robots.txt 文件，包括来自 majestic.com 的 MJ12bot 并且应该遵循说明。

文件如下所示：

User-agent: google
User-agent: googlebot
Disallow: /results/
Crawl-Delay: 30

User-agent: *
Disallow: /results/
Disallow: /travel/
Disallow: /viajar/
Disallow: /reisen/
Crawl-Delay: 30

我想告诉机器人的是：

只有 google 可以抓取包含 /travel/、/viajar/ 或 /reisen/.
None 他们应该访问任何包含 /results/.
两次查询之间的时间间隔应至少为 30 秒。

但是，无论如何，MJ12bot 正在抓取 url 包含 /travel/、/viajar/ 或 /reisen/，此外，它不会在查询之间等待 30 秒。

mydomain.com/robots.txt 按预期显示文件。

文件有问题吗？

Answer 1

你的 robots.txt 是正确的。

例如，MJ12bot 不应该抓取 http://example.com/reisen/42/，但它可能会抓取 http://example.com/42/reisen/。

如果检查主机相同（https vs. http，www vs. no www，相同的域名），您可以考虑sending Majestic a message:

We are keen to see any reports of potential violations of robots.txt by MJ12bot.

如果你不想等，你可以尝试直接针对 MJ12bot 是否有效：

User-agent: MJ12bot
Disallow: /results/
Disallow: /travel/
Disallow: /viajar/
Disallow: /reisen/
Crawl-Delay: 20

（我把 Crawl-Delay 改成 20 因为那是他们支持的最大值。但是指定一个更高的值应该没问题，他们四舍五入了。）

更新

Why might they crawl http://example.com/42/reisen/? That might be actually my problem, since the url has the form example.com/de/reisen/ or example.com/en/travel/... Should I change to */travel/ then?

Disallow 值始终是 URL 路径的 the beginning。

如果您想禁止抓取 http://example.com/de/reisen/，以下所有行都可以实现：

Disallow: /

Disallow: /d

Disallow: /de

Disallow: /de/

Disallow: /de/r

等等

在最初的robots.txt规范中，*在Disallow值中没有特殊意义，所以Disallow: /*/travel/会literally阻塞http://example.com/*/travel/ .

不过有些机器人支持它 (including Googlebot)。关于 MJ12bot 的文档说：

Simple pattern matching in Disallow directives compatible with Yahoo's wildcard specification

我不知道他们提到的 Yahoo 规范，但他们似乎也支持它。

但如果可能的话，当然还是依靠标准功能更好，例如：

User-agent: *
Disallow: /en/travel/
Disallow: /de/reisen/

机器人不关注 robots.txt 文件

Bots not following robots.txt file

robots.txt

bots

更新