如果 robots.txt 允许所有内容并禁止所有内容，这意味着什么？

Question

我正在尝试使用 python 中的漂亮汤和机械化库来抓取几个网站。但是，我遇到了一个网站，其中包含以下 robots.txt

User-Agent: *
Allow: /$
Disallow: /

根据维基百科，Allow 指令抵消了以下 Disallow 指令。我已经阅读了一些更简单的例子，并了解它是如何工作的，但这种情况让我有点困惑。 我认为允许我的爬虫访问该网站上的所有内容是否正确？ 如果是的话，该网站甚至会首先写 robots.txt 似乎真的很奇怪......

补充信息： 当我试图抓取这个网站时，Mechanize 给了我一个错误，这个错误类似于 Http error 403, crawling is prohibited because of robots.txt。如果我的上述假设是正确的，那么我认为 mechanize 在尝试访问该网站时返回错误的原因是因为它没有能力处理此类 robots.txt 或者它遵循不同的解释标准 robots.txt 文件。（在这种情况下，我只需要让我的爬虫忽略 robots.txt）

更新：

我刚刚偶然发现了这个问题

robots.txt allow root only, disallow everything else?

特别是我看了@eywu 的回答，现在我觉得我最初的假设是错误的，我只能访问 website.com 而不能访问网站。com/other-stuff

Answer 1

不行，你的爬虫只能访问首页。

Allow 指令允许您访问 /$； $ 在这里很重要！这意味着只有文字 / 路径匹配，根据 Disallow 指令不允许任何其他路径（如 /foo/bar），它匹配所有路径（它没有 $ ).

参见Google documentation on path matching：

$ designates the end of the URL

Mechanize 正确解释了 robots.txt 文件。

Answer 2

您的更新是正确的。您可以访问 http://example.com/，但不能访问 http://example.com/page.htm。

这来自 Robots.txt Specifications，请查看页面底部标题为 "Order of precedence for group-member records" 的部分，其中指出：

URL allow:  disallow:   Verdict Comments
http://example.com/page /p  /   allow    
http://example.com/folder/page  /folder/    /folder allow    
http://example.com/page.htm /page   /*.htm  undefined    
http://example.com/ /$  /   allow    
http://example.com/page.htm /$  /   disallow

如果 robots.txt 允许所有内容并禁止所有内容，这意味着什么？

What does it mean if robots.txt allows everything and disallows everything?

python

robots.txt

mechanize