robots.txt目录只允许一个文件？

Question

我只想允许目录 /minsc 的一个文件，但我想禁止目录的其余部分。

现在robots.txt中是这样的：

User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/

我要允许的文件是/minsc/menu-leaf.png

我怕造成伤害，所以我不知道是否必须使用：

A)

User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/
Allow: /minsc/menu-leaf.png

或

B)

User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/*    //added "*" -------------------------------
Allow: /minsc/menu-leaf.png

?

谢谢，对不起我的英语。

Answer 1

Robots.txt 是一种 'informal' 标准，可以有不同的解释。唯一有趣的 'standard' 是主要参与者如何解读它。

我发现这个来源说不支持 globbing（'*' 式通配符）：

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".

http://www.robotstxt.org/robotstxt.html

所以根据这个来源，你应该坚持你的选择 (A)。

Answer 2

根据the robots.txt website：

To exclude all files except one

This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:

User-agent: *

Disallow: /~joe/stuff/

或者您可以明确禁止所有不允许的页面：

User-agent: *

Disallow: /~joe/junk.html

Disallow: /~joe/foo.html

Disallow: /~joe/bar.html

根据 Wikipedia，如果您要使用 Allow 指令，它应该在 Disallow 之前以实现最大兼容性：

Allow: /directory1/myfile.html
Disallow: /directory1/

此外，根据Yandex:

，你应该把Crawl-delay放在最后

To maintain compatibility with robots that may deviate from the standard when processing robots.txt, the Crawl-delay directive needs to be added to the group that starts with the User-Agent record right after the Disallow and Allow directives).

所以，最后，您的 robots.txt 文件应该如下所示：

User-agent: *
Allow: /minsc/menu-leaf.png
Disallow: /minsc/
Crawl-delay: 10

robots.txt目录只允许一个文件？

Allow only one file of directory in robots.txt?

robots.txt