robots.txt目录只允许一个文件?

Allow only one file of directory in robots.txt?

我只想允许目录 /minsc 的一个文件,但我想禁止目录的其余部分。

现在robots.txt中是这样的:

User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/

我要允许的文件是/minsc/menu-leaf.png

我怕造成伤害,所以我不知道是否必须使用:

A)

User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/
Allow: /minsc/menu-leaf.png

B)

User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/*    //added "*" -------------------------------
Allow: /minsc/menu-leaf.png

?

谢谢,对不起我的英语。

Robots.txt 是一种 'informal' 标准,可以有不同的解释。唯一有趣的 'standard' 是主要参与者如何解读它。

我发现这个来源说不支持 globbing('*' 式通配符):

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".

http://www.robotstxt.org/robotstxt.html

所以根据这个来源,你应该坚持你的选择 (A)。

根据the robots.txt website

To exclude all files except one

This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:

User-agent: *

Disallow: /~joe/stuff/

或者您可以明确禁止所有不允许的页面:

User-agent: *

Disallow: /~joe/junk.html

Disallow: /~joe/foo.html

Disallow: /~joe/bar.html

根据 Wikipedia,如果您要使用 Allow 指令,它应该在 Disallow 之前以实现最大兼容性:

Allow: /directory1/myfile.html
Disallow: /directory1/

此外,根据Yandex:

,你应该把Crawl-delay放在最后

To maintain compatibility with robots that may deviate from the standard when processing robots.txt, the Crawl-delay directive needs to be added to the group that starts with the User-Agent record right after the Disallow and Allow directives).

所以,最后,您的 robots.txt 文件应该如下所示:

User-agent: *
Allow: /minsc/menu-leaf.png
Disallow: /minsc/
Crawl-delay: 10