robots.txt目录只允许一个文件?
Allow only one file of directory in robots.txt?
我只想允许目录 /minsc
的一个文件,但我想禁止目录的其余部分。
现在robots.txt中是这样的:
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/
我要允许的文件是/minsc/menu-leaf.png
我怕造成伤害,所以我不知道是否必须使用:
A)
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/
Allow: /minsc/menu-leaf.png
或
B)
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/* //added "*" -------------------------------
Allow: /minsc/menu-leaf.png
?
谢谢,对不起我的英语。
Robots.txt 是一种 'informal' 标准,可以有不同的解释。唯一有趣的 'standard' 是主要参与者如何解读它。
我发现这个来源说不支持 globbing('*' 式通配符):
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
http://www.robotstxt.org/robotstxt.html
所以根据这个来源,你应该坚持你的选择 (A)。
To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The
easy way is to put all files to be disallowed into a separate
directory, say "stuff", and leave the one file in the level above this
directory:
User-agent: *
Disallow: /~joe/stuff/
或者您可以明确禁止所有不允许的页面:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
根据 Wikipedia,如果您要使用 Allow 指令,它应该在 Disallow 之前以实现最大兼容性:
Allow: /directory1/myfile.html
Disallow: /directory1/
此外,根据Yandex:
,你应该把Crawl-delay放在最后
To maintain compatibility with robots that may deviate from the
standard when processing robots.txt, the Crawl-delay directive needs
to be added to the group that starts with the User-Agent record right
after the Disallow and Allow directives).
所以,最后,您的 robots.txt 文件应该如下所示:
User-agent: *
Allow: /minsc/menu-leaf.png
Disallow: /minsc/
Crawl-delay: 10
我只想允许目录 /minsc
的一个文件,但我想禁止目录的其余部分。
现在robots.txt中是这样的:
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/
我要允许的文件是/minsc/menu-leaf.png
我怕造成伤害,所以我不知道是否必须使用:
A)
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/
Allow: /minsc/menu-leaf.png
或
B)
User-agent: *
Crawl-delay: 10
# Directories
Disallow: /minsc/* //added "*" -------------------------------
Allow: /minsc/menu-leaf.png
?
谢谢,对不起我的英语。
Robots.txt 是一种 'informal' 标准,可以有不同的解释。唯一有趣的 'standard' 是主要参与者如何解读它。
我发现这个来源说不支持 globbing('*' 式通配符):
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".
http://www.robotstxt.org/robotstxt.html
所以根据这个来源,你应该坚持你的选择 (A)。
To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:
User-agent: *
Disallow: /~joe/stuff/
或者您可以明确禁止所有不允许的页面:
User-agent: *
Disallow: /~joe/junk.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
根据 Wikipedia,如果您要使用 Allow 指令,它应该在 Disallow 之前以实现最大兼容性:
Allow: /directory1/myfile.html
Disallow: /directory1/
此外,根据Yandex:
,你应该把Crawl-delay放在最后To maintain compatibility with robots that may deviate from the standard when processing robots.txt, the Crawl-delay directive needs to be added to the group that starts with the User-Agent record right after the Disallow and Allow directives).
所以,最后,您的 robots.txt 文件应该如下所示:
User-agent: *
Allow: /minsc/menu-leaf.png
Disallow: /minsc/
Crawl-delay: 10