如何禁止 google 爬虫使用 robots.txt 爬取的少数 URL 列表

Question

我有几个页面和 URL 我不想被 Google 爬虫抓取。

我知道可以通过 robots.txt 完成。我搜索 Google 并发现我们需要在 robots.txt 中安排所有内容以禁止爬虫，但我不确定这样做是否正确。

User-Agent: *
Disallow: /music?
Disallow: /widgets/radio?

Disallow: /affiliate/
Disallow: /affiliate_redirect.php
Disallow: /affiliate_sendto.php
Disallow: /affiliatelink.php
Disallow: /campaignlink.php
Disallow: /delivery.php

Disallow: /music/+noredirect/
Disallow: /user/*/library/music/
Disallow: /*/+news/*/visit
Disallow: /*/+wiki/diff

# AJAX content
Disallow: /search/autocomplete
Disallow: /template
Disallow: /ajax
Disallow: /user/*/tasteomatic

我可以这样给URL吗？我的意思是，我可以将完整的 URL 指定为不允许吗？

Disallow: http://www.bba-reman.com/admin/feedback.htm

编辑

我当前的 robots.txt 个条目如下所示

User-Agent: *
Disallow: /CheckLogin
Disallow: /DTC.pdf
Disallow: /catalogue/bmw.htm
Disallow: /auto-mine/bmw/index.htm
Disallow: /forums/parent.Jmp('i100')
Disallow: /forums/parent.Jmp('i040')
Disallow: /forums/CodeDescriptions.html
Disallow: /forums/parent.Jmp('i050')
Disallow: /forums/parent.Scl('000','24601')
Disallow: /forums/parent.Jmp('i030')
Disallow: /catalogue/peugeot.htm

可以吗.....告诉我。谢谢

Answer 1

Disallow 字段的值始终是 URL path.

的开头

因此，如果您的 robots.txt 可从 http://example.com/robots.txt 访问，并且它包含此行

Disallow: http://example.com/admin/feedback.htm

然后 URL这样的将被禁止：

http://example.com/http://example.com/admin/feedback.htm
http://example.com/http://example.com/admin/feedback.html
http://example.com/http://example.com/admin/feedback.htm_foo
http://example.com/http://example.com/admin/feedback.htm/bar
…

所以如果你想禁止 URL http://example.com/admin/feedback.htm，你必须使用

Disallow: /admin/feedback.htm

这会阻止像这样的 URL：

http://example.com/admin/feedback.htm
http://example.com/admin/feedback.html
http://example.com/admin/feedback.htm_foo
http://example.com/admin/feedback.htm/bar
…

如何禁止 google 爬虫使用 robots.txt 爬取的少数 URL 列表

EHow to Disallow few list of URL crawled by google crawler using robots.txt

url

robots.txt

googlebot

编辑