robots.txt 并禁止绝对路径 URL

Question

我正在使用 Heroku 管道。所以当我推送我的应用程序时，它被推送到暂存应用程序

https://appname.herokuapp.com/

如果一切正确，我会将该应用程序推广到产品中。没有新的构建过程。这是第一次为登台构建的应用程序。

https://appname.com/

问题在于，这会导致 内容重复 问题。网站是彼此的克隆。完全相同的。我想从 Google 索引和搜索引擎中排除登台应用程序。

我想到的一种方法是使用 robots.txt 文件。

为了让它工作，我应该这样写

User-agent: *
Disallow: https://appname.herokuapp.com/

使用绝对路径，因为这个文件将在服务器上在暂存和生产应用程序中，我只想从 Google 索引中删除暂存应用程序，而不是触摸生产一.

这样做正确吗？

Answer 1

不，使用您的建议会阻止所有搜索引擎/机器人访问https://appname.herokuapp.com/。

您应该使用的是：

User-agent: Googlebot
Disallow: /

这只会阻止 Googlebot 访问 https://appname.herokuapp.com/。请记住，机器人可以忽略 robots.txt 文件，这更像是 please 而不是任何东西。但是Google会按照你的要求去做。

编辑

在看到 unor 的建议后，无法通过 URL 来禁止，所以我已经从我的回答中更改了它。但是，您可以通过特定文件进行阻止，例如/appname/ 或者您使用 / 来阻止 Googlebot 访问任何内容。

Answer 2

不，Disallow 字段不能包含完整的 URL 引用。您的 robots.txt 会像这样阻止 URL：

https://example.com/https://appname.herokuapp.com/
https://example.com/https://appname.herokuapp.com/foo

Disallow 值始终表示 URL 路径的 开头。

要阻止 https://appname.herokuapp.com/ 下的所有 URL，您需要：

Disallow: /

因此您必须为 https://appname.herokuapp.com/ 和 https://appname.com/ 使用不同的 robots.txt 文件。

如果您不介意机器人爬行 https://appname.herokuapp.com/，您可以改用 noindex。但这也需要两个站点的不同行为。不需要不同行为的替代方法是使用 canonical。这传达给爬虫 URL 是索引的首选。

 <link rel="canonical" href="https://appname.com/foobar" />

 <link rel="canonical" href="https://appname.com/foobar" />

robots.txt 并禁止绝对路径 URL

robots.txt and disalowing absolute path URL

seo

robots.txt

heroku

noindex