Google 仍在索引唯一网址

Question

我有一个 robots.txt 文件这样设置

User-agent: *
Disallow: /*

对于完全基于 URL 的唯一网站。有点像 https://jsfiddle.net/，当你保存一个新的 fiddle 时，它会赋予它一个独特的 URL。我希望我所有的独特 URL 都对 Google 不可见。没有索引。

Google 已将我所有的唯一 URL 编入索引，即使上面写着 "A description for this result is not available because of the site's robots.txt file. - learn more"

但这仍然很糟糕，因为所有 URL 都在那里，并且可以点击 - 所以里面的所有数据都可用。我能做些什么来 1) 摆脱这些关闭 Google 和 2) 停止 Google 索引这些 URLs.

Answer 1

Robots.txt 告诉搜索引擎不要抓取该页面，但不会阻止它们将该页面编入索引，尤其是当存在从其他站点指向该页面的链接时。如果您的主要目标是保证这些页面永远不会出现在搜索结果中，您应该改用 robots meta tags。带有 'noindex' 的机器人元标记表示 "Do not index this page at all"。在 robots.txt 中屏蔽页面意味着 "Do not request this page from the server."

添加漫游器元标记后，您需要更改 robots.txt 文件以不再禁止这些页面。否则，robots.txt 文件将阻止爬虫加载页面，这将阻止它看到元标记。在您的情况下，您只需将 robots.txt 文件更改为：

User-agent: *
Disallow:

（或者完全删除 robots.txt 文件）

如果出于某种原因无法使用机器人元标记，您也可以使用 X-Robots-Tag header 来完成同样的事情。

Google 仍在索引唯一网址

Google still indexing unique URLs

indexing

robots.txt

google-search-console