Robots.txt

Question

我想阻止我们站点上的 *.html 文件的索引 - 这样只有干净的 url 才会被索引。

所以我想 www.example.com/en/login 编入索引但不 www.example.com/en/login/index.html

目前我有：

User-agent: *
Disallow: /
Disallow: /**.html   - not working
Allow: /$
Allow: /*/login*

我知道我可以禁止例如Disallow: /*/login/index.html，但我的问题是我有很多这样的 .html 文件我不想编入索引 - 所以想知道是否有办法禁止所有这些文件而不是单独执行它们？

Answer 1

首先，您一直在使用 "indexed" 这个词，所以我想确保您知道 robots.txt 约定只是建议自动爬虫避免某些 URLs 在您的域中，但是如果 robots.txt 文件中列出的页面还有关于该页面的其他数据，它们仍然可以显示在搜索引擎索引中。例如，Google explicitly states 他们仍然会索引并列出一个 URL，即使他们不允许抓取它。我只是想让你知道这一点，以防你使用 "indexed" 这个词来表示 "listed in a search engine" 而不是 "getting crawled by an automated program".

其次，没有标准的方法来完成您的要求。每“The Web Robots Pages”：

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".

话虽如此，这是许多爬虫都支持的常见添加。例如，在 Google's documentation of they directives they support 中，他们描述了确实处理使用 * 作为通配符的模式匹配支持。因此，您可以添加 Disallow: /*.html$ 指令，然后 Google 将不会抓取以 .html 结尾的 URL，尽管它们仍可能出现在搜索结果中。

但是，如果您的主要目标是告诉搜索引擎您 URL 认为 "clean" 和偏好什么，那么您真正要寻找的是指定 Canonical URLs。您可以在每个页面上放置一个 link rel="canonical" 元素，并为该页面设置您首选的 URL，使用该元素的搜索引擎将使用它来确定显示该页面时首选的路径。

Robots.txt - 阻止 .html 文件的索引

Robots.txt - prevent index of .html files