如何停止 Google 抓取 /fileadmin

Question

我在一个包含大约 4000 个页面的站点中使用 TYPO3。在 /fileadmin 中，我存储了 TYPO3 使用 "fetchurl" 插件获取的 html 个页面。文件夹结构与 TYPO3 页面树具有相同的层次结构：fileadmin/folder1/folder2/folder3/file.html 呈现为 www.example.com/folder1/folder2/folder3/file.html 在所有情况下，树状结构都与网站的导航结构完全对应。

html 页面包含最少的格式标签，如 p、div、img 等。没有 css，没有标题，没有页脚。 TYPO3 完成剩下的工作。我使用 robots.txt 保护 /fileadmin 以避免被爬虫索引。是的，我知道爬虫会抓取所有内容，而不管 robots.txt。（顺便说一句，在 Apache 中我已经阻止了对许多爬虫的访问）。

这种方法可以正常工作 20 年，没有任何问题，但今天我收到一封来自 Google 的电子邮件，内容如下：

Top Warnings. ... Some warnings can affect your appearance on Search; some might be reclassified as errors in the future (emphasis mine). The following warnings were found on your site: Indexed, though blocked by robots.txt We recommend that you fix these issues when possible to enable the best experience and coverage in Google Search.

问题是解决这个问题的最佳方法是什么？

-用其他东西替换 html 扩展名并使用 FilesMatch 限制？
-使用文件夹权限来阻止外部访问？
-将 /fileadmin 移到 public_html 之外？（多年来我一直想把很多文件夹移到public_html之外）
-使用'noindex'标签？（它在没有 !DOCTYPE 声明且没有 head 标签的 filename.html 中工作吗？）
还有其他想法吗？

谢谢

Answer 1

理想情况下，您希望将文件移到文档根目录之外 (public_html)。我不知道 fetchurl 扩展名，但根据描述，您需要 URL 才能访问这些文件。因此，如果不替换此扩展名，这可能不是一个选项。

如果无法将文件移出文档根目录，我会通过 IP 地址限制访问。您可以通过将 .htaccess 文件（假设他们的 .htaccess 支持未在您的服务器上禁用）添加到 fileadmin 来执行此操作，其中包含以下内容：

<RequireAny>
  Require local
</RequireAny>

或者如果使用 Apache <2.4:

Order deny,allow
Deny from all
Allow from 127.0.0.1
Allow from ::1

如何停止 Google 抓取 /fileadmin

how to stop Google crawl /fileadmin

html

apache

typo3

typo3-7.6.x