防止机器人抓取动态 javascript 文件

Question

我需要防止机器人抓取 .js 文件。如您所知，Google 能够抓取 .js 个文件。只有一个 .js 文件，但它会随着新的部署和更新而改变。

例如：

<script type="text/javascript" src="/7c2af7d5829e81965805cc932aeacdea8049891f.js?js_resource=true"></script>

我想确定一下，因为我不知道如何验证，这是正确的：

// robots.txt
Disallow: /*.js$

此外，如果 .js 文件通过 cdn 提供，是否相同？

Answer 1

robot.txt 文件不支持通配符和正则表达式。来自 http://www.robotstxt.org:

Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "User-agent: bot", "Disallow: /tmp/*" or "Disallow: *.gif".

您应该将 JavaScript 文件移动到机器人文件中不允许的目录：

User-agent: *
Disallow: /hidden-javascript/

Answer 2

# robots.txt
Disallow: /*.js?js_resource

这很好用。您可以在 Google Search Console 又名 Google 网站管理员工具中测试您的 robots.txt。

防止机器人抓取动态 javascript 文件

Prevent bots from crawling dynamic javascript files

javascript

bots

googlebot

web-crawler

google-crawlers