Google 'Sitemap contains urls which are blocked by robots.txt' 警告

Question

问题是 robots.txt 与 Disallow: / 的白名单无法像预期的那样与 Google 一起工作。

Google 在限制 robots.txt 规则方面存在问题：

User-agent: *
Host: sitename
Allow: /$
Allow: /sitemap.xml
Allow: /static/
Allow: /articles/
Disallow: /
Disallow: /static/*.js$

其中 sitemap.xml 包含 / 和许多 /articles/... URL。:

<url><loc>http://sitename/</loc><changefreq>weekly</changefreq></url>
<url><loc>http://sitename/articles/some-article</loc><changefreq>weekly</changefreq></url>
<url><loc>http://sitename/articles/...</loc><changefreq>weekly</changefreq></url>
...

Crawl / robots.txt Tester 在 Google 搜索控制台中正确解释它，它显示这些 URL 是允许的（'Fetch as Google' 也有效):

sitename/

sitename/articles/some-article

但是，Crawl / Sitemaps 中的报告显示 sitemap.xml 所有 /articles/... 个 URL 都有问题，警告是：

Sitemap contains urls which are blocked by robots.txt

因此，只有 / 被编入索引（它甚至在某些时候被从索引中删除，尽管 Google 从未在站点地图报告中抱怨过）。

此设置背后的原因是 Google 无法正确呈现 SPA 路由，因此一些 SPA 路由（/ 和 /articles/...）被预呈现为片段并允许抓取（其他路由尚未预渲染，目前不希望将它们用于抓取）。

我临时将Disallow: /替换为所有已知路由的黑名单，没有分片，问题消失了：

User-agent: *
Host: sitename
Allow: /$
Allow: /sitemap.xml
Allow: /static/
Allow: /articles/
Disallow: /blacklisted-route1
Disallow: /blacklisted-route2
...
Disallow: /static/*.js$

前一种方法有什么问题？为什么 Google 会这样？

robots.txt 规则非常明确，Google 的 robots.txt 测试人员仅证实了这一点。

Answer 1

当您 allow /$ 和 disallow / 时，不允许获胜（请参阅 https://developers.google.com/search/reference/robots_txt 中组成员记录的优先顺序）。

忘记我之前关于最后一条规则优先于第一条规则的评论。它不适用于您的情况。

要删除片段，请使用规范标签。如果您不想 Google 抓取您的网页，请设置 nofollow.

Google 'Sitemap contains urls which are blocked by robots.txt' 警告

Google 'Sitemap contains urls which are blocked by robots.txt' warning

sitemap

robots.txt

google-search

single-page-application

google-search-console